Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.

Wednesday, July 24, 2013

How to Infer Ancestry from SNP Genotypes

Self-reported ancestry is poor metric to use when attempting to statistically adjust for the effects of ancestry since several individuals falsely report their ancestry or are simply unaware of their true ancestry.  Worse yet, sometimes you don't even have information collected on an individual's ancestry.  As many of you know, if you have SNP genotyping data you can rather precisely estimate the ancestry of an individual.  Classically, to do this one needed to combine genotypes from their study sample with genotypes from a reference panel (eg: HapMap or 1000 Genomes), find the intersection of SNPs in each dataset, and then run a clustering program to see which samples clustered with the reference ancestral populations.  Not a ton of work, but a minor annoyance at best.  Luckily, a relatively new program was just released that, in essence, does a lot of this ground work for you.  It is called SNPWEIGHTS and can be downloaded here.  Essentially, the program takes SNP genotypes as input, finds the intersection of the sample genotypes with reference genotypes, weights them based on pre-configured parameters to construct the first couple of principle components (aka. eigenvectors) and then calculates an individual's percentage ancestry for each ancestral population.  Here is how to run the program.

First, make sure you have Python installed on your system and that your genotyping data is in EIGENSTRAT format.  A brief tutorial to convert to EIGENSTRAT format using the convertf tool is here.

Next, download the SNPWEIGHTS package here and a reference panel. I usually use the European, West African and East Asian ancestral populations, but there are other options on the SNPWEIGHTS webpage as well.

Then, create a parameter file with directories of input files, your input population (designated "AA", "CO", and "EA"), and a output file.  An example of one is below:


Finally, run the program using the command inferancestry.py --par par.SNPWEIGHTS.  For the program to run correctly, make sure the inferancestry.info and snpwt.co files are in the same directory as your inferancestry.py file.

For more details, see the SNPWEIGHTS paper or the README file included in the SNPWEIGHTS zip folder.  For code on generating eigenvector plots with overlaid ancestry percentages and triangle plots according to percent ancestry, see this post.

No comments:

Post a Comment