Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label EIGENSTRAT. Show all posts
Showing posts with label EIGENSTRAT. Show all posts

Wednesday, July 24, 2013

Converting Genotype Files to EIGENSTRAT format

Need to convert your genotyping files into EIGENSTRAT format?  Here is a quick primer to do so.  If you haven't already, download the EIGENSTRAT package and extract the contents.  We will be using the convertf tool to convert your genotypes to EIGENSTRAT format.  The convertf tool can convert to/from ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, and PACKEDANCESTRYMAP files.

First, check to see if your convertf program is working out of the box by going to the /bin directory and typing in ./convertf.  Mine was not working (Error message:./convertf: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by ./convertf)).  My quick fix was to try to recompile things based on the included C code, but going to the src directory and typing make convertf did not seem to do the trick.  Skimming through the README file I saw that you can remake the entire package by again going to the src directory and typing make clobber and then make convertf (the recommended make install produced errors on my system).  Typing ./convertf finally returned the "parameter p compulsory" output indicating the program is now working.  (Note: make clobber will remake the entire package and likely require you to remake other packages you need to use, ex: make mergeit, make eigenstrat, etc.).  I am not a UNIX guru, so if you know of a better way to get convertf running, please let me know in the comments.

Now that we have a running version of convertf, we can begin making a parameter file (ie: the par.PED.EIGENSTRAT file) for converting from one file format to another.  The included README file in the CONVERTF directory is instrumental for doing this.  Since .ped and .map files are ubiquitous, I will use this as an example to convert to EIGENSTRAT format.  Below is a parameter file to do so, but again converting to/from other formats is possible as well.


Finally, save this file as par.PED.EIGENSTRAT and run it by simply going into the directory with your working convertf program (mine was in the /EIG5.0/src directory) and put in the command ./convertf -p dir/par.PED.EIGENSTRAT, where dir is the directory your par.PED.EIGENSTRAT file is saved in.  You can essentially call the par.PED.EIGENSTRAT file any name, but this name meshes nicely with that in the samples that come with EIGENSTRAT.  Hope this is helpful, feel free to comment with questions.

How to Infer Ancestry from SNP Genotypes

Self-reported ancestry is poor metric to use when attempting to statistically adjust for the effects of ancestry since several individuals falsely report their ancestry or are simply unaware of their true ancestry.  Worse yet, sometimes you don't even have information collected on an individual's ancestry.  As many of you know, if you have SNP genotyping data you can rather precisely estimate the ancestry of an individual.  Classically, to do this one needed to combine genotypes from their study sample with genotypes from a reference panel (eg: HapMap or 1000 Genomes), find the intersection of SNPs in each dataset, and then run a clustering program to see which samples clustered with the reference ancestral populations.  Not a ton of work, but a minor annoyance at best.  Luckily, a relatively new program was just released that, in essence, does a lot of this ground work for you.  It is called SNPWEIGHTS and can be downloaded here.  Essentially, the program takes SNP genotypes as input, finds the intersection of the sample genotypes with reference genotypes, weights them based on pre-configured parameters to construct the first couple of principle components (aka. eigenvectors) and then calculates an individual's percentage ancestry for each ancestral population.  Here is how to run the program.

First, make sure you have Python installed on your system and that your genotyping data is in EIGENSTRAT format.  A brief tutorial to convert to EIGENSTRAT format using the convertf tool is here.

Next, download the SNPWEIGHTS package here and a reference panel. I usually use the European, West African and East Asian ancestral populations, but there are other options on the SNPWEIGHTS webpage as well.

Then, create a parameter file with directories of input files, your input population (designated "AA", "CO", and "EA"), and a output file.  An example of one is below:


Finally, run the program using the command inferancestry.py --par par.SNPWEIGHTS.  For the program to run correctly, make sure the inferancestry.info and snpwt.co files are in the same directory as your inferancestry.py file.

For more details, see the SNPWEIGHTS paper or the README file included in the SNPWEIGHTS zip folder.  For code on generating eigenvector plots with overlaid ancestry percentages and triangle plots according to percent ancestry, see this post.