Genome Toolbox: Genome Toolbox

Showing posts with label Genome Toolbox. Show all posts

Monday, July 7, 2014

30K

30,000 page views and counting! Thanks for all the interest.

Friday, January 24, 2014

Today marks the 15,000 page view milestone for Genome Toolbox since its beginnings on May 1, 2013. Its great to see so much interest and an encouragement to keep posting new tips I find useful. Keep coming back for more to come!

Thursday, November 21, 2013

10K

Today Genome Toolbox surpassed 10,000 page views! A great milestone which has yet to be achieved by any of my other blogs. Thanks to all who have visited the site and for the continued encouragement to keep adding content. Stay tuned for more tools and tips relevant to genomic analysis.

Wednesday, July 24, 2013

Converting Genotype Files to EIGENSTRAT format

Need to convert your genotyping files into EIGENSTRAT format? Here is a quick primer to do so. If you haven't already, download the EIGENSTRAT package and extract the contents. We will be using the convertf tool to convert your genotypes to EIGENSTRAT format. The convertf tool can convert to/from ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, and PACKEDANCESTRYMAP files.

First, check to see if your convertf program is working out of the box by going to the /bin directory and typing in ./convertf. Mine was not working (Error message:./convertf: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by ./convertf)). My quick fix was to try to recompile things based on the included C code, but going to the src directory and typing make convertf did not seem to do the trick. Skimming through the README file I saw that you can remake the entire package by again going to the src directory and typing make clobber and then make convertf (the recommended make install produced errors on my system). Typing ./convertf finally returned the "parameter p compulsory" output indicating the program is now working. (Note: make clobber will remake the entire package and likely require you to remake other packages you need to use, ex: make mergeit, make eigenstrat, etc.). I am not a UNIX guru, so if you know of a better way to get convertf running, please let me know in the comments.

Now that we have a running version of convertf, we can begin making a parameter file (ie: the par.PED.EIGENSTRAT file) for converting from one file format to another. The included README file in the CONVERTF directory is instrumental for doing this. Since .ped and .map files are ubiquitous, I will use this as an example to convert to EIGENSTRAT format. Below is a parameter file to do so, but again converting to/from other formats is possible as well.

Finally, save this file as par.PED.EIGENSTRAT and run it by simply going into the directory with your working convertf program (mine was in the /EIG5.0/src directory) and put in the command ./convertf -p dir/par.PED.EIGENSTRAT, where dir is the directory your par.PED.EIGENSTRAT file is saved in. You can essentially call the par.PED.EIGENSTRAT file any name, but this name meshes nicely with that in the samples that come with EIGENSTRAT. Hope this is helpful, feel free to comment with questions.

How to Infer Ancestry from SNP Genotypes

Self-reported ancestry is poor metric to use when attempting to statistically adjust for the effects of ancestry since several individuals falsely report their ancestry or are simply unaware of their true ancestry. Worse yet, sometimes you don't even have information collected on an individual's ancestry. As many of you know, if you have SNP genotyping data you can rather precisely estimate the ancestry of an individual. Classically, to do this one needed to combine genotypes from their study sample with genotypes from a reference panel (eg: HapMap or 1000 Genomes), find the intersection of SNPs in each dataset, and then run a clustering program to see which samples clustered with the reference ancestral populations. Not a ton of work, but a minor annoyance at best. Luckily, a relatively new program was just released that, in essence, does a lot of this ground work for you. It is called SNPWEIGHTS and can be downloaded here. Essentially, the program takes SNP genotypes as input, finds the intersection of the sample genotypes with reference genotypes, weights them based on pre-configured parameters to construct the first couple of principle components (aka. eigenvectors) and then calculates an individual's percentage ancestry for each ancestral population. Here is how to run the program.

First, make sure you have Python installed on your system and that your genotyping data is in EIGENSTRAT format. A brief tutorial to convert to EIGENSTRAT format using the convertf tool is here.

Next, download the SNPWEIGHTS package here and a reference panel. I usually use the European, West African and East Asian ancestral populations, but there are other options on the SNPWEIGHTS webpage as well.

Then, create a parameter file with directories of input files, your input population (designated "AA", "CO", and "EA"), and a output file. An example of one is below:

Finally, run the program using the command inferancestry.py --par par.SNPWEIGHTS. For the program to run correctly, make sure the inferancestry.info and snpwt.co files are in the same directory as your inferancestry.py file.

For more details, see the SNPWEIGHTS paper or the README file included in the SNPWEIGHTS zip folder. For code on generating eigenvector plots with overlaid ancestry percentages and triangle plots according to percent ancestry, see this post.

Wednesday, July 17, 2013

2K

Today Genome Toolbox hit a cumulative total of 2,000 page views over the life of the blog! Its exciting to see so much interest in genomic analysis tools. Thanks for stopping by and keep coming back to check out whats new.

How to LD Thin a List of SNPs

Okay, you have a list of SNPs with many of them in high linkage disequilibrium (highly correlated) with other SNPs in the list. Your goal is to generate a new list that has more or less independent SNPs. To do this, you need to LD prune the original starting list by removing all SNPs except one that are in a region of high LD. Essentially you need to set a threshold, say R2 > 0.50, and filter out SNPs correlated with other SNPs that surpassed this threshold as long as at least one SNP is kept to tag that region of LD.

I feel like there needs to be an online resource that would do this based on SNP correlations in a reference population (say HapMap or 1000 Genomes), but I can not find such a resource. Please leave a comment below if you are aware of one. So instead, you need an actual dataset (for example: .ped and .map files) and need to feed the data as input to a LD thinning program. Current programs use the LD structure within the input population and return a list of LD thinned SNPs.

PLINK is the first place I usually go to do this. The program has two options for LD thinning: based on variance inflation factor (by regressing a SNP on all other SNPs in the window simultaneously) and based on pairwise correlation (R2). These are the --indep and --indep-pairwise options, respectively. Here is code I use:

For details on the options click here. Essentially this takes a data_in.ped and a data_in.map file and will output two files: a data_out.prune.in file (which contains the LD pruned SNPs to keep) and a data_out.prune.out file (which is a list of SNPs to remove). PLINK is easy and relatively quick, only taking a few minutes on an list of SNPs that spanned 4 megabases.

Alternatively, GLU also has a LD thinning feature built in. This requires one to convert their .ped and .map files into a .ldat file (for details click here). It also requires a snplist file with a header containing a "Locus" and "score p-value". It was a little more work up front and seems to be optimized for analyzing GWAS association results, but I created such a file with the score p-value equal to one for all the SNPs. I am assuming if I used real p-values in the list the program may prioritize the pruning process to keep SNPs with the lowest p-values. There also appears to be an option to force in certain loci (--includeloci). To run the thinning procedure in GLU, I used the following code:

For more details and options, click here. While this takes a bit longer to run and the documentation is rather sparse on how the thinning was being performed (seems to be pairwise windows of R2), it does provide a nice output file that clearly details what SNPs are be kept and why each SNP is removed.

Recently, I became aware of PriorityPruner (thanks Christopher). This is a java program that is compatible with any OS (Windows, UNIX, Mac) as long as you have Java installed. It has a lot of the same options and functionality as GLU, but doesn't require the .ldat format and has much better documentation. The program can be downloaded here and is run with the command:

I have yet to test PriorityPruner, but so far I am impressed with what I see.

I hope this post is useful for others looking to thin a set of SNPs. If you are aware of other LD thinning approaches, please describe it in the comments below. Happy pruning!

Tuesday, July 9, 2013

Calculate Minor Allele Frequencies from VCF File Variants

Today I needed to calculate minor allele frequencies (MAFs) for sequence variants called in a .vcf file. I couldn't find any programs that would do this for me, so I wrote a quick script to do it in Python.

This can be run in python from the command prompt by typing:

where project.vcf is the vcf file you want to calculate MAFs for. It will return a project.txt file that contains the calculated MAF values. This script will only work for SNPs and does not work on insertions and deletions.

Alternatively, if Python scares you there is a bit of a round about way that will do this for you too. First use vcftools to convert your .vcf file into a Plink compatible .ped and .map file.

Then, open Plink and run the --freq option on the newly created .ped file.

**UPDATE**
Today I found an updated way to use Vcftools to directly calculate the MAF values for you. It just takes the simple command --freq. Here is some example code:

Friday, July 5, 2013

Best Notepad++ Alternatives for Mac OS

I ❤ Notepad++. Its a powerful, fully-loaded, and free text editing application that has been an invaluable tool for writing code in a variety of programming languages. The only caveat: it's only available for Windows operating systems. With the acquisition of my shiny, new Macbook Pro, I was incredibly disappointed to find out Notepad++ could not be installed on Macs; so much so I almost returned the Macbook. Since I couldn't find a better laptop to meet my needs (and aesthetic desires), the quest has begun to find a comparable and preferably free text editor that runs on a Mac operating system. I was surprised to find the list of candidates quite long. Here are options I found, unfortunately not all options are free:

Aptana

Aquamacs

BBedit ($50)

Bluefish

Coda ($75)

Crossover (Windows emulator, $60) + Notepad++

Editra

Espresso ($75)

Fraise

gEdit

jEdit
Komodo Edit

TextEdit (the basic text editor pre-loaded on your Mac)

TextMate (€39 or about $53)
TextWrangler (free lite version of BBedit)
Smultron ($5)
SubEthaEdit (€29, or about $43)
Sublime ($70)
Tincta (free, Pro version for $16)
WINE (Windows emulator) + Notepad++

Apparently the market is saturated with Notepad++ "replacement" text editors for Macs. The predominant text editors most recommended online are highlighted in bold. While looking into the options, it became apparent there really is no one best Notepad++ replacement text editor. It all really depends on what the user is using Notepad++ for and the options they need it to do (plus a bit of personal preference in user interface). I have tried a few of the above options and am still not completely satisfied. I am secretly hoping the folks responsible for Notepad++ are cooking up a way to install it on Macs. The emulator approach to installing Notepad++ on a Mac also seems interesting. I will have to try it when I have some free time. In the meantime, I am curious what has been working best for you other Notepad++ lovers who have made the switch to a Mac. Also, if you are aware of other text editors not mentioned here, please share!

Sunday, June 30, 2013

1,000 Page Views

Today, Genome Toolbox is on track to reach 1,000 page views! What an awesome milestone. I am thrilled so many people have found the blog and are using it as a resource. Thank you all for your interest and keep the feedback coming.

Tuesday, June 25, 2013

Install SRA Toolkit on UNIX OS

SRA Toolkit is the set of programatic tools Sequence Read Archive (SRA) provides for accessing information in their .sra files. The toolkit can be downloaded for a variety of platforms here, but in this post I will focus on installing the toolkit on a UNIX platform.

The first step is to download to the directory you want the program saved in and extract the program:

Next, you need to configure the toolkit by going into the bin directory and opening the java configuration script (make sure you have X11 forwarding enabled):

For the most part you can accept defaults when you walk through the setup, just make note of the directory you specified during configuration. You should be all setup! Check out other blog entries for SRA here.

Monday, June 24, 2013

Only Download Regional BAM File from Sequence Read Archive

Sequence Read Archive (SRA) stores raw next-generation sequencing data from a wide variety of sequencing platforms and projects. It is the NIH's primary archive for high-throughput sequencing data containing well over 741,690,391,503,250 bases of publicly available sequence reads (that's a really big number). Data is stored in one basic archived format, the SRA format, and can be downloaded for public use. A toolkit is provided that supports conversion from .sra files to several popular data formats (see post for installation). SRA recommends downloading .sra files using Aspera.

Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.

Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces. SRA offers a Run Browser which can be used to visualize data in the .src files. After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region. Just click to output the format to file and the download will begin in your web browser.

SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site. It is easy to install and provides clear documentation. An essential included module for extracting sequence read data is sam-dump. A genomic region can be selected and then output to .bam format using Samtools. Here is an example script:

I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest. These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project. To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file. Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.

Thursday, May 23, 2013

Download One Thousand Genome Data for Haploview

Haploview has a built-in portal to download HapMap data, but Haploview development hasn't kept pace with developing a way to automatically download 1000G SNP data. Searching for a way to visualize the higher density SNP coverage of the 1000G project, I found it was not all too difficult to do this manually. It involves a couple of extra steps.

First, determine the genomic coordinates of region you are interested in. This needs to be in hg19 coordinates. If you have hg18 coordinates, liftOver is a useful tool to convert coordinates from one human genome build to another (liftOver format is chr:start-end, for example: chr8:1000-50000).

Next, go to this 1000G website, and plug in your genomic coordinates of interest. Here the coordinates should not include chr in the chromosome name (for example: 8:1000-50000). Then on the next page select ancestral populations you are interested in (you can select multiple populations by holding down Ctrl). Give the website a few seconds to generate the files. Eventually a link to a marker information file (.info) and linkage pedigree file (.ped) will appear. Right click on each of these files and save them to your computer.

Now, fire up Haploview and select Open new data. Go to the Linkage Format tab and browse for your .ped file in the Data File field and your .info file in the Locus Information File field (the .info file field is usually automatically generated after selecting your .ped file if your .ped and .info files have the same prefix). Haploview will load the files and you should be ready to visualize the LD structure. Enjoy using 1000G data in Haploview!

Monday, May 20, 2013

How to Install Vcftools

Vcftools is a handy program to manipulate .vcf files. This page describes how to install vcftools. Here is a brief summary of what to do.

1) Download the most recent version of vcftools.
2) Extract vcftools using the extract command or the following line of code.

3) To build vcftools, cd into the vcftools directory and type make.
4) Add the following two lines to you .bashrc file and then type source .bashrc.

One final note: vcftools requires .vcf files be zipped by bgzip and indexed by Tabix. To install Tabix, see these installation instructions. Enjoy using vcftools!

Wednesday, May 15, 2013

Install Haploview on Vista / Windows 7 / Windows 8

I wanted to use Haploview to visualize the haplotype structure around a genomic area of interest. Thinking it would be a quick, simple task I downloaded the Windows installer and installed it on my computer. When I tried to open the program, I kept getting the error: Cannot run program "c:\program": CreateProcess error=2, The system cannot find the file specified. After tinkering with things myself and getting some assistance from the IT staff, here is the workaround we developed to install Haploview on newer Windows operating systems.

1) Download the newest Haploview.jar file and save it to a location where you want to permanently keep it, for example: "C:\Haploview\". This .jar file contains all the Java code needed to run Haploview.

2) For most 64-bit Windows operating systems, the newest version of Java will usually run Haploview. The most current version is available for download here. For 32-bit versions of Windows operating systems, Haploview seems to be a bit pickier about what version of Java will run Haploview. I have had the best luck with Java version 6 update 43 and earlier. They can be downloaded from the Oracle Java archive. Click the radio button to accept the license agreement and select the Windows x86 Offline version for download. The next page will ask you to sign in or sign up for a free account. Just complete the form and the download will begin after you log in.

3) Use Notepad (or Notepad++) to create a Haploview.bat file in the same directory you placed the Haploview.jar file with the following code in it. The Haploview.bat file is simply created by pasting the below code into Notepad and then saving it as Haploview.bat. This is just a quick and easy way to open a command prompt in the background and run the Haploview.jar file in Java.

For 64-bit Windows operating systems use the code:

For 32-bit Windows operating systems use the code:
(Note: The code for the 32-bit Windows operating system explicitly states the version of java to run, i.e. "C:\Program Files\Java\jre1.6.0_43\bin\java.exe". This allows us to avoid having to set up a path variable in Windows (which I have found buggy and difficult to set up). The "jre1.6.0_43" portion in the path is an example of where your version 6 Java is located. Lower versions of Java 6 will be "jre1.6.0_##", where ## is the version number. If you only have one version 6 of Java on your computer, the folder will named "jre6".)

4) Right click on the Haploview.bat file you just created and choose create a shortcut. This shortcut is now what you can use to open Haploview. Cut and paste this shortcut into a location that is easy to access. I put it in my Start Menu under All Programs, but pasting it on the Desktop works well too.

Hope this was helpful and saves those interested in using Haploview on newer 32- and 64-bit Windows operating systems a lot of time. While these tips should be useful for getting Haploview to run on a majority of newer Windows operating systems, it may still take a bit of trial and error to get Haploview up and running. If you are still having difficulty getting Haploview to work with newer versions of Java, try older versions from the Oracle Java archive and follow the above instructions for 32-bit Windows operating systems. Also, check out the comments section below to see what has worked for others. If something has worked for you and its not posted below, please share!

Friday, May 10, 2013

Install Samtools on UNIX System

Samtools is a very useful tool for manipulating and visualizing .bam files. Here is a quick guide on how to install Samtools. First download the most current version from the Samtools website. Unzip the file using either tar xvjf or the extract command. Go into the newly created directory and compile the code by typing "make". Your code should look something like this.

The next step is then to modify your .bashrc file so that when you type "samtools" it calls the program. To do this open your .bashrc file and add the following line of code, where directory is equal to the directory that you have installed Samtools in.

Make sure as a final step to source the .bashrc file by entering source .bashrc into the command line.

Thursday, May 9, 2013

Merge Multiple .bam Files

I had multiple .bam files from different subjects I wanted to merge into one master .bam file. It seemed like Samtools would easily convert these files using the merge option, but after reading a few online posts there looks like there are some issues with creating an appropriate header for the new merged .bam file from all the original .bam files. Picard seemed to be the optimal method to do this. Here's some code that uses Picard to merge multiple .bam files.

Create .ped and .map files from .vcf file

Here is a quick and easy script to convert .vcf files into a PLINK compatible .ped and .map file using vcftools. Only bi-allelic loci will be output. The --plink option can be very slow on large datasets in which case it is recommended to use the --chr option to output individual chromosomes or the --plink-tped option to output transposed PLINK files.

Sunday, May 5, 2013

Samtools Coverage Depth for Multiple Regions of Interest

Here's some code to use Samtools to extract the sequencing coverage depth from a BAM file for multiple regions of interest as specified in a BED file. The resulting filename_depth.txt gives coverage data for each base in the region that is in the BAM file.

Genome Toolbox

Pages