Genome Toolbox: sequencing

Showing posts with label sequencing. Show all posts

Friday, June 13, 2014

Find Total, Mapped, and Unmapped Alignments in a BAM File

Often one of the first descriptive statistics of interest for a .bam file is the total number of alignments included in the BAM file. An alignment is where a read from a next-generation sequencing approach maps to the reference genome. There are a few ways of calculating the total number of mapped, unmapped, and overall number of alignments, but in my opinion samtools provides the most powerful and efficient means of doing this. Here are some simple example scripts to count total alignments and total reads in a bam file.

Wednesday, June 26, 2013

Remove a List of Reads from a BAM File

Sometimes it is necessary to remove a subset of reads from a .bam file. In my case, I wanted to remove a few chimeric reads where it appeared reads from different amplicons were fusing together before entering the sequencer. Here is a line of code where I use Samtools and grep to remove a list of read ID's from the original .bam file and create a new filtered .bam file. Hope it is useful for other applications as well.

Note the trailing hyphen at the end.

Tuesday, June 25, 2013

Install SRA Toolkit on UNIX OS

SRA Toolkit is the set of programatic tools Sequence Read Archive (SRA) provides for accessing information in their .sra files. The toolkit can be downloaded for a variety of platforms here, but in this post I will focus on installing the toolkit on a UNIX platform.

The first step is to download to the directory you want the program saved in and extract the program:

Next, you need to configure the toolkit by going into the bin directory and opening the java configuration script (make sure you have X11 forwarding enabled):

For the most part you can accept defaults when you walk through the setup, just make note of the directory you specified during configuration. You should be all setup! Check out other blog entries for SRA here.

Monday, June 24, 2013

Only Download Regional BAM File from Sequence Read Archive

Sequence Read Archive (SRA) stores raw next-generation sequencing data from a wide variety of sequencing platforms and projects. It is the NIH's primary archive for high-throughput sequencing data containing well over 741,690,391,503,250 bases of publicly available sequence reads (that's a really big number). Data is stored in one basic archived format, the SRA format, and can be downloaded for public use. A toolkit is provided that supports conversion from .sra files to several popular data formats (see post for installation). SRA recommends downloading .sra files using Aspera.

Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.

Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces. SRA offers a Run Browser which can be used to visualize data in the .src files. After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region. Just click to output the format to file and the download will begin in your web browser.

SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site. It is easy to install and provides clear documentation. An essential included module for extracting sequence read data is sam-dump. A genomic region can be selected and then output to .bam format using Samtools. Here is an example script:

I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest. These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project. To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file. Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.

Thursday, June 20, 2013

Does BAM File Use Hg18 or HG19 Coordinates?

How do you tell which coordinates are used in a .bam file? Well, its pretty easy. Just pull the .bam file header up using Samtools

Then check out the rows that begin with @SQ followed by SN:chr## and LN:########. Compare the lengths of a few of the chromosomes to the below list of lengths. Whichever list the lengths match will indicate which coordinates are being used in the .bam file.

Hope this is helpful in determining which coordinates are used in your .bam files.

Wednesday, May 29, 2013

Find Palindromic Sub-sequences in DNA Nucleotide Sequence

A quick and easy online tool I found to search a DNA sequence for palindromic sub-sequences is the DNA Palindrome Finder. Just paste your nucleotide sequence into the text box and press Send. Results will show up on the next page with palindromic sequences displayed as capital letters. Its pretty basic, but does what it needs to do.

Thursday, May 23, 2013

Download One Thousand Genome Data for Haploview

Haploview has a built-in portal to download HapMap data, but Haploview development hasn't kept pace with developing a way to automatically download 1000G SNP data. Searching for a way to visualize the higher density SNP coverage of the 1000G project, I found it was not all too difficult to do this manually. It involves a couple of extra steps.

First, determine the genomic coordinates of region you are interested in. This needs to be in hg19 coordinates. If you have hg18 coordinates, liftOver is a useful tool to convert coordinates from one human genome build to another (liftOver format is chr:start-end, for example: chr8:1000-50000).

Next, go to this 1000G website, and plug in your genomic coordinates of interest. Here the coordinates should not include chr in the chromosome name (for example: 8:1000-50000). Then on the next page select ancestral populations you are interested in (you can select multiple populations by holding down Ctrl). Give the website a few seconds to generate the files. Eventually a link to a marker information file (.info) and linkage pedigree file (.ped) will appear. Right click on each of these files and save them to your computer.

Now, fire up Haploview and select Open new data. Go to the Linkage Format tab and browse for your .ped file in the Data File field and your .info file in the Locus Information File field (the .info file field is usually automatically generated after selecting your .ped file if your .ped and .info files have the same prefix). Haploview will load the files and you should be ready to visualize the LD structure. Enjoy using 1000G data in Haploview!

Monday, May 20, 2013

Merge VCF Files

VCF tools can be used to merge .vcf files using the following commnad.

How to Install Vcftools

Vcftools is a handy program to manipulate .vcf files. This page describes how to install vcftools. Here is a brief summary of what to do.

1) Download the most recent version of vcftools.
2) Extract vcftools using the extract command or the following line of code.

3) To build vcftools, cd into the vcftools directory and type make.
4) Add the following two lines to you .bashrc file and then type source .bashrc.

One final note: vcftools requires .vcf files be zipped by bgzip and indexed by Tabix. To install Tabix, see these installation instructions. Enjoy using vcftools!

Friday, May 17, 2013

Convert Complete Genomics Data to BAM File

Complete Genomics provides cgatools as a program suite to help manage the data they produce for investigators. Although still in its beta form, cgatools does provide some functionality for converting evidence dnbs files (.tsv.bz2) to .bam format. Below is a script that does so, the only caveat being that I can't seem to get the samtools depth command to work with the generated .bam files.

Here are links to documentation and a handy command reference for cgatools.

Download Region of Indexed VCF File

Tabix is a handy tool to download a region of interest from a .vcf file on an ftp site. Below is an example of how to extract a region on chromosome 1 from a Complete Genomics .vcf file hosted on the 1000 Genomes ftp site. The .vcf.gz file must have a matching .vcf.gz.tbi index file for tabix to work and the chromosome region of interest should not include "chr" with the chromosome number.

Thursday, May 16, 2013

Download Complete Genomics Reference Files

A reference genome is needed when using cgatools on Complete Genomics data. Here are links to the ftp sites that contain the reference compact randomly accessible reference (.crr) files. Just use the wget command from a UNIX cluster to download.

NCBI Build 36:
ftp://ftp.completegenomics.com/ReferenceFiles/build36.crr

NCBI Build 37:
ftp://ftp.completegenomics.com/ReferenceFiles/build37.crr

The next step is to verify that the file downloaded completely. Run one of the following commands at the command prompt depending on the version of the .crr file you downloaded.

The file output should look like this for build 36.

And this for build 37.

Thursday, May 9, 2013

Change .bam File Header

Each .bam file has an important header that describes a number of characteristics about the read sequences it contains. The header is usually multiple lines and has information no chromosomes and samples included in the .bam file. Samtools can be used to view the header of a .bam file with the following command.

If the need arises, Samtools can also be used to modify the header of a .bam file. Samtools uses the reheader command to do this. Below is an example where I changed the length of chrM in the header from 16569 to 16571 base pairs.

Merge Multiple .bam Files

I had multiple .bam files from different subjects I wanted to merge into one master .bam file. It seemed like Samtools would easily convert these files using the merge option, but after reading a few online posts there looks like there are some issues with creating an appropriate header for the new merged .bam file from all the original .bam files. Picard seemed to be the optimal method to do this. Here's some code that uses Picard to merge multiple .bam files.

Create Reference FASTA File

A lot of sequencing programs and analyses require a reference genome FASTA file to run. Since I didn't have a reference genome in my possession, I looked into making one of my own. Here is the code I used to create an indexed reference sequence file from the UCSC ftp site that would be compatible with GATK. When concatenating the chromosomes together, make sure they are in the same order and the same length as the .bam file you want to use it with.

After running GATK for the first time, a hg.19.dict file is also created.

Create .ped and .map files from .vcf file

Here is a quick and easy script to convert .vcf files into a PLINK compatible .ped and .map file using vcftools. Only bi-allelic loci will be output. The --plink option can be very slow on large datasets in which case it is recommended to use the --chr option to output individual chromosomes or the --plink-tped option to output transposed PLINK files.

Thousand Genomes Complete Genomics Information

Recently I have been using the Complete Genomics high coverage sequencing data from the 1000 Genomes project. Its a bit difficult to find information on the populations, samples, and available sequencing data since they are all stored in different places on their ftp server. I decided to make a post that tried to combine all the useful information into one spot. Here are links to files that may be of interest.

population file: gives a key of abbreviations used for the 1000 Genomes populations

pedigree file: provides relationship information on 1000 Genomes individuals that are related as well as gender and 1000 Genomes population

sample file: detailed spreadsheet that offers information on sample id, accession number, population, family, gender, relationship, sequencing center, and coverage for each 1000G sample.

Sunday, May 5, 2013

Samtools Coverage Depth for Multiple Regions of Interest

Here's some code to use Samtools to extract the sequencing coverage depth from a BAM file for multiple regions of interest as specified in a BED file. The resulting filename_depth.txt gives coverage data for each base in the region that is in the BAM file.

Wednesday, May 1, 2013

1000 Genomes Complete Genomics Data

1000 Genomes now has Complete Genomics high coverage sequencing data available for 57 samples online at the ftp sites: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/data and ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data. The first ftp is best for US access and the second is best for access from Europe. Here's a link to the 1000 Genomes page for more details: http://www.1000genomes.org/announcements/complete-genomics-data-available-2012-12-19.

Pages