Genome Toolbox: next-generation

Friday, June 13, 2014

Find Total, Mapped, and Unmapped Alignments in a BAM File

Often one of the first descriptive statistics of interest for a .bam file is the total number of alignments included in the BAM file. An alignment is where a read from a next-generation sequencing approach maps to the reference genome. There are a few ways of calculating the total number of mapped, unmapped, and overall number of alignments, but in my opinion samtools provides the most powerful and efficient means of doing this. Here are some simple example scripts to count total alignments and total reads in a bam file.

Monday, June 24, 2013

Only Download Regional BAM File from Sequence Read Archive

Sequence Read Archive (SRA) stores raw next-generation sequencing data from a wide variety of sequencing platforms and projects. It is the NIH's primary archive for high-throughput sequencing data containing well over 741,690,391,503,250 bases of publicly available sequence reads (that's a really big number). Data is stored in one basic archived format, the SRA format, and can be downloaded for public use. A toolkit is provided that supports conversion from .sra files to several popular data formats (see post for installation). SRA recommends downloading .sra files using Aspera.

Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.

Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces. SRA offers a Run Browser which can be used to visualize data in the .src files. After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region. Just click to output the format to file and the download will begin in your web browser.

SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site. It is easy to install and provides clear documentation. An essential included module for extracting sequence read data is sam-dump. A genomic region can be selected and then output to .bam format using Samtools. Here is an example script:

I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest. These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project. To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file. Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.

Pages

Friday, June 13, 2014

Find Total, Mapped, and Unmapped Alignments in a BAM File

Monday, June 24, 2013

Only Download Regional BAM File from Sequence Read Archive