Sequence Read Archive (SRA) stores raw next-generation sequencing data from a wide variety of sequencing platforms and projects. It is the NIH's primary archive for high-throughput sequencing data containing well over 741,690,391,503,250 bases of publicly available sequence reads (that's a really big number). Data is stored in one basic archived format, the SRA format, and can be downloaded for public use. A toolkit is provided that supports conversion from .sra files to several popular data formats (see post for installation). SRA recommends downloading .sra files using Aspera.
Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.
Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces. SRA offers a Run Browser which can be used to visualize data in the .src files. After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region. Just click to output the format to file and the download will begin in your web browser.
SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site. It is easy to install and provides clear documentation. An essential included module for extracting sequence read data is sam-dump. A genomic region can be selected and then output to .bam format using Samtools. Here is an example script:
I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest. These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project. To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file. Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.
A repository of programs, scripts, and tips essential to
genetic epidemiology, statistical genetics, and bioinformatics
Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Subscribe to:
Post Comments (Atom)
Is this still working for you? I'm also trying to extract regions from Complete Genomics files and when I run your example, nothing happens.
ReplyDeleteHi Aaron,
DeleteThe code returns the line:
[samopen] SAM header is present: 25 sequences.
Then is hangs forever. Not sure what it going on. It makes a SRR819317_chr1.bam file, but it is a truncated file. Let me know if you get it running.