Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.
Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces. SRA offers a Run Browser which can be used to visualize data in the .src files. After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region. Just click to output the format to file and the download will begin in your web browser.
SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site. It is easy to install and provides clear documentation. An essential included module for extracting sequence read data is sam-dump. A genomic region can be selected and then output to .bam format using Samtools. Here is an example script:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
./sam-dump --aligned-region chr1:1-1000000 --header SRR819317 | samtools view -Sb -o SRR819317_chr1.bam - |
I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest. These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project. To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file. Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.
Is this still working for you? I'm also trying to extract regions from Complete Genomics files and when I run your example, nothing happens.
ReplyDeleteHi Aaron,
DeleteThe code returns the line:
[samopen] SAM header is present: 25 sequences.
Then is hangs forever. Not sure what it going on. It makes a SRR819317_chr1.bam file, but it is a truncated file. Let me know if you get it running.