Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label Samtools. Show all posts
Showing posts with label Samtools. Show all posts

Wednesday, July 23, 2014

Sort BAM File in Samtools

Samtools does a host of useful operations for .bam files.  One such operation is sorting.  Below is a simple example script to show how to use samtools to sort an unsorted BAM file.


This script will sort the unsorted.bam using 8 threads that allocates 12G of memory per thread.  The resulting sorted .bam file will be called sorted.bam.

To confirm a .bam file is sorted, check the header (samtools view -H sorted.bam) for the line:
SO:coordinate.

Additionally, here is the usage information for Samtools sort:

Usage:   samtools sort [options]

Options: -n        sort by read name
         -f        use as full file name instead of prefix
         -o        final output to stdout
         -l INT    compression level, from 0 to 9 [-1]
         -@ INT    number of sorting and compression threads [1]
         -m INT    max memory per thread; suffix K/M/G recognized [768M]

Friday, June 13, 2014

Find Total, Mapped, and Unmapped Alignments in a BAM File

Often one of the first descriptive statistics of interest for a .bam file is the total number of alignments included in the BAM file.  An alignment is where a read from a next-generation sequencing approach maps to the reference genome.  There are a few ways of calculating the total number of mapped, unmapped, and overall number of alignments, but in my opinion samtools provides the most powerful and efficient means of doing this.  Here are some simple example scripts to count total alignments and total reads in a bam file.

Friday, May 16, 2014

Create Index for BAM File

Sometimes databases or contributors share .BAM files, but fail to provide the associated BAM index file (the .BAI file). The BAM index file, usually named filename.bam.bai, is needed to visualize the reads in IGV as well as several other applications. Samtools can easily generate an index file for your sequence bam file. The below code is an example of how to do so for a bam file called filename.bam.

Thursday, February 20, 2014

Zero Fill Coverage Gaps in Samtools Depth Output

Merging depth output from multiple .bam files can be difficult since Samtools only outputs depth counts for coordinates with non-zero coverage.  If you want to merge depth output from these .bam files you first need to fill in the base pair positions of no coverage with zero values so the depth output for all .bam files is the same length.  Then using a simple UNIX cat you can merge multiple .bam file depth output into one file for comparison and analysis.

Here is a simple Python script to zero fill Samtools depth output:



And below is an example of how to use it:

Wednesday, June 26, 2013

Remove a List of Reads from a BAM File

Sometimes it is necessary to remove a subset of reads from a .bam file.  In my case, I wanted to remove a few chimeric reads where it appeared reads from different amplicons were fusing together before entering the sequencer.  Here is a line of code where I use Samtools and grep to remove a list of read ID's from the original .bam file and create a new filtered .bam file.  Hope it is useful for other applications as well.


Note the trailing hyphen at the end.

Thursday, June 20, 2013

Clean BAM Files Generated by CGATools

Tired of seeing the error "samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed."  in those .bam files you generated using CGATools.  Here's a way to remove those pesky lines that Samtools does not like.  Basically, we are going to remove any line where the cigar string fulfills any of the following:

starts with \d+N\dD
starts with \d+P
starts with \d+I
ends with \d+P

where \d+ is the Perl regular expression meaning any integer containing an unspecified number of digits.  Here is a pipeline that uses Samtools and AWK to do this for us.


Enjoy nice, clean (and hopefully problem free) .bam files.

Does BAM File Use Hg18 or HG19 Coordinates?

How do you tell which coordinates are used in a .bam file?  Well, its pretty easy.  Just pull the .bam file header up using Samtools


Then check out the rows that begin with @SQ followed by SN:chr## and LN:########.  Compare the lengths of a few of the chromosomes to the below list of lengths.  Whichever list the lengths match will indicate which coordinates are being used in the .bam file.


Hope this is helpful in determining which coordinates are used in your .bam files.

Monday, June 17, 2013

Index FASTA File

Some programs require indexed FASTA files (*.fa.fai files).  Indexing a FASTA file is easy to do with Samtools.  Just run the below script on your FASTA file.

Friday, May 17, 2013

Convert Complete Genomics Data to BAM File

Complete Genomics provides cgatools as a program suite to help manage the data they produce for investigators.  Although still in its beta form, cgatools does provide some functionality for converting evidence dnbs files (.tsv.bz2) to .bam format.  Below is a script that does so, the only caveat being that I can't seem to get the samtools depth command to work with the generated .bam files.


Here are links to documentation and a handy command reference for cgatools.

Friday, May 10, 2013

Install Samtools on UNIX System

Samtools is a very useful tool for manipulating and visualizing .bam files.  Here is a quick guide on how to install Samtools.  First download the most current version from the Samtools website.  Unzip the file using either tar xvjf or the extract command.  Go into the newly created directory and compile the code by typing "make".  Your code should look something like this.

The next step is then to modify your .bashrc file so that when you type "samtools" it calls the program.  To do this open your .bashrc file and add the following line of code, where directory is equal to the directory that you have installed Samtools in.

Make sure as a final step to source the .bashrc file by entering source .bashrc into the command line.

Thursday, May 9, 2013

Change .bam File Header

Each .bam file has an important header that describes a number of characteristics about the read sequences it contains.  The header is usually multiple lines and has information no chromosomes and samples included in the .bam file.   Samtools can be used to view the header of a .bam file with the following command.

If the need arises, Samtools can also be used to modify the header of a .bam file.  Samtools uses the reheader command to do this.  Below is an example where I changed the length of chrM in the header from 16569 to 16571 base pairs.

Create Reference FASTA File

A lot of sequencing programs and analyses require a reference genome FASTA file to run.  Since I didn't have a reference genome in my possession, I looked into making one of my own.  Here is the code I used to create an indexed reference sequence file from the UCSC ftp site that would be compatible with GATK.  When concatenating the chromosomes together, make sure they are in the same order and the same length as the .bam file you want to use it with.



After running GATK for the first time, a hg.19.dict file is also created.

Sunday, May 5, 2013

Samtools Coverage Depth for Multiple Regions of Interest

Here's some code to use Samtools to extract the sequencing coverage depth from a BAM file for multiple regions of interest as specified in a BED file.  The resulting filename_depth.txt gives coverage data for each base in the region that is in the BAM file.

Samtools Download BAM Region Only

Often times publicly available sequencing data can serve as a useful reference for a sequencing project. The 1000 Genomes project is a great source, especially with their newly released high coverage Complete Genomics data.  Here is an example UNIX script that shows how BAM files with genomic regions of interest can be created from a whole-genome BAM file that is hosted on an FTP server, without having to download the entire BAM file first.  The bai_file_list.txt file is a file that contains unique identifiers for each BAM file extracted from a previously selected list of BAI files of interest.  Here I am just extracting the BRCA1 and BRCA2 regions of the genome.  The extracted reads are sorted and then saved as a BAM file and an associated BAI index file is also created.  The final step removes excess BAI files that are downloaded and used by Samtools to extract the region of interest from the BAM files on the FTP server.