Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label cgatools. Show all posts
Showing posts with label cgatools. Show all posts

Thursday, June 20, 2013

Clean BAM Files Generated by CGATools

Tired of seeing the error "samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed."  in those .bam files you generated using CGATools.  Here's a way to remove those pesky lines that Samtools does not like.  Basically, we are going to remove any line where the cigar string fulfills any of the following:

starts with \d+N\dD
starts with \d+P
starts with \d+I
ends with \d+P

where \d+ is the Perl regular expression meaning any integer containing an unspecified number of digits.  Here is a pipeline that uses Samtools and AWK to do this for us.


Enjoy nice, clean (and hopefully problem free) .bam files.

Friday, May 17, 2013

Convert Complete Genomics Data to BAM File

Complete Genomics provides cgatools as a program suite to help manage the data they produce for investigators.  Although still in its beta form, cgatools does provide some functionality for converting evidence dnbs files (.tsv.bz2) to .bam format.  Below is a script that does so, the only caveat being that I can't seem to get the samtools depth command to work with the generated .bam files.


Here are links to documentation and a handy command reference for cgatools.

Thursday, May 16, 2013

Download Complete Genomics Reference Files

A reference genome is needed when using cgatools on Complete Genomics data.  Here are links to the ftp sites that contain the reference compact randomly accessible reference (.crr) files.  Just use the wget command from a UNIX cluster to download.

NCBI Build 36:
ftp://ftp.completegenomics.com/ReferenceFiles/build36.crr

NCBI Build 37:
ftp://ftp.completegenomics.com/ReferenceFiles/build37.crr

The next step is to verify that the file downloaded completely.  Run one of the following commands at the command prompt depending on the version of the .crr file you downloaded.


The file output should look like this for build 36.


And this for build 37.