Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label extract. Show all posts
Showing posts with label extract. Show all posts

Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R.  No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library.  Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study.  Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format.  The TCGA project is one such group that keeps a record of samples and analysis information in this format.  It can be difficult to extract data from these files without the appropriate tools.  In some cases a simple grep command can be used, but even then the output usually needs to be cleaned.  Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted.  The library I found most useful was ElementTree.  This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file.  Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object.  Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Wednesday, January 15, 2014

Split a Variable in R into Components

I had a TCGA barcode that I wanted to extract information about the sample type (ie: cut out characters 14-15).  This would have been easy to do in Python (ex: id[13:15] or id.split("-")[3][0:2]), but I wanted to be able to do this inside R.  To do this I found a handy little base function called substr.  This is a function that allows you to take a subset of a string.  Here is the code to extract characters 14-15 from a string:

substr(x=id, start=14, stop=15)

or also from a variable called ids:

substr(x=data$ids, start=14, stop=15)

Tuesday, December 3, 2013

Extract Only Desired Files from Compressed File

Lately I have been working with some large tarballs (tar.gz files).  To access the compressed contents, usually I would extract all the files that were compressed in the archive.  This was time consuming to decompress all the files, a pain to filter through looking for the files I wanted, and required more time to delete everything I didn't want.  I knew there was a better way to just get the files I wanted.  Sure enough, the basic tar command has options that allow you just to extract a file of interest or even a set of files using wildcards.  The command looks like this:

where,
big_file.tar.gz is the tarball you are extracting from
path is the path to your file in the tarball
your_file.txt is the file you want to extract

This can also be done using wildcards.  For example, if I wanted all text files in the above path I could substitute *.txt for your_file.txt.

Sunday, May 5, 2013

Samtools Download BAM Region Only

Often times publicly available sequencing data can serve as a useful reference for a sequencing project. The 1000 Genomes project is a great source, especially with their newly released high coverage Complete Genomics data.  Here is an example UNIX script that shows how BAM files with genomic regions of interest can be created from a whole-genome BAM file that is hosted on an FTP server, without having to download the entire BAM file first.  The bai_file_list.txt file is a file that contains unique identifiers for each BAM file extracted from a previously selected list of BAI files of interest.  Here I am just extracting the BRCA1 and BRCA2 regions of the genome.  The extracted reads are sorted and then saved as a BAM file and an associated BAI index file is also created.  The final step removes excess BAI files that are downloaded and used by Samtools to extract the region of interest from the BAM files on the FTP server.