Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label TCGA. Show all posts
Showing posts with label TCGA. Show all posts

Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R.  No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library.  Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study.  Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Monday, June 16, 2014

Decode TCGA Cancer Abbreviations

The Cancer Genome Atlas (TCGA) uses abbreviations in a lot of filenames and directories for the cancer subtypes they study.  Most of these sample abbreviations are pretty self-explanatory, but there are a few hard to guess cancer abbreviations.  To decipher these TCGA sample types, I reference the below table:


Abbreviation  Cancer Type
ACC Adrenocortical carcinoma
BLCA Bladder Urothelial Carcinoma
BRCA Breast invasive carcinoma
CESC Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL Cholangiocarcinoma
CNTL Controls
COAD Colon adenocarcinoma
DLBC Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
ESCA Esophageal carcinoma
GBM Glioblastoma multiforme
HNSC Head and Neck squamous cell carcinoma
KICH Kidney Chromophobe
KIRC Kidney renal clear cell carcinoma
KIRP Kidney renal papillary cell carcinoma
LAML Acute Myeloid Leukemia
LCML Chronic Myelogenous Leukemia
LGG Brain Lower Grade Glioma
LIHC Liver hepatocellular carcinoma
LUAD Lung adenocarcinoma
LUSC Lung squamous cell carcinoma
MESO Mesothelioma
MISC Miscellaneous
OV Ovarian serous cystadenocarcinoma
PAAD Pancreatic adenocarcinoma
PCPG Pheochromocytoma and Paraganglioma
PRAD Prostate adenocarcinoma
READ Rectum adenocarcinoma
SARC Sarcoma
SKCM Skin Cutaneous Melanoma
STAD Stomach adenocarcinoma
TGCT Testicular Germ Cell Tumors
THCA Thyroid carcinoma
THYM Thymoma
UCEC Uterine Corpus Endometrial Carcinoma
UCS Uterine Carcinosarcoma
UVM Uveal Melanoma

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format.  The TCGA project is one such group that keeps a record of samples and analysis information in this format.  It can be difficult to extract data from these files without the appropriate tools.  In some cases a simple grep command can be used, but even then the output usually needs to be cleaned.  Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted.  The library I found most useful was ElementTree.  This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file.  Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object.  Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Wednesday, January 15, 2014

Split a Variable in R into Components

I had a TCGA barcode that I wanted to extract information about the sample type (ie: cut out characters 14-15).  This would have been easy to do in Python (ex: id[13:15] or id.split("-")[3][0:2]), but I wanted to be able to do this inside R.  To do this I found a handy little base function called substr.  This is a function that allows you to take a subset of a string.  Here is the code to extract characters 14-15 from a string:

substr(x=id, start=14, stop=15)

or also from a variable called ids:

substr(x=data$ids, start=14, stop=15)