Genome Toolbox: TCGA

Showing posts with label TCGA. Show all posts

Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R. No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library. Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study. Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Monday, June 16, 2014

Decode TCGA Cancer Abbreviations

The Cancer Genome Atlas (TCGA) uses abbreviations in a lot of filenames and directories for the cancer subtypes they study. Most of these sample abbreviations are pretty self-explanatory, but there are a few hard to guess cancer abbreviations. To decipher these TCGA sample types, I reference the below table:

Abbreviation	Cancer Type
ACC	Adrenocortical carcinoma
BLCA	Bladder Urothelial Carcinoma
BRCA	Breast invasive carcinoma
CESC	Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL	Cholangiocarcinoma
CNTL	Controls
COAD	Colon adenocarcinoma
DLBC	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
ESCA	Esophageal carcinoma
GBM	Glioblastoma multiforme
HNSC	Head and Neck squamous cell carcinoma
KICH	Kidney Chromophobe
KIRC	Kidney renal clear cell carcinoma
KIRP	Kidney renal papillary cell carcinoma
LAML	Acute Myeloid Leukemia
LCML	Chronic Myelogenous Leukemia
LGG	Brain Lower Grade Glioma
LIHC	Liver hepatocellular carcinoma
LUAD	Lung adenocarcinoma
LUSC	Lung squamous cell carcinoma
MESO	Mesothelioma
MISC	Miscellaneous
OV	Ovarian serous cystadenocarcinoma
PAAD	Pancreatic adenocarcinoma
PCPG	Pheochromocytoma and Paraganglioma
PRAD	Prostate adenocarcinoma
READ	Rectum adenocarcinoma
SARC	Sarcoma
SKCM	Skin Cutaneous Melanoma
STAD	Stomach adenocarcinoma
TGCT	Testicular Germ Cell Tumors
THCA	Thyroid carcinoma
THYM	Thymoma
UCEC	Uterine Corpus Endometrial Carcinoma
UCS	Uterine Carcinosarcoma
UVM	Uveal Melanoma

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format. The TCGA project is one such group that keeps a record of samples and analysis information in this format. It can be difficult to extract data from these files without the appropriate tools. In some cases a simple grep command can be used, but even then the output usually needs to be cleaned. Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted. The library I found most useful was ElementTree. This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file. Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object. Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Wednesday, January 15, 2014

Split a Variable in R into Components

I had a TCGA barcode that I wanted to extract information about the sample type (ie: cut out characters 14-15). This would have been easy to do in Python (ex: id[13:15] or id.split("-")[3][0:2]), but I wanted to be able to do this inside R. To do this I found a handy little base function called substr. This is a function that allows you to take a subset of a string. Here is the code to extract characters 14-15 from a string:

substr(x=id, start=14, stop=15)

or also from a variable called ids:

substr(x=data$ids, start=14, stop=15)

Pages