Genome Toolbox: package

Showing posts with label package. Show all posts

Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R. No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library. Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study. Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Monday, June 2, 2014

Make Venn Diagram in R with Correctly Weighted Areas

Venn diagrams are incredibly intuitive plots that visually display the overlap between groups. There are a host of programs out there that make Venn diagrams, but very few actually weight the areas correctly to scale. In my opinion, this is unacceptable in the 21st century. I did stumble across one useful R package that calculates the appropriate area for intersections. It is the Vennerable package. While documentation and options are rather light for this package, it does what it needs to do: correctly size overlapping regions.

Its a little tricky to install this Venn diagram package. To do so, follow the below steps:
(1) Type setRepositories() in the R command console.
(2) Select R-forge, rforge.net, CRAN extras, BioC software, and BioC extras in the pop-up window and then press OK. I just have all of them selected.
(3) Install Vennerable package by typing install.packages("Vennerable", repos="http://R-Forge.R-project.org")
(4) Install the dependencies by typing install.packages(c("graph", "RBGL"), dependencies=TRUE).

This should have you up and running. The R Venerable help page can be accessed here, but its really not that useful. Below are some examples you can run as a bit of a primer. The trickiest thing to learn is how the weights are assigned in the Venn plot. As a general rule of thumb, if SetNames=c("A","B","C"), then Weight=c(notABC, A, B, AB, C, AC, BC, ABC). Its a bit frustrating to have no control over general plotting details such as color, label location, and rotation, but I guess I can live with these things for now. Also of note, the plotting algorithm tries its best to converge at a Venn diagram that fits circle areas to your data, but if for some reason it can't there is no guarantee the plot will match your data, so double check things thoroughly! Here are some examples below. If you know of some new or better options out there, please let me know in the comments section. Enjoy.

The output looks like this:

Tuesday, March 25, 2014

Calculate P-value for Linear Mixed Model in R

The lme4 R package is a powerful tool for fitting mixed models in R where you can specify fixed and random effects. One oddity about the program is it returns t statistics, but no p-value. To get the p-value takes a little extra coding. Here is a quick example for fitting a linear mixed model in R (using lmer) and then the added code to calculate p-values from the t statistic. Either p1 or p2 are acceptable p-values to use.

Monday, January 27, 2014

Produce SAS Proc Freq Output in R

I have never been a huge fan or advocate of SAS, and actually recommend using other SAS alternatives, but I am somewhat addicted to the SAS Proc Freq procedure and its output tables. Its a nice way to not only visualize the data but also to get some useful summary statistics. I have been using the R table command for a while and in most cases combined with margin.table or prop.table it suffices to summarize the data. Recently, I have been in need of summary statistics for the tables as well. This can be accomplished easy enough using the R chisq.test or fisher.test commands, but still doesn't quite provide the fluid integration of data visualization and summary statistics that the SAS Proc Freq output provides. Today I came across the R package CrossTable. This procedure, part of the gmodels library, provides formatted output very similar to that of Proc Freq. So much so, it even uses the same ascii characters to delineate cell boundaries. There is a bit of playing around with options and such to get the exact statistics and percentages and such you would like, but overall a very nice (a not to mention free) alternative to SAS's Proc Freq output. Below is an example table I created.

Thursday, December 12, 2013

Create Triangle Plot from Inferred Genetic Ancestry

I previously posted on how to infer ancestry for a group of study participants using SNP genotypes. Today, I want to visually plot some of the output in R. Two informative plots that can be generated from the output are a standard plot of the two eigenvectors with percent ancestry overlaid and a triangle (or ternary) plot with each axis representing percentage of one of the three ancestral populations (ex: European, African, and Asian).

Here is some simple code to plot this in R. There is no base package to plot the triangle plot, so the plotrix package will need to first be installed. The ancestry.txt file is the output file from SNPWEIGHTS, but other output could be formatted to work as well.

The output should look similar to the plots below.

Friday, November 15, 2013

Install R Package in User Account on UNIX Cluster

R is a wonderful open-source statistical program with new and exciting packages constantly being released. Often UNIX cluster administrators don't have the time (or interest) to update or add new packages as they are released. Here's where this blog post comes in. By installing R packages in a directory on your UNIX account and loading them from there you can circumvent the need to have cluster administrators install these packages. Briefly, here is how to do this:

(1) Install the R package either from the UNIX console or R console to a directory within your user account.
From UNIX console:
or

From R console:

In the above example, pkg is the R package you want to install and /R_packages/ is the directory you would like to install them in.

(2) To load the package for use in R, simply put the following command in your script.

Pages