Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label data. Show all posts
Showing posts with label data. Show all posts

Friday, November 21, 2014

Convert 2 by 2 Contingency Table to Data Frame in R

Contingency tables are useful short hand ways of inputting and visualizing data. I have yet to find an easy way to convert between contingency tables and data frames in R. Below is a short script in which I input a contingency table, create a function to convert the 2 by 2 table to a data frame, and convert the data frame back to a table. These operations are useful for running some statistical operations that either only work on tables or only work on data frames. Hope the below example is useful.

Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R.  No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library.  Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study.  Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Friday, August 22, 2014

Create Multiple Y Axes for an R Plot

R base graphics is a powerful tool for plotting data.  Sometimes it is convenient to visualize combinations of variables on the same plot.  Often different variables require different scales.  This can be facilitated by using different Y axes, such as a plot with a Y 2 axis on the right hand side.  In R adding a Y axis is very easy to do.  Here are two simple example scripts you can use to build off of.  The first two examples are two different ways of showing different scales for the same variable (i.e. temperature in Farenheit and Celsus).  The third example is two different variables overlaid on the same R plot with two Y axes used to show the scales of each variable.


Monday, July 7, 2014

Creating and Accessing SQL Databases with Python

Python has numerous ways of outputting and storing data.  Recently, I investigated using shelves in Python. This is a way to store indexed data that uses command syntax similar to that of Python dictionaries, but I found it too time consuming to create shelves of large datasets.  Searching for a way to efficiently build databases in Python, I came across the SQL functionality. The library sqlite3 is an indispensable way to create databases in Python.  This package permits Python users to create and query large databases using syntax borrowed from SQL.  SQL stands for Structured Query Language and is used for managing data held in a relational database management system. The sqlite3 is a nonstandard variant of SQL query language that is compliant with the DB-API 2.0 specification. As a quick reference, I thought I would create an example script that could be used to build a SQL database using the Python programming language.  Below is a simple tutorial to follow that hopefully is useful for learning how to use the sqlite3 package.

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format.  The TCGA project is one such group that keeps a record of samples and analysis information in this format.  It can be difficult to extract data from these files without the appropriate tools.  In some cases a simple grep command can be used, but even then the output usually needs to be cleaned.  Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted.  The library I found most useful was ElementTree.  This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file.  Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object.  Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Monday, June 2, 2014

Make Venn Diagram in R with Correctly Weighted Areas

Venn diagrams are incredibly intuitive plots that visually display the overlap between groups.  There are a host of programs out there that make Venn diagrams, but very few actually weight the areas correctly to scale.  In my opinion, this is unacceptable in the 21st century.  I did stumble across one useful R package that calculates the appropriate area for intersections.  It is the Vennerable package.  While documentation and options are rather light for this package, it does what it needs to do: correctly size overlapping regions.

Its a little tricky to install this Venn diagram package.  To do so, follow the below steps:
(1) Type setRepositories() in the R command console.
(2) Select R-forge, rforge.net, CRAN extras, BioC software, and BioC extras in the pop-up window and then press OK.  I just have all of them selected.
(3) Install Vennerable package by typing install.packages("Vennerable", repos="http://R-Forge.R-project.org")
(4) Install the dependencies by typing install.packages(c("graph", "RBGL"), dependencies=TRUE).

This should have you up and running.  The R Venerable help page can be accessed here, but its really not that useful.  Below are some examples you can run as a bit of a primer.  The trickiest thing to learn is how the weights are assigned in the Venn plot.  As a general rule of thumb, if SetNames=c("A","B","C"), then Weight=c(notABC, A, B, AB, C, AC, BC, ABC).  Its a bit frustrating to have no control over general plotting details such as color, label location, and rotation, but I guess I can live with these things for now.  Also of note, the plotting algorithm tries its best to converge at a Venn diagram that fits circle areas to your data, but if for some reason it can't there is no guarantee the plot will match your data, so double check things thoroughly!  Here are some examples below.  If you know of some new or better options out there, please let me know in the comments section.  Enjoy.



The output looks like this:


Tuesday, April 1, 2014

Create Empty Data Frame in R with Specified Dimensions

Sometimes it is necessary to create an empty data frame in R to fill with output.  An example would be output from a for loop that loops over SNPs and calculates an association p-value to fill into the empty data frame.  The matrix functionality in R makes this easy to do.  The code below will create a 100 by 2 matrix and convert it into a dataframe with 100 rows and 2 column variables.  The nrow and ncol options are what designate the size of the resulting data frame.  The data frame is filled with NA values so that missing values are handled appropriately.  Below is some example code:

Wednesday, January 29, 2014

Bar Plot with 95% Conficence Interval Error Bars in R

R is a great plotting program for making publication quality bar plots of your data.  I often need to make quick bar plots to help collaborators quickly visualize data, so I thought I would put together a very generalized script that can be modified and built on as a template to make future bar plots in R.

In this script you manually enter your data (as you would see it in a 2x2 table) and then calculate the estimated frequency and 95% CI around that frequency using the binom.confint function in the binom package.  Next, a parametric and non-parametric p-value is calculated with the binom.test and fisher.test commands, respectively.  These statistics are then plotted using the R's barplot function.  The example code is below along with what the plot will look like.  P-values and N's are automatically filled and the y limits are calculated to ensure the graph includes all the plotted details.



Wednesday, January 15, 2014

Statistical Package Alternatives to SAS

Lets face it, SAS (aka Statistical Analysis Software) is an archaic program that is difficult to learn, cumbersome to code, and slow to implement calculations.   While graphics are improving, graphs are still hard to customize and lag far behind those of other programs.  In addition, yearly licensing fees were more than enough to scare me away, especially since there are a lot of open source statistical programing packages that are just as good, if not better than SAS.

So what are these alternatives to SAS?  Here is my attempt to make a list of valid alternatives to SAS.  Some may be more powerful or better tailored to specific niches than others, but all are capable of carrying out basic statistical tests.  The list is alphabetical with major SAS contenders highlighted in bold.

Epi Info - simple CDC tool for clinicians
Excel - basic, but there are some useful plugins
gretl - open source with econometrics focus
KXEN - seems more geared towards industry
Mathematica - used it doing physics research, seemed powerful at the time, but a long time ago
MATLAB - again, maybe more for engineers and such, but has statistical functionality
Minitab - quite accessible to those with little computer knowledge
Octav - open source, similar to MATLAB
OpenEpi - online, JAVA based statistical applets
Python - more of a programming language, but has statistical packages you can load
R - open source, thousands of libraries and active community of users
S-Plus - similar to R, S based dialect, could not find a link to this
Sage- unified interface of open source packages
SPSS - point and click, with some scripting options, crashes a lot
Stata - command line or point and click, powerful but not bloated
STATISTICA - markets themselves as a user friendly SAS alternative

That's about as exhaustive of a list as I can come up with.  If I missed something important please comment and let me know about it.  Hopefully I have been able to convince you there are plenty of excellent alternative statistical programming packages to SAS that are worth your time looking into.  In my opinion, R and Stata are the two clear front runners, particularly R since it is free and open source.  Happy hunting for a SAS replacement!

Tuesday, January 14, 2014

Remove Rows with NA Values From R Data Frame

Rows with NA values can be a pesky nuisance when trying to analyze data in R. Here is a short primer on how to remove them.

There are two primary options when getting rid of NA values in R, the na.omit/is.na commands and the complete.cases command.  Both are part of the base stats package and require no additional library or package to be loaded.  Below are examples of how the two work with a data frame called data and a variable called var.


The na.omit/is.na commands work as follows:
na.omit(data) - will only select rows with complete data in all columns
data[rowSums(is.na(data[,c(2,3,5)]))==0,] - will only select rows with complete data in columns 2, 3, and 5
var[!is.na(var)] - will only select values of a variable not equal to NA


The complete.cases command works as follows:
data[complete.cases(data),] - will only select rows with complete data in all columns
data[complete.cases(data[,c(2,3,5)]),] - will only select rows with complete data in columns 2, 3, and 5
var[complete.cases(var)] - will only select values of a variable not equal to NA


I use both commands at times, but ultimately prefer the complete.cases command for the cleaner syntax and generalizability.  Hope this helps you remove those NA's from your data.  If you have additional tips or questions please leave a comment below.

Thursday, December 12, 2013

Create Triangle Plot from Inferred Genetic Ancestry

I previously posted on how to infer ancestry for a group of study participants using SNP genotypes.  Today, I want to visually plot some of the output in R.  Two informative plots that can be generated from the output are a standard plot of the two eigenvectors with percent ancestry overlaid and a triangle (or ternary) plot with each axis representing percentage of one of the three ancestral populations (ex: European, African, and Asian).

Here is some simple code to plot this in R.  There is no base package to plot the triangle plot, so the plotrix package will need to first be installed.  The ancestry.txt file is the output file from SNPWEIGHTS, but other output could be formatted to work as well.


The output should look similar to the plots below.



Wednesday, November 20, 2013

Remove Duplicate Records from Excel Spreadsheet

I stray away from using Microsoft Excel for data management, but sometimes it is necessary to do quick manipulations on a relatively small dataset.  For this purpose, Excel shines.  One task I assumed Excel could do, but never knew exactly how to do it was filter out duplicate records from a dataset.  After some Goggling, I found how to do this:

(1) Select the column headers and rows in the worksheet you want to remove duplicate records from.
(2) Go to the Data tab and select Advanced Filter.
(3) For action click the Filter the list, in-place radio button and check the box for Unique records only.
(4) Press OK.

You should now have a filtered worksheet with duplicate rows removed.  This can be copied and pasted to a new worksheet for further manipulation.  Hope this was helpful.  If you have improvements on the method or further questions, please add them in the comments section below.

Wednesday, August 14, 2013

UCSC Tools for BigWig Files

Several UCSC data tracks are downloaded in BigWig format (.bw).  This is a compressed file that cannot be read by common text editors.  Rather, you need to use utilities built to handle these files.  Several are available at the ftp website: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/.  Here are a few UNIX programs I have found useful:

bigWigAverageOverBed bigWigSummary

Download Data Track from UCSC Genome Browser

Did you know you can download tracks you visualize in the UCSC Genome Browser for personal use and analysis?  Here's how to do it.

(1) Click on the grey bar on the far left side of the UCSC data track.  This will bring you to the track settings page.
(2) Click on the link called View table schema.  This will bring up a new page with track information and a description of data fields.
(3) Look for the field called Primary Table and copy the name.
(4) Go to the UCSC FTP site (link) and find the correct genome build you are after.  Usually you will want to select hg19 or hg18.
(5) Click on the database link and then search for the name of the field you copied from the Primary Table field in step 3.
(6) There will usually be a .sql and a .txt.gz for most tracks.  You are interested in the .txt.gz file.  You can click on it to download via your internet web browser or right click on the link to copy the web address and use the wget command to download it.  Here's and example script to download the NHGRI GWAS catalog using the wget command:
(7) Extract the compressed .txt.gz file with the following command, where filename is the name of the file you downloaded

This method should work for downloading the majority of the UCSC data tracks.  Sometimes it takes a bit of digging around the UCSC FTP site to find the dataset you are looking for, but in most cases I have been successful in finding it on the UCSC FTP site.

One final note:  If you are interested in downloading only a small portion of the track (for example, just a region on chromosome 8), you can download this region using the UCSC Table browser.  Here's how to do this:

(1) Follow steps (1) and (2) above.
(2) Once on the Table Schema page for the track of interest go to the link bar on the top of the page and select Tools > Table Browser.  This will take you to the UCSC Table Browser where all the fields will already be filled in with the track you are interested in.
(3) To download your region of interest click on the radio button next to position and type in your desired coordinates (ex: chr8:128362121-129551294).
(4) Make sure all the filters are cleared and give your output a filename.  Select get output and your file will be downloaded.  There is no need to unzip unless you chose the gzip compressed option.

Best wishes and good luck analyzing UCSC data tracks!