Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Tuesday, March 15, 2016

R Function to Calculate Confidence Interval from Odds Ratio and P-value

Every once in a while I need a confidence interval from a published estimate that only reports the estimated odds ratio and respective p-value. With a little bit of algebra it is simple to build a function in R to calculate a confidence interval. Below is a short, easy example script.

Friday, January 29, 2016

Normal Distribution Functions in R

I always need to look up how to use the distributional functions in R. Rather than it always being a guessing game I made a quick primer with visual example plots of what each command means and the results each command actually returns.

The below examples assume a N(0,1). A link to the R manual is here.

dnorm-returns the height of the normal curve at a specified value along the x-axis
pnorm-the cumulative density function (CDF) that returns the area to the left of a specified value
qnorm-returns quantiles or "critical values"
rnorm-generates random numbers from the normal distribution
These are examples for the normal distribution, but as you could imagine R has commands for numerous other distributions such as the chi-square, beta, uniform, and poisson.

If you are curious, here are the commands I used to plot these figures:

Thursday, May 7, 2015

Break Age Variable into Age Groups in R

I recently found the cut function in R as a useful resource to divide a variable into groups. This is really handy for dividing age into age groups, but can also be used for a variety of other variable types.

Friday, November 21, 2014

Convert 2 by 2 Contingency Table to Data Frame in R

Contingency tables are useful short hand ways of inputting and visualizing data. I have yet to find an easy way to convert between contingency tables and data frames in R. Below is a short script in which I input a contingency table, create a function to convert the 2 by 2 table to a data frame, and convert the data frame back to a table. These operations are useful for running some statistical operations that either only work on tables or only work on data frames. Hope the below example is useful.

Tuesday, November 4, 2014

Fix Origin of R Plot Axes at Zero

Standard R plots does not set the origin of the x and y axis at zero. To reset this parameter so that the origin of the plot is fixed at 0,0 simply use the xaxs and yaxs parameters. Here is a description from the R graphical parameters help page.

The style of axis interval calculation to be used for the x-axis. Possible values are "r", "i", "e", "s", "d". The styles are generally controlled by the range of data or xlim, if given.
Style "r" (regular) first extends the data range by 4 percent at each end and then finds an axis with pretty labels that fits within the extended range.
Style "i" (internal) just finds an axis with pretty labels that fits within the original data range.
Style "s" (standard) finds an axis with pretty labels within which the original data range fits.
Style "e" (extended) is like style "s", except that it is also ensures that there is room for plotting symbols within the bounding box.
Style "d" (direct) specifies that the current axis should be used on subsequent plots.
(Only "r" and "i" styles have been implemented in R.)

Finally, here is a brief example code snippet to demonstrate how the syntax works.

R Syntax for a Simple For Loop

For loops in R are useful tidbits of code to keep track of a counter, iterate through a variable, or do a complex operation on a subset of variables.  Below is an example R script showing the code and syntax needed to do some simple tasks with R for loops.

Friday, October 31, 2014

Import Variable into R Script

I always forget the syntax for defining a variable to read into an R script. It is relatively easy to import an external variable into R (such as a Unix variable or user-defined variable). The option to feed R a variable is incredibly useful on a cluster system where you want to farm out a large job into smaller sized pieces. Here is example code for doing so:



Tuesday, August 26, 2014

Easy Access to TCGA Data in R

Today I discovered that the Memorial Sloan Kettering folks at cBioPortal have made it super easy to access The Cancer Genome Atlas (TCGA) data in R.  No need to download massive amounts of data, extract needed files, and link data types together. Rather, they have a public data portal that R can easily interact with using the cgdsr library.  Data queries can be made for all TCGA cancer tissue types across a multitude of different data types (ex: genotyping, sequencing, copy number, mutation, methylation). Clinical data is even available for several cancer samples.

To install the library, open your R terminal and type: install.packages("cgdsr"). Alternately, to install on a UNIX cluster account, see this link.

The main tools you will use are:
getCancerStudies- shows all the studies included in cBioPortal (not limited to TCGA data) and includes descriptions. Use this to find a study you are interested in and its respective study id.
getCaseLists- divides study into case sets based on available data for samples. Use this to select the cases you are interested in querying.
getGeneticProfiles- shows available genetic data for a study.  Use this to select the type of data you are interested in querying.
getProfileData- performs the query on the case samples and genetic data of interest.
getClinicalData- extracts clinical data for a list of cases.

Here is an example script that can be used as a scaffold to design queries of the TCGA (as well as data from the BROAD and other places). It can be used to access all types of genetic data stored in the cBioPortal repository, but this particular example focuses on finding mutations within the BRCA1 gene in lung cancer cases.

Friday, August 22, 2014

Create Multiple Y Axes for an R Plot

R base graphics is a powerful tool for plotting data.  Sometimes it is convenient to visualize combinations of variables on the same plot.  Often different variables require different scales.  This can be facilitated by using different Y axes, such as a plot with a Y 2 axis on the right hand side.  In R adding a Y axis is very easy to do.  Here are two simple example scripts you can use to build off of.  The first two examples are two different ways of showing different scales for the same variable (i.e. temperature in Farenheit and Celsus).  The third example is two different variables overlaid on the same R plot with two Y axes used to show the scales of each variable.


Monday, August 18, 2014

Rounding in R: How to Keep Trailing Zeros

The round function in R is an easy way to round a number to a desired number of digits after the decimal point.  When the final digit of the number happens to be a zero, R gets a little aggressive and chooses to drop the last zero as well.  If you want to neatly print output, this behavior can be a bit frustrating. I wish there was an option for the round function to fix this behavior for values ending in zero, but I couldn't find anything. Luckily, there is an easy work around to fix this: using the base R sprintf function.  Here is a quick example script of how to round a number and keep trailing zeros (Note: this results in converting the number to a character format).

Tuesday, August 5, 2014

How to Calculate Genomic Inflation Factor and λgc for GWAS

You have conducted your genome-wide association study (GWAS) and have tested each genetic variant for an association with your trait of interest. Now it is time to investigate if there are any systematic biases that may be present in your association results. A common way to do this is to calculate the genomic inflation factor, also known as lambda gc (λgc). By definition, λgc is defined as the median of the resulting chi-squared test statistics divided by the expected median of the chi-squared distribution. The median of a chi-squared distribution with one degree of freedom is 0.4549364. A λgc value can be calculated from z-scores, chi-square statistics, or p-values, depending on the output you have from the association analysis. Follow these simple steps to calculate lambda GC using R programming language.

(1) Convert your output to chi-squared values
(2) Calculate lambda gc (λgc)
If analysis results your data follows the normal chi-squared distribution, the expected λgc value is 1. If the λgc value is greater than 1, then this may be evidence for some systematic bias that needs to be corrected in your analysis.

Wednesday, June 4, 2014

Easy Forest Plots in R


Forest plots are great ways to visualize individual group estimates as well as investigate heterogeneity of effect.  Most forest plot programs will display combined effect estimates and give you an indicator of whether there is evidence for heterogeneity among subgroups.  Fortunately, the R metafor package makes meta-analysis and plotting forest plots relatively easy.  Below is a sample R script that will combine beta and standard error estimates from an input file (input.txt) and create simple forest plots with an overall estimate as well as p-values for association and heterogeneity.

In general, the input.txt file should either

(1) Have the columns:
group - name for the study or group
beta - the log odds ratio for the effect of interest
se - the standard error for the log odds ratio estimate

or

(2) Have the columns:
group - name for the study or group
OR - odds ratio for the effect of interest
LCL - lower confidence interval for the odds ratio
UCL - upper confidence interval for the odds ratio

For the second case where you have an odds ratio and 95% confidence estimates, beta and se need to be estimated.  This is done by uncommenting lines 8 and 9 of the script.  Of note: due to rounding error the final forest plot may have 95% CI limits that are one digit off.

The R script that uses the metafor package as well as an example input.txt are below.




Monday, June 2, 2014

Wald Test Statistic

Wald tests are simple statistical tests for determining if an estimated parameter value is significantly different from zero.  The test can take two forms.  The first is where the estimate (minus zero) is divided by its standard error.  The resulting statistic is a z-score that follows a standard normal distribution from which a p-value can be calculated.  The second form is where the difference in the parameter estimate and zero is squared and divided by the variance of the parameter.  This produces a statistic that follows a chi-squared distribution with one degree of freedom.  The two forms of the Wald test statistic are below.

(1)
W=(β^β0)seˆ(β^)N(0,1)
(2)
W2=(β^β0)2Varˆ(β^)χ21

Programs like R, Microsoft Excel, or even online widgets can be used to calculate p-values from a z-score or chi-squared statistic.  Here's how to do it.

R:
P-value=2*(1-pnorm(W))
P-value=1-pchisq(W2,1)

MS Excel:
P-value=2*(1-(NORMDIST(W,0,1,TRUE)))
P-value=CHIDIST(W2,1)

Make Venn Diagram in R with Correctly Weighted Areas

Venn diagrams are incredibly intuitive plots that visually display the overlap between groups.  There are a host of programs out there that make Venn diagrams, but very few actually weight the areas correctly to scale.  In my opinion, this is unacceptable in the 21st century.  I did stumble across one useful R package that calculates the appropriate area for intersections.  It is the Vennerable package.  While documentation and options are rather light for this package, it does what it needs to do: correctly size overlapping regions.

Its a little tricky to install this Venn diagram package.  To do so, follow the below steps:
(1) Type setRepositories() in the R command console.
(2) Select R-forge, rforge.net, CRAN extras, BioC software, and BioC extras in the pop-up window and then press OK.  I just have all of them selected.
(3) Install Vennerable package by typing install.packages("Vennerable", repos="http://R-Forge.R-project.org")
(4) Install the dependencies by typing install.packages(c("graph", "RBGL"), dependencies=TRUE).

This should have you up and running.  The R Venerable help page can be accessed here, but its really not that useful.  Below are some examples you can run as a bit of a primer.  The trickiest thing to learn is how the weights are assigned in the Venn plot.  As a general rule of thumb, if SetNames=c("A","B","C"), then Weight=c(notABC, A, B, AB, C, AC, BC, ABC).  Its a bit frustrating to have no control over general plotting details such as color, label location, and rotation, but I guess I can live with these things for now.  Also of note, the plotting algorithm tries its best to converge at a Venn diagram that fits circle areas to your data, but if for some reason it can't there is no guarantee the plot will match your data, so double check things thoroughly!  Here are some examples below.  If you know of some new or better options out there, please let me know in the comments section.  Enjoy.



The output looks like this:


Tuesday, April 1, 2014

Create Empty Data Frame in R with Specified Dimensions

Sometimes it is necessary to create an empty data frame in R to fill with output.  An example would be output from a for loop that loops over SNPs and calculates an association p-value to fill into the empty data frame.  The matrix functionality in R makes this easy to do.  The code below will create a 100 by 2 matrix and convert it into a dataframe with 100 rows and 2 column variables.  The nrow and ncol options are what designate the size of the resulting data frame.  The data frame is filled with NA values so that missing values are handled appropriately.  Below is some example code:

Tuesday, March 25, 2014

Calculate P-value for Linear Mixed Model in R

The lme4 R package is a powerful tool for fitting mixed models in R where you can specify fixed and random effects.  One oddity about the program is it returns t statistics, but no p-value.  To get the p-value takes a little extra coding.  Here is a quick example for fitting a linear mixed model in R (using lmer) and then the added code to calculate p-values from the t statistic.  Either p1 or p2 are acceptable p-values to use.

Wednesday, March 12, 2014

Estimate Combined Percent Familial Risk Explained by GWAS Loci

Most current genome-wide association studies (GWAS) include a calculation of the percent familial risk the discovered loci explain.  This calculation indicates how much of the familial risk can be accounted for by the known loci and is usually used as evidence there are additional yet undetected loci that remain to be discovered.  Looking through references, it can be a bit difficult to find exactly how this calculation is performed.  One reference I found that includes a formula for the calculation is by Cox et al. 2007 (PMID:17293864), but I am sure there are plenty others that also include a formula.   While the Cox et al. formula for calculating familial relative risk due to each locus is arranged differently than the ones below, the two are equivalent.  I just find this arrangement less cumbersome to use.  The overall equation is to compare the cumulative risk of the known loci (sum of log lambda k) to the estimated risk of a first degree relative (log lambda 0).
where:
      p is the risk allele frequency for locus k
      r is the per allele odds ratio for locus k.

To make calculations easy, I made a simple R script that does all the calculations automatically.  The input for the script is a file with 3 columns:

(1) Annotation for the SNP - this can be anything, for example: RS number, chromosomal coordinates, etc.
(2) Risk allele frequency - this is the frequency of the risk allele (range: 0-1) equal to p in the above equation.
(3) Per allele odds ratio - odds ratio for every one unit increase in the number of risk alleles.

Note, the risk allele frequency is the frequency of the risk allele and not the minor allele frequency.  The program also needs an estimate of the familial relative risk (lambda 0).  This can usually be done by looking for previous familial studies for the disease.

Here is the R script:

It can be run from the command line by the example command:

Rscript familial_risk_snps.R snp_lst.txt 4

where:
      familial_risk_snps.R is the name of the script.
      snp_lst.txt is the input file with three columns described above.
      4 is the estimate of the familial relative risk of the disease.

Wednesday, January 29, 2014

Bar Plot with 95% Conficence Interval Error Bars in R

R is a great plotting program for making publication quality bar plots of your data.  I often need to make quick bar plots to help collaborators quickly visualize data, so I thought I would put together a very generalized script that can be modified and built on as a template to make future bar plots in R.

In this script you manually enter your data (as you would see it in a 2x2 table) and then calculate the estimated frequency and 95% CI around that frequency using the binom.confint function in the binom package.  Next, a parametric and non-parametric p-value is calculated with the binom.test and fisher.test commands, respectively.  These statistics are then plotted using the R's barplot function.  The example code is below along with what the plot will look like.  P-values and N's are automatically filled and the y limits are calculated to ensure the graph includes all the plotted details.



Tuesday, January 28, 2014

Fisher Exact Test on 2 by N Table in R

Fisher exact tests are non-parametric tests of association that are usually recommended when cell counts fall below 5, since distributional assumptions of parametric tests typically do not hold in these scenarios.  Here are quick code snippets of how to set up a 2 by N table in R and conduct a Fisher exact test (fisher.test) on the table.

First off, the simplest scenario is a 2 by 2 table.  The first example will show how to manually set up a 2x@ table and then run the Fisher exact test.  The code generates a p-value and 95% confidence interval testing whether the observed odds ratio differs from the hypothesized odds ratio of 1.

Alternately, one can also use the table command in R to circumvent the first step above and then apply the fisher.test command on the output from the R table output. Below is an example.


In addition, the fisher.test procedure can be expanded to 2xN tables by using a hybrid approximation of the exact distribution.  The command follows that used above and requires the additional option of hybrid=TRUE to indicate R should use the exact approximation.  Without this option you will get the error:

Error in fisher.test(a) : FEXACT error 7.
LDSTP is too small for this problem.
Try increasing the size of the workspace.

 See the example below.


**Note: This hybrid option does not seem to work on all versions of R.  If you still get the above error after using the hybrid option on a 2 x N contingency table, try using another version.  For example, on my PC R 3.0.1(x64) does not work, but R 2.15.1(i386) does work.  This may have something to do with the default memory settings for each version.  Know of other versions that work/don't work, please comment below.

A work around to the hybrid method not working on 2xN contingency tables would be to use the built in simulation abilities of the fisher.test function to get an estimated p-value.  This uses a Monte Carlo simulation and as with all simulations will be closer to the expected p-value with more and more iterations of the simulation.  To do use this simulated p-value approach, you need to set the option simulate.p.value=T and define B equal to the number of simulations you wish to conduct.  I recommend 1e7 as a good starting point, which should take about a minute or so depending on the speed of your machine.  Below is example code to follow.

Monday, January 27, 2014

Produce SAS Proc Freq Output in R

I have never been a huge fan or advocate of SAS, and actually recommend using other SAS alternatives, but I am somewhat addicted to the SAS Proc Freq procedure and its output tables.  Its a nice way to not only visualize the data but also to get some useful summary statistics.  I have been using the R table command for a while and in most cases combined with margin.table or prop.table it suffices to summarize the data.  Recently, I have been in need of summary statistics for the tables as well.  This can be accomplished easy enough using the R chisq.test or fisher.test commands, but still doesn't quite provide the fluid integration of data visualization and summary statistics that the SAS Proc Freq output provides.  Today I came across the R package CrossTable.  This procedure, part of the gmodels library, provides formatted output very similar to that of Proc Freq.  So much so, it even uses the same ascii characters to delineate cell boundaries.  There is a bit of playing around with options and such to get the exact statistics and percentages and such you would like, but overall a very nice (a not to mention free) alternative to SAS's Proc Freq output.  Below is an example table I created.