Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label test. Show all posts
Showing posts with label test. Show all posts

Tuesday, August 5, 2014

How to Calculate Genomic Inflation Factor and λgc for GWAS

You have conducted your genome-wide association study (GWAS) and have tested each genetic variant for an association with your trait of interest. Now it is time to investigate if there are any systematic biases that may be present in your association results. A common way to do this is to calculate the genomic inflation factor, also known as lambda gc (λgc). By definition, λgc is defined as the median of the resulting chi-squared test statistics divided by the expected median of the chi-squared distribution. The median of a chi-squared distribution with one degree of freedom is 0.4549364. A λgc value can be calculated from z-scores, chi-square statistics, or p-values, depending on the output you have from the association analysis. Follow these simple steps to calculate lambda GC using R programming language.

(1) Convert your output to chi-squared values
(2) Calculate lambda gc (λgc)
If analysis results your data follows the normal chi-squared distribution, the expected λgc value is 1. If the λgc value is greater than 1, then this may be evidence for some systematic bias that needs to be corrected in your analysis.

Wednesday, June 18, 2014

Test for a Difference in Two Odds Ratios

Testing for a statistical difference between two odds ratio estimates can be useful in determining if an association has statistically different effects in different groups or strata of a variable.  For example, maybe an association is stronger for older individuals than younger individuals.  To test for such a difference we need the odds ratio estimate (or more precisely the natural log of the odds ratio estimate, aka the beta estimate from a logistic regression) and the standard error of the log odds ratio.

If you don't have access to the primary data and need to estimate the standard error from a 95% confidence interval (95% CI), see this blog entry.  If you have forgotten how to calculate the standard error of the log odds ratio use this formula:
SE(logOR)=1n1+1n2+1n3+1n4

To test if two odds ratios are significantly different and get a p-value for the difference follow these steps:
(1) Take the absolute value of the difference between the two log odds ratios. We will call this value δ.
(2) Calculate the standard error for δ, SE(δ), using the formula:
SE21+SE22


(3) Calculate the Z score for the test: z=δ/SE(δ)
(4) Calculate the p-value from the z score. The p-value can be easily calculated in R or Microsoft Excel using the below formulas.

R: P-value=2*(1-pnorm(Z))
MS Excel: P-value=2*(1-(NORMDIST(Z,0,1,TRUE)))

Monday, June 2, 2014

Wald Test Statistic

Wald tests are simple statistical tests for determining if an estimated parameter value is significantly different from zero.  The test can take two forms.  The first is where the estimate (minus zero) is divided by its standard error.  The resulting statistic is a z-score that follows a standard normal distribution from which a p-value can be calculated.  The second form is where the difference in the parameter estimate and zero is squared and divided by the variance of the parameter.  This produces a statistic that follows a chi-squared distribution with one degree of freedom.  The two forms of the Wald test statistic are below.

(1)
W=(β^β0)seˆ(β^)N(0,1)
(2)
W2=(β^β0)2Varˆ(β^)χ21

Programs like R, Microsoft Excel, or even online widgets can be used to calculate p-values from a z-score or chi-squared statistic.  Here's how to do it.

R:
P-value=2*(1-pnorm(W))
P-value=1-pchisq(W2,1)

MS Excel:
P-value=2*(1-(NORMDIST(W,0,1,TRUE)))
P-value=CHIDIST(W2,1)

Tuesday, January 28, 2014

Fisher Exact Test on 2 by N Table in R

Fisher exact tests are non-parametric tests of association that are usually recommended when cell counts fall below 5, since distributional assumptions of parametric tests typically do not hold in these scenarios.  Here are quick code snippets of how to set up a 2 by N table in R and conduct a Fisher exact test (fisher.test) on the table.

First off, the simplest scenario is a 2 by 2 table.  The first example will show how to manually set up a 2x@ table and then run the Fisher exact test.  The code generates a p-value and 95% confidence interval testing whether the observed odds ratio differs from the hypothesized odds ratio of 1.

Alternately, one can also use the table command in R to circumvent the first step above and then apply the fisher.test command on the output from the R table output. Below is an example.


In addition, the fisher.test procedure can be expanded to 2xN tables by using a hybrid approximation of the exact distribution.  The command follows that used above and requires the additional option of hybrid=TRUE to indicate R should use the exact approximation.  Without this option you will get the error:

Error in fisher.test(a) : FEXACT error 7.
LDSTP is too small for this problem.
Try increasing the size of the workspace.

 See the example below.


**Note: This hybrid option does not seem to work on all versions of R.  If you still get the above error after using the hybrid option on a 2 x N contingency table, try using another version.  For example, on my PC R 3.0.1(x64) does not work, but R 2.15.1(i386) does work.  This may have something to do with the default memory settings for each version.  Know of other versions that work/don't work, please comment below.

A work around to the hybrid method not working on 2xN contingency tables would be to use the built in simulation abilities of the fisher.test function to get an estimated p-value.  This uses a Monte Carlo simulation and as with all simulations will be closer to the expected p-value with more and more iterations of the simulation.  To do use this simulated p-value approach, you need to set the option simulate.p.value=T and define B equal to the number of simulations you wish to conduct.  I recommend 1e7 as a good starting point, which should take about a minute or so depending on the speed of your machine.  Below is example code to follow.

Wednesday, January 15, 2014

Statistical Package Alternatives to SAS

Lets face it, SAS (aka Statistical Analysis Software) is an archaic program that is difficult to learn, cumbersome to code, and slow to implement calculations.   While graphics are improving, graphs are still hard to customize and lag far behind those of other programs.  In addition, yearly licensing fees were more than enough to scare me away, especially since there are a lot of open source statistical programing packages that are just as good, if not better than SAS.

So what are these alternatives to SAS?  Here is my attempt to make a list of valid alternatives to SAS.  Some may be more powerful or better tailored to specific niches than others, but all are capable of carrying out basic statistical tests.  The list is alphabetical with major SAS contenders highlighted in bold.

Epi Info - simple CDC tool for clinicians
Excel - basic, but there are some useful plugins
gretl - open source with econometrics focus
KXEN - seems more geared towards industry
Mathematica - used it doing physics research, seemed powerful at the time, but a long time ago
MATLAB - again, maybe more for engineers and such, but has statistical functionality
Minitab - quite accessible to those with little computer knowledge
Octav - open source, similar to MATLAB
OpenEpi - online, JAVA based statistical applets
Python - more of a programming language, but has statistical packages you can load
R - open source, thousands of libraries and active community of users
S-Plus - similar to R, S based dialect, could not find a link to this
Sage- unified interface of open source packages
SPSS - point and click, with some scripting options, crashes a lot
Stata - command line or point and click, powerful but not bloated
STATISTICA - markets themselves as a user friendly SAS alternative

That's about as exhaustive of a list as I can come up with.  If I missed something important please comment and let me know about it.  Hopefully I have been able to convince you there are plenty of excellent alternative statistical programming packages to SAS that are worth your time looking into.  In my opinion, R and Stata are the two clear front runners, particularly R since it is free and open source.  Happy hunting for a SAS replacement!