Contingency tables are useful short hand ways of inputting and visualizing data. I have yet to find an easy way to convert between contingency tables and data frames in R. Below is a short script in which I input a contingency table, create a function to convert the 2 by 2 table to a data frame, and convert the data frame back to a table. These operations are useful for running some statistical operations that either only work on tables or only work on data frames. Hope the below example is useful.
A repository of programs, scripts, and tips essential to
genetic epidemiology, statistical genetics, and bioinformatics
Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label table. Show all posts
Showing posts with label table. Show all posts
Friday, November 21, 2014
Tuesday, January 28, 2014
Fisher Exact Test on 2 by N Table in R
Fisher exact tests are non-parametric tests of association that are usually recommended when cell counts fall below 5, since distributional assumptions of parametric tests typically do not hold in these scenarios. Here are quick code snippets of how to set up a 2 by N table in R and conduct a Fisher exact test (fisher.test) on the table.
First off, the simplest scenario is a 2 by 2 table. The first example will show how to manually set up a 2x@ table and then run the Fisher exact test. The code generates a p-value and 95% confidence interval testing whether the observed odds ratio differs from the hypothesized odds ratio of 1.
Alternately, one can also use the table command in R to circumvent the first step above and then apply the fisher.test command on the output from the R table output. Below is an example.
In addition, the fisher.test procedure can be expanded to 2xN tables by using a hybrid approximation of the exact distribution. The command follows that used above and requires the additional option of hybrid=TRUE to indicate R should use the exact approximation. Without this option you will get the error:
Error in fisher.test(a) : FEXACT error 7.
LDSTP is too small for this problem.
Try increasing the size of the workspace.
See the example below.
**Note: This hybrid option does not seem to work on all versions of R. If you still get the above error after using the hybrid option on a 2 x N contingency table, try using another version. For example, on my PC R 3.0.1(x64) does not work, but R 2.15.1(i386) does work. This may have something to do with the default memory settings for each version. Know of other versions that work/don't work, please comment below.
A work around to the hybrid method not working on 2xN contingency tables would be to use the built in simulation abilities of the fisher.test function to get an estimated p-value. This uses a Monte Carlo simulation and as with all simulations will be closer to the expected p-value with more and more iterations of the simulation. To do use this simulated p-value approach, you need to set the option simulate.p.value=T and define B equal to the number of simulations you wish to conduct. I recommend 1e7 as a good starting point, which should take about a minute or so depending on the speed of your machine. Below is example code to follow.
First off, the simplest scenario is a 2 by 2 table. The first example will show how to manually set up a 2x@ table and then run the Fisher exact test. The code generates a p-value and 95% confidence interval testing whether the observed odds ratio differs from the hypothesized odds ratio of 1.
Alternately, one can also use the table command in R to circumvent the first step above and then apply the fisher.test command on the output from the R table output. Below is an example.
In addition, the fisher.test procedure can be expanded to 2xN tables by using a hybrid approximation of the exact distribution. The command follows that used above and requires the additional option of hybrid=TRUE to indicate R should use the exact approximation. Without this option you will get the error:
Error in fisher.test(a) : FEXACT error 7.
LDSTP is too small for this problem.
Try increasing the size of the workspace.
See the example below.
**Note: This hybrid option does not seem to work on all versions of R. If you still get the above error after using the hybrid option on a 2 x N contingency table, try using another version. For example, on my PC R 3.0.1(x64) does not work, but R 2.15.1(i386) does work. This may have something to do with the default memory settings for each version. Know of other versions that work/don't work, please comment below.
A work around to the hybrid method not working on 2xN contingency tables would be to use the built in simulation abilities of the fisher.test function to get an estimated p-value. This uses a Monte Carlo simulation and as with all simulations will be closer to the expected p-value with more and more iterations of the simulation. To do use this simulated p-value approach, you need to set the option simulate.p.value=T and define B equal to the number of simulations you wish to conduct. I recommend 1e7 as a good starting point, which should take about a minute or so depending on the speed of your machine. Below is example code to follow.
Monday, January 27, 2014
Produce SAS Proc Freq Output in R
I have never been a huge fan or advocate of SAS, and actually recommend using other SAS alternatives, but I am somewhat addicted to the SAS Proc Freq procedure and its output tables. Its a nice way to not only visualize the data but also to get some useful summary statistics. I have been using the R table command for a while and in most cases combined with margin.table or prop.table it suffices to summarize the data. Recently, I have been in need of summary statistics for the tables as well. This can be accomplished easy enough using the R chisq.test or fisher.test commands, but still doesn't quite provide the fluid integration of data visualization and summary statistics that the SAS Proc Freq output provides. Today I came across the R package CrossTable. This procedure, part of the gmodels library, provides formatted output very similar to that of Proc Freq. So much so, it even uses the same ascii characters to delineate cell boundaries. There is a bit of playing around with options and such to get the exact statistics and percentages and such you would like, but overall a very nice (a not to mention free) alternative to SAS's Proc Freq output. Below is an example table I created.
Subscribe to:
Posts (Atom)