Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.

Thursday, January 16, 2014

Remove Outliers from R Variable

R boxplot is an easy function to visualize a variable and get a sense of the distribution of values as well as potential outlier data points that may exist.  By saving the output to a variable (ex: bxplt <- boxplot(data$expression)), you can see a list of outlier points (ex: bxplt$out) that you may wish to exclude from an analysis.  The method R uses to identify extreme values is to calculate 1.5 times the interquartile range (ie: third quartile minus first quartile) and create limits by subtracting this value from the first quartile and adding it to the third quartile.  Any point that is less than the smaller limit or greater than the larger limit is considered an outlier by this method.  Here is a simple function I created to remove outliers from an R variable, the script essentially removes outliers identified by the boxplot function by replacing outlier values with NA and returning this modified variable for analysis.  So, for example, if you wanted to find the mean of data$expression with outliers removed all you would need to do is first run the below function and then use the command mean(ro(data$expression)).  Overall, a pretty simple way to remove out outliers if you do indeed choose to do so.

No comments:

Post a Comment