Genome Toolbox: Remove Outliers from R Variable

Thursday, January 16, 2014

Remove Outliers from R Variable

R boxplot is an easy function to visualize a variable and get a sense of the distribution of values as well as potential outlier data points that may exist. By saving the output to a variable (ex: bxplt <- boxplot(data$expression)), you can see a list of outlier points (ex: bxplt$out) that you may wish to exclude from an analysis. The method R uses to identify extreme values is to calculate 1.5 times the interquartile range (ie: third quartile minus first quartile) and create limits by subtracting this value from the first quartile and adding it to the third quartile. Any point that is less than the smaller limit or greater than the larger limit is considered an outlier by this method. Here is a simple function I created to remove outliers from an R variable, the script essentially removes outliers identified by the boxplot function by replacing outlier values with NA and returning this modified variable for analysis. So, for example, if you wanted to find the mean of data$expression with outliers removed all you would need to do is first run the below function and then use the command mean(ro(data$expression)). Overall, a pretty simple way to remove out outliers if you do indeed choose to do so.

Genome Toolbox

Pages

Thursday, January 16, 2014

Remove Outliers from R Variable

No comments:

Post a Comment