Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.
Showing posts with label code. Show all posts
Showing posts with label code. Show all posts

Tuesday, November 4, 2014

R Syntax for a Simple For Loop

For loops in R are useful tidbits of code to keep track of a counter, iterate through a variable, or do a complex operation on a subset of variables.  Below is an example R script showing the code and syntax needed to do some simple tasks with R for loops.

Friday, October 31, 2014

Import Variable into R Script

I always forget the syntax for defining a variable to read into an R script. It is relatively easy to import an external variable into R (such as a Unix variable or user-defined variable). The option to feed R a variable is incredibly useful on a cluster system where you want to farm out a large job into smaller sized pieces. Here is example code for doing so:



Monday, July 7, 2014

Creating and Accessing SQL Databases with Python

Python has numerous ways of outputting and storing data.  Recently, I investigated using shelves in Python. This is a way to store indexed data that uses command syntax similar to that of Python dictionaries, but I found it too time consuming to create shelves of large datasets.  Searching for a way to efficiently build databases in Python, I came across the SQL functionality. The library sqlite3 is an indispensable way to create databases in Python.  This package permits Python users to create and query large databases using syntax borrowed from SQL.  SQL stands for Structured Query Language and is used for managing data held in a relational database management system. The sqlite3 is a nonstandard variant of SQL query language that is compliant with the DB-API 2.0 specification. As a quick reference, I thought I would create an example script that could be used to build a SQL database using the Python programming language.  Below is a simple tutorial to follow that hopefully is useful for learning how to use the sqlite3 package.

Wednesday, March 26, 2014

Convert Word Document Field Codes into Formatted Text

Reference management software such as EndNote, Mendeley, etc. are great time savers when inserting citations in a manuscript typed in Microsoft Word.  Sometimes it is necessary to modify or remove the field codes these programs place in a document.  Situations include the need to edit some of the fields or submit a text-only article to a journal.  In these instances, these fields need to be removed and replaced with the appropriately formatted text.  How is this done?  Its incredibly easy...as long as you know the keyboard shortcut.  Here are the two simple steps:

(1) Select the text you want to remove the field codes from.  This can be done by highlighting a section of interest or pressing Ctrl + A if you want to replace the field codes in the entire document.

(2) Press Ctrl + Shift + F9.  This is the actual step that converts field codes into formatted text.

That's it.  You're done!  All your MS Word field codes in your .doc file should now be removed and the appropriate formatted text should be inserted in their place.  Hope this works for you as easily as it did for me.  If you find this post particularly helpful, please help me out by clicking the +1 link on the bottom of the post.

Tuesday, March 25, 2014

Calculate P-value for Linear Mixed Model in R

The lme4 R package is a powerful tool for fitting mixed models in R where you can specify fixed and random effects.  One oddity about the program is it returns t statistics, but no p-value.  To get the p-value takes a little extra coding.  Here is a quick example for fitting a linear mixed model in R (using lmer) and then the added code to calculate p-values from the t statistic.  Either p1 or p2 are acceptable p-values to use.

Wednesday, January 29, 2014

Bar Plot with 95% Conficence Interval Error Bars in R

R is a great plotting program for making publication quality bar plots of your data.  I often need to make quick bar plots to help collaborators quickly visualize data, so I thought I would put together a very generalized script that can be modified and built on as a template to make future bar plots in R.

In this script you manually enter your data (as you would see it in a 2x2 table) and then calculate the estimated frequency and 95% CI around that frequency using the binom.confint function in the binom package.  Next, a parametric and non-parametric p-value is calculated with the binom.test and fisher.test commands, respectively.  These statistics are then plotted using the R's barplot function.  The example code is below along with what the plot will look like.  P-values and N's are automatically filled and the y limits are calculated to ensure the graph includes all the plotted details.



Friday, January 24, 2014

15K and Growing

Today marks the 15,000 page view milestone for Genome Toolbox since its beginnings on May 1, 2013.  Its great to see so much interest and an encouragement to keep posting new tips I find useful.  Keep coming back for more to come!

Thursday, January 16, 2014

Remove Outliers from R Variable

R boxplot is an easy function to visualize a variable and get a sense of the distribution of values as well as potential outlier data points that may exist.  By saving the output to a variable (ex: bxplt <- boxplot(data$expression)), you can see a list of outlier points (ex: bxplt$out) that you may wish to exclude from an analysis.  The method R uses to identify extreme values is to calculate 1.5 times the interquartile range (ie: third quartile minus first quartile) and create limits by subtracting this value from the first quartile and adding it to the third quartile.  Any point that is less than the smaller limit or greater than the larger limit is considered an outlier by this method.  Here is a simple function I created to remove outliers from an R variable, the script essentially removes outliers identified by the boxplot function by replacing outlier values with NA and returning this modified variable for analysis.  So, for example, if you wanted to find the mean of data$expression with outliers removed all you would need to do is first run the below function and then use the command mean(ro(data$expression)).  Overall, a pretty simple way to remove out outliers if you do indeed choose to do so.

Wednesday, January 15, 2014

Syntax for a User-Defined R Function

R functions are incredibly handy ways to have R carry out repetitive tasks for you without having to copy and paste lines in your code over and over again.  Since I don't daily write new R functions, I sometimes forget what the syntax is to create these custom functions.  Here's an example function I made to tell if a number is even or odd:


To run a custom R function simply use the function name with all the needed input variable in parenthesis (ex: even_num(413)).

Add an Overall Title to an Array of R Plots

Adding a top-level title to a group of R plots is relatively easy...as long as you know the correct commands and sequence to put them in.  Here is some quick code to place a title on the top of multiple plots in R:



Note: The command to include an overall array title needs to be at the end of the code after the other plots have been generated.

Thursday, December 12, 2013

Sum Overlapping Base Pairs of Features from Chromosomal BED File

I had a .bed file of genomic features on a chromosome that I wanted to figure out the extent of overlap of the features to investigate commonly covered genes as well as positions where features were likely to form.  I wanted to generate a plot similar to a coverage depth plot from next-generation sequencing reads.  I am sure more efficient methods exist, but here is some Python code that takes in a .bed file of features (features.bed) and creates an output file (features.depth) with the feature overlap "depth" every 5,000 base pairs across the areas which contain features in your chromosomal .bed file.



Thursday, July 18, 2013

What Are SNP Ambiguity Codes and What Do They Mean?

That's a good question.  In fact one that I had myself.  Here's what I found:

Apparently single nucleotide polymorphism (SNP) ambiguity codes were constructed by the International Union of Pure and Applied Chemistry (IUPAC) to denote nucleotide changes in SNPs.  Here is a table of the meaning of each code taken from the ENSEMBLE SNPView website.

IUPAC Code   Mnemonic    MeaningComplement
AAdenineAT
CCytosineCG
GGuanineGC
T/UThymidineTA
KKetoG or TM
MAminoA or CK
SStrongC or GS
WWeakA or TW
RPurineA or GY
YPyrimidine C or TR
Bnot AC, G, or TV
Dnot CA, G, or TH
Hnot GA, C, or TD
Vnot T or U     A, C, or GB
NanyG, A, T, or C    N

Friday, July 5, 2013

Best Notepad++ Alternatives for Mac OS

I ❤ Notepad++.  Its a powerful, fully-loaded, and free text editing application that has been an invaluable tool for writing code in a variety of programming languages.  The only caveat: it's only available for Windows operating systems.  With the acquisition of my shiny, new Macbook Pro, I was incredibly disappointed to find out Notepad++ could not be installed on Macs; so much so I almost returned the Macbook.  Since I couldn't find a better laptop to meet my needs (and aesthetic desires), the quest has begun to find a comparable and preferably free text editor that runs on a Mac operating system.  I was surprised to find the list of candidates quite long.  Here are options I found, unfortunately not all options are free:

BBedit ($50)
Coda ($75)
Crossover (Windows emulator, $60) + Notepad++
Espresso ($75)
jEdit
Komodo Edit
TextEdit (the basic text editor pre-loaded on your Mac)
TextMate (€39 or about $53)
TextWrangler (free lite version of BBedit)
Smultron ($5)
SubEthaEdit (€29, or about $43)
Sublime ($70)
Tincta (free, Pro version for $16)
WINE (Windows emulator) + Notepad++

Apparently the market is saturated with Notepad++ "replacement" text editors for Macs.  The predominant text editors most recommended online are highlighted in bold.  While looking into the options, it became apparent there really is no one best Notepad++ replacement text editor.  It all really depends on what the user is using Notepad++ for and the options they need it to do (plus a bit of personal preference in user interface).  I have tried a few of the above options and am still not completely satisfied.  I am secretly hoping the folks responsible for Notepad++ are cooking up a way to install it on Macs.  The emulator approach to installing Notepad++ on a Mac also seems interesting.  I will have to try it when I have some free time.  In the meantime, I am curious what has been working best for you other Notepad++ lovers who have made the switch to a Mac.  Also, if you are aware of other text editors not mentioned here, please share!

Thursday, May 2, 2013

GibHub

Starting this blog, I found it difficult to share snippets of code easily through the blogger interface.  I stumbled across an easy tool to help do this: GitHub.  Its a pretty easy-to-use interface that allows you to paste your code into a text box, select which coding language you are using, and then creates an easy one line link you can embed into your Blogger blog.  It even highlights certain commands and syntax that are relevant to that coding language.  Pretty slick.