Genome Toolbox: script

Showing posts with label script. Show all posts

Tuesday, March 15, 2016

R Function to Calculate Confidence Interval from Odds Ratio and P-value

Every once in a while I need a confidence interval from a published estimate that only reports the estimated odds ratio and respective p-value. With a little bit of algebra it is simple to build a function in R to calculate a confidence interval. Below is a short, easy example script.

Friday, November 21, 2014

Convert 2 by 2 Contingency Table to Data Frame in R

Contingency tables are useful short hand ways of inputting and visualizing data. I have yet to find an easy way to convert between contingency tables and data frames in R. Below is a short script in which I input a contingency table, create a function to convert the 2 by 2 table to a data frame, and convert the data frame back to a table. These operations are useful for running some statistical operations that either only work on tables or only work on data frames. Hope the below example is useful.

Tuesday, November 4, 2014

R Syntax for a Simple For Loop

For loops in R are useful tidbits of code to keep track of a counter, iterate through a variable, or do a complex operation on a subset of variables. Below is an example R script showing the code and syntax needed to do some simple tasks with R for loops.

Friday, October 31, 2014

Import Variable into R Script

I always forget the syntax for defining a variable to read into an R script. It is relatively easy to import an external variable into R (such as a Unix variable or user-defined variable). The option to feed R a variable is incredibly useful on a cluster system where you want to farm out a large job into smaller sized pieces. Here is example code for doing so:

Wednesday, September 10, 2014

How to Index a VCF File

There are two simple ways to create an index for a VCF file of sequence variants. The first is a command line driven approach using Tabix. For directions on installing Tabix, see this post. Here is the code needed for indexing the VCF file (either .vcf or .vcf.gz). First you need to make sure the vcf file is compressed as a vcf.gz file. This is done in the first line of code. Next, create a new .tbi index file in the same directory as your vcf.gz file. Using the -f command will write over an old index file that may be outdated or corrupted. The -p command will tell tabix to use the "vcf" file format.

The second way to index a VCF file is a point and click approach using the BROAD Institute's Integrated Genomics Viewer (IGV) program, a Java based program that will run on a variety of operating systems. To index a VCF file, open IGV, click on the Tools menu and select Run igvtools... A dialogue box will pop up. In the command drop down menu select Index and then click on Browse to select your desired .vcf file. Click run and a new .tbi index file will be created in the same folder.

There are probably other ways to index a VCF file, but these are the ones I am aware of. If you are aware of another method, please share in the comments.

Monday, August 4, 2014

Run Unix Commands and Other Programs within Python

The python subprocess module is a part of the standard python library that enables a user to run a variety of subprocess within python and collect their output and error pipes. It is a wonderfully handy tool that really expands the possibilities of what python is capable of doing. The tool is intended to build upon and essentially replace a lot of the functionality of the os package which I have found cumbersome to use. The main commands of the subprocess module include:

subprocess.call-runs a command and returns output and the exit code
subprocess.check_call-runs a command; if exit code is 0 returns output, otherwise CalledProcessError

Here are some simple example scripts to help get started using the subprocess inside python code:

Wednesday, July 16, 2014

RNA-seq: RPKM, FPKM, Formulas, and Scripts

Reads per kilobase per million mapped reads (RPKM) is a common metric used when investigating RNA expression of a gene transcript in sequencing data from RNA-seq experiments. The RPKM measure of read density reflects the molar concentration of a transcript in a starting sample by normalizing for RNA length and the total read number. By doing so, RPKM values facilitates transparent comparison of transcript levels both within and between samples.The formula for RPKM is as follows

where ER is the number of mapped reads in the gene's exons, EL is the sum of exon length in base pairs, and MR is the total number of mapped reads.

The number of transcript copies (TC) can be derived from RPKM as well. Essentially

where TL is the length of the transcriptome in base pairs. This can then be rearranged and RPKM can be substituted in as follows

The difficult part is getting an estimate on TL. TL can be estimated from spike-in data or can be derived from the starting amount of mRNA if you are willing to assume 100% efficiency of the cDNA synthesis. A great paper on RPKM is by Mortazavi et al. (Nature Methods 2008)

Fragments per kilobase per million mapped fragments (FPKM) is essentially analogous to RPKM. The only difference being that rather than using read counts you are estimating abundance of gene transcripts in terms of fragments observed. In paired-end RNA-seq experiments (ex: Illumina sequencing), fragments are sequenced from both ends providing two reads for each fragment. Therefore, RPKM=one read (single end) and FPKM=fragments are two reads (paired end). A common misconception is that RPKM values are twice that of FPKM. That is untrue, since FPKM is fragments per kilobase per million mapped fragments, not fragments per kilobase per million mapped reads. RPKM is approximately equal to FPKM.

Here are links to some programs and scripts that are useful:

cufflinks

rpkmforgenes.py

Monday, July 7, 2014

Creating and Accessing SQL Databases with Python

Python has numerous ways of outputting and storing data. Recently, I investigated using shelves in Python. This is a way to store indexed data that uses command syntax similar to that of Python dictionaries, but I found it too time consuming to create shelves of large datasets. Searching for a way to efficiently build databases in Python, I came across the SQL functionality. The library sqlite3 is an indispensable way to create databases in Python. This package permits Python users to create and query large databases using syntax borrowed from SQL. SQL stands for Structured Query Language and is used for managing data held in a relational database management system. The sqlite3 is a nonstandard variant of SQL query language that is compliant with the DB-API 2.0 specification. As a quick reference, I thought I would create an example script that could be used to build a SQL database using the Python programming language. Below is a simple tutorial to follow that hopefully is useful for learning how to use the sqlite3 package.

Tuesday, June 24, 2014

Quick Primer on Python Shelve

Python shelve is a convenient means of storing a Python data object to disk for later use. The feature behaves similarly to a Python dictionary object and uses a lot of the same syntax. Each object that is shelved has a key which is associated with the object so it can be quickly and efficiently accessed from disk. A shelved object must be something that can be pickled with the pickle package, essentially making shelves an easy way to organize and store pickled objects. "Pickling" is the process of making a Python object into a byte stream, and the inverse, "unpickling" is where the byte stream is restored back into the original Python object. Pickling is synonymous with serialization, marshalling, and flattening of data. Example Python objects that can be pickled include: integers, lists, tupules, sets, dictionaries, and classes. In addition to saving items to disk, shelves allow for quick access to portions of large data objects and can store them in binary format by specifying a protocol greater than 0. Here are some simple Python scripts that hopefully serve as a useful tutorial to learn about shelves and their syntax. The example focuses on saving dictionaries into a shelve, but it can be easily extended into other objects as well.

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format. The TCGA project is one such group that keeps a record of samples and analysis information in this format. It can be difficult to extract data from these files without the appropriate tools. In some cases a simple grep command can be used, but even then the output usually needs to be cleaned. Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted. The library I found most useful was ElementTree. This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file. Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object. Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Wednesday, June 4, 2014

Easy Forest Plots in R

Forest plots are great ways to visualize individual group estimates as well as investigate heterogeneity of effect. Most forest plot programs will display combined effect estimates and give you an indicator of whether there is evidence for heterogeneity among subgroups. Fortunately, the R metafor package makes meta-analysis and plotting forest plots relatively easy. Below is a sample R script that will combine beta and standard error estimates from an input file (input.txt) and create simple forest plots with an overall estimate as well as p-values for association and heterogeneity.

In general, the input.txt file should either

(1) Have the columns:
group - name for the study or group
beta - the log odds ratio for the effect of interest
se - the standard error for the log odds ratio estimate

or

(2) Have the columns:
group - name for the study or group
OR - odds ratio for the effect of interest
LCL - lower confidence interval for the odds ratio
UCL - upper confidence interval for the odds ratio

For the second case where you have an odds ratio and 95% confidence estimates, beta and se need to be estimated. This is done by uncommenting lines 8 and 9 of the script. Of note: due to rounding error the final forest plot may have 95% CI limits that are one digit off.

The R script that uses the metafor package as well as an example input.txt are below.

Sunday, June 1, 2014

Generate Random Genomic Positions

Generating random genomic positions or coordinates can be useful in comparing characteristics of a set of genomic loci to that what would be expected from permutations of the underlying genomic distribution. Below is a Python script to aid in selecting random genomic positions. The script chooses a chromosome based on probabilities assigned by chromosome length and then chooses a chromosomal position from a uniform distribution of the chromosome's length. An added gap checking statement is included to ensure the chosen position lies within the accessible genome. You can choose the number of positions you want, the number of permutations to conduct, the size of the genomic positions, and the genomic build of interest. A UNIX shell script is included as a wrapper to automatically download needed chromosomal gap and cytoband files as well as run the Python script. Useage for the UNIX script can be seen by typing ./make_random.sh from the command line after giving the script executable privileges. An example command would be ./make_random 100 10 1000 hg19. This command would make 10 .bed files each with 100 random 1Kb genomic regions from the hg19 genome build. Below are the make_random.sh and make_random.py scripts.

Wednesday, January 29, 2014

Bar Plot with 95% Conficence Interval Error Bars in R

R is a great plotting program for making publication quality bar plots of your data. I often need to make quick bar plots to help collaborators quickly visualize data, so I thought I would put together a very generalized script that can be modified and built on as a template to make future bar plots in R.

In this script you manually enter your data (as you would see it in a 2x2 table) and then calculate the estimated frequency and 95% CI around that frequency using the binom.confint function in the binom package. Next, a parametric and non-parametric p-value is calculated with the binom.test and fisher.test commands, respectively. These statistics are then plotted using the R's barplot function. The example code is below along with what the plot will look like. P-values and N's are automatically filled and the y limits are calculated to ensure the graph includes all the plotted details.

Thursday, January 23, 2014

Genetic Simulation Resources

I just came across a useful repository of genetic simulation resources I thought would make a good addition to Genome Toolbox. The site, aptly called Genetic Simulation Resources (GSR), provides a detailed listing of over 80 useful software applications available for genetic simulations. The NCI sponsored catalog aids in scanning through available simulation programs, comparing similar applications, and quickly identifying the most appropriate software application for a particular study. In addition to providing a description and external link, literature citations are also listed for many of the simulation packages. If you are the developer of a genetic simulation resource that is not listed in the GSR repository, you can submit a request to add it. Overall, a great resource for simulating data for genetic studies that may help you avoid reinventing the wheel by programming a de novo simulation routine. A good first place to check.

Thursday, December 12, 2013

Sum Overlapping Base Pairs of Features from Chromosomal BED File

I had a .bed file of genomic features on a chromosome that I wanted to figure out the extent of overlap of the features to investigate commonly covered genes as well as positions where features were likely to form. I wanted to generate a plot similar to a coverage depth plot from next-generation sequencing reads. I am sure more efficient methods exist, but here is some Python code that takes in a .bed file of features (features.bed) and creates an output file (features.depth) with the feature overlap "depth" every 5,000 base pairs across the areas which contain features in your chromosomal .bed file.

Friday, November 15, 2013

Install R Package in User Account on UNIX Cluster

R is a wonderful open-source statistical program with new and exciting packages constantly being released. Often UNIX cluster administrators don't have the time (or interest) to update or add new packages as they are released. Here's where this blog post comes in. By installing R packages in a directory on your UNIX account and loading them from there you can circumvent the need to have cluster administrators install these packages. Briefly, here is how to do this:

(1) Install the R package either from the UNIX console or R console to a directory within your user account.
From UNIX console:
or

From R console:

In the above example, pkg is the R package you want to install and /R_packages/ is the directory you would like to install them in.

(2) To load the package for use in R, simply put the following command in your script.

Thursday, November 7, 2013

How to Fully Utilize All Cores of a UNIX Compute Node

Parallelizing tasks can drastically improve computation time of a program. One way to do this is to ensure your code is utilizing all available cores of a processor. To do this you need to write your code in such a way that background tasks are being carried out simultaneously. This is done by inserting the ampersand (&) at the end of a line of code. If the number of background tasks running equals the number of cores of the computer node, then you are efficiently and fully utilizing the resource. The final necessary piece of code is to use the wait command. This tells the computer to wait until all the background tasks are completed before moving on to the next line of code. The wait command comes in handy to ensure the number of background tasks submitted does not exceed the number of processor cores. If this happens you will likely overwhelm the processor with too many tasks. To prevent this the idea is to simultaneously submit a number of tasks equal to the number of cores and then use the wait command to wait for the jobs to finish before submitting more tasks. Here is some example code I put together to utilize all 8 cores of a node when running 1,000 permutations of a program:

This will run a script with the following commands:

Of course, this could further be parallelized to run permutations simultaneously on different compute nodes as well to further speed up run time. Hope this is a helpful example to help you fully utilize compute nodes and speed up your processing time.

Wednesday, August 14, 2013

UCSC Tools for BigWig Files

Several UCSC data tracks are downloaded in BigWig format (.bw). This is a compressed file that cannot be read by common text editors. Rather, you need to use utilities built to handle these files. Several are available at the ftp website: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/. Here are a few UNIX programs I have found useful:

bigWigAverageOverBed

bigWigInfo

bigWigSummary

bigWigToBedGraph

Friday, May 31, 2013

How To Import Command Line Arguments into a Python Script

It is often handy to be able to feed an argument from the UNIX command line to a Python script. This is very simple to do with the sys package installed. Below is an example of feeding three arguments (chr, start, end) from the command line into Python to be used as variables in a script.

In the script.py file:

At the command line:

Pages