Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.

Wednesday, December 18, 2013

Copy and Transfer Files To or From SFTP Site

Transferring files to and from a SFTP site is relatively simple once you have generated a public/private key pair and have your account set up.

To login to the SFTP server, type in sftp username@server at the UNIX command prompt.  Once logged in, you can change directories on the SFTP site similar to how you would change directories in UNIX: with the cd command.  You can also change directories on you local account using the lcd command.  To transfer files from one server to another, you need to first be in the correct local and SFTP directories from which you want the files transferred to/from.

To copy a file from the SFTP server to your local host, use the get command.  For example, if you wanted to get the file ids.txt, you would type get ids.txt at the command prompt.  Conversely, to transfer file to the SFTP site from your local host, use the put command.  The SFTP also uses wildcards (*), so if for example, you wanted to transfer all .jpg files to the SFTP server, you would type in put *.jpeg at the command prompt.  Below are a few other useful commands.

Sftp CommandDescription
cd dirChange directory on the ftp server to dir.
lcd dirChange directory on your machine to dir.
lsList files in the current directory on the ftp server.
llsList files in the current directory on your machine.
pwdPrint the current directory on the ftp server.
lpwdPrint the current directory on your machine.
get fileDownload the file from the ftp server to current directory.
put fileUpload the file from your machine to the ftp server.
exitExit from the sftp program.

Also, remember it is always a good idea to check the checksums of files you have downloaded with those on the SFTP site to ensure you downloaded the files in their entirety.

Generate Public/Private RSA Key Pair for SFTP

To access an SFTP server, an SSH public key must be created and shared with the SFTP host.  To generate a public/private key pair the ssh-keygen utility in UNIX can be used.  This program will calculate a key pair for you by following these steps:

(1) At the UNIX prompt, type ssh-keygen.  A message will appear indicating the application is generating the public/private rsa key pair.

(2) You will next be prompted to enter the location where you want to save the key.  You can specify a filename or just leave the space blank and press Enter.

(3) Afterwards you will be prompted to enter a passphrase to use as a password.  Enter your passphrase here or you can opt to leave the space blank.  It is acceptable to simply hit Enter to proceed without specifying a passphrase.

(4) Finally, you will be prompted to re-enter your passphrase.

That is all there is to creating a public and private key pair.  To access the remote SFTP server the administrator will need you to forward a copy of your public key to set up your account.  The public key will be a file with one line that includes "ssh-rsa", a long key code, followed by your account name and host name (ex: user@host.school.edu).  The file extension will be .pub.

Thursday, December 12, 2013

Generate Coverage Plot in R from Depth Data

Here is a simple way to take coverage data (coverage.depth) and make a plot in R to visualize it.  All you need is a column with base pair coordinates (V1) and a column with respective depth (V2) and you are all set to plug it in the below R code.

Sum Overlapping Base Pairs of Features from Chromosomal BED File

I had a .bed file of genomic features on a chromosome that I wanted to figure out the extent of overlap of the features to investigate commonly covered genes as well as positions where features were likely to form.  I wanted to generate a plot similar to a coverage depth plot from next-generation sequencing reads.  I am sure more efficient methods exist, but here is some Python code that takes in a .bed file of features (features.bed) and creates an output file (features.depth) with the feature overlap "depth" every 5,000 base pairs across the areas which contain features in your chromosomal .bed file.



Create Triangle Plot from Inferred Genetic Ancestry

I previously posted on how to infer ancestry for a group of study participants using SNP genotypes.  Today, I want to visually plot some of the output in R.  Two informative plots that can be generated from the output are a standard plot of the two eigenvectors with percent ancestry overlaid and a triangle (or ternary) plot with each axis representing percentage of one of the three ancestral populations (ex: European, African, and Asian).

Here is some simple code to plot this in R.  There is no base package to plot the triangle plot, so the plotrix package will need to first be installed.  The ancestry.txt file is the output file from SNPWEIGHTS, but other output could be formatted to work as well.


The output should look similar to the plots below.



Tuesday, December 3, 2013

List Contents of a .tar.gz File

Here is a quick one liner to see the contents of a .tar.gz file:

This will list all the files in the .tar.gz tarball.  The v command can be left out to not show file information.

Extract Only Desired Files from Compressed File

Lately I have been working with some large tarballs (tar.gz files).  To access the compressed contents, usually I would extract all the files that were compressed in the archive.  This was time consuming to decompress all the files, a pain to filter through looking for the files I wanted, and required more time to delete everything I didn't want.  I knew there was a better way to just get the files I wanted.  Sure enough, the basic tar command has options that allow you just to extract a file of interest or even a set of files using wildcards.  The command looks like this:

where,
big_file.tar.gz is the tarball you are extracting from
path is the path to your file in the tarball
your_file.txt is the file you want to extract

This can also be done using wildcards.  For example, if I wanted all text files in the above path I could substitute *.txt for your_file.txt.

Thursday, November 28, 2013

Happy Thanksgiving

                               .-"'-. _
                             ,'     /`\ .-"` `'.
                            /      /  '`\       \
                           .     ,'      `.      ;
                          _|   .'          `.    |_
                       ,'` '-'`              `'-.' \
                       |     \        Oo .-,_   /   | ___
                  .--. '      `..--'"""''"~-.\-'   ,-'   `.
                /`    `.\     (             | ;   /        |
               ;        `.     \            ;  |           '
               |                `.         .\   ;         /_,.
               '            ,-''-.`'-...-'`.'   |'-,    ,-'   `\
                `._        |      \| |  |  ||  /    |           |
              _.--"'`   __ \       ' |  |   `-'     /.--.       /
            ,'       ,-'  `'.        |  ;          .'    '    _'
            |       /        `     ,-;   \        '      |  '"---.
            ;      |          ,--'` /    |`'----.,       '        `\
             \_    \        _(     /     |        )    .'          |
              .7'-, '.    (`      /      |'      `-,-'`           /
             /        `-._/     ,'       ;          )          _.'
            |            |     /          \        r,    ___.-'
             \           \    /            `.        )--''
              `-._____,.-'`---|              `-.__.-'
                              \                 /
  _   _  __  ___   ___ __   __ '.             .'
 | |_| |/  \|  _`\|  _`\ \ / /   `--._____.--'
 |  _  | |\ | |_) | |_) \ ' / ___    || ||    ___
 | | | |  _ |  __/|  __/ ' |_(_..`'--'| |'--'`.._)_
 |_|_|_|_||_|_|__ |_|  _ |(___,.----'"` `"'----.,__)  _  __   _
 |_   _| | | |/  \| \ | | |/ / / _|/ _| | | || | | \ | |/ _| | |
   | | | |_| | |\ |  \| | ' / | | | /   | | || | |  \| | /   | |
   | | |  _  | || |   ' |  ;   \ \| | _ | | || | |   ' | | _ | |
   | | | | | |  _ | .   | . \   | | || \| | '/ | | .   | || \|_|
   | | | | | | || | |\  | |\ \ _/ | |_| | |\   | | |\  | |_| |_
   |_| |_| |_|_||_|_| \_|_| \_|__/ \___/|_| \_/|_|_| \_|\___/|_| 

Monday, November 25, 2013

Installing Tabix on UNIX

Tabix is an incredibly useful tool that indexes large .vcf files and makes it very efficient to find a particular genomic region in the file.  It is particularly useful for downloading small portions of files from publicly available ftp data repositories.  Here is a brief tutorial on how to install Tabix on a UNIX operating system.

(1) Go here to download the newest release.
(2) Extract the file:

(3) Compile the program by typing make on the UNIX command line.
(4) Export the path by adding the following line to your .bashrc file, saving your .bashrc file, and typing source on the UNIX command line.  Note: path_to_tabix is the directory where tabix is installed.

Friday, November 22, 2013

Find SNP Overlap from Different SNP Arrays

If you have .bed files for array manifests it is relatively easy to use UNIX to compare the overlap between SNP positions in the two files.  Here is a quick one line of code to do so:


This essentially finds all rows that are the same in the two files.  It requires all unique rows in each file and will only work for one base pair long items.

One Line Command to Remove Header from File in UNIX

Sometimes you need to remove a header, footer, or range of lines from a file to better manipulate it in UNIX.  Here are some quick one liners to do so:



Install liftOver Locally on UNIX

Many of the UCSC genome tools are available for download for use locally on your UNIX system.  liftOver is an example of one such tool.  To download, go to their apps download page, select your operating system, and then click on the liftOver link.  Here are links for liftOver_linux.x86_64 and liftOver_linux.x86_64.v287.  For some reason, I could only get the linux.x86_64.v287 version to work on my system.  Once downloaded, make it executable.

The next step is to get the required chain files needed to convert from one genome build to another.  All UCSC genome builds are listed here and you can select the desired "LiftOver files" link under the genome build that you want to convert from.  This will take you to a downloads page with links to the chain files.  Chain files are appropriately named so you know what builds you are converting to and from.  For example,  hg19ToHg18.over.chain.gz is the chain file needed to convert from hg19 to hg18.  Once downloaded, unzip the file for use.

To run liftOver, the useage is:
liftOver oldFile map.chain newFile unMapped

where:
oldFile is the file you want to convert from
map.chain is the chain file used to convert from one build to another
newFile is the converted file you want to create
unMapped is a file that contains all the unmapped positions

For more details on usage, just type liftOver in the command line.

Make BED File from Illumina Manifest

To compare coverage across SNP arrays, it is handy to convert Illumina manifest files into .bed files for easy manipulation.  To do this, the first step is to download the desired manifest file from the Illumina website.  Here is a site that lists the downloads available for each array they support.  Once you find the array select the manifest file (I usually choose the CSV file) and download from the browser or copy the link and use wget.  Here is an example for the Methylation450 array:


Next, look at the manifest file and determine what columns are for the chromosome and the base pair coordinates (usually some name like Chr and Coordinate).  Put these column numbers into the below line of UNIX code and you will have a .bed file from the manifest file.  Here I am making a file only for the X chromosome, but this can be modified to fit your respective needs as well.


You should be all set with a .bed file from your manifest file.  Just ensure that when you are making comparisons you are using the correct genome build.

Thursday, November 21, 2013

10K

Today Genome Toolbox surpassed 10,000 page views!  A great milestone which has yet to be achieved by any of my other blogs.  Thanks to all who have visited the site and for the continued encouragement to keep adding content.  Stay tuned for more tools and tips relevant to genomic analysis.

Wednesday, November 20, 2013

Remove Duplicate Records from Excel Spreadsheet

I stray away from using Microsoft Excel for data management, but sometimes it is necessary to do quick manipulations on a relatively small dataset.  For this purpose, Excel shines.  One task I assumed Excel could do, but never knew exactly how to do it was filter out duplicate records from a dataset.  After some Goggling, I found how to do this:

(1) Select the column headers and rows in the worksheet you want to remove duplicate records from.
(2) Go to the Data tab and select Advanced Filter.
(3) For action click the Filter the list, in-place radio button and check the box for Unique records only.
(4) Press OK.

You should now have a filtered worksheet with duplicate rows removed.  This can be copied and pasted to a new worksheet for further manipulation.  Hope this was helpful.  If you have improvements on the method or further questions, please add them in the comments section below.

Tuesday, November 19, 2013

Confirm Checksum of a Downloaded File

When copying data from SFTP, FTP, or other websites you may have noticed a lot of these large files and program packages have checksum values posted.  These are essentially "digital fingerprints" to ensure what you have downloaded is truly what you wanted in its complete and full integrity form.  So how exactly do you find the checksum value of the file you downloaded?  This is relatively simple to do in UNIX.  First download the file and go to the directory the file is stored in.  Then see what type of checksum was published for the file.  Here are a few popular checksum types and the UNIX commands to use to calculate them:

Checksum - UNIX command
md5 - md5sum
sha1 - sha1sum
sha256 - sha256sum

Pretty simple command structure.  Usually you just add "sum" after the checksum name and there is an aptly named UNIX command that will perform the function.  As far as usage goes, simply type the command followed by the file name you want to check the checksum for.


Alternatively, to have the checksum program check the sum for you, use the -c (or --check) option with the known checksum value pasted into the command.  See the below example (Note the two spaces between the checksum and the file name and the dash at the end of the command).


If the checksums are equal you will get "filename.txt: OK".  If not, you will get the error "WARNING: 1 of 1 computed checksum did NOT match".  Hope this is helpful for checking file integrity.  Please add suggestions you have in the comments section below.

Color Features in UCSC Genome Browser Custom Track with RGB BED File Field

Custom tracks are incredibly useful in visualizing your data in the UCSC genome browser.  All that is needed is for your data to be in a compatible format.  I typically use .bed files.  The first three columns of .bed files are required fields for chromosome, start, and stop position:

(1) chrom - name of the chromosome or scaffold.  Chromosome names can be given with or without the 'chr' prefix.
(2) chromStart - Start position of the feature in standard chromosomal coordinates (i.e. first base is 0).
(3) chromEnd - End position of the feature in standard chromosomal coordinates

The next six fields are optional. Note that columns cannot be empty (ie: lower-numbered fields must always be populated if higher-numbered ones are used).

(4) name - Label to be displayed under the feature.
(5) score - A score between 0 and 1000.
(6) strand - defined as + (forward), - (reverse), or . (not applicable).
(7) thickStart - field used by UCSC drawing code, typically same as chromStart.
(8) thickEnd - field used by UCSC drawing code, typically same as chromEnd.
(9) itemRgb - an RGB color value (e.g. 0,0,255).

Coloring elements is fairly straightforward.  First define RGB values for each track feature in the itemRgb field (9th column) and ensure fields 1-8 have appropriate values.  Then, upload your track as a custom track (see above link) to UCSC.  Once uploaded, click on the track name (usually "User Track") and edit the configuration to include: itemRgb="On".  Other configuration values can be changed as well:

name - unique name to identify this track in the custom tracks list.
description - label to be displayed above the track in Genome Browser.
priority - integer defining the order in which to display tracks, if multiple tracks are defined.
useScore - a value from 1 to 4, which determines how scored data will be displayed. Additional parameters may be needed.
itemRgb - if set to 'on' (case-insensitive), the individual RGB values defined in tracks will be used.

Here's an example configuration:

Encode Features with no Strand Information in UCSC BED Format

The UCSC Genome Browser is a powerful tool that allows you to visualize your own data in custom tracks with respect to scores of other publicly available data tracks.  To use some of the more advanced features for plotting .bed files, strand information is a required field in the .bed files.  Usually, I would just arbitrarily assign a "+" for the strand information to indicate the feature was on the plus strand, when in actuality there is no relevant strand for the feature.  This results in arrows (>>>>>>) being drawn over the feature in Genome Browser, not the end of the world, but still not the polished look I was going for.  To circumvent this, I tried to leave the field blank when uploading a custom track, however, I would get the error: "Error line 1 of custom track: Expecting + or - in strand".  Then I stumbled on a simple fix, that doesn't seem to be documented well.  All Genome Browser needs is a period (.) in the strand field of the .bam file for each feature and all seems to work fine.  The result: a sharp looking feature track in Genome Browser.

How to Get RGB Values for a Color in R

For continuity in creating presentations and figures it is often necessary to have color schemes that are consistent for certain features, especially when different plots and figures of these features are made using different software.  R includes a nice palette of pre-defined colors for easily implementing rich colors in graphics.  In order to extract the RGB values (ie: red, green, blue) to use these colors in other programs, R has a nice built-in command that extracts the RGB values: col2rgb.  Simply put your color name in quotes and parenthesis and R will return the RGB values.  Here is an example.

Friday, November 15, 2013

Install R Package in User Account on UNIX Cluster

R is a wonderful open-source statistical program with new and exciting packages constantly being released.  Often UNIX cluster administrators don't have the time (or interest) to update or add new packages as they are released.  Here's where this blog post comes in.  By installing R packages in a directory on your UNIX account and loading them from there you can circumvent the need to have cluster administrators install these packages.  Briefly, here is how to do this:


(1) Install the R package either from the UNIX console or R console to a directory within your user account.
From UNIX console:
or

From R console:

In the above example, pkg is the R package you want to install and /R_packages/ is the directory you would like to install them in.


(2) To load the package for use in R, simply put the following command in your script.

How to Convert Number Values that Include Commas From Character to Numeric in R

Sometimes when importing data into R, especially from Excel spreadsheets, there are comma separators between every three digits to more clearly delineate the size of the number.  While this is convenient visually, when importing such numbers into R it can be a bit of a headache trying to convert these numbers from character format into an analysis friendly numeric format.  Here is a quick one liner of code to convert such an imported variable from character to numeric:


Essentially, this code first removes all commas by replacing them with nothing, then converts the numbers from character format to numeric format, and finally replaces the original variable with the new one.

How to Clear the Console in R

Sometimes the console gets a bit messy in R and its nice to clear things out to ensure a program is running correctly and without any errors or warnings.  I have found two ways to clear the console.

Keyboard shortcut:
Ctrl+L in Windows
Command+L in Mac

Programmatically:
cat(rep("\n",64)) essentially adds a lot of character returns

There are probably a host of other ways to do this programmatically, so please include suggestions in the comments.  For example, I saw cat("\014") as an example when Goggling this, but the code does not seem to work on my PC.

Tuesday, November 12, 2013

Relationship between Odds and Probability

It's a simple mathematical relationship, but sometimes I forget.

Odds=p/(1-p)=proportion of success/proportion of failure
p=O/(1+O)=proportion of successes

If there is a probability p of something happening, the the odds (O) is, on average, the number of successes you expect per failure.  High odds correspond to high probabilities and low odds correspond to low probabilities.  As the probability approaches zero, the odds is approximately equal to p.  Probabilities are bound to [0,1], while odds are bound to [0,∞).

Friday, November 8, 2013

How to Estimate a P-value from a Confidence Interval

Confidence intervals (CIs) are useful statistical calculations to help get a level of certainty around an estimated effect size.  Whenever possible, I advocate to include a CI when reporting an estimated effect size.  Sometimes, however, it is of interest to back calculate a p-value from a confidence interval if the p-value is not reported in the manuscript.  To do so, we need to remember the basic equations for the confidence interval and the calculation of a p-value.  Assuming we are dealing with a 95% CI, we would take the effect size and subtract/add 1.96 times the standard error of the effect size to get our lower and upper bounds of the confidence interval.  For the p-value, we just take the effect estimate and divide it by the standard error of the effect estimate to get a z score from which we can calculate the p-value.  Therefore, if we are given an effect size and confidence interval all we need to do is back calculate the standard error and combine that with the effect size to get the z score used to calculate the p-value.  Below are two examples to illustrate how to do this.

Suppose we have an estimate of a risk difference and a respective 95 percent confidence interval of 3.60 (0.70, 6.50).  Here are the steps to follow:
(1) Subtract the lower limit from the upper limit to get the difference and divide by 2: (6.50-0.70)/2=2.9
(2) Divide the difference by 1.96 (for a 95% CI) to get the standard error estimate: 2.9/1.96=1.48
(3) Divide the risk difference estimate by the standard error estimate to get a z score: 3.60/1.48=2.43
(4) Look up the z score using Python, R (ex: 2*pnorm(-abs(z))), Excel (ex: 2*1-normsdist(z score)), or an online calculator to get the p-value.  Usually the two-sided p-value is reported: p=0.015 (two-sided)


For an odds ratio, things are a bit trickier because we need to first take the natural log of the estimate and 95% confidence interval before we can carry out the back calculation of the standard error for calculating the p-value.  Suppose we have an odds ratio and 95 percent confidence interval of 1.28 (1.05, 1.57).  Here are the steps to follow:
(1) Take the natural log (ln) of each value in the 95% CI: 0.25 (0.05, 0.45)
(2) Subtract the lower limit from the upper limit and divide by 2: (0.45-0.05)/2=0.2
(3) Divide the difference by 1.96 (for a 95% CI) to get the standard error estimate: 0.2/1.96=0.10
(4) Divide the log odds ratio by the standard error estimate to get a z score: 0.25/0.10=2.50
(5) Look up the z score using Python, R (ex: 2*pnorm(-abs(z))), Excel (ex: 2*1-normsdist(z score)), or an online calculator to get the p-value.  Usually the two-sided p-value is reported: p=0.012 (two-sided)

Hopefully these are good examples to get you started.  As you can imagine you can also go from a p-value to a 95% confidence interval by extending these methods in the opposite direction, but in practice it is somewhat unlikely an author would report an effect size and p-value while leaving out the 95% confidence interval.

Python Function to Calculate P-value from Z score

Every time I needed to quickly convert a z score to a p-value, I would have to search online for an online calculator or refresh my mind how to do in in R/Excel.  Getting tired of periodically having to do this, I built a simple Python function to do this for me.  Here's the code if you find it helpful:




Thursday, November 7, 2013

How to Fully Utilize All Cores of a UNIX Compute Node

Parallelizing tasks can drastically improve computation time of a program.  One way to do this is to ensure your code is utilizing all available cores of a processor.  To do this you need to write your code in such a way that background tasks are being carried out simultaneously.  This is done by inserting the ampersand (&) at the end of a line of code.  If the number of background tasks running equals the number of cores of the computer node, then you are efficiently and fully utilizing the resource.  The final necessary piece of code is to use the wait command.  This tells the computer to wait until all the background tasks are completed before moving on to the next line of code.  The wait command comes in handy to ensure the number of background tasks submitted does not exceed the number of processor cores.  If this happens you will likely overwhelm the processor with too many tasks.  To prevent this the idea is to simultaneously submit a number of tasks equal to the number of cores and then use the wait command to wait for the jobs to finish before submitting more tasks.  Here is some example code I put together to utilize all 8 cores of a node when running 1,000 permutations of a program:


This will run a script with the following commands:

Of course, this could further be parallelized to run permutations simultaneously on different compute nodes as well to further speed up run time. Hope this is a helpful example to help you fully utilize compute nodes and speed up your processing time.

Wednesday, August 14, 2013

UCSC Tools for BigWig Files

Several UCSC data tracks are downloaded in BigWig format (.bw).  This is a compressed file that cannot be read by common text editors.  Rather, you need to use utilities built to handle these files.  Several are available at the ftp website: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/.  Here are a few UNIX programs I have found useful:

bigWigAverageOverBed bigWigSummary

Download Data Track from UCSC Genome Browser

Did you know you can download tracks you visualize in the UCSC Genome Browser for personal use and analysis?  Here's how to do it.

(1) Click on the grey bar on the far left side of the UCSC data track.  This will bring you to the track settings page.
(2) Click on the link called View table schema.  This will bring up a new page with track information and a description of data fields.
(3) Look for the field called Primary Table and copy the name.
(4) Go to the UCSC FTP site (link) and find the correct genome build you are after.  Usually you will want to select hg19 or hg18.
(5) Click on the database link and then search for the name of the field you copied from the Primary Table field in step 3.
(6) There will usually be a .sql and a .txt.gz for most tracks.  You are interested in the .txt.gz file.  You can click on it to download via your internet web browser or right click on the link to copy the web address and use the wget command to download it.  Here's and example script to download the NHGRI GWAS catalog using the wget command:
(7) Extract the compressed .txt.gz file with the following command, where filename is the name of the file you downloaded

This method should work for downloading the majority of the UCSC data tracks.  Sometimes it takes a bit of digging around the UCSC FTP site to find the dataset you are looking for, but in most cases I have been successful in finding it on the UCSC FTP site.

One final note:  If you are interested in downloading only a small portion of the track (for example, just a region on chromosome 8), you can download this region using the UCSC Table browser.  Here's how to do this:

(1) Follow steps (1) and (2) above.
(2) Once on the Table Schema page for the track of interest go to the link bar on the top of the page and select Tools > Table Browser.  This will take you to the UCSC Table Browser where all the fields will already be filled in with the track you are interested in.
(3) To download your region of interest click on the radio button next to position and type in your desired coordinates (ex: chr8:128362121-129551294).
(4) Make sure all the filters are cleared and give your output a filename.  Select get output and your file will be downloaded.  There is no need to unzip unless you chose the gzip compressed option.

Best wishes and good luck analyzing UCSC data tracks!

Wednesday, July 24, 2013

Converting Genotype Files to EIGENSTRAT format

Need to convert your genotyping files into EIGENSTRAT format?  Here is a quick primer to do so.  If you haven't already, download the EIGENSTRAT package and extract the contents.  We will be using the convertf tool to convert your genotypes to EIGENSTRAT format.  The convertf tool can convert to/from ANCESTRYMAP, EIGENSTRAT, PED, PACKEDPED, and PACKEDANCESTRYMAP files.

First, check to see if your convertf program is working out of the box by going to the /bin directory and typing in ./convertf.  Mine was not working (Error message:./convertf: /lib64/libc.so.6: version `GLIBC_2.7' not found (required by ./convertf)).  My quick fix was to try to recompile things based on the included C code, but going to the src directory and typing make convertf did not seem to do the trick.  Skimming through the README file I saw that you can remake the entire package by again going to the src directory and typing make clobber and then make convertf (the recommended make install produced errors on my system).  Typing ./convertf finally returned the "parameter p compulsory" output indicating the program is now working.  (Note: make clobber will remake the entire package and likely require you to remake other packages you need to use, ex: make mergeit, make eigenstrat, etc.).  I am not a UNIX guru, so if you know of a better way to get convertf running, please let me know in the comments.

Now that we have a running version of convertf, we can begin making a parameter file (ie: the par.PED.EIGENSTRAT file) for converting from one file format to another.  The included README file in the CONVERTF directory is instrumental for doing this.  Since .ped and .map files are ubiquitous, I will use this as an example to convert to EIGENSTRAT format.  Below is a parameter file to do so, but again converting to/from other formats is possible as well.


Finally, save this file as par.PED.EIGENSTRAT and run it by simply going into the directory with your working convertf program (mine was in the /EIG5.0/src directory) and put in the command ./convertf -p dir/par.PED.EIGENSTRAT, where dir is the directory your par.PED.EIGENSTRAT file is saved in.  You can essentially call the par.PED.EIGENSTRAT file any name, but this name meshes nicely with that in the samples that come with EIGENSTRAT.  Hope this is helpful, feel free to comment with questions.

How to Infer Ancestry from SNP Genotypes

Self-reported ancestry is poor metric to use when attempting to statistically adjust for the effects of ancestry since several individuals falsely report their ancestry or are simply unaware of their true ancestry.  Worse yet, sometimes you don't even have information collected on an individual's ancestry.  As many of you know, if you have SNP genotyping data you can rather precisely estimate the ancestry of an individual.  Classically, to do this one needed to combine genotypes from their study sample with genotypes from a reference panel (eg: HapMap or 1000 Genomes), find the intersection of SNPs in each dataset, and then run a clustering program to see which samples clustered with the reference ancestral populations.  Not a ton of work, but a minor annoyance at best.  Luckily, a relatively new program was just released that, in essence, does a lot of this ground work for you.  It is called SNPWEIGHTS and can be downloaded here.  Essentially, the program takes SNP genotypes as input, finds the intersection of the sample genotypes with reference genotypes, weights them based on pre-configured parameters to construct the first couple of principle components (aka. eigenvectors) and then calculates an individual's percentage ancestry for each ancestral population.  Here is how to run the program.

First, make sure you have Python installed on your system and that your genotyping data is in EIGENSTRAT format.  A brief tutorial to convert to EIGENSTRAT format using the convertf tool is here.

Next, download the SNPWEIGHTS package here and a reference panel. I usually use the European, West African and East Asian ancestral populations, but there are other options on the SNPWEIGHTS webpage as well.

Then, create a parameter file with directories of input files, your input population (designated "AA", "CO", and "EA"), and a output file.  An example of one is below:


Finally, run the program using the command inferancestry.py --par par.SNPWEIGHTS.  For the program to run correctly, make sure the inferancestry.info and snpwt.co files are in the same directory as your inferancestry.py file.

For more details, see the SNPWEIGHTS paper or the README file included in the SNPWEIGHTS zip folder.  For code on generating eigenvector plots with overlaid ancestry percentages and triangle plots according to percent ancestry, see this post.

Thursday, July 18, 2013

What Are SNP Ambiguity Codes and What Do They Mean?

That's a good question.  In fact one that I had myself.  Here's what I found:

Apparently single nucleotide polymorphism (SNP) ambiguity codes were constructed by the International Union of Pure and Applied Chemistry (IUPAC) to denote nucleotide changes in SNPs.  Here is a table of the meaning of each code taken from the ENSEMBLE SNPView website.

IUPAC Code   Mnemonic    MeaningComplement
AAdenineAT
CCytosineCG
GGuanineGC
T/UThymidineTA
KKetoG or TM
MAminoA or CK
SStrongC or GS
WWeakA or TW
RPurineA or GY
YPyrimidine C or TR
Bnot AC, G, or TV
Dnot CA, G, or TH
Hnot GA, C, or TD
Vnot T or U     A, C, or GB
NanyG, A, T, or C    N

How to Get a cDNA Sequence for a Gene Transcript

A paper I was reading had coordinates for mutations in a gene based off the gene's complementary DNA (cDNA) sequence.  In order to dig deeper into the finer details of the paper, I wanted to download the cDNA sequence.  Here's how I did it.  First go to Ensemble and search Human (or whatever other organism you are studying) for the gene you want the cDNA sequence for.  On the next results summary page, click on the link to the gene.  Then on the result in detail page, click on the gene id.  You should notice the next page is essentially a tab labeled with the gene you searched for.  On this page should be a table of transcripts for your gene.  Click on the transcript ID of the transcript you want the cDNA sequence for.  If you aren't sure which transcript you want, the one with the CCDS ID is a good start.  A new page will be loaded where you will be on a new tab for the transcript you selected.  On the left side of the tab is a menu of different links.  Find cDNA in this menu under sequence and click on it.  This will lead you to a page with the cDNA sequence.  Congrats, you now are able to get the cDNA sequence for a gene transcript.

Wednesday, July 17, 2013

2K

Today Genome Toolbox hit a cumulative total of 2,000 page views over the life of the blog!  Its exciting to see so much interest in genomic analysis tools.  Thanks for stopping by and keep coming back to check out whats new.

How to LD Thin a List of SNPs

Okay, you have a list of SNPs with many of them in high linkage disequilibrium (highly correlated) with other SNPs in the list.  Your goal is to generate a new list that has more or less independent SNPs.  To do this, you need to LD prune the original starting list by removing all SNPs except one that are in a region of high LD.  Essentially you need to set a threshold, say R2 > 0.50, and filter out SNPs correlated with other SNPs that surpassed this threshold as long as at least one SNP is kept to tag that region of LD.

I feel like there needs to be an online resource that would do this based on SNP correlations in a reference population (say HapMap or 1000 Genomes), but I can not find such a resource.  Please leave a comment below if you are aware of one.  So instead, you need an actual dataset (for example: .ped and .map files) and need to feed the data as input to a LD thinning program.  Current programs use the LD structure within the input population and return a list of LD thinned SNPs.

PLINK is the first place I usually go to do this.  The program has two options for LD thinning: based on variance inflation factor (by regressing a SNP on all other SNPs in the window simultaneously) and based on pairwise correlation (R2).  These are the --indep and --indep-pairwise options, respectively.  Here is code I use:


For details on the options click here.  Essentially this takes a data_in.ped and a data_in.map file and will output two files: a data_out.prune.in file (which contains the LD pruned SNPs to keep) and a data_out.prune.out file (which is a list of SNPs to remove).  PLINK is easy and relatively quick, only taking a few minutes on an list of SNPs that spanned 4 megabases.

Alternatively, GLU also has a LD thinning feature built in.  This requires one to convert their .ped and .map files into a .ldat file (for details click here).  It also requires a snplist file with a header containing a "Locus" and "score p-value".  It was a little more work up front and seems to be optimized for analyzing GWAS association results, but I created such a file with the score p-value equal to one for all the SNPs.  I am assuming if I used real p-values in the list the program may prioritize the pruning process to keep SNPs with the lowest p-values.  There also appears to be an option to force in certain loci (--includeloci).  To run the thinning procedure in GLU, I used the following code:


For more details and options, click here.  While this takes a bit longer to run and the documentation is rather sparse on how the thinning was being performed (seems to be pairwise windows of R2), it does provide a nice output file that clearly details what SNPs are be kept and why each SNP is removed.

Recently, I became aware of PriorityPruner (thanks Christopher).  This is a java program that is compatible with any OS (Windows, UNIX, Mac) as long as you have Java installed.  It has a lot of the same options and functionality as GLU, but doesn't require the .ldat format and has much better documentation.  The program can be downloaded here and is run with the command:


I have yet to test PriorityPruner, but so far I am impressed with what I see.

I hope this post is useful for others looking to thin a set of SNPs.  If you are aware of other LD thinning approaches, please describe it in the comments below.  Happy pruning!

Tuesday, July 16, 2013

Convert .PED and .MAP to and from .LDAT File Format

To run GLU, your data needs to be in the .ldat format.  Luckily, GLU makes it easy to convert from .ped and .map files to .ldat format with the transform command.  Here is how to carry out the transformation.

From .ped/.map → .ldat:

And likewise from .ldat → .ped/.map:

More details on the transform command can be found here.

Convert .GEN and .SAMPLE to and from .PED and .MAP File Format

A program called GTOOL can easily convert .gen and .sample files into .ped and .map files (as well as .ped and .map file format to .gen and .sample format).  The usage is quite simple.

For .gen/.sample → .ped/.map:

For .ped/.map → .gen/.sample:

Create FASTA sequences for Phased Haplotypes

Here is some Python code I put together to convert a .haps file (and associated .sample file) into a .fasta file with an entry for each haploytpe sequence.  Haplotypes are designated >ID_A and >ID_B for each ID in the .sample file.   The program can easily be modified to accept a list of SNPs or IDs that you would like to extract from the .haps file.  Also, this program removes indels that may be present in the .haps file to avoid alignment issues.  This program was useful to feed haplotype input into phylogenetic tree programs, such as MEGA.  Just run the program by typing python make_fasta.py data.haps in at the commnad prompt and you will get a data.fasta file as output.  Hope it is useful.

How to get Phased Imputed Genotypes for Haplotype Analysis

I wanted to compare haplotypes from a set of cases genotyped on one platform to a set of reference controls genotyped on another platform for a small chromosomal region.  I looked into a few methods to do this and found that phasing the genotypes using SHAPEIT and then imputing off the 1000 Genomes reference panel using IMPUTE2 was the optimal approach to have a set of overlapping genotypes for a haplotype analysis.  Phasing from the .ped and .map files was easy using SHAPEIT.  The command I used with my .ped and .map files was:


Other file types can also be used as direct input to SHAPEIT (ex: .bed/.bim/.fam and .gen/.sample).  Genetic recombination maps can be downloaded here for build36 and build37.  Also the backslashes (\) are not necessary.  They just help organize the code by telling the computer to keep reading the input on the next line.

Once you have your .haps and .sample file generated, you are then ready to impute using IMPUTE2.  I used the command below:


Again the backslashes are just for a cleaner visualization of the code.  It is important when using the 1000 Genomes as your reference panel to have the SNP coordinates in hg19 coordinates.  LiftOver can help you convert from one build to another.  If you need the 1000 Genomes reference panel, it can be downloaded here.  The genetic map file is from the above SHAPEIT download.  Finally, if you want phased imputed results, the -phase command is essential to include.

You should now have your phased imputed genotypes to do whatever you wish with.  I chose to convert the phased imputed genotypes into a .fasta file with entries for each haplotype (see Python script).  Then I created parsimony trees with MEGA and visualized them in HapView.

Tuesday, July 9, 2013

Calculate Minor Allele Frequencies from VCF File Variants

Today I needed to calculate minor allele frequencies (MAFs) for sequence variants called in a .vcf file.  I couldn't find any programs that would do this for me, so I wrote a quick script to do it in Python.


This can be run in python from the command prompt by typing:


where project.vcf is the vcf file you want to calculate MAFs for.  It will return a project.txt file that contains the calculated MAF values.  This script will only work for SNPs and does not work on insertions and deletions.

Alternatively, if Python scares you there is a bit of a round about way that will do this for you too.  First use vcftools to convert your .vcf file into a Plink compatible .ped and .map file.


Then, open Plink and run the --freq option on the newly created .ped file.

**UPDATE**
Today I found an updated way to use Vcftools to directly calculate the MAF values for you.  It just takes the simple command --freq.  Here is some example code:

Monday, July 8, 2013

Useful UNIX Aliases

Here are some useful UNIX aliases that help save the fingers from too much typing.  Feel free to borrow as many as you like.

Setting Up Automatic Publication Notifications

Interested in setting up an automatic notification letting you know when new articles have been published in your area of interest?  This is easier to do than you may think, and a huge time saver.  Instead of doing all the searching yourself, notifications will come right to your email inbox.  Sounds great, right?  Well, here are a couple of ways to get this up and running.

PubMed offers an automatic search update through their My NCBI feature.  Just follow the link to PubMed and click on the top right to sign in to NCBI.  Next, either enter your NCBI account credentials or register for and NCBI account.  After successfully signing in, go ahead and conduct a search using keywords, MeSH terms, or the advanced search features.  Once the search is optimized to bring up the key articles you are interested in, follow the save search link immediately below the search box.  This will bring you to a new page where you can name the search.  Click save.  On the next page there will be options to further refine your search.  Make sure to check the radio button to get email updates of new search results and confirm your correct email address is showing.  There will also be options about the frequency of emailing search results to you.  Once you are happy with everything click save.  You should be all set up.  If you should ever decide not to receive these email updates, just go to the My NCBI link at the top right of PubMed and manage your saved searches.

Google Scholar also offers an easy way to set up automatic email updates for a literature search.  Simply navigate to Google Scholar and sign in using your Google account (or sign up for a free one).  Once logged in, type in the search you are interested in and press search.  If you are happy with the search, click on the create alert link on the bottom left.  From there make sure all the details are correct (including your email address) and click create alert when finished.  To cancel alerts, just go to the Alerts link on the top left of Google Scholar.

I am sure there are other ways of setting up automated literature searches, but these are the easiest free ones I am aware of.  I personally prefer the PubMed alerts since I find them a bit easier to customize and they are somewhat restricted to biomedical journals.  From my experience, Google Scholar sometimes brings up some pretty obscure articles that don't seem to match the search parameters.  Hope this is helpful in getting you set up with automated searches.  If you know of additional ways to set up automatic publication notifications, please let me know in the comments section below.

Friday, July 5, 2013

Getting Set Up on a UNIX Cluster

Okay, you have been granted access to a UNIX cluster to work on a project, but have no idea how to get started with things.  No problem.  This post is designed to clue you in on the essentials so you can get up and running in no time.


SSH Clinet
This is the first thing you will need.  SSH clients are programs that enable your computer to connect remotely to the UNIX cluster.  SSH stands for Secure SHell, which is a network protocol for communicating data from one computer to another.  Unquestionably, the most widely used SSH Client for Windows operating system users is PuTTY.  This is a free program and easy to set up.  Download putty.exe and move it to a spot where you can easily access it.  Double clicking on it will bring up a security warning.  Select run and PuTTY will open.  For the Host Name insert the IP address for the cluster (ex: computer.university.edu).  In general, this is all you really need to do before pressing the Open button.  You can choose to save this as a session for easy access in the future, but just accept all the defaults for now.  PuTTY will then connect to the computer at the IP address you specified.  If you are a Mac user, this is much easier to do.  Just go to the pre-loaded Mac terminal and type "ssh username@computer.university.edu", where computer.university.edu is the IP address for the computer you want to connect to.  The first time you connect you will get a warning about a certificate.  Just choose the option to proceed.  Next, you will usually be asked by the remote computer for a username and password.  Input what was given to you by the cluster administrator.  You will not see the password as you type it in.  After your credentials have been accepted, you will have access to the remote computer.  Congratulations, you are now connected!  For Windows users, there are also other flavors of PuTTY that can be used.  I like PuTTYTray for the added options built into it as well as MTPuTTY which allows for multiple tabs to be opened simultaneously.

File Transfer Application
Next, you will need a way to transfer files and scripts from your computer to the remote computer (cluster) and vice versa.  WinSCP is an excellent free program to use from a Windows operating system.  Unfortunately, Macs really don't have an equivalent (if you have any suggestions, let me know).  Once WinSCP is downloaded and setup, open the program and select New.  Fill in the Host name and User name.  I usually leave the password blank (in which case I will be asked to manually provide it later), but this is up to your discretion.  From there you can log in.  After your session has been authenticated, you will be greeted with a window having two major panes.  The left side is your computer and the right side is the remote computer.  Just drag files from one window to the other to transfer from one computer to another.  You now have a way to transfer files to and from the remote computer!

X Server
Some programs require an X server to process output from X11 sessions.  This essentially allows you to open an interactive window on your computer in which you can interact with the remote computer.  Some programs such as R, Python, and Java have modules that can interact in this fashion.  Should you find out you need an X server, I would highly recommend Xming for Windows users.  It is free and just needs to be open in the background while you have the terminal open.  Mac will have this functionality built in.  First though, you need to set this up before logging into the cluster.  On Macs, simply add the -X option when connecting through the terminal (ex: ssh -X username@computer.university.edu).  For Windows, open PuTTY and in the left menu select Connection > SSH > X11 and click on the box enabling X11 forwarding.  Then go back to the top of the menu to Session and log in as before.  To make sure this is working, type the command "xeyes".  Do you see two eyballs looking at you?  If so, it works!

Text Editor
If you like Vi or Emacs, this is not for you.  For everyone else, there are a wide variety of text editors that are helpful in writing your code in a variety of programming languages.  I prefer Notepad++ for Windows.  For Macs, however, I am still searching for a worthy Notepad++ alternative.


Well, I hope this was helpful in getting you up and running on a UNIX computing environment.  From personal experience, I know the learning curve can be steep.  Best wishes as you learn to navigate your way around on a remote cluster.  If you have any helpful suggestions, I highly encourage you to post a comment below.

Best Notepad++ Alternatives for Mac OS

I ❤ Notepad++.  Its a powerful, fully-loaded, and free text editing application that has been an invaluable tool for writing code in a variety of programming languages.  The only caveat: it's only available for Windows operating systems.  With the acquisition of my shiny, new Macbook Pro, I was incredibly disappointed to find out Notepad++ could not be installed on Macs; so much so I almost returned the Macbook.  Since I couldn't find a better laptop to meet my needs (and aesthetic desires), the quest has begun to find a comparable and preferably free text editor that runs on a Mac operating system.  I was surprised to find the list of candidates quite long.  Here are options I found, unfortunately not all options are free:

BBedit ($50)
Coda ($75)
Crossover (Windows emulator, $60) + Notepad++
Espresso ($75)
jEdit
Komodo Edit
TextEdit (the basic text editor pre-loaded on your Mac)
TextMate (€39 or about $53)
TextWrangler (free lite version of BBedit)
Smultron ($5)
SubEthaEdit (€29, or about $43)
Sublime ($70)
Tincta (free, Pro version for $16)
WINE (Windows emulator) + Notepad++

Apparently the market is saturated with Notepad++ "replacement" text editors for Macs.  The predominant text editors most recommended online are highlighted in bold.  While looking into the options, it became apparent there really is no one best Notepad++ replacement text editor.  It all really depends on what the user is using Notepad++ for and the options they need it to do (plus a bit of personal preference in user interface).  I have tried a few of the above options and am still not completely satisfied.  I am secretly hoping the folks responsible for Notepad++ are cooking up a way to install it on Macs.  The emulator approach to installing Notepad++ on a Mac also seems interesting.  I will have to try it when I have some free time.  In the meantime, I am curious what has been working best for you other Notepad++ lovers who have made the switch to a Mac.  Also, if you are aware of other text editors not mentioned here, please share!

Sunday, June 30, 2013

1,000 Page Views

Today, Genome Toolbox is on track to reach 1,000 page views!  What an awesome milestone.  I am thrilled so many people have found the blog and are using it as a resource.  Thank you all for your interest and keep the feedback coming.

Wednesday, June 26, 2013

Remove a List of Reads from a BAM File

Sometimes it is necessary to remove a subset of reads from a .bam file.  In my case, I wanted to remove a few chimeric reads where it appeared reads from different amplicons were fusing together before entering the sequencer.  Here is a line of code where I use Samtools and grep to remove a list of read ID's from the original .bam file and create a new filtered .bam file.  Hope it is useful for other applications as well.


Note the trailing hyphen at the end.

Tuesday, June 25, 2013

Install SRA Toolkit on UNIX OS

SRA Toolkit is the set of programatic tools Sequence Read Archive (SRA) provides for accessing information in their .sra files.  The toolkit can be downloaded for a variety of platforms here, but in this post I will focus on installing the toolkit on a UNIX platform.

The first step is to download to the directory you want the program saved in and extract the program:

Next, you need to configure the toolkit by going into the bin directory and opening the java configuration script (make sure you have X11 forwarding enabled):

For the most part you can accept defaults when you walk through the setup, just make note of the directory you specified during configuration.  You should be all setup!  Check out other blog entries for SRA here.

Monday, June 24, 2013

Only Download Regional BAM File from Sequence Read Archive

Sequence Read Archive (SRA) stores raw next-generation sequencing data from a wide variety of sequencing platforms and projects.  It is the NIH's primary archive for high-throughput sequencing data containing well over 741,690,391,503,250 bases of publicly available sequence reads (that's a really big number).  Data is stored in one basic archived format, the SRA format, and can be downloaded for public use.  A toolkit is provided that supports conversion from .sra files to several popular data formats (see post for installation).  SRA recommends downloading .sra files using Aspera.

Not having sufficient space resources to download the entire .src files I would like regional .bam files for, I found two ways of downloading much smaller regional .bam files.

Web Based Interface
A somewhat cumbersome but practical solution is through one of SRA's web interfaces.  SRA offers a Run Browser which can be used to visualize data in the .src files.  After searching for the accession number and then choosing the reference chromosome and range, the run browser has the option of creating a .bam (as well as .fasta, .fastq, .sam, and .pileup) file for that region.  Just click to output the format to file and the download will begin in your web browser.

SRA Toolkit
This is a very useful set of programs to access data in .sra files and can be used to access data remotely (without having to download .sra files locally!) through the NCBI site.  It is easy to install and provides clear documentation.  An essential included module for extracting sequence read data is sam-dump.  A genomic region can be selected and then output to .bam format using Samtools.  Here is an example script:

I used these approaches to get .bam files I could use to investigate depth of coverage by Complete Genomics in a few regions of interest.  These methods allowed me to get complete (rather than evidenceOnly or evidenceSupport) .bam files for a few genomic regions for a subset of individuals that had been whole-genome sequenced by Complete Genomics in the 1000 Genomes Project.  To get their accession numbers needed for Run Browser and SRA Toolkit, extract the SRR number out of the index file.  Hope this is helpful for those wanting to use SRA, but not having loads of space to store the .sra files locally.

Thursday, June 20, 2013

Clean BAM Files Generated by CGATools

Tired of seeing the error "samtools: bam_pileup.c:112: resolve_cigar2: Assertion `s->k < c->n_cigar' failed."  in those .bam files you generated using CGATools.  Here's a way to remove those pesky lines that Samtools does not like.  Basically, we are going to remove any line where the cigar string fulfills any of the following:

starts with \d+N\dD
starts with \d+P
starts with \d+I
ends with \d+P

where \d+ is the Perl regular expression meaning any integer containing an unspecified number of digits.  Here is a pipeline that uses Samtools and AWK to do this for us.


Enjoy nice, clean (and hopefully problem free) .bam files.

Does BAM File Use Hg18 or HG19 Coordinates?

How do you tell which coordinates are used in a .bam file?  Well, its pretty easy.  Just pull the .bam file header up using Samtools


Then check out the rows that begin with @SQ followed by SN:chr## and LN:########.  Compare the lengths of a few of the chromosomes to the below list of lengths.  Whichever list the lengths match will indicate which coordinates are being used in the .bam file.


Hope this is helpful in determining which coordinates are used in your .bam files.

Monday, June 17, 2013

Create 2bit Database for Faster BLAT Searching

Included with the BLAT source code is a useful utility that converts FASTA references to 2bit format.  The utility, faToTwoBit, works simply by providing the original reference .fa file to be converted to a .2bit file.  Here is the syntax to run the program:


A few simple options are also included:
-noMask
-stripVersion
-ignoreDups

For more details on these options, type faToTwoBit at the command line.  Hope this 2bit reference database speeds up your BLAT searches!

Index FASTA File

Some programs require indexed FASTA files (*.fa.fai files).  Indexing a FASTA file is easy to do with Samtools.  Just run the below script on your FASTA file.

Friday, June 14, 2013

Calculate an Author's H-Index

The h-index is an indicator of the number of highly cited citations an author has among their total published work.  Simply put, an h-index of h is the number of papers an author has that have at least h citations, so an author with an h-index of 10 has at least 10 papers that have received over 10 citations.  An intuitive graphical representation is shown here.  This calculation is used so that articles that are highly cited or have not yet been cited do not disproportionately weight the overall score.

To calculate the score, go to the ISI Web of Knowledge webpage, click on the Web of Science tab, and enter the author's name in the correct field.  A good way to enter an author's name is last name, first initial followed by an asterisk (*).  Once the search has been conducted, click on the Create Citation Report link at the top of the search results.  This will take you to a report page that has the h-index calculated for you.  How high of a h-index can you find?  The highest I could find was 212.

Google Scholar also has a way to calculate an h-index, however, a Google account is required and this approach is more tailored to calculating your h-index.  At the main Google Scholar page, there is a link at the top right called My Citations.  Click the link and login.  This leads you to a wizard where you fill in your details and add papers to your My Citations list.  After all your papers are added, an h-index is calculated.  Best wishes for a strong h-index!

Find out how many times a paper has been cited

Today I discovered that not only did someone else read one of the papers I published, but they also cited the paper!  This made me curious: How many others have cited this paper or other papers I have published?  A quick search led me to the ISI Web of Knowledge.  I may be mistaken, but I believe your institution needs a subscription to be able to use this service.  Once there, go to the ISI Web of Science tab and then click on Cited Reference Search.  Fill in the appropriate fields to customize you search and press the Search button.  This should bring up your paper or papers of interest and the number of citations each paper has received.  One option I was particularly impressed by were the citation maps that show forward and reverse maps of citations up to two generations.  Alternatively, if you don't have a subscription to ISI Web of Knowledge, Google Scholar is another resource that provides citation information.  Just search for your article of interest and the search result for the article should have a small cited by link on the lower left of the entry indicating the number of articles that have cited the current article.  Follow the link to get a list of all the articles that have cited the original paper.  Hope you are excited as I was to know how many citations your papers have received.

Friday, May 31, 2013

How To Import Command Line Arguments into a Python Script

It is often handy to be able to feed an argument from the UNIX command line to a Python script.  This is very simple to do with the sys package installed.  Below is an example of feeding three arguments (chr, start, end) from the command line into Python to be used as variables in a script.

In the script.py file:

At the command line:

Thursday, May 30, 2013

Transpose List of Lists in Python

Today in Python I wanted to be able to transpose a list of lists that I created. So, for example, I wanted to be able to transpose the list

l=[[1,2,3],[4,5,6],[7,8,9]]

to this

t=[[1,4,7],[2,5,8],[3,6,9]]

I found an easy command to do this, where l is the list above and t is the transposed list.

How to Create Indicator (Dummy) Variables in R

Countless times when performing an analysis in R, I need to create indicator variables to run a regression.  I usually do this manually, but today I came across a way to have R do this for me.  Basically, all you need to do is to use the factor() command.  Here is a quick example where I create indicator variables from ethnicity coded as 0,1,2.

Wednesday, May 29, 2013

Find Palindromic Sub-sequences in DNA Nucleotide Sequence

A quick and easy online tool I found to search a DNA sequence for palindromic sub-sequences is the DNA Palindrome Finder.  Just paste your nucleotide sequence into the text box and press Send.  Results will show up on the next page with palindromic sequences displayed as capital letters.  Its pretty basic, but does what it needs to do.

Unix Sort by Column Number

I always forget the flags to use when using UNIX sort.  Here is a list of common useful flags.

Flags
-k: selects the column number to sort
-n: indicates you are sorting numeric values so use numeric ordering (ex: 10, 11, not 10, 100)
-r: sort in reverse (descending) order for this column

An example line of code that sorts by numeric values in column 2 in descending order would be as follows.

Tuesday, May 28, 2013

Custom Tracks in UCSC Genome Browser

One of my colleagues recently shared a web link to a custom track they had uploaded and visualized on the standard UCSC Genome Browser.  A bit impressed, I looked into how to do this myself.  UCSC has a pretty informative webpage detailing how to add custom tracks to Genome Browser.  Here's a quick summary of what you need to do.

First get your data in one of the formats supported by Genome Browser (.gtf, .gff, .bed, .wig, or .psl).  I usually prefer .bed when appropriate.  Next login to your account on Genome Browser.  If you don't have an account, sign up for a free one here.  Then go to the UCSC Add Custom Tracks link.  Select the appropriate genome and assembly and then upload your data under the "Paste URLs or data" section.  After pressing Submit, it will take you to a new page where you will see that your new track is generated and where you can follow a link to see it in the Genome Browser.  You can now explore how other tracks match up to your custom track.  You can also save your session and share it with others by creating a URL link to it.  Here is an example of a custom track I made from FASTA files where I wanted to compare gaps in FASTA files to the Gaps track at UCSC.  Have fun with custom tracks on UCSC Genome Browser.