Genome Toolbox: genomic

Showing posts with label genomic. Show all posts

Tuesday, May 12, 2015

Convert Protein Codon to Genome Coordinates

I needed a way to check if a few codons from different proteins were covered in a next-generation sequencing panel. This sounds relatively easy to do, but proves to be a bit difficult. Here are steps to do this.

(1) Use Biomart ID converter to find the Ensembl protein ID for your protein of interest.
(2) Use Ensembl GET map/translation/:id/:region to find the genomic coordinates of the codon of interest using the following script:

ENSP00000288602 is the Ensembl protein ID for your protein of interest (example: BRAF gene)
100..100 are the start and stop codons (example: just codon 100)

The result is a JSON formatted string like this:
{"mappings":[{"assembly_name":"GRCh38","end":140834815,"seq_region_name":"7","gap":0,"strand":-1,"coord_system":"chromosome","rank":0,"start":140834813}]}

This indicates that codon 100 of the BRAF gene (for this protein transcript) is located at chr7:140834813-140834815. Ensembl uses GRCh38. If you need other builds of the genome, use liftOver for converting.

I am sure there are probably more automated ways out there to do this, but this worked for the small subset of codons I needed to check in the design panel. If you have a better way to do this, please share in the comments section.

Tuesday, August 5, 2014

How to Calculate Genomic Inflation Factor and λgc for GWAS

You have conducted your genome-wide association study (GWAS) and have tested each genetic variant for an association with your trait of interest. Now it is time to investigate if there are any systematic biases that may be present in your association results. A common way to do this is to calculate the genomic inflation factor, also known as lambda gc (λgc). By definition, λgc is defined as the median of the resulting chi-squared test statistics divided by the expected median of the chi-squared distribution. The median of a chi-squared distribution with one degree of freedom is 0.4549364. A λgc value can be calculated from z-scores, chi-square statistics, or p-values, depending on the output you have from the association analysis. Follow these simple steps to calculate lambda GC using R programming language.

(1) Convert your output to chi-squared values
(2) Calculate lambda gc (λgc)
If analysis results your data follows the normal chi-squared distribution, the expected λgc value is 1. If the λgc value is greater than 1, then this may be evidence for some systematic bias that needs to be corrected in your analysis.

Sunday, June 1, 2014

Generate Random Genomic Positions

Generating random genomic positions or coordinates can be useful in comparing characteristics of a set of genomic loci to that what would be expected from permutations of the underlying genomic distribution. Below is a Python script to aid in selecting random genomic positions. The script chooses a chromosome based on probabilities assigned by chromosome length and then chooses a chromosomal position from a uniform distribution of the chromosome's length. An added gap checking statement is included to ensure the chosen position lies within the accessible genome. You can choose the number of positions you want, the number of permutations to conduct, the size of the genomic positions, and the genomic build of interest. A UNIX shell script is included as a wrapper to automatically download needed chromosomal gap and cytoband files as well as run the Python script. Useage for the UNIX script can be seen by typing ./make_random.sh from the command line after giving the script executable privileges. An example command would be ./make_random 100 10 1000 hg19. This command would make 10 .bed files each with 100 random 1Kb genomic regions from the hg19 genome build. Below are the make_random.sh and make_random.py scripts.

Friday, May 24, 2013

Download Nucleotide Sequence for Genomic Region

Sometimes I need the nucleotide sequence for a specific region of the genome to investigate sequence similarity, simple repeats present, or recurring motifs. I know entire chromosomal .fasta files can be downloaded from the UCSC ftp site, but then I would have to go through the entire file and hopefully extract out the correct sequence I needed. Today I came across a very easy way to download a nucleotide sequence for a genomic region using the UCSC DAS server. Simply modify the below web link to include the appropriate genome build and genomic coordinates and you will get a customized XML page generated with the nucleotide sequence for your query. One word of caution: the DAS server uses an index of +1 for the first base. Pretty cool and very simple to do.

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr1:100000,200000

Sunday, May 5, 2013

Samtools Download BAM Region Only

Often times publicly available sequencing data can serve as a useful reference for a sequencing project. The 1000 Genomes project is a great source, especially with their newly released high coverage Complete Genomics data. Here is an example UNIX script that shows how BAM files with genomic regions of interest can be created from a whole-genome BAM file that is hosted on an FTP server, without having to download the entire BAM file first. The bai_file_list.txt file is a file that contains unique identifiers for each BAM file extracted from a previously selected list of BAI files of interest. Here I am just extracting the BRCA1 and BRCA2 regions of the genome. The extracted reads are sorted and then saved as a BAM file and an associated BAI index file is also created. The final step removes excess BAI files that are downloaded and used by Samtools to extract the region of interest from the BAM files on the FTP server.

Pages