Tuesday, May 12, 2015

Convert Protein Codon to Genome Coordinates

I needed a way to check if a few codons from different proteins were covered in a next-generation sequencing panel. This sounds relatively easy to do, but proves to be a bit difficult. Here are steps to do this.

(1) Use Biomart ID converter to find the Ensembl protein ID for your protein of interest.
(2) Use Ensembl GET map/translation/:id/:region to find the genomic coordinates of the codon of interest using the following script:

ENSP00000288602 is the Ensembl protein ID for your protein of interest (example: BRAF gene)
100..100 are the start and stop codons (example: just codon 100)

The result is a JSON formatted string like this:

This indicates that codon 100 of the BRAF gene (for this protein transcript) is located at chr7:140834813-140834815. Ensembl uses GRCh38. If you need other builds of the genome, use liftOver for converting.

I am sure there are probably more automated ways out there to do this, but this worked for the small subset of codons I needed to check in the design panel. If you have a better way to do this, please share in the comments section.

