Genome Toolbox: transcript

Wednesday, July 16, 2014

RNA-seq: RPKM, FPKM, Formulas, and Scripts

Reads per kilobase per million mapped reads (RPKM) is a common metric used when investigating RNA expression of a gene transcript in sequencing data from RNA-seq experiments. The RPKM measure of read density reflects the molar concentration of a transcript in a starting sample by normalizing for RNA length and the total read number. By doing so, RPKM values facilitates transparent comparison of transcript levels both within and between samples.The formula for RPKM is as follows

where ER is the number of mapped reads in the gene's exons, EL is the sum of exon length in base pairs, and MR is the total number of mapped reads.

The number of transcript copies (TC) can be derived from RPKM as well. Essentially

where TL is the length of the transcriptome in base pairs. This can then be rearranged and RPKM can be substituted in as follows

The difficult part is getting an estimate on TL. TL can be estimated from spike-in data or can be derived from the starting amount of mRNA if you are willing to assume 100% efficiency of the cDNA synthesis. A great paper on RPKM is by Mortazavi et al. (Nature Methods 2008)

Fragments per kilobase per million mapped fragments (FPKM) is essentially analogous to RPKM. The only difference being that rather than using read counts you are estimating abundance of gene transcripts in terms of fragments observed. In paired-end RNA-seq experiments (ex: Illumina sequencing), fragments are sequenced from both ends providing two reads for each fragment. Therefore, RPKM=one read (single end) and FPKM=fragments are two reads (paired end). A common misconception is that RPKM values are twice that of FPKM. That is untrue, since FPKM is fragments per kilobase per million mapped fragments, not fragments per kilobase per million mapped reads. RPKM is approximately equal to FPKM.

Here are links to some programs and scripts that are useful:

cufflinks

rpkmforgenes.py

Thursday, July 18, 2013

How to Get a cDNA Sequence for a Gene Transcript

A paper I was reading had coordinates for mutations in a gene based off the gene's complementary DNA (cDNA) sequence. In order to dig deeper into the finer details of the paper, I wanted to download the cDNA sequence. Here's how I did it. First go to Ensemble and search Human (or whatever other organism you are studying) for the gene you want the cDNA sequence for. On the next results summary page, click on the link to the gene. Then on the result in detail page, click on the gene id. You should notice the next page is essentially a tab labeled with the gene you searched for. On this page should be a table of transcripts for your gene. Click on the transcript ID of the transcript you want the cDNA sequence for. If you aren't sure which transcript you want, the one with the CCDS ID is a good start. A new page will be loaded where you will be on a new tab for the transcript you selected. On the left side of the tab is a menu of different links. Find cDNA in this menu under sequence and click on it. This will lead you to a page with the cDNA sequence. Congrats, you now are able to get the cDNA sequence for a gene transcript.

Pages

Wednesday, July 16, 2014

RNA-seq: RPKM, FPKM, Formulas, and Scripts

Thursday, July 18, 2013

How to Get a cDNA Sequence for a Gene Transcript