Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.

Wednesday, July 16, 2014

RNA-seq: RPKM, FPKM, Formulas, and Scripts

Reads per kilobase per million mapped reads (RPKM) is a common metric used when investigating RNA expression of a gene transcript in sequencing data from RNA-seq experiments.  The RPKM measure of read density reflects the molar concentration of a transcript in a starting sample by normalizing for RNA length and the total read number. By doing so, RPKM values facilitates transparent comparison of transcript levels both within and between samples.The formula for RPKM is as follows


where ER is the number of mapped reads in the gene's exons, EL is the sum of exon length in base pairs, and MR is the total number of mapped reads.

The number of transcript copies (TC) can be derived from RPKM as well.  Essentially


where TL is the length of the transcriptome in base pairs.  This can then be rearranged and RPKM can be substituted in as follows


The difficult part is getting an estimate on TL.  TL can be estimated from spike-in data or can be derived from the starting amount of mRNA if you are willing to assume 100% efficiency of the cDNA synthesis.  A great paper on RPKM is by Mortazavi et al. (Nature Methods 2008)

Fragments per kilobase per million mapped fragments (FPKM) is essentially analogous to RPKM.  The only difference being that rather than using read counts you are estimating abundance of gene transcripts in terms of fragments observed.  In paired-end RNA-seq experiments (ex: Illumina sequencing), fragments are sequenced from both ends providing two reads for each fragment.  Therefore, RPKM=one read (single end) and FPKM=fragments are two reads (paired end).  A common misconception is that RPKM values are twice that of FPKM.  That is untrue, since FPKM is fragments per kilobase per million mapped fragments, not fragments per kilobase per million mapped reads.  RPKM is approximately equal to FPKM.

Here are links to some programs and scripts that are useful:
rpkmforgenes.py



No comments:

Post a Comment