Reads per kilobase per million mapped reads (RPKM) is a common metric used when investigating RNA expression of a gene transcript in sequencing data from RNA-seq experiments. The RPKM measure of read density reflects the molar concentration of a transcript in a starting sample by normalizing for RNA length and the total read number. By doing so, RPKM values facilitates transparent comparison of transcript levels both within and between samples.The formula for RPKM is as follows
where ER is the number of mapped reads in the gene's exons, EL is the sum of exon length in base pairs, and MR is the total number of mapped reads.
The number of transcript copies (TC) can be derived from RPKM as well. Essentially
where TL is the length of the transcriptome in base pairs. This can then be rearranged and RPKM can be substituted in as follows
The difficult part is getting an estimate on TL. TL can be estimated from spike-in data or can be derived from the starting amount of mRNA if you are willing to assume 100% efficiency of the cDNA synthesis. A great paper on RPKM is by Mortazavi et al. (Nature Methods 2008)
Fragments per kilobase per million mapped fragments (FPKM) is essentially analogous to RPKM. The only difference being that rather than using read counts you are estimating abundance of gene transcripts in terms of fragments observed. In paired-end RNA-seq experiments (ex: Illumina sequencing), fragments are sequenced from both ends providing two reads for each fragment. Therefore, RPKM=one read (single end) and FPKM=fragments are two reads (paired end). A common misconception is that RPKM values are twice that of FPKM. That is untrue, since FPKM is fragments per kilobase per million mapped fragments, not fragments per kilobase per million mapped reads. RPKM is approximately equal to FPKM.
Here are links to some programs and scripts that are useful:
rpkmforgenes.py
No comments:
Post a Comment