Okay, you have a list of SNPs with many of them in high linkage disequilibrium (highly correlated) with other SNPs in the list. Your goal is to generate a new list that has more or less independent SNPs. To do this, you need to LD prune the original starting list by removing all SNPs except one that are in a region of high LD. Essentially you need to set a threshold, say R2 > 0.50, and filter out SNPs correlated with other SNPs that surpassed this threshold as long as at least one SNP is kept to tag that region of LD.
I feel like there needs to be an online resource that would do this based on SNP correlations in a reference population (say HapMap or 1000 Genomes), but I can not find such a resource. Please leave a comment below if you are aware of one. So instead, you need an actual dataset (for example: .ped and .map files) and need to feed the data as input to a LD thinning program. Current programs use the LD structure within the input population and return a list of LD thinned SNPs.
PLINK is the first place I usually go to do this. The program has two options for LD thinning: based on variance inflation factor (by regressing a SNP on all other SNPs in the window simultaneously) and based on pairwise correlation (R2). These are the --indep and --indep-pairwise options, respectively. Here is code I use:
For details on the options click here. Essentially this takes a data_in.ped and a data_in.map file and will output two files: a data_out.prune.in file (which contains the LD pruned SNPs to keep) and a data_out.prune.out file (which is a list of SNPs to remove). PLINK is easy and relatively quick, only taking a few minutes on an list of SNPs that spanned 4 megabases.
Alternatively, GLU also has a LD thinning feature built in. This requires one to convert their .ped and .map files into a .ldat file (for details click here). It also requires a snplist file with a header containing a "Locus" and "score p-value". It was a little more work up front and seems to be optimized for analyzing GWAS association results, but I created such a file with the score p-value equal to one for all the SNPs. I am assuming if I used real p-values in the list the program may prioritize the pruning process to keep SNPs with the lowest p-values. There also appears to be an option to force in certain loci (--includeloci). To run the thinning procedure in GLU, I used the following code:
For more details and options, click here. While this takes a bit longer to run and the documentation is rather sparse on how the thinning was being performed (seems to be pairwise windows of R2), it does provide a nice output file that clearly details what SNPs are be kept and why each SNP is removed.
Recently, I became aware of PriorityPruner (thanks Christopher). This is a java program that is compatible with any OS (Windows, UNIX, Mac) as long as you have Java installed. It has a lot of the same options and functionality as GLU, but doesn't require the .ldat format and has much better documentation. The program can be downloaded here and is run with the command:
I have yet to test PriorityPruner, but so far I am impressed with what I see.
I hope this post is useful for others looking to thin a set of SNPs. If you are aware of other LD thinning approaches, please describe it in the comments below. Happy pruning!
Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.