Genome Toolbox: index

Showing posts with label index. Show all posts

Monday, September 22, 2014

1000 Genome Phase 3 Variant Set Mirrors

The 1000 Genomes consortium has released their final release of the Phase 3 variant set. This is a set of phased called variants in thousand genome individuals. Data is available in compressed .vcf.gz format along with the indexed tabix files. Looking for a way to download the data. See these FTP sites:

US Mirror:
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/release/20130502/

UK Mirror:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/

The US 1000 Genomes mirror is substantially faster than the UK mirror when downloading files in the US.

Wednesday, September 10, 2014

How to Index a VCF File

There are two simple ways to create an index for a VCF file of sequence variants. The first is a command line driven approach using Tabix. For directions on installing Tabix, see this post. Here is the code needed for indexing the VCF file (either .vcf or .vcf.gz). First you need to make sure the vcf file is compressed as a vcf.gz file. This is done in the first line of code. Next, create a new .tbi index file in the same directory as your vcf.gz file. Using the -f command will write over an old index file that may be outdated or corrupted. The -p command will tell tabix to use the "vcf" file format.

The second way to index a VCF file is a point and click approach using the BROAD Institute's Integrated Genomics Viewer (IGV) program, a Java based program that will run on a variety of operating systems. To index a VCF file, open IGV, click on the Tools menu and select Run igvtools... A dialogue box will pop up. In the command drop down menu select Index and then click on Browse to select your desired .vcf file. Click run and a new .tbi index file will be created in the same folder.

There are probably other ways to index a VCF file, but these are the ones I am aware of. If you are aware of another method, please share in the comments.

Friday, May 16, 2014

Create Index for BAM File

Sometimes databases or contributors share .BAM files, but fail to provide the associated BAM index file (the .BAI file). The BAM index file, usually named filename.bam.bai, is needed to visualize the reads in IGV as well as several other applications. Samtools can easily generate an index file for your sequence bam file. The below code is an example of how to do so for a bam file called filename.bam.

Monday, November 25, 2013

Installing Tabix on UNIX

Tabix is an incredibly useful tool that indexes large .vcf files and makes it very efficient to find a particular genomic region in the file. It is particularly useful for downloading small portions of files from publicly available ftp data repositories. Here is a brief tutorial on how to install Tabix on a UNIX operating system.

(1) Go here to download the newest release.
(2) Extract the file:

(3) Compile the program by typing make on the UNIX command line.
(4) Export the path by adding the following line to your .bashrc file, saving your .bashrc file, and typing source on the UNIX command line. Note: path_to_tabix is the directory where tabix is installed.

Monday, June 17, 2013

Index FASTA File

Some programs require indexed FASTA files (*.fa.fai files). Indexing a FASTA file is easy to do with Samtools. Just run the below script on your FASTA file.

Thursday, May 9, 2013

Thousand Genomes Complete Genomics Information

Recently I have been using the Complete Genomics high coverage sequencing data from the 1000 Genomes project. Its a bit difficult to find information on the populations, samples, and available sequencing data since they are all stored in different places on their ftp server. I decided to make a post that tried to combine all the useful information into one spot. Here are links to files that may be of interest.

population file: gives a key of abbreviations used for the 1000 Genomes populations

pedigree file: provides relationship information on 1000 Genomes individuals that are related as well as gender and 1000 Genomes population

sample file: detailed spreadsheet that offers information on sample id, accession number, population, family, gender, relationship, sequencing center, and coverage for each 1000G sample.

Thursday, May 2, 2013

Thoudand Genomes Complete Genomics Index

As mentioned previously, 1000 Genomes has now made Complete Genomics whole genome sequencing publicly available for download. They give an index file that mentions some of the individual high coverage .bam files that are available for download, but it seems to be missing a lot of the newer data released this April. I was trying to find a way to efficiently search through the 1000 Genomes ftp site to get a better index of the available Complete Genomics data. I am primarily interested in CEU samples and wanted the high coverage evidence support files. Here is the code I used to search through the ftp. I chose just to search for and download .bai files from the site since they are quick to download and would create a useful index for downloading the bam files.

Pages