Genome Toolbox: file

Showing posts with label file. Show all posts

Friday, August 22, 2014

Search for File Type in UNIX Directory and All Subdirectories

Here's a simple way to find and list all files in a UNIX directory and the containing subdirectories. This example script shows how to search for all files with the extension ".R":

Wednesday, July 23, 2014

Sort BAM File in Samtools

Samtools does a host of useful operations for .bam files. One such operation is sorting. Below is a simple example script to show how to use samtools to sort an unsorted BAM file.

This script will sort the unsorted.bam using 8 threads that allocates 12G of memory per thread. The resulting sorted .bam file will be called sorted.bam.

To confirm a .bam file is sorted, check the header (samtools view -H sorted.bam) for the line:
SO:coordinate.

Additionally, here is the usage information for Samtools sort:

Usage: samtools sort [options]

Options: -n sort by read name
-f use as full file name instead of prefix
-o final output to stdout
-l INT compression level, from 0 to 9 [-1]
-@ INT number of sorting and compression threads [1]
-m INT max memory per thread; suffix K/M/G recognized [768M]

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format. The TCGA project is one such group that keeps a record of samples and analysis information in this format. It can be difficult to extract data from these files without the appropriate tools. In some cases a simple grep command can be used, but even then the output usually needs to be cleaned. Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted. The library I found most useful was ElementTree. This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file. Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object. Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.

Below is the expected output:

Friday, May 16, 2014

Create Index for BAM File

Sometimes databases or contributors share .BAM files, but fail to provide the associated BAM index file (the .BAI file). The BAM index file, usually named filename.bam.bai, is needed to visualize the reads in IGV as well as several other applications. Samtools can easily generate an index file for your sequence bam file. The below code is an example of how to do so for a bam file called filename.bam.

Friday, April 4, 2014

Merge Changes from Multiple Word Files into One Document

Collaborations get you access to lots of data. However, collaborations lead to long author lists; long author lists lead to many comments from co-authors; and many comments from co-authors can lead to great headaches trying to track changes and get a final clean manuscript together. Well, fortunately Microsoft Word has a built in feature that enables users to merge changes together from many different contributors into one master document (.doc or .docx file). This is done iteratively, two at a time, until all the comments from reviewers are in one merged MS Word document. To do this follow these steps:

1) Open a blank document in Microsoft Word
2) Go to the Review tab and click the Compare icon and then select Combine....
3) In the dialogue box that pops up, input your original file name in the Original document field and one of the changed document file names into the Revised document field.
4) Click on the more button and ensure the the radio button next to Original document is selected under the Show changes in... heading.
5) Click OK and a document will be generated that merges changes from your original and revised document.
6) Repeat steps 2-5, over again for each revised document you want to combine with the merged document.

It is a bit repetitive, but eventually all the changes from each file will be combined and tracked into one master document. Ideally, the developers at Microsoft will improve the functionality of this so that many changes from many documents can be merged into one document in a single step. A final note is that Word can only store one set of formatting changes at a time, so if formatting does change from draft to draft a dialogue box will appear asking you which formatting you want to use. Hope this saves you a lot of time and frustration.

Tuesday, March 11, 2014

Notepad++ Replace a Tab with a Newline

With the extended search mode in Notepad++, it is easy to replace non-character items using the Replace tool of Notepad++. First open the replace dialogue box and click the radio button next to Extended (\n, \r, \t, \0, \x...) in the Search Mode box on the bottom left. Now you are ready to look for these formatting expressions in your file. Below are some examples.

To replace all tabs with newlines:
Find what: \t
Replace with: \r\n

To replace all newlines with tabs:
Find what: \r\n
Replace with: \t

Thursday, February 20, 2014

Zero Fill Coverage Gaps in Samtools Depth Output

Merging depth output from multiple .bam files can be difficult since Samtools only outputs depth counts for coordinates with non-zero coverage. If you want to merge depth output from these .bam files you first need to fill in the base pair positions of no coverage with zero values so the depth output for all .bam files is the same length. Then using a simple UNIX cat you can merge multiple .bam file depth output into one file for comparison and analysis.

Here is a simple Python script to zero fill Samtools depth output:

And below is an example of how to use it:

Wednesday, December 18, 2013

Copy and Transfer Files To or From SFTP Site

Transferring files to and from a SFTP site is relatively simple once you have generated a public/private key pair and have your account set up.

To login to the SFTP server, type in sftp username@server at the UNIX command prompt. Once logged in, you can change directories on the SFTP site similar to how you would change directories in UNIX: with the cd command. You can also change directories on you local account using the lcd command. To transfer files from one server to another, you need to first be in the correct local and SFTP directories from which you want the files transferred to/from.

To copy a file from the SFTP server to your local host, use the get command. For example, if you wanted to get the file ids.txt, you would type get ids.txt at the command prompt. Conversely, to transfer file to the SFTP site from your local host, use the put command. The SFTP also uses wildcards (*), so if for example, you wanted to transfer all .jpg files to the SFTP server, you would type in put *.jpeg at the command prompt. Below are a few other useful commands.

Sftp Command	Description
cd dir	Change directory on the ftp server to dir.
lcd dir	Change directory on your machine to dir.
ls	List files in the current directory on the ftp server.
lls	List files in the current directory on your machine.
pwd	Print the current directory on the ftp server.
lpwd	Print the current directory on your machine.
get file	Download the file from the ftp server to current directory.
put file	Upload the file from your machine to the ftp server.
exit	Exit from the sftp program.

Also, remember it is always a good idea to check the checksums of files you have downloaded with those on the SFTP site to ensure you downloaded the files in their entirety.

Tuesday, December 3, 2013

List Contents of a .tar.gz File

Here is a quick one liner to see the contents of a .tar.gz file:

This will list all the files in the .tar.gz tarball. The v command can be left out to not show file information.

Extract Only Desired Files from Compressed File

Lately I have been working with some large tarballs (tar.gz files). To access the compressed contents, usually I would extract all the files that were compressed in the archive. This was time consuming to decompress all the files, a pain to filter through looking for the files I wanted, and required more time to delete everything I didn't want. I knew there was a better way to just get the files I wanted. Sure enough, the basic tar command has options that allow you just to extract a file of interest or even a set of files using wildcards. The command looks like this:

where,
big_file.tar.gz is the tarball you are extracting from

path is the path to your file in the tarball

your_file.txt is the file you want to extract

This can also be done using wildcards. For example, if I wanted all text files in the above path I could substitute *.txt for your_file.txt.

Friday, November 22, 2013

Make BED File from Illumina Manifest

To compare coverage across SNP arrays, it is handy to convert Illumina manifest files into .bed files for easy manipulation. To do this, the first step is to download the desired manifest file from the Illumina website. Here is a site that lists the downloads available for each array they support. Once you find the array select the manifest file (I usually choose the CSV file) and download from the browser or copy the link and use wget. Here is an example for the Methylation450 array:

Next, look at the manifest file and determine what columns are for the chromosome and the base pair coordinates (usually some name like Chr and Coordinate). Put these column numbers into the below line of UNIX code and you will have a .bed file from the manifest file. Here I am making a file only for the X chromosome, but this can be modified to fit your respective needs as well.

You should be all set with a .bed file from your manifest file. Just ensure that when you are making comparisons you are using the correct genome build.

Tuesday, July 16, 2013

Convert .PED and .MAP to and from .LDAT File Format

To run GLU, your data needs to be in the .ldat format. Luckily, GLU makes it easy to convert from .ped and .map files to .ldat format with the transform command. Here is how to carry out the transformation.

From .ped/.map → .ldat:

And likewise from .ldat → .ped/.map:

More details on the transform command can be found here.

Convert .GEN and .SAMPLE to and from .PED and .MAP File Format

A program called GTOOL can easily convert .gen and .sample files into .ped and .map files (as well as .ped and .map file format to .gen and .sample format). The usage is quite simple.

For .gen/.sample → .ped/.map:

For .ped/.map → .gen/.sample:

Friday, July 5, 2013

Getting Set Up on a UNIX Cluster

Okay, you have been granted access to a UNIX cluster to work on a project, but have no idea how to get started with things. No problem. This post is designed to clue you in on the essentials so you can get up and running in no time.

SSH Clinet
This is the first thing you will need. SSH clients are programs that enable your computer to connect remotely to the UNIX cluster. SSH stands for Secure SHell, which is a network protocol for communicating data from one computer to another. Unquestionably, the most widely used SSH Client for Windows operating system users is PuTTY. This is a free program and easy to set up. Download putty.exe and move it to a spot where you can easily access it. Double clicking on it will bring up a security warning. Select run and PuTTY will open. For the Host Name insert the IP address for the cluster (ex: computer.university.edu). In general, this is all you really need to do before pressing the Open button. You can choose to save this as a session for easy access in the future, but just accept all the defaults for now. PuTTY will then connect to the computer at the IP address you specified. If you are a Mac user, this is much easier to do. Just go to the pre-loaded Mac terminal and type "ssh username@computer.university.edu", where computer.university.edu is the IP address for the computer you want to connect to. The first time you connect you will get a warning about a certificate. Just choose the option to proceed. Next, you will usually be asked by the remote computer for a username and password. Input what was given to you by the cluster administrator. You will not see the password as you type it in. After your credentials have been accepted, you will have access to the remote computer. Congratulations, you are now connected! For Windows users, there are also other flavors of PuTTY that can be used. I like PuTTYTray for the added options built into it as well as MTPuTTY which allows for multiple tabs to be opened simultaneously.

File Transfer Application
Next, you will need a way to transfer files and scripts from your computer to the remote computer (cluster) and vice versa. WinSCP is an excellent free program to use from a Windows operating system. Unfortunately, Macs really don't have an equivalent (if you have any suggestions, let me know). Once WinSCP is downloaded and setup, open the program and select New. Fill in the Host name and User name. I usually leave the password blank (in which case I will be asked to manually provide it later), but this is up to your discretion. From there you can log in. After your session has been authenticated, you will be greeted with a window having two major panes. The left side is your computer and the right side is the remote computer. Just drag files from one window to the other to transfer from one computer to another. You now have a way to transfer files to and from the remote computer!

X Server
Some programs require an X server to process output from X11 sessions. This essentially allows you to open an interactive window on your computer in which you can interact with the remote computer. Some programs such as R, Python, and Java have modules that can interact in this fashion. Should you find out you need an X server, I would highly recommend Xming for Windows users. It is free and just needs to be open in the background while you have the terminal open. Mac will have this functionality built in. First though, you need to set this up before logging into the cluster. On Macs, simply add the -X option when connecting through the terminal (ex: ssh -X username@computer.university.edu). For Windows, open PuTTY and in the left menu select Connection > SSH > X11 and click on the box enabling X11 forwarding. Then go back to the top of the menu to Session and log in as before. To make sure this is working, type the command "xeyes". Do you see two eyballs looking at you? If so, it works!

Text Editor
If you like Vi or Emacs, this is not for you. For everyone else, there are a wide variety of text editors that are helpful in writing your code in a variety of programming languages. I prefer Notepad++ for Windows. For Macs, however, I am still searching for a worthy Notepad++ alternative.

Well, I hope this was helpful in getting you up and running on a UNIX computing environment. From personal experience, I know the learning curve can be steep. Best wishes as you learn to navigate your way around on a remote cluster. If you have any helpful suggestions, I highly encourage you to post a comment below.

Pages