Genome Toolbox: shell

Showing posts with label shell. Show all posts

Monday, June 9, 2014

UNIX For Loop through Files in Directory

Sometimes I want to do a particular task to each file in a directory using a UNIX shell script, but for some reason I can never remember the correct syntax to do so. Here is a quick example bash script to loop through all files of a particular extension in a directory and do an action with the filenames.

The above example loops through all .tar.gz files in the directory and extracts any text file within those files.

Sunday, June 1, 2014

Generate Random Genomic Positions

Generating random genomic positions or coordinates can be useful in comparing characteristics of a set of genomic loci to that what would be expected from permutations of the underlying genomic distribution. Below is a Python script to aid in selecting random genomic positions. The script chooses a chromosome based on probabilities assigned by chromosome length and then chooses a chromosomal position from a uniform distribution of the chromosome's length. An added gap checking statement is included to ensure the chosen position lies within the accessible genome. You can choose the number of positions you want, the number of permutations to conduct, the size of the genomic positions, and the genomic build of interest. A UNIX shell script is included as a wrapper to automatically download needed chromosomal gap and cytoband files as well as run the Python script. Useage for the UNIX script can be seen by typing ./make_random.sh from the command line after giving the script executable privileges. An example command would be ./make_random 100 10 1000 hg19. This command would make 10 .bed files each with 100 random 1Kb genomic regions from the hg19 genome build. Below are the make_random.sh and make_random.py scripts.

Thursday, November 7, 2013

How to Fully Utilize All Cores of a UNIX Compute Node

Parallelizing tasks can drastically improve computation time of a program. One way to do this is to ensure your code is utilizing all available cores of a processor. To do this you need to write your code in such a way that background tasks are being carried out simultaneously. This is done by inserting the ampersand (&) at the end of a line of code. If the number of background tasks running equals the number of cores of the computer node, then you are efficiently and fully utilizing the resource. The final necessary piece of code is to use the wait command. This tells the computer to wait until all the background tasks are completed before moving on to the next line of code. The wait command comes in handy to ensure the number of background tasks submitted does not exceed the number of processor cores. If this happens you will likely overwhelm the processor with too many tasks. To prevent this the idea is to simultaneously submit a number of tasks equal to the number of cores and then use the wait command to wait for the jobs to finish before submitting more tasks. Here is some example code I put together to utilize all 8 cores of a node when running 1,000 permutations of a program:

This will run a script with the following commands:

Of course, this could further be parallelized to run permutations simultaneously on different compute nodes as well to further speed up run time. Hope this is a helpful example to help you fully utilize compute nodes and speed up your processing time.

Pages