Welcome to the Genome Toolbox! I am glad you navigated to the blog and hope you find the contents useful and insightful for your genomic needs. If you find any of the entries particularly helpful, be sure to click the +1 button on the bottom of the post and share with your colleagues. Your input is encouraged, so if you have comments or are aware of more efficient tools not included in a post, I would love to hear from you. Enjoy your time browsing through the Toolbox.

Tuesday, June 10, 2014

Easily Parse XML Files in Python

While not the most popular means of storing data, there are instances where data is stored in the XML file format.  The TCGA project is one such group that keeps a record of samples and analysis information in this format.  It can be difficult to extract data from these files without the appropriate tools.  In some cases a simple grep command can be used, but even then the output usually needs to be cleaned.  Luckily, Python has some packages that are aid in parsing XML files so that the desired data can be extracted.  The library I found most useful was ElementTree.  This package combined with the urllib2 package enables one to download an XML file remotely from the internet, parse the XML file, and extract the desired information from the data stored within the XML file.  Below is an example Python script that downloads an XML description file from TCGA (link) and then places each extracted element into a Python dictionary object.  Of course this could be easily modified for your particular application of interest, but at least it provides a simple backbone to easily build other scripts off of.
# Import Functions
import urllib2
import xml.etree.ElementTree as ET
# Get online XML file
url="https://cghub.ucsc.edu/cghub/metadata/analysisDetail/a8f16339-4802-440c-81b6-d7a6635e604b"
request=urllib2.Request(url, headers={"Accept" : "application/xml"})
u=urllib2.urlopen(request)
tree=ET.parse(u)
root=tree.getroot()
dict={}
for i in root.iter():
if i.text!=None:
dict[i.tag]=i.text.strip()
else:
dict[i.tag]=""
for key in sorted(dict.keys(), key=lambda v: v.upper()):
print key+":"+dict[key]
view raw parse_xml.py hosted with ❤ by GitHub

Below is the expected output:
aliquot_id:429b50eb-316f-459c-bc3a-0aca6e6dba46
analysis_data_uri:https://cghub.ucsc.edu/cghub/data/analysis/download/a8f16339-4802-440c-81b6-d7a6635e604b
analysis_full_uri:https://cghub.ucsc.edu/cghub/metadata/analysisFull/a8f16339-4802-440c-81b6-d7a6635e604b
analysis_id:a8f16339-4802-440c-81b6-d7a6635e604b
analysis_submission_uri:https://cghub.ucsc.edu/cghub/metadata/analysisSubmission/a8f16339-4802-440c-81b6-d7a6635e604b
analyte_code:R
center_name:UNC-LCCC
checksum:9542ae33ae43c286cc158fd946227e2d
disease_abbr:BRCA
downloadable_file_count:2
downloadable_file_size:6.73
file:
filename:UNCID_1126837.429b50eb-316f-459c-bc3a-0aca6e6dba46.sorted_genome_alignments.bam.bai
files:
filesize:6094992
Hits:1
last_modified:2013-05-16T20:53:36Z
legacy_sample_id:TCGA-BH-A18V-01A-11R-A12D-07
library_strategy:RNA-Seq
live:1
participant_id:6b960b58-28e1-41c6-bd6e-7e669c6aa4ef
platform:ILLUMINA
published_date:2012-05-29T23:55:56Z
Query:analysis_id:a8f16339-4802-440c-81b6-d7a6635e604b
reason:
refassem_short_name:HG19
Result:
ResultSet:
ResultSummary:
sample_accession:
sample_id:2923c735-34a1-4db6-a248-1c5de2241dd4
sample_type:01
state:live
state_count:
study:phs000178
tss_id:BH
upload_date:2012-05-12T07:15:18Z
view raw xml_output.txt hosted with ❤ by GitHub

No comments:

Post a Comment