Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin

Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/

Aims and objectives  Refresh the structure of a XML (or XHTML) document  Know problems in reading and writing XML  Understand the requirements of XML parsers and the two main types  Know how to write code using the DOM parser  PRACTICAL: write a script to read XML

An XML refresher! x-ray 1.8 0.20 x-ray 1.8 0.20 Tags: paired opening and closing tags May contain data and/or other (nested) tags un-paired tags use special syntax Attributes: optional – contained within the opening tag

Writing XML Writing XML is straightforward  Generate XML from a Perl script using print() statements.  However:  tags correctly nested  quote marks correctly paired  international character sets

Reading XML As simple or complex as you wish!  Full control over XML:  simple pattern may suffice  Otherwise, may be dangerous  may rely on un-guaranteed formatting x-ray 1.8 0.20 x-ray 1.8 0.20 <mutation res='L24’ native='ALA’ subs='ARG'/>

XML Parsers  Clear rules for data boundaries and hierarchy  Predictable; unambiguous  Parser translates XML into  stream of events  complex data object

XML Parsers  Different data sources of data  files  character strings  remote references  different character encodings  standard Latin  Japanese  checking for well-formedness errors Good parser will handle:

XML Parsers  Read stream of characters  differentiate markup and data  Optionally replace entity references  (e.g. < with <)  Assemble complete document  disparate (perhaps remote) sources  Report syntax and validation errors  Pass data to client program

XML Parsers  If XML has no syntax errors it is 'well formed'  With a DTD, a validating parser will check it matches: 'valid'

XML Parsers  Writing a good parser is a lot of work!  A lot of testing needed  Fortunately, many parsers available

Getting data to your program  Parser can generate 'events'  Tags are converted into events  Events triggered in your program as the document is read  Parser acts as a pipeline converting XML into processed chunks of data sent to your program:  an 'event stream'

Getting data to your program OR…  XML converted into a tree structure  Reflects organization of the XML  Whole document read into memory before your program gets access

Pros and cons  In the parser, everything is likely to be event-driven  tree-based parsers create a data structure from the event stream Data structure More convenient Can access data in any order Code usually simpler  May be impossible to handle very large files  Need more processor time  Need more memory Event stream Faster to access limited data Use less memory  Parser loses data at the next event  More complex code

SAX and DOM de facto standard APIs for XML parsing  SAX (Simple API for XML)  event-stream API  originally for Java, but now for several programming languages (including Perl)  development promoted by Peter Murray Rust, 1997-8  DOM (Document Object Model)  W3C standard tree-based parser  platform- and language-neutral  allows update of document content and structure as well as reading

Perl XML parsers Many parsers available  Differ in three major ways:  parsing style (event driven or data structure)  'standards-completeness’  speed (implementation in C or pure Perl)

Perl XML parsers XML::Simple  Very easy to use  Designed for simple applications  Can’t handle 'mixed content'  tags containing both data and other tags This is mixed content

Perl XML parsers XML::Parser  Oldest Perl XML parser  Reasonably fast and flexibile  Not very standards-compliant.  Is a wrapper to 'expat’  probably the first C XML parser written by James Clark

XML::Parser use XML::Parser; my $xmlfile = shift @ARGV; # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report error or success if( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n"; } else { print STDERR "'$xmlfile' is well-formed\n"; } Simple example - check well-formedness Eval used so parser doesn’t cause program to exit Remember 2 lines before an error $@ stores the error

Perl XML parsers XML::DOM  Implements W3C DOM Level 1  Built on top of XML::Parser  Very good fast, stable and complete  Limited extended functionality XML::SAX  Implements SAX2 wrapper to Expat  Fast, stable and complete

Perl XML parsers XML::LibXML  Wrapper around GNOME libxml2  Very fast, complete and stable  Validating/non-validating  DOM and SAX support

Perl XML parsers XML::Twig  DOM-like parser, BUT  Allows you to define elements which can be parsed as discrete units  'twigs' (small branches of a tree)

Perl XML parsers Several others  More specialized, adding…  XPath (to select data from XML document)  re-formatting (XSLT or other methods) ...

XML::DOM  DOM is a standard API  once learned moving to a different language is straightforward  moving between implementations also easy  Suppose we want to extract some data from an XML file...

XML::DOM cat fruit fly We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered

#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Import XML::DOM module obtain filename initialize a parser parse the file Nested loops correspond to nested tags Returns each element matching ‘species’ Look at each species element in turn $species contains content and descendant elements Can now treat $species in the same way as $doc

#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); $common_name contains content and any descendant elements Here there is only one element per element, but parser can't know that. Have to specify that we want the first (and only) element The element contains only one child which is data. Obtain first (and only) child element using ->getFirstChild Extract the text from the element object using ->getNodeValue Returns each element matching ‘common-name’

->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { } $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Here we obtain an object and use the ->item() method to obtain the first item: @common_names = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; Alternative is to access an individual array element:

$common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; cat There could have been more than one element within this element Obtain the actual text rather than an element object

#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Attributes much simpler! Can only contain text (no nested elements) Simply specify the attribute Attributes

cat $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); Extract the elements from this element DOM parser can't know there is only one element per species. Extract the first (and only) one. This is an empty element, there are no child elements so we don’t need ->getFirstChild Contains only a ‘ status’ attribute, extract its value with ->getAttribute()

#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Print the extracted information Loop back to next element Clean up and free memory

XML::DOM Note  Not necessary to use variable names that match the tags, but it is a very good idea!  There are many many more functions, but this set covers most needs

Writing XML with XML::DOM

#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Import XML::DOM Initialize data Create an XML document object Utility method to print XML header

#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a ‘root’ element for our data Loop through each of the species Create a element Set its ‘ name ’ attribute Join the element to the parent element

#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Create a text node containing the specified text Join this text node as a child of the element Join the element as a child of

#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Set the ‘status’ attribute Join the element as a child of

#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Loop back to handle the next species Finally print the resulting data structure

cat fruit fly

Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile( ' filename ' ); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName( ' tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute( ' attribute-name');

Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString;

Summary  Two types of parser  Event-driven  Data structure  Writing a good parser is difficult!  Many parsers available  XML::DOM for reading and writing data

Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin

Similar presentations

Presentation on theme: "Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin

Similar presentations

Presentation on theme: "Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin"— Presentation transcript:

Similar presentations

About project

Feedback