Download presentation
Presentation is loading. Please wait.
Published byStephen Joseph Modified over 9 years ago
1
Perl/XML::DOM - reading and writing XML from Perl Dr. Andrew C.R. Martin martin@biochem.ucl.ac.uk http://www.bioinf.org.uk/
2
Aims and objectives Refresh the structure of a XML (or XHTML) document Know problems in reading and writing XML Understand the requirements of XML parsers and the two main types Know how to write code using the DOM parser PRACTICAL: write a script to read XML
3
An XML refresher! x-ray 1.8 0.20 x-ray 1.8 0.20 Tags: paired opening and closing tags May contain data and/or other (nested) tags un-paired tags use special syntax Attributes: optional – contained within the opening tag
4
Writing XML Writing XML is straightforward Generate XML from a Perl script using print() statements. However: tags correctly nested quote marks correctly paired international character sets
5
Reading XML As simple or complex as you wish! Full control over XML: simple pattern may suffice Otherwise, may be dangerous may rely on un-guaranteed formatting x-ray 1.8 0.20 x-ray 1.8 0.20 <mutation res='L24’ native='ALA’ subs='ARG'/>
6
XML Parsers Clear rules for data boundaries and hierarchy Predictable; unambiguous Parser translates XML into stream of events complex data object
7
XML Parsers Different data sources of data files character strings remote references different character encodings standard Latin Japanese checking for well-formedness errors Good parser will handle:
8
XML Parsers Read stream of characters differentiate markup and data Optionally replace entity references (e.g. < with <) Assemble complete document disparate (perhaps remote) sources Report syntax and validation errors Pass data to client program
9
XML Parsers If XML has no syntax errors it is 'well formed' With a DTD, a validating parser will check it matches: 'valid'
10
XML Parsers Writing a good parser is a lot of work! A lot of testing needed Fortunately, many parsers available
11
Getting data to your program Parser can generate 'events' Tags are converted into events Events triggered in your program as the document is read Parser acts as a pipeline converting XML into processed chunks of data sent to your program: an 'event stream'
12
Getting data to your program OR… XML converted into a tree structure Reflects organization of the XML Whole document read into memory before your program gets access
13
Pros and cons In the parser, everything is likely to be event-driven tree-based parsers create a data structure from the event stream Data structure More convenient Can access data in any order Code usually simpler May be impossible to handle very large files Need more processor time Need more memory Event stream Faster to access limited data Use less memory Parser loses data at the next event More complex code
14
SAX and DOM de facto standard APIs for XML parsing SAX (Simple API for XML) event-stream API originally for Java, but now for several programming languages (including Perl) development promoted by Peter Murray Rust, 1997-8 DOM (Document Object Model) W3C standard tree-based parser platform- and language-neutral allows update of document content and structure as well as reading
15
Perl XML parsers Many parsers available Differ in three major ways: parsing style (event driven or data structure) 'standards-completeness’ speed (implementation in C or pure Perl)
16
Perl XML parsers XML::Simple Very easy to use Designed for simple applications Can’t handle 'mixed content' tags containing both data and other tags This is mixed content
17
Perl XML parsers XML::Parser Oldest Perl XML parser Reasonably fast and flexibile Not very standards-compliant. Is a wrapper to 'expat’ probably the first C XML parser written by James Clark
18
XML::Parser use XML::Parser; my $xmlfile = shift @ARGV; # the file to parse # initialize parser object and parse the string my $parser = XML::Parser->new( ErrorContext => 2 ); eval { $parser->parsefile( $xmlfile ); }; # report error or success if( $@ ) { $@ =~ s/at \/.*?$//s; # remove module line number print STDERR "\nERROR in '$xmlfile':\n$@\n"; } else { print STDERR "'$xmlfile' is well-formed\n"; } Simple example - check well-formedness Eval used so parser doesn’t cause program to exit Remember 2 lines before an error $@ stores the error
19
Perl XML parsers XML::DOM Implements W3C DOM Level 1 Built on top of XML::Parser Very good fast, stable and complete Limited extended functionality XML::SAX Implements SAX2 wrapper to Expat Fast, stable and complete
20
Perl XML parsers XML::LibXML Wrapper around GNOME libxml2 Very fast, complete and stable Validating/non-validating DOM and SAX support
21
Perl XML parsers XML::Twig DOM-like parser, BUT Allows you to define elements which can be parsed as discrete units 'twigs' (small branches of a tree)
22
Perl XML parsers Several others More specialized, adding… XPath (to select data from XML document) re-formatting (XSLT or other methods) ...
23
XML::DOM DOM is a standard API once learned moving to a different language is straightforward moving between implementations also easy Suppose we want to extract some data from an XML file...
24
XML::DOM cat fruit fly We want: cat (Felix domesticus) not endangered fruit fly (Drosophila melanogaster) not endangered
25
#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Import XML::DOM module obtain filename initialize a parser parse the file Nested loops correspond to nested tags Returns each element matching ‘species’ Look at each species element in turn $species contains content and descendant elements Can now treat $species in the same way as $doc
26
#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); $common_name contains content and any descendant elements Here there is only one element per element, but parser can't know that. Have to specify that we want the first (and only) element The element contains only one child which is data. Obtain first (and only) child element using ->getFirstChild Extract the text from the element object using ->getNodeValue Returns each element matching ‘common-name’
27
->getElementsByTagName returns an array Here we work through the array: foreach $species ($doc->getElementsByTagName('species')) { } $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; Here we obtain an object and use the ->item() method to obtain the first item: @common_names = $species->getElementsByTagName('common-name'); $cname = $common_names[0]->getFirstChild->getNodeValue; Alternative is to access an individual array element:
28
$common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; cat There could have been more than one element within this element Obtain the actual text rather than an element object
29
#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Attributes much simpler! Can only contain text (no nested elements) Simply specify the attribute Attributes
30
cat $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); Extract the elements from this element DOM parser can't know there is only one element per species. Extract the first (and only) one. This is an empty element, there are no child elements so we don’t need ->getFirstChild Contains only a ‘ status’ attribute, extract its value with ->getAttribute()
31
#!/usr/bin/perl use XML::DOM; $file = shift @ARGV; $parser = XML::DOM::Parser->new(); $doc = $parser->parsefile($file); foreach $species ($doc->getElementsByTagName('species')) { $common_name = $species->getElementsByTagName('common-name'); $cname = $common_name->item(0)->getFirstChild->getNodeValue; $name = $species->getAttribute('name'); $conservation = $species->getElementsByTagName('conservation'); $status = $conservation->item(0)->getAttribute('status'); print "$cname ($name) $status\n"; } $doc->dispose(); Print the extracted information Loop back to next element Clean up and free memory
32
XML::DOM Note Not necessary to use variable names that match the tags, but it is a very good idea! There are many many more functions, but this set covers most needs
33
Writing XML with XML::DOM
34
#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Import XML::DOM Initialize data Create an XML document object Utility method to print XML header
35
#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a ‘root’ element for our data Loop through each of the species Create a element Set its ‘ name ’ attribute Join the element to the parent element
37
#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Create a text node containing the specified text Join this text node as a child of the element Join the element as a child of
38
cat
39
#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Create a element Set the ‘status’ attribute Join the element as a child of
40
cat
41
#!/usr/bin/perl use XML::DOM; $nspecies = 2; @names = ('Felix domesticus', 'Drosophila melanogaster'); @commonNames = ('cat', 'fruit fly'); @consStatus = ('not endangered', 'not endangered'); $doc = XML::DOM::Document->new; $xml_pi = $doc->createXMLDecl ('1.0'); print $xml_pi->toString; $root = $doc->createElement('data'); for($i=0; $i<$nspecies; $i++) { $species = $doc->createElement('species'); $species->setAttribute('name', $names[$i]); $root->appendChild($species); $cname = $doc->createElement('common-name'); $text = $doc->createTextNode($commonNames[$i]); $cname->appendChild($text); $species->appendChild($cname); $cons = $doc->createElement('conservation'); $cons->setAttribute('status', $consStatus[$i]); $species->appendChild($cons); } print $root->toString; Loop back to handle the next species Finally print the resulting data structure
42
cat fruit fly
43
Summary - reading XML Create a parser $parser = XML::DOM::Parser->new(); Parse a file $doc = $parser->parsefile( ' filename ' ); Extract all elements matching tag-name $element_set = $doc->getElementsByTagName( ' tag-name') Extract first element of a set $element = $element_set->item(0); Extract first child of an element $child_element = $element->getFirstChild; Extract text from an element $text = $element->getNodeValue; Get the value of a tag’s attribute $text = $element->getAttribute( ' attribute-name');
44
Summary - writing XML Create an XML document structure $doc = XML::DOM::Document->new; Utility to create an XML header $header = $doc->createXMLDecl('1.0'); Create a tagged element $element = $doc->createElement('tag-name'); Set an attribute for an element $element->setAttribute('attrib-name', 'value'); Append a child element to an element $parent_element->appendChild($child_element); Create a text node element $element = $doc->createTextNode('text'); Print a document structure as a string print $root_element->toString;
45
Summary Two types of parser Event-driven Data structure Writing a good parser is difficult! Many parsers available XML::DOM for reading and writing data
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.