Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-

Similar presentations


Presentation on theme: "1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-"— Presentation transcript:

1

2 1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer- Verlag) by Ethan Cerami. Please email cerami@cs.nyu.edu for permission to copy.cerami@cs.nyu.edu

3 2 Road Map SAX Overview –What is SAX? –Advantages/Disadvantages Basic SAX Examples –About Xerces 2 Parser –XMLReader Interface –ContentHandler Interface –Extending the SAX Default Handler Checking for Well-Formedness

4 3 SAX Overview

5 4 Introduction to SAX The Simple API for XML (SAX) is a standard, event- based interface for parsing XML documents. Versions: –SAX 1.0: original standard –SAX 2.0: current standard SAX is a de facto standard, supported by most XML parsers today. Unlike DOM, it is not an official W3C standard. SAX was originally built explicitly for Java, but SAX now exists for other languages, including Perl, Python, etc.

6 5 SAX Interface At its core, SAX is simply a series of interfaces that are implemented by an XML parser. Because different parsers implement the same SAX interface, you can easily swap in/out different parsers.

7 6 SAX Interface Java App SAX Interface Xerces Parser Crimson Parser Ælfred Parser XML Document Implementation details are hidden behind the SAX interface. You can therefore swap parsers in/out. Same idea as JDBC.

8 7 Advantages/Disadvantages Advantages –Very widely implemented by just about every XML Parser –Fast Performance –Low Memory Overhead Disadvantages –Does not provide an easy to navigate XML tree like DOM or JDOM. –Does not provide an easy mechanism for creating/modifying XML documents.

9 8 Basic SAX Example

10 9 Xerces 2 Parser All of our examples will use the Xerces 2 Parser. Xerces 2 is the latest open source XML parser from the Apache XML Group. The Distribution is available at: http://xml.apache.org/xerces2-j/ The distribution includes two JAR files: –xmlParserAPIs.jar: includes the relevant XML APIs, including DOM Level 2, SAX 2.0, and JAXP 1.2. –xercesImpl.jar: includes the Xerces implementation of the XML APIs.

11 10 BasicSAX.java First example illustrates the simplest SAX functionality: –Creates an XML Parser object –Parses a document specified on the command line –Receives SAX events and prints these to the console. First, let’s examine a sample XML document. Then view the output when this document is parsed.

12 11 Sample XML Document <!DOCTYPE DASDNA SYSTEM 'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' > taatttctcccattttgtaggttatcacttcactctgttgactttcttttg taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg Document contains two sequences of DNA.

13 12 Start Document Start Element: DASDNA Start Element: SEQUENCE Start Element: DNA Characters: taatttctcccattttgtaggttatcacttcactctgttgactttcttttg Characters: End Element: DNA End Element: SEQUENCE Start Element: SEQUENCE Start Element: DNA Characters: taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg Characters: End Element: DNA End Element: SEQUENCE End Element: DASDNA End Document Sample Output

14 13 package com.oreilly.bioxml.sax; import org.xml.sax.Attributes; import org.xml.sax.ContentHandler; import org.xml.sax.Locator; import org.xml.sax.SAXException; import org.xml.sax.XMLReader; import org.xml.sax.helpers.XMLReaderFactory; import java.io.IOException; /** * Basic SAX Example. * Illustrates basic implementation of the SAX Content Handler. */ public class SAXBasic implements ContentHandler { public void startDocument() throws SAXException { System.out.println("Start Document"); }

15 14 public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str); } public void endDocument() throws SAXException { System.out.println("End Document"); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName); } public void endPrefixMapping(String prefix) throws SAXException { // No-op }

16 15 public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException { // No-op } public void processingInstruction(java.lang.String target, java.lang.String data) throws SAXException { // No-op } public void setDocumentLocator(Locator locator) { // No-op } public void skippedEntity(String name) throws SAXException { // No-op } public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { System.out.println("Start Element: " + localName); }

17 16 public void startPrefixMapping(String prefix, String uri) throws SAXException { // No-op } /** * Prints Command Line Usage */ private static void printUsage() { System.out.println ("usage: SAXBasic xml-file"); System.exit(0); } /** * Main Method * Options for instantiating XMLReader Implementation: * 1) XMLReader parser = XMLReaderFactory.createXMLReader(); * 2) XMLReader parser = XMLReaderFactory.createXMLReader * ("org.apache.xerces.parsers.SAXParser"); * 3) XMLReader parser = new org.apache.xerces.parsers.SAXParser(); */

18 17 public static void main(String[] args) { if (args.length != 1) { printUsage(); } try { SAXBasic saxHandler = new SAXBasic(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); }

19 18 Main SAX Interfaces SAX provides two main interfaces: –XMLReader: implemented by the XML parser. –ContentHandler: implemented by your application in order to receive SAX events. Each time an event occurs, e.g. start element, end element, the XML parser calls the ContentHandler and informs you of the specific event.

20 19 XMLReader Interface You have three main options for instantiating an XMLReader class. Option 1: Use the SAX XMLReaderFactory class with no arguments: XMLReader parser = XMLReaderFactory.createXMLReader(); The factory will attempt to instantiate an XMLReader based on system defaults.

21 20 Option 1: Continued You can specify a system property from the java command line via the -D option. For example, the following line invokes the SAXBasic class and specifies the Xerces2 XML Parser: javaw.exe - Dorg.xml.sax.driver=org.apache.xerces.parsers.SAXParser com.oreilly.bioxml.sax.SAXBasic sample.xml The advantage of using system properties is that you can dynamically change parsers at any time without recompiling any code. If the Factory is unable to determine any valid system defaults, it will throw a SAXException, with a specific message: "System property org.xml.sax.driver not specified."

22 21 Using Different Parsers The specific class the implements the XMLReader interface varies from parser to parser. For example: For the Xerces XML Parser, it's org.apache.xerces.parser.SAXParser. For the Crimson XML Parser, it's org.apache.crimson.parser.XMLReaderImpl.

23 22 Option 2 Call the XMLReaderFactory with a String argument indicating the class name that implements the XMLReader interface: For example: XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser");

24 23 Option 3 Instantiate the XMLReader implementation directly: For example: XMLReader parser = new org.apache.xerces.parsers.SAXParser(); This option works fine. However, note that if you switch parsers, you will need to recompile.

25 24 XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.parse(“simple.xml”); Using an XMLReader Once you have an XMLReader class, you can call the parse() method to start parsing: You can pass a local file name or an absolute URL to the parse() method.

26 25 ContentHandler Interface The ContentHandler receives all SAX events. In total, there are 11 defined events. The most important events/methods are defined below: charactersReceive notification of character data. endDocumentReceive notification of the end of a document. endElementReceive notification of the end of an element. Continued…

27 26 Content Handler API (cont) ignorableWhitespace Receive notification of ignorable whitespace in element content. setDocumentLocator Receive an object for locating the origin of SAX document events. startDocumentReceive notification of the beginning of a document. startElementReceive notification of the beginning of an element.

28 27 Character “Chunking” Suppose you have the following piece of XML: taatgcaactaaatccaggcgaagcatttcagcttaaccccg You will receive a start element event, followed by one or more character events. Parsers are free to call the characters() method any way they want. For example, one parse might do the following: –characters (“t”); –characters (“a”); Another parser might do this: –characters (“taatgcaactaaatccagg”); –characters (“cgaagcatttcagcttaaccccg”);

29 28 Character Chunking Your application needs to be able to handle either of these strategies. To do this, it is best to store character data in some kind of buffer, like StringBuffer. For example: /** * Processes Character Events via Buffer */ public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); currentText.append(str); }

30 29 Using ContentHandlers To receive events, you must: –Implement the ContentHandler interface –Register your content handler with the XML parser: XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.setContentHandler(saxHandler); parser.parse(args[0]);

31 30 ContentHandler Implementation Here’s a sample implementation that just outputs information about each event: public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName); }

32 31 Using the SAX Default Handler

33 32 SAX Default Handler In total, an implementation of ContentHandler must implement 11 methods. You usually don’t need to intercept all 11 of these events. It is therefore much easier to extend the SAX DefaultHandler. The DefaultHandler provides no-op implementations of all methods. You can therefore simply override those that you want. The next few slides provides an example.

34 33 package com.oreilly.bioxml.sax; import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.helpers.XMLReaderFactory; import org.xml.sax.SAXException; import org.xml.sax.Attributes; import org.xml.sax.XMLReader; import java.io.IOException; /** * Basic SAX Example. * Illustrates extending of DefaultHandler */ public class SAXDefaultHandler extends DefaultHandler { public void startDocument() throws SAXException { System.out.println("Start Document"); }

35 34 public void characters(char[] ch, int start, int length) throws SAXException { String str = new String(ch, start, length); System.out.println("Characters: " + str); } public void endDocument() throws SAXException { System.out.println("End Document"); } public void endElement(String namespaceURI, String localName, String qName) throws SAXException { System.out.println("End Element: " + localName); } public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { System.out.println("Start Element: " + localName); } Only override those methods that you need.

36 35 /** * Prints Command Line Usage */ private static void printUsage() { System.out.println ("usage: SAXDefaultHandler xml-file"); System.exit(0); } /** * Main Method */ public static void main(String[] args) { if (args.length != 1) { printUsage(); } try { SAXDefaultHandler saxHandler = new SAXDefaultHandler(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser");

37 36 parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } By extending the Default Handler, your code is much more compact and concise. The output of this program is identical to the first example.

38 37 Checking for Well- Formedness

39 38 Defaults By default, the Xerces XML parser (and most other parsers) will check for well- formedness, but they will not automatically check for validity. Suppose we have the following document on the next page.

40 39 Sample Document: Not Well-formed <!DOCTYPE DASDNA SYSTEM 'http://servlet.sanger.ac.uk:8080/das/dasdna.dtd' > taatttctcccattttgtaggttatcacttcactctgttgactttcttttg taatgcaactaaatccaggcgaagcatttcagcttaaccccgagacttttg This document is not well-formed, because I deleted one of the end tags.

41 40 Sample Output Start Document Start Element: DASDNA Start Element: SEQUENCE Start Element: DNA Characters: taatttctcccattttgtaggttatcacttcactctgttgactttcttttg Characters: [Fatal Error] ensemble_dna_error.xml:8:5: The element type "DNA" must be terminated by the matching end-tag " ". org.xml.sax.SAXParseException: The element type "DNA" must be terminated by the matching end-tag " ". at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at com.oreilly.bioxml.sax.SAXBasic.main(SAXBasic.java:101) This is a fatal error. The parser therefore throws a SAXParseException.

42 41 Try / Catch Clause try { SAXDefaultHandler saxHandler = new SAXDefaultHandler(); XMLReader parser = XMLReaderFactory.createXMLReader ("org.apache.xerces.parsers.SAXParser"); parser.setContentHandler(saxHandler); parser.parse(args[0]); } catch (SAXException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } Indicates a fatal parsing error, such as errors in well-formedness. Indicates an IO Error, such as failed network connection.

43 42 Summary SAX is a standard, event-based interface for parsing XML documents. It is a de facto standard, not an official W3C standard. XML Parsers must implement the XMLReader interface. Applications must implement the ContentHandler interface. For more concise programs, extend the SAX Default Handler. Make sure to surround calls to parse() with a try/catch clause.


Download ppt "1 The Simple API for XML (SAX) Part I ©Copyright 2003-2004. These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-"

Similar presentations


Ads by Google