Presentation on theme: "Performing impossible feats of XML processing with pipelining"— Presentation transcript:
1Performing impossible feats of XML processing with pipelining XML Open 2004Sean McGrathPropylon
2Contents The pipelining philosophy Major functional elements of pipelinesSome examplesPipelining and GridsPipelining and Web Services/SOAsSome anticipated objections (and answers)Some musingsSome technology pointers
3What is XML pipelining?It is an architectural framework for developing robust, scaleable, manageable XML processing systems.based on proven mechanical manufacturing patterns. Specifically:Assembly Lines (divide and conquer)Component assembly and component re-use
4What is XML pipelining and why is it useful? A way of thinking about systems that focuses on XML dataflows rather than object APIs. (This is critical and non-trivial focus-shift for many programmers!)Why? Because pipelining provides a mechanical, inspiration-free, genius-free way of handling the mind-boggling complexity of complex XML transformation projects.
5Pipelining Philosophy XML is all about complex hierarchical data structures…
6Pipelining Philosophy Cars are complex, hierarchical structuresHenry Ford’s Model T Ford Assembly Line – 1914
7Pipelining Philosophy Lunch is a complex, hierarchical structureLunch Assembly Line. NY, 2004
8Pipelining Philosophy We are complex, hierarchical structures
9Pipelining philosophy What have these scenes got it common?Complex construction of cars, tuna melts and tendons made possible and efficient throughassembly line manufacturing pattern of divide and conquerre-usable component processes and component materialsWhy not apply this approach to XML “manufacturing”?
10Pipeline philosophy Why does the assembly line approach work? Transformation task decompositionRe-usable transformation componentsTransformation decomposition is the key to complexity management. Just ask:Henry FordHerbert Simon (The Two Watchmakers – “The Architecture of Complexity”)George Miller (7+/-2)Adam Smith (An Inquiry into the Nature And Causes of the Wealth of Nations,1776)Any electrical or chemical engineer.
11Pipeline philosophy Component re-use is the key to productivity Ask any form of engineer (electrical, chemical etc.) apart from software engineers…Component re-use remains a holy grail in software engineeringPipelining is yet another attempt based on data transformation and data flow rather than algorithms
12Pipeline philosophyA lot of data processing for the forseable future will consist of XML to XML transformationA lot of non-XML data processing can consist of XML to XML transformations with the addition of top and tail transformations to non-XML formatsAn XML pipeliners mantra:Get data into XML as quickly as possibleKeep it in XML until the last possible minuteBring all your XML tools to bear on solving the data processing problem
13Pipeline philosophy Input XML Output XML Top Transformation Tail TransformationNon-XMLInputNon-XMLOutput
14Pipeline philosophyThe philosophy hinges on the fact that every complex XML transformation can be broken down into a series of smaller ones than can be chained together
15Pipeline philosophyOnly so many ways to re-arrange an XML tree structureA finite number of fundamental transformations, from which all transformations can be derived
16Pipeline philosophyStarting point: data at time T conforming to “spec” A. Data at time T2 conforming to “spec.” B.Transformation Analysis/Decomposition – decompose the problem of getting from A to B into independent XML in, XML out stagesDecide what transformation components you already have.Implement the ones you don’t – make them re-usable for the next transformation project.
17Pipeline philosophy Transformation analysis & decomposition leads to a series of small, manageable, “stand alone” problems with an XML input “spec” and an XML output “spec”. “Spec” = schemas + structure rules + narrative.Can build, test, use and then re-use these transformation componentsVery team development friendly – parallel development of loosely coupled componentsVery debugging friendly – log2(n) “chops” to find any given problem.
18Pipeline debugging … Schema Delta 1 Schema Delta N Schema A Schema B InputXMLOutputXMLXMLDelta 1XMLDelta NTop TransformationTail TransformationNon-XMLInputNon-XMLOutput
19Pipeline philosophyThe answer to the SAX/DOM question is “mu”. (More on this later)No such thing as “the” correct abstraction for processing XMLPipeline approach means you can mix ‘n’match black-box components that internally use whatever paradigm best suited the problemLexicalSAX,STAX,DOM,XOMCOmega,XSLT, XQueryXDuce, Pyxie, Java, C#, Groovy, Ruby, Haskell, WebIt! Etc. etc.
21Pipeline philosophy Many XML transformations end up monolithic Assertion : developers would use a more component based approach to XML processing if they did not have to write the plumbing (orchestration, exception handling) themselves“Gee, this problem is complex. Maybe I’ll do it in multiple stages! Gee, now I have to orchestrate the stages somehow. Batch files/shell scripts/driver program – all ugly and error prone. Maybe I’ll just write a single program after all. Besides, it will run faster...”
22Pipeline philosophy“Professional developers spend 50 percent of their time writing plumbing” – Adam BosworthPipelining promotes the creation of a reusable plumbing “layer” letting developers concentrate on the application in hand.
23Philosophy SummaryThink flow - data processing == data transformation w.r.t. time – Michael JacksonXML is the current runaway winner in the self-descriptive data stakes and a very good IDDL (Intermediate Data Description Language) for all types of data that are not natively XML based
24Philosophy SummaryInside every complex XML transformation is a sequence of simpler XML transformations trying to get out – a pipelineDecomposed transformation:new transformations +already componentized transformations-> Component Reuse Nirvana
30Simple pipeline transformation component examples Reading a file is an XML to XML transformation<file>lewisscarrol.xml</file><poem><line>Twas brillig, and the slithy tomes, did gyre and gimbal in the wave</line>…</poem>
31Simple pipeline transformation component examples Arithmetic is an XML to XML transformation<expr>1 + 2</expr><res>3</res>
32Simple pipeline transformation component examples Unix pipe utilities e.g. trhello worldHELLO WORLD
33A little orchestration in a transformation component Conditionals are XML to XML transformation “tee junctions” triggered by XPathsif XPath TRUE branchInif XPathif XPath FALSE branch
34Validation as a transformation component XMLAXMLA’RelaxNGSchematronJython/Java/JACLXComponentInputOutputValidationLogError
35Sample Transformation Component Examples Once you start thinking in terms of pipes – components appear everywhere:Regular fragmentationsDoctype changerNamespace normalizerCharacter set transcoderHash generatorArchitectural form processingRelaxNG/Schematron etc
36First objection “It will be dog slow” or (stronger form): “Re-usable tree transforming components won’t work in my shop – my XML files are too big to schlep around in strings, never mind DOMs!”
37Document fulcra and the scatter/gather pattern For any given transformation t to be performed on documents conforming to schema s, there is a fragment expression that can be used to chop each document into n pieces, on which t can be performed.I call these points fulcra and are a function of (t,s)
38Identifying FulcraFor data-oriented XML, the fulcra often coincide with the “record” iteration in the XML schema and may be independent of t.For document-oriented XML, the fulcra are much more dependent on t.
39Document fulcra and scatter/gather pattern Having identified the fulcra:-Chop the input document into fragments – scatter phasePerform tJoin all the processed fragments together to constitute the output document – gather phaseThree stage pipeline – scatter & gather either side of the core component
40Document Fulcra Input Doc Scatter n fragments TIME Invoke t t t t t t GatherOutputDoc
41Document FulcraNote the data domain de-composition – meets XML markup.Trivially parallelizable
42Document FulcraA good fulcra based scatter/gather will make performance head north faster, cheaper and with a high upper limit than any amount of hand-crafted, genius level XML coding of your transformations in horrid SAX or lexical parse mode.Massive Parallelism will kill all von Neumann throughput argumentsDocuments per second, not seconds per document – throughput is the true measure of XML processing speedDocument fulcra – Locality of reference (Denning) applies to XML processing (more on this later)
43More objections (with more answers) It will be slowNo it won’t - Premature optimization is the root of all evil!Speed is a three headed monster. I’m old enough to have left the X axis and currently heading for Y through ZThe 3 Axes to Speed
44Some objections (with some answers) Component based software? Harumph! We have heard that one before…Pipelines are data flow based not API based (COM, VBX, CORBA)Two pin interfaces and minimal “verbs”The XML “payload” is what is important – not the API - RESTian
45Revisiting the XSLT/DOM -> SAX non-sequiter XSLT and DOM are memory bound – trade off between ease of use and resource usage – ease of use favouredSAX is not memory bound – trade off between ease of use and resource usage – low resource usage favouredOn xml-dev users often advised to rewrite their apps using SAX! Ugh!
46XSLT/DOM -> pipeline Pipelines and scatter/gather allow you to keep the ease of use of XSLT/DOM with the finite resource utilization of SAXAs long as you can identify a good fulcrum functionThey exist more often than notIf they exist, they are very easily found and “drop out” of document analysis – eg: xpath expressions in XSLT stylesheet templates
47Pipelining and GridsGrid Technologies – computational power “on tap” (http://www.gridforum.org)A match made in heaven (bandwidth permitting)
49Grids - caveatsFor large data volumes it is simple not feasible to shunt the data over the wire – Jim GrayOrganizations are sensitive about their data going beyond firewallsPay-per-use “racks” in your back-office a better bet. – Rent a grid the way you would rent a chainsaw.
50A Service Oriented Architecture “service” = XML transformation with side optional effects
51Pipelines and Service Oriented Architecture Can usefully blurr the distinction between a message queue and a transformation pipeServices have the same XML-in, XML-out interfaceAll components can be servicesAll pipes can be servicesAll SOAs can be services…
53Musings #1 - Debugging Pipelines are very debugging friendly log2(N) time required for fault diagnosis“Probes” in the form of loggers, RelaxNG validators, easily plug-inable (as transformation components) to a pipe to watch what is going on.Pre/Post condition on/off switch is a useful “design by contract” debuggerXML-aware browsers as “breakpoints”
54Musings #2 – Validation – grammers versus rules versus FYI’s Pipelines make it natural to segregate “business rules” from “grammar rules” and can dramatically simplify bothSome of the most useful business “rules” are non dyadic. “FYIs” are really, really useful monitoring/QA tools.
55Musings #3 – Inbetween-ing and component development Transformation analysts spec the transformationOnly need to code new componentsSpec == Documentation of what the transform needs to do with pre/post etc. but no codeProvides built in JIT-style acceptance test via the pre/post conditionsOutsource friendly, parallelisability friendly and third-party market friendly
56Musing #4 - Web ServicesFirst generation will be a total blind alley – RPCDocument Oriented Messaging – not Object Oriented Messaging -> SOAsThe next stage in encapsulation and loose coupling – something like pipelining will be a pre-requisite in a doc/literal world.
57Musing #5 – naming and parametric typing Naming components is a really hard problemProgrammers don’t do metadata Finding components to re-use is a real problem – the Google lessonNumerous components that do the same thing but optimized on different axes:SpaceTimeInfoset considerations
58Musing #6 – Pre-validation Transformation Killing ourselves seeking one-shot expressivity in schema validation languagesMany complex validations become a lot simpler if you do some transformation(s) firstCo-occurrence constraintsContextual constraintsClear analog with formatting (pre-flow transformation(s) + flow = DSSSL/XSL)
59Musing #7 – grids, scheduling and compilers Scheduling transformations on a pipeline grid is hard – manufacturing lore needs to be brought to bear (e.g. Flow Shop Scheduling).Pipe -> Component via compiler is a powerful ideaBoth for grids (IO optimisation) and for general program distributionPipe compilation can beat the IO problems while retaining the simple, componentised development approach.Back to the future with Jackson’s Program Inversion
60Musing #8 – Higher order transformations What if, instead of transforming an instance, you transformed a grammer?Auto-generation of instance transformation primitivesLimited to non-PCDATA transforms and side-effect free transforms but useful nonetheless
61Some pipeline-related open source technologies | - Unix PipesSAX FiltersXBeansCocoonXpipe (sadly under resourced)axKitxvifDSDLAnt, W3C Pipeline Note