Presentation on theme: "Sean McGrath 1http://www.propylon.com Performing impossible feats of XML processing with pipelining XML Open 2004 Sean McGrath."— Presentation transcript:
Sean McGrath 1http://www.propylon.com Performing impossible feats of XML processing with pipelining XML Open 2004 Sean McGrath Propylon
Sean McGrath 2http://www.propylon.com The pipelining philosophy Major functional elements of pipelines Some examples Pipelining and Grids Pipelining and Web Services/SOAs Some anticipated objections (and answers) Some musings Some technology pointers Contents
Sean McGrath 3http://www.propylon.com What is XML pipelining? It is an architectural framework for developing robust, scaleable, manageable XML processing systems. based on proven mechanical manufacturing patterns. Specifically: –Assembly Lines (divide and conquer) –Component assembly and component re-use
Sean McGrath 4http://www.propylon.com What is XML pipelining and why is it useful? A way of thinking about systems that focuses on XML dataflows rather than object APIs. (This is critical and non-trivial focus-shift for many programmers!) Why? Because pipelining provides a mechanical, inspiration-free, genius-free way of handling the mind-boggling complexity of complex XML transformation projects.
Sean McGrath 5http://www.propylon.com Pipelining Philosophy XML is all about complex hierarchical data structures…
Sean McGrath 6http://www.propylon.com Pipelining Philosophy Henry Fords Model T Ford Assembly Line – 1914 Cars are complex, hierarchical structures
Sean McGrath 7http://www.propylon.com Pipelining Philosophy Lunch Assembly Line. NY, 2004 Lunch is a complex, hierarchical structure
Sean McGrath 8http://www.propylon.com Pipelining Philosophy We are complex, hierarchical structures
Sean McGrath 9http://www.propylon.com Pipelining philosophy What have these scenes got it common? –Complex construction of cars, tuna melts and tendons made possible and efficient through assembly line manufacturing pattern of divide and conquer re-usable component processes and component materials Why not apply this approach to XML manufacturing?
Sean McGrath 10http://www.propylon.com Pipeline philosophy Why does the assembly line approach work? –Transformation task decomposition –Re-usable transformation components Transformation decomposition is the key to complexity management. Just ask: –Henry Ford –Herbert Simon (The Two Watchmakers – The Architecture of Complexity) –George Miller (7+/-2) –Adam Smith ( An Inquiry into the Nature And Causes of the Wealth of Nations,1776 ) –Any electrical or chemical engineer.
Sean McGrath 11http://www.propylon.com Pipeline philosophy Component re-use is the key to productivity –Ask any form of engineer (electrical, chemical etc.) apart from software engineers… –Component re-use remains a holy grail in software engineering –Pipelining is yet another attempt based on data transformation and data flow rather than algorithms
Sean McGrath 12http://www.propylon.com Pipeline philosophy A lot of data processing for the forseable future will consist of XML to XML transformation A lot of non-XML data processing can consist of XML to XML transformations with the addition of top and tail transformations to non-XML formats An XML pipeliners mantra: 1.Get data into XML as quickly as possible 2.Keep it in XML until the last possible minute 3.Bring all your XML tools to bear on solving the data processing problem
Sean McGrath 13http://www.propylon.com Pipeline philosophy Input XML Output XML Non-XML Input Top Transformation Non-XML Output Tail Transformation
Sean McGrath 14http://www.propylon.com Pipeline philosophy The philosophy hinges on the fact that every complex XML transformation can be broken down into a series of smaller ones than can be chained together
Sean McGrath 15http://www.propylon.com Pipeline philosophy Only so many ways to re-arrange an XML tree structure A finite number of fundamental transformations, from which all transformations can be derived
Sean McGrath 16http://www.propylon.com Pipeline philosophy 1.Starting point: data at time T conforming to spec A. Data at time T2 conforming to spec. B. 2.Transformation Analysis/Decomposition – decompose the problem of getting from A to B into independent XML in, XML out stages 3.Decide what transformation components you already have. 4.Implement the ones you dont – make them re-usable for the next transformation project.
Sean McGrath 17http://www.propylon.com Pipeline philosophy –Transformation analysis & decomposition leads to a series of small, manageable, stand alone problems with an XML input spec and an XML output spec. Spec = schemas + structure rules + narrative. Can build, test, use and then re-use these transformation components Very team development friendly – parallel development of loosely coupled components Very debugging friendly – log2(n) chops to find any given problem.
Sean McGrath 18http://www.propylon.com Pipeline debugging Input XML Output XML Non-XML Input Top Transformation Non-XML Output Tail Transformation Schema A Schema B Schema Delta 1 Schema Delta N … XML Delta 1 XML Delta N
Sean McGrath 19http://www.propylon.com Pipeline philosophy The answer to the SAX/DOM question is mu. (More on this later) No such thing as the correct abstraction for processing XML Pipeline approach means you can mix nmatch black-box components that internally use whatever paradigm best suited the problem Lexical SAX,STAX,DOM,XOM COmega,XSLT, XQuery XDuce, Pyxie, Java, C#, Groovy, Ruby, Haskell, WebIt! Etc. etc.
Sean McGrath 20http://www.propylon.com Sample Pipeline DB /CMS Character Set Mods Add Doctype + validate + strip doctype Re-arrange Elements Stats + FTP XHTML Generate Validation SQL Replace Lexical Schematron/ RelaxNG/ Rhino Jython Java XSLT Lexical DOM
Sean McGrath 21http://www.propylon.com Pipeline philosophy Many XML transformations end up monolithic Assertion : developers would use a more component based approach to XML processing if they did not have to write the plumbing (orchestration, exception handling) themselves –Gee, this problem is complex. Maybe Ill do it in multiple stages! Gee, now I have to orchestrate the stages somehow. Batch files/shell scripts/driver program – all ugly and error prone. Maybe Ill just write a single program after all. Besides, it will run faster...
Sean McGrath 22http://www.propylon.com Pipeline philosophy Professional developers spend 50 percent of their time writing plumbing – Adam Bosworth Pipelining promotes the creation of a reusable plumbing layer letting developers concentrate on the application in hand.
Sean McGrath 23http://www.propylon.com Philosophy Summary Think flow - data processing == data transformation w.r.t. time – Michael Jackson XML is the current runaway winner in the self-descriptive data stakes and a very good IDDL (Intermediate Data Description Language) for all types of data that are not natively XML based
Sean McGrath 24http://www.propylon.com Philosophy Summary Inside every complex XML transformation is a sequence of simpler XML transformations trying to get out – a pipeline Decomposed transformation: – new transformations + –already componentized transformations – -> Component Reuse Nirvana
Sean McGrath 25http://www.propylon.com Pipeline Philosophy In Out In Level 0 – transformation component Level 1 - pipeline Level 2 – Rudimentary orchestration Out In Out
Sean McGrath 26http://www.propylon.com Simple pipeline transformation component examples Fundamental Operation – Rename Element –Rename Input : baz Output: baz foo baz bar baz
Sean McGrath 27http://www.propylon.com Simple pipeline transformation component examples Fundamental Operation - Peel Input : baz Output: baz foo baz bar foo baz
Sean McGrath 29http://www.propylon.com Simple pipeline transformation component examples KlingonCloak –Input: baz –Output: – baz foo baz bar tag type=foo baz tag type=bar
Sean McGrath 30http://www.propylon.com Reading a file is an XML to XML transformation – lewisscarrol.xml – Twas brillig, and the slithy tomes, did gyre and gimbal in the wave … Simple pipeline transformation component examples
Sean McGrath 31http://www.propylon.com Arithmetic is an XML to XML transformation – – 3 Simple pipeline transformation component examples
Sean McGrath 32http://www.propylon.com Simple pipeline transformation component examples Unix pipe utilities e.g. tr –hello world –HELLO WORLD
Sean McGrath 33http://www.propylon.com Conditionals are XML to XML transformation tee junctions triggered by XPaths In if XPath if XPath TRUE branch if XPath FALSE branch A little orchestration in a transformation component
Sean McGrath 34http://www.propylon.com Validation as a transformation component XML A XML A RelaxNG Schematron Jython/Java/JACL XComponent Validation Log InputOutput Error
Sean McGrath 35http://www.propylon.com Sample Transformation Component Examples Once you start thinking in terms of pipes – components appear everywhere: –Regular fragmentations –Doctype changer –Namespace normalizer –Character set transcoder –Hash generator –Architectural form processing –RelaxNG/Schematron etc
Sean McGrath 36http://www.propylon.com First objection It will be dog slow or (stronger form): –Re-usable tree transforming components wont work in my shop – my XML files are too big to schlep around in strings, never mind DOMs!
Sean McGrath 37http://www.propylon.com Document fulcra and the scatter/gather pattern For any given transformation t to be performed on documents conforming to schema s, there is a fragment expression that can be used to chop each document into n pieces, on which t can be performed. I call these points fulcra and are a function of (t,s)
Sean McGrath 38http://www.propylon.com Identifying Fulcra For data-oriented XML, the fulcra often coincide with the record iteration in the XML schema and may be independent of t. For document-oriented XML, the fulcra are much more dependent on t.
Sean McGrath 39http://www.propylon.com Document fulcra and scatter/gather pattern Having identified the fulcra:- –Chop the input document into fragments – scatter phase –Perform t –Join all the processed fragments together to constitute the output document – gather phase Three stage pipeline – scatter & gather either side of the core component
Sean McGrath 40http://www.propylon.com Document Fulcra Input Doc Output Doc ttttt Scatter Invoke t Gather n fragments TIME n fragments
Sean McGrath 41http://www.propylon.com Document Fulcra Note the data domain de-composition – meets XML markup. Trivially parallelizable
Sean McGrath 42http://www.propylon.com Document Fulcra A good fulcra based scatter/gather will make performance head north faster, cheaper and with a high upper limit than any amount of hand-crafted, genius level XML coding of your transformations in horrid SAX or lexical parse mode. –Massive Parallelism will kill all von Neumann throughput arguments Documents per second, not seconds per document – throughput is the true measure of XML processing speed Document fulcra – Locality of reference (Denning) applies to XML processing (more on this later)
Sean McGrath 43http://www.propylon.com More objections (with more answers) It will be slow –No it wont - Premature optimization is the root of all evil! –Speed is a three headed monster. Im old enough to have left the X axis and currently heading for Y through Z The 3 Axes to Speed
Sean McGrath 44http://www.propylon.com Some objections (with some answers) Component based software? Harumph! We have heard that one before… –Pipelines are data flow based not API based (COM, VBX, CORBA) –Two pin interfaces and minimal verbs –The XML payload is what is important – not the API - RESTian
Sean McGrath 45http://www.propylon.com Revisiting the XSLT/DOM -> SAX non-sequiter XSLT and DOM are memory bound – trade off between ease of use and resource usage – ease of use favoured SAX is not memory bound – trade off between ease of use and resource usage – low resource usage favoured On xml-dev users often advised to rewrite their apps using SAX! Ugh!
Sean McGrath 46http://www.propylon.com XSLT/DOM -> pipeline Pipelines and scatter/gather allow you to keep the ease of use of XSLT/DOM with the finite resource utilization of SAX As long as you can identify a good fulcrum function –They exist more often than not –If they exist, they are very easily found and drop out of document analysis – eg: xpath expressions in XSLT stylesheet templates
Sean McGrath 47http://www.propylon.com Pipelining and Grids Grid Technologies – computational power on tap (http://www.gridforum.org)http://www.gridforum.org A match made in heaven (bandwidth permitting)
Sean McGrath 48http://www.propylon.com An XML Processing Grid – on demand In Out DMZ
Sean McGrath 49http://www.propylon.com Grids - caveats For large data volumes it is simple not feasible to shunt the data over the wire – Jim Gray Organizations are sensitive about their data going beyond firewalls Pay-per-use racks in your back-office a better bet. – Rent a grid the way you would rent a chainsaw.
Sean McGrath 50http://www.propylon.com A Service Oriented Architecture service = XML transformation with side optional effects
Sean McGrath 51http://www.propylon.com Pipelines and Service Oriented Architecture Can usefully blurr the distinction between a message queue and a transformation pipe Services have the same XML-in, XML-out interface –All components can be services –All pipes can be services –All SOAs can be services…
Sean McGrath 52http://www.propylon.com Federated SOAs Pipeline transformation
Sean McGrath 53http://www.propylon.com Musings #1 - Debugging Pipelines are very debugging friendly –log2(N) time required for fault diagnosis –Probes in the form of loggers, RelaxNG validators, easily plug-inable (as transformation components) to a pipe to watch what is going on. –Pre/Post condition on/off switch is a useful design by contract debugger –XML-aware browsers as breakpoints
Sean McGrath 54http://www.propylon.com Musings #2 – Validation – grammers versus rules versus FYIs Pipelines make it natural to segregate business rules from grammar rules and can dramatically simplify both Some of the most useful business rules are non dyadic. FYIs are really, really useful monitoring/QA tools.
Sean McGrath 55http://www.propylon.com Musings #3 – Inbetween-ing and component development Transformation analysts spec the transformation Only need to code new components Spec == Documentation of what the transform needs to do with pre/post etc. but no code Provides built in JIT-style acceptance test via the pre/post conditions Outsource friendly, parallelisability friendly and third-party market friendly
Sean McGrath 56http://www.propylon.com Musing #4 - Web Services First generation will be a total blind alley – RPC Document Oriented Messaging – not Object Oriented Messaging -> SOAs The next stage in encapsulation and loose coupling – something like pipelining will be a pre-requisite in a doc/literal world.
Sean McGrath 57http://www.propylon.com Musing #5 – naming and parametric typing Naming components is a really hard problem Programmers dont do metadata Finding components to re-use is a real problem – the Google lesson Numerous components that do the same thing but optimized on different axes: –Space –Time –Infoset considerations
Sean McGrath 58http://www.propylon.com Musing #6 – Pre-validation Transformation Killing ourselves seeking one-shot expressivity in schema validation languages Many complex validations become a lot simpler if you do some transformation(s) first –Co-occurrence constraints –Contextual constraints Clear analog with formatting (pre-flow transformation(s) + flow = DSSSL/XSL)
Sean McGrath 59http://www.propylon.com Musing #7 – grids, scheduling and compilers Scheduling transformations on a pipeline grid is hard – manufacturing lore needs to be brought to bear (e.g. Flow Shop Scheduling). Pipe -> Component via compiler is a powerful idea –Both for grids (IO optimisation) and for general program distribution –Pipe compilation can beat the IO problems while retaining the simple, componentised development approach. –Back to the future with Jacksons Program Inversion
Sean McGrath 60http://www.propylon.com Musing #8 – Higher order transformations What if, instead of transforming an instance, you transformed a grammer? Auto-generation of instance transformation primitives Limited to non-PCDATA transforms and side-effect free transforms but useful nonetheless
Sean McGrath 61http://www.propylon.com Some pipeline-related open source technologies | - Unix Pipes SAX Filters XBeans Cocoon Xpipe (sadly under resourced) axKit xvif DSDL Ant, W3C Pipeline Note
Sean McGrath 62http://www.propylon.com Thank you (question,answer?)*