Presentation is loading. Please wait.

Presentation is loading. Please wait.

XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015.

Similar presentations


Presentation on theme: "XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015."— Presentation transcript:

1 XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

2 Reminders  Homework 2 “release” version is now on the Web site  Simple web crawling  XPath  XSLT  Storage (Berkeley DB)  Milestone 1 due March 1  Milestone 2 due March 8 2

3 More than XPath  XPath identifies or extracts subtrees from an XML document  … But there are lots of cases where we want to convert from XML  XML, or something else  XML  text (document extraction)  XML  HTML  XML  SVG  etc.  Here we need something more – often XSLT 3

4 4 A Functional Language for XML  XSLT is based on a series of templates that match different parts of an XML document  There’s a policy for what rule or template is applied if more than one matches (it’s not what you’d think!)  XSLT templates can invoke other templates  XSLT templates can be nonterminating (beware!)  XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths)  Within each template, directly describe what should be output

5 5 An XSLT Template  An XML document itself  XML tags create output OR are XSL operations  All XSL tags are prefixed with “xsl” namespace  All non-XSL tags are part of the XML output  Common XSL operations:  template with a match XPath  Recursive call to apply-templates, which may also select where it should be applied  Attach to XML document with a processing-instruction: http://www.com/my.xsl

6 6 An Example XSLT Stylesheet This is DBLP …

7 7 XML Data Root ?xml dblp mastersthesis article mdate key authortitleyearschool editortitleyearjournalvolumeee mdate key 2002… ms/Brown92 Kurt P…. PRPL… 1992 Univ…. 2002… tr/dec/… Paul R. The… Digital… SRC… 1997 db/labs/dec http://www. attribute root p-i element text

8 8 XSLT Processing Model  List of source nodes  result tree fragment(s)  Start with root  Find all template rules with matching patterns from root  Find “best” match according to some heuristics  Set the current node list to be the set of things it maches  Iterate over each node in the current node list  Apply the operations of the template  “Append” the results of the matching template rule to the result tree structure  Repeat recursively if specified to by apply-templates

9 9 What If There’s More than One Match?  Eliminate rules of lower precedence due to importing  Break a rule into any | branches and consider separately  Choose rule with highest computed or specified priority  Simple rules for computing priority based on “precision”:  QName preceded by XPath child/axis specifier: priority 0  NCName preceded by child/axis specifier: priority -0.25  NodeTest preceded by child/axis specifier: pririty -0.5  else priority 0.5

10 10 Other Common Operations  Iteration:  Conditionals:  Copying current node and children to the result set:

11 11 Creating Output Nodes  Return text/attribute data (this is a default rule):  Create an element from text (attribute is similar):  Copy nodes matching a path

12 12 Embedding Stylesheets  You can “import” or “include” one stylesheet from another: http://www.com/my.xsl http://www.com/my.xsl  “Include”: the rules get same precedence as in including template  “Import”: the rules are given lower precedence

13 13 XSLT Summary  A very powerful, template-based transformation language for XML document  other structured document  Commonly used to convert XML  PDF, SVG, GraphViz DOT format, HTML, WML, …  Primarily useful for presentation of XML or for very simple conversions What if we want to:  Manage and combine collections of XML documents?  Make Web service requests for XML?  “Glue together” different Web service requests?  Query for keywords within documents, with ranked answers  This is where XQuery plays a role – see CIS 330 / 550 for details

14 Now… How Do We Crawl the Web and Get Data?  A few remarks on basic crawlers…  … Then an XML-specific crawler 14

15 15 Crawling the Web: The Basic Process  Start with some initial page P 0  Collect all URLs from P 0 and add to the crawler queue  Consider tag, anchor links, optionally image links, CSS, DTDs, scripts  Considerations:  What order to traverse (polite to do BFS – why?)  How deep to traverse  What to ignore (coverage)  How to escape “spider traps” and avoid cycles  How often to crawl

16 16 Essential Crawler Etiquette  Robot exclusion protocols  First, ignore pages with:  Second, look for robots.txt at root of web server  See http://www.robotstxt.org/wc/robots.htmlhttp://www.robotstxt.org/wc/robots.html  To exclude all robots from a server: User-agent: * Disallow: /  To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/

17 Suppose We Want to Crawl XML Documents Based on User Interests  We need several parts:  A list of “interests” – expressed in an executable form, perhaps XPath queries  A crawler – goes out and fetches XML content  A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 17

18 18 XML-Based Information Dissemination Basic model (XFilter, YFilter, Xyleme):  Users are interested in data relating to a particular topic, and know the schema /politics/usa//body  A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties

19 19 Engine for XFilter [Altinel & Franklin 00]

20 20 How Does It Work?  Each XPath segment is basically a subset of regular expressions over element tags  Convert into finite state automata  Parse data as it comes in – use SAX API  Match against finite state machines  Most of these systems use modified FSMs because they want to match many patterns at the same time

21 21 Path Nodes and FSMs  XPath parser decomposes XPath expressions into a set of path nodes  These nodes act as the states of corresponding FSM  A node in the Candidate List denotes the current state  The rest of the states are in corresponding Wait Lists  Simple FSM for /politics[@topic=“president”]/usa//body: politics usabody Q1_1 Q1_2 Q1_3

22 22 Decomposing Into Path Nodes Query ID Position in state machine Relative Position (RP) in tree: 0 for root node if it’s not preceded by “//” -1 for any node preceded by “//” Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = –1, then –1, else 0 Finaly, NextPathNodeSet points to next node Q1=/politics[@topic=“president”]/usa//body Q1 123 01 12 Q1-1Q1-2Q1-3 Q2 123 21 00 Q2-1Q2-2Q2-3 Q2=//usa/*/body/p

23 23 Query Index  Query index entry for each XML tag  Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes  “Live” queries’ states are in CL; “pending” queries + states are in WL  Events that cause state transition are generated by the XML parser politics usa body p Q1-1 Q2-1 Q1-3Q2-2 Q2-3 X X X X X X X X CL WL Q1-2

24 24 Encountering an Element  Look up the element name in the Query Index and all nodes in the associated CL  Validate that we actually have a match Q1 1 0 1 Q1-1 politics Q1-1 X X WL startElement: politics CL Query ID Position Rel. Position Level Entry in Query Index: NextPathNodeSet

25 25 Validating a Match  We first check that the current XML depth matches the level in the user query:  If level in CL node is less than 1, then ignore height  else level in CL node must = height  This ensures we’re matching at the right point in the tree!  Finally, we validate any predicates against attributes (e.g., [@topic=“president”])

26 26 Processing Further Elements  Queries that don’t meet validation are removed from the Candidate Lists  For other queries, we advance to the next state  We copy the next node of the query from the WL to the CL, and update the RP and level  When we reach a final state (e.g., Q1-3), we can output the document to the subscriber  When we encounter an end element, we must remove that element from the CL

27 27 Publish-Subscribe Model Summarized  Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication) Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles  Seems like a perfect fit for publish-subscribe models!


Download ppt "XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015."

Similar presentations


Ads by Google