XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

XML: Extensible Markup Language
Internet Technologies1 1 Lecture 4: Programming with XSLT.
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
XSL XSLT and XPath 11-Apr-17.
Information Retrieval in Practice
Crawling the WEB Representation and Management of Data on the Internet.
XML Study-Session: Part IV Transforming XML Documents Copyright Quddus Chong 2001.
XSL Concepts Lecture 7. XML Display Options What can XSL Transformations do? generation of constant text suppression of content moving text (e.g., exchanging.
XSL Unit 6 November 2. XSL –eXtensible Stylesheet Language –Basically a stylesheet for XML documents XSL has three parts: –XSLT –XPath –XSL-FO.
1 Copyright (c) [2000]. Roger L. Costello. All Rights Reserved. Using XSLT and XPath to Transform XML Documents Roger L. Costello XML Technologies.
XML Language Family Detailed Examples Most information contained in these slide comes from: These slides are intended.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
XML Querying and Views Helena Galhardas DEI IST (slides baseados na disciplina CIS 550 – Database & Information Systems, Univ. Pennsylvania, Zachary Ives)CIS.
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
Overview of Search Engines
XML Schemas and Queries Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015.
MC 365 – Software Engineering Presented by: John Ristuccia Shawn Posts Ndi Sampson XSLT Introduction BCi.
4/20/2017.
ECA 228 Internet/Intranet Design I Intro to XSL. ECA 228 Internet/Intranet Design I XSL basics W3C standards for stylesheets – CSS – XSL: Extensible Markup.
10/06/041 XSLT: crash course or Programming Language Design Principle XSLT-intro.ppt 10, Jun, 2004.
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
XML for E-commerce III Helena Ahonen-Myka. In this part... n Transforming XML n Traversing XML n Web publishing frameworks.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
WORKING WITH XSLT AND XPATH
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
HKU CSIS DB Seminar: HKU CSIS DB Seminar: Efficient Filtering of XML Documents for Selective Dissemination of Information Mehmet Altinel, Micheal J. Franklin.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Advanced Java Session 9 New York University School of Continuing and Professional Studies.
Introduction to XML Extensible Markup Language. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information.
Crawling Slides adapted from
XML About XML Things to be known Related Technologies XML DOC Structure Exploring XML.
 2004 Prentice Hall, Inc. All rights reserved. 1 Chapter 34 - Case Study: Active Server Pages and XML Outline 34.1 Introduction 34.2 Setup and Message.
ECA 228 Internet/Intranet Design I XSLT Example. ECA 228 Internet/Intranet Design I 2 CSS Limitations cannot modify content cannot insert additional text.
JSTL, XML and XSLT An introduction to JSP Standard Tag Library and XML/XSLT transformation for Web layout.
CITA 330 Section 6 XSLT. Transforming XML Documents to XHTML Documents XSLT is an XML dialect which is declared under namespace "
Extensible Stylesheet Language Chao-Hsien Chu, Ph.D. School of Information Sciences and Technology The Pennsylvania State University XSL-FO XSLT.
XSLT Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
Lecture 11 XSL Transformations (part 1: Introduction)
Computing & Information Sciences Kansas State University Thursday, 15 Mar 2007CIS 560: Database System Concepts Lecture 24 of 42 Thursday, 15 March 2007.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Applying eXtensible Style Sheets (XSL) Ellen Pearlman Eileen Mullin Programming.
1 Overview of XSL. 2 Outline We will use Roger Costello’s tutorial The purpose of this presentation is  To give a quick overview of XSL  To describe.
Internet Technologies Review Week 1 How does Jigsaw differ from EchoServer.java? What abstractions are made available to the servlet writer (under.
XFilter and Distributed Data Storage Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems November 22, 2015 Some portions.
1 Copyright (c) [2000]. Roger L. Costello. All Rights Reserved. Using XSLT and XPath to Transform XML Documents Roger L. Costello XML Technologies.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
COP 3813 Intro to Internet Computing Prof. Roy Levow XML.
Unit 3 — Advanced Internet Technologies Lesson 11 — Introduction to XSL.
Finding What We Want: DNS and XPath-Based Pub-Sub Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 12, 2008.
More XML XPATH, XSLT CS 431 – February 23, 2005 Carl Lagoze – Cornell University.
Martin Kruliš by Martin Kruliš (v1.1)1.
© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania XML (continued) February 10, 2016.
Querying XML, Part II Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems February 5, 2008.
 XML derives its strength from a variety of supporting technologies.  Structure and data types: When using XML to exchange data among clients, partners,
© 2016 A. Haeberlen, Z. Ives CIS 455/555: Internet and Web Systems 1 University of Pennsylvania Crawling and Publish/Subscribe February 22, 2016.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
XML Schema – XSLT Week 8 Web site:
1 XSL Transformations (XSLT). 2 XSLT XSLT is a language for transforming XML documents into XHTML documents or to other XML documents. XSLT uses XPath.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 XSLT XSLT (extensible stylesheet language – transforms ) is another language to process XML documents. Originally intended as a presentation language:
XML: Extensible Markup Language
Unit 4 Representing Web Data: XML
XML in Web Technologies
Chapter 7 Representing Web Data: XML
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Unit 6 - XML Transformations
Presentation transcript:

XML Transformations and Content-based Crawling Zachary G. Ives University of Pennsylvania CIS 455 / 555 – Internet and Web Systems August 7, 2015

Reminders  Homework 2 “release” version is now on the Web site  Simple web crawling  XPath  XSLT  Storage (Berkeley DB)  Milestone 1 due March 1  Milestone 2 due March 8 2

More than XPath  XPath identifies or extracts subtrees from an XML document  … But there are lots of cases where we want to convert from XML  XML, or something else  XML  text (document extraction)  XML  HTML  XML  SVG  etc.  Here we need something more – often XSLT 3

4 A Functional Language for XML  XSLT is based on a series of templates that match different parts of an XML document  There’s a policy for what rule or template is applied if more than one matches (it’s not what you’d think!)  XSLT templates can invoke other templates  XSLT templates can be nonterminating (beware!)  XSLT templates are based on XPath “match”es, and we can also apply other templates (potentially to “select”ed XPaths)  Within each template, directly describe what should be output

5 An XSLT Template  An XML document itself  XML tags create output OR are XSL operations  All XSL tags are prefixed with “xsl” namespace  All non-XSL tags are part of the XML output  Common XSL operations:  template with a match XPath  Recursive call to apply-templates, which may also select where it should be applied  Attach to XML document with a processing-instruction:

6 An Example XSLT Stylesheet This is DBLP …

7 XML Data Root ?xml dblp mastersthesis article mdate key authortitleyearschool editortitleyearjournalvolumeee mdate key 2002… ms/Brown92 Kurt P…. PRPL… 1992 Univ…. 2002… tr/dec/… Paul R. The… Digital… SRC… 1997 db/labs/dec attribute root p-i element text

8 XSLT Processing Model  List of source nodes  result tree fragment(s)  Start with root  Find all template rules with matching patterns from root  Find “best” match according to some heuristics  Set the current node list to be the set of things it maches  Iterate over each node in the current node list  Apply the operations of the template  “Append” the results of the matching template rule to the result tree structure  Repeat recursively if specified to by apply-templates

9 What If There’s More than One Match?  Eliminate rules of lower precedence due to importing  Break a rule into any | branches and consider separately  Choose rule with highest computed or specified priority  Simple rules for computing priority based on “precision”:  QName preceded by XPath child/axis specifier: priority 0  NCName preceded by child/axis specifier: priority  NodeTest preceded by child/axis specifier: pririty -0.5  else priority 0.5

10 Other Common Operations  Iteration:  Conditionals:  Copying current node and children to the result set:

11 Creating Output Nodes  Return text/attribute data (this is a default rule):  Create an element from text (attribute is similar):  Copy nodes matching a path

12 Embedding Stylesheets  You can “import” or “include” one stylesheet from another:  “Include”: the rules get same precedence as in including template  “Import”: the rules are given lower precedence

13 XSLT Summary  A very powerful, template-based transformation language for XML document  other structured document  Commonly used to convert XML  PDF, SVG, GraphViz DOT format, HTML, WML, …  Primarily useful for presentation of XML or for very simple conversions What if we want to:  Manage and combine collections of XML documents?  Make Web service requests for XML?  “Glue together” different Web service requests?  Query for keywords within documents, with ranked answers  This is where XQuery plays a role – see CIS 330 / 550 for details

Now… How Do We Crawl the Web and Get Data?  A few remarks on basic crawlers…  … Then an XML-specific crawler 14

15 Crawling the Web: The Basic Process  Start with some initial page P 0  Collect all URLs from P 0 and add to the crawler queue  Consider tag, anchor links, optionally image links, CSS, DTDs, scripts  Considerations:  What order to traverse (polite to do BFS – why?)  How deep to traverse  What to ignore (coverage)  How to escape “spider traps” and avoid cycles  How often to crawl

16 Essential Crawler Etiquette  Robot exclusion protocols  First, ignore pages with:  Second, look for robots.txt at root of web server  See  To exclude all robots from a server: User-agent: * Disallow: /  To exclude one robot from two directories: User-agent: BobsCrawler Disallow: /news/ Disallow: /tmp/

Suppose We Want to Crawl XML Documents Based on User Interests  We need several parts:  A list of “interests” – expressed in an executable form, perhaps XPath queries  A crawler – goes out and fetches XML content  A filter / routing engine – matches XML content against users’ interests, sends them the content if it matches 17

18 XML-Based Information Dissemination Basic model (XFilter, YFilter, Xyleme):  Users are interested in data relating to a particular topic, and know the schema /politics/usa//body  A crawler-aggregator reads XML files from the web (or gets them from data sources) and feeds them to interested parties

19 Engine for XFilter [Altinel & Franklin 00]

20 How Does It Work?  Each XPath segment is basically a subset of regular expressions over element tags  Convert into finite state automata  Parse data as it comes in – use SAX API  Match against finite state machines  Most of these systems use modified FSMs because they want to match many patterns at the same time

21 Path Nodes and FSMs  XPath parser decomposes XPath expressions into a set of path nodes  These nodes act as the states of corresponding FSM  A node in the Candidate List denotes the current state  The rest of the states are in corresponding Wait Lists  Simple FSM for politics usabody Q1_1 Q1_2 Q1_3

22 Decomposing Into Path Nodes Query ID Position in state machine Relative Position (RP) in tree: 0 for root node if it’s not preceded by “//” -1 for any node preceded by “//” Else =1+ (no of “*” nodes from predecessor node) Level: If current node has fixed distance from root, then 1+ distance Else if RP = –1, then –1, else 0 Finaly, NextPathNodeSet points to next node Q Q1-1Q1-2Q1-3 Q Q2-1Q2-2Q2-3 Q2=//usa/*/body/p

23 Query Index  Query index entry for each XML tag  Two lists: Candidate List (CL) and Wait List (WL) divided across the nodes  “Live” queries’ states are in CL; “pending” queries + states are in WL  Events that cause state transition are generated by the XML parser politics usa body p Q1-1 Q2-1 Q1-3Q2-2 Q2-3 X X X X X X X X CL WL Q1-2

24 Encountering an Element  Look up the element name in the Query Index and all nodes in the associated CL  Validate that we actually have a match Q Q1-1 politics Q1-1 X X WL startElement: politics CL Query ID Position Rel. Position Level Entry in Query Index: NextPathNodeSet

25 Validating a Match  We first check that the current XML depth matches the level in the user query:  If level in CL node is less than 1, then ignore height  else level in CL node must = height  This ensures we’re matching at the right point in the tree!  Finally, we validate any predicates against attributes (e.g.,

26 Processing Further Elements  Queries that don’t meet validation are removed from the Candidate Lists  For other queries, we advance to the next state  We copy the next node of the query from the WL to the CL, and update the RP and level  When we reach a final state (e.g., Q1-3), we can output the document to the subscriber  When we encounter an end element, we must remove that element from the CL

27 Publish-Subscribe Model Summarized  Well-suited to an XML format called RSS (Rich Site Summary or Really Simple Syndication) Many news sites, web logs, mailing lists, etc. use RSS to publish daily articles  Seems like a perfect fit for publish-subscribe models!