ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009.

Slides:



Advertisements
Similar presentations
XML: Extensible Markup Language
Advertisements

XML May 3 rd, XQuery Based on Quilt (which is based on XML-QL) Check out the W3C web site for the latest. XML Query data model –Ordered !
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
XQuery Or, what about REAL databases?. XQuery - its place in the XML team XLink XSLT XQuery XPath XPointer.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Information Retrieval in Practice
Search Engines and Information Retrieval
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
XML Prashant Karmarkar Brendan Nolan Alexander Roda.
1 COS 425: Database and Information Management Systems XML and information exchange.
CS276B Text Retrieval and Mining Winter 2005 Lecture 12.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Introduction to XML This material is based heavily on the tutorial by the same name at
Overview of Search Engines
Overview of XPath Author: Dan McCreary Date: October, 2008 Version: 0.2 with TEI Examples M D.
4/20/2017.
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
XML Anisha K J Jerrin Thomas. Outline  Introduction  Structure of an XML Page  Well-formed & Valid XML Documents  DTD – Elements, Attributes, Entities.
Search Engines and Information Retrieval Chapter 1.
Lecture 21 XML querying. 2 XSL (eXtensible Stylesheet Language) In HTML, default styling is built into browsers as tag set for HTML is predefined and.
Lecture 6 of Advanced Databases XML Schema, Querying & Transformation Instructor: Mr.Ahmed Al Astal.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
TDDD43 XML and RDF Slides based on slides by Lena Strömbäck and Fang Wei-Kleiner 1.
Introduction to XML Eugenia Fernandez IUPUI. What is XML? From the World Wide Web Consortium (W3C) The Extensible Markup Language (XML) is the universal.
An Introduction to XML Presented by Scott Nemec at the UniForum Chicago meeting on 7/25/2006.
XML Retrieval with slides of C. Manning und H.Schutze 04/12/2008.
CISC 3140 (CIS 20.2) Design & Implementation of Software Application II Instructor : M. Meyer Address: Course Page:
Introduction to XML. XML - Connectivity is Key Need for customized page layout – e.g. filter to display only recent data Downloadable product comparisons.
ITCS 6010 SALT. Speech Application Language Tags (SALT) Speech interface markup language Extension of HTML and other markup languages Adds speech and.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 16 3 Dec 2002.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
1 XML: an introduction David Nathan. 2 XML  an in-line markup system  single sequence of plain text only (but can be unicode)  equivalent to a tree.
 XML is designed to describe data and to focus on what data is. HTML is designed to display data and to focus on how data looks.  XML is created to structure,
Intro. to XML & XML DB Bun Yue Professor, CS/CIS UHCL.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Database Systems Part VII: XML Querying Software School of Hunan University
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 11.
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.
What it is and how it works
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
XML Engr. Faisal ur Rehman CE-105T Spring Definition XML-EXTENSIBLE MARKUP LANGUAGE: provides a format for describing data. Facilitates the Precise.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
1 Indexing The syntax for creating a index is: CREATE [UNIQUE] INDEX index_name ON table_name (column1, column2,... column_n) [ COMPUTE STATISTICS ]; Why.
CS276 Information Retrieval and Web Search Lecture 10: XML Retrieval.
Sept. 27, 2002 ISDB’02 Transforming XPath Queries for Bottom-Up Query Processing Yoshiharu Ishikawa Takaaki Nagai Hiroyuki Kitagawa University of Tsukuba.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
+ 1 XML eXtensible Markup Language. + 2 XML Lecture Adapted from the work of Dr. Praveen Madiraju of Marquette University.
©Silberschatz, Korth and Sudarshan10.1Database System Concepts W3C - The World Wide Web Consortium W3C - The World Wide Web Consortium.
Martin Kruliš by Martin Kruliš (v1.1)1.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture Nov 2002.
Chapter 5 The Semantic Web 1. The Semantic Web  Initiated by Tim Berners-Lee, the inventor of the World Wide Web.  A common framework that allows data.
XML – Basic Concepts (modified version from Dr. Praveen Madiraju) 2015, Fall Pusan National University Ki-Joune Li.
XML Extensible Markup Language
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
XML and Distributed Applications By Quddus Chong Presentation for CS551 – Fall 2001.
1 XML eXtensible Markup Language. 2 Introduction and Motivation Dr. Praveen Madiraju Modified from Dr.Sagiv’s slides.
XML: Extensible Markup Language
XML Indexing and Search
XML in Web Technologies
Lecture 9: XML Monday, October 17, 2005.
CS276B Text Retrieval and Mining Winter 2005
Presentation transcript:

ITCS 6265 Information Retrieval & Web Mining Lecture 18-A Fall 2009

What is XML? eXtensible Markup Language A framework for defining markup languages No fixed collection of markup tags Each XML language targeted for application All XML languages share features Enables building of generic tools

Basic Structure An XML document is an ordered, labeled tree character data leaf nodes (text nodes) contain the actual data (text strings) element nodes, are each labeled with a name (often called the element type), and a set of attributes (attribute nodes), each consisting of a name and a value can have child (element) nodes

XML Example

FileCab This chapter describes the commands that manage the FileCab inet application.

Elements Elements are denoted by markup tags thetext Start/opening tag: foo Attribute: attr1 The character data: thetext Matching end/closing tag:

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML can be formalized as an XML language (XHTML) XML defines logical structure only HTML: same intention, but has evolved into a presentation language

XML: Design Goals Separate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured data (data that are structured, but don’t fit relational model) XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language Having a Humble Opinion of Self EVERY man naturally desires knowledge Aristotle, Metaphysics, i. 1. ; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. Augustine, Confessions V. 4.

XML Schemas Schema = syntax definition of XML language Schema language = formal language for expressing XML schemas Examples Document Type Definition XML Schema (W3C) Relevance for XML IR Our job is much easier if we have a (one) schema

XML Tutorial (Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorial

XML Indexing and Search

Native XML Database Uses XML document as logical unit Should support Elements Attributes PCDATA (parsed character data) Document order Contrast with DB modified for XML Generic IR system modified for XML

XML Indexing and Search Most native XML databases have taken a DB approach Exact match Evaluate path expressions No IR type relevance ranking Only a few that focus on relevance ranking

Data vs. Text-centric XML Data-centric XML: used for messaging between enterprise applications Mainly a recasting of relational data Content-centric XML: used for annotating content Rich in text Demands good integration of text retrieval functionality E.g., find me the ISBN #s of Book s with at least three Chapter s discussing cocoa production, ranked by Price

IR XML Challenge 1: Term Statistics There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity

IR XML Challenge 2: Fragments IR systems don’t store content (only index) Need to go to document for retrieving/displaying fragment E.g., give me the Abstract s of Paper s on existentialism Where do you retrieve the Abstract from? Easier in DB framework

IR XML Challenges 3: Schemas Ideally: There is one schema User understands schema In practice: rare Many schemas Schemas not known in advance Schemas change Users don’t understand schemas Need to identify similar elements in different schemas Example: employee

IR XML Challenges 4: UI Help user find relevant nodes in schema Author, editor, contributor, “from:”/sender What is the query language you expose to the user? Specific XML query language? No. Forms? Parametric search? A textbox? In general: design layer between XML and user

IR XML Challenges 5: using a DB Why you don’t want to use a DB Spelling correction Contains vs “is about” DB has no notion of ordering Relevance ranking

Querying XML This lecture: XQuery Next lecture: Vector space approaches

XQuery SQL for XML Usage scenarios Human-readable documents Data-oriented documents Mixed documents (e.g., patient records) Relies on XPath XML Schema data types Turing complete

XQuery The principal forms of XQuery expressions are: path expressions FLWR ("flower") expressions element constructors list/sequence expressions conditional expressions quantified expressions Evaluated with respect to a context

FLWR FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR generates an ordered list of bindings of publisher names to $p LET associates to each binding a further binding of the list of book elements with that publisher to $b at this stage, we have an ordered list of tuples of bindings: ($p,$b) WHERE filters that list to retain only the desired tuples RETURN constructs for each tuple a resulting value

How XQuery makes ranking difficult All documents in set A must be ranked above all documents in set B. Fragments must be ordered in depth-first, left-to-right order.

XQuery: Order By Clause for $d in document("depts.xml")//deptno let $e := document("emps.xml")//emp[deptno = $d] where count($e) >= 10 order by avg($e/salary) descending return { $d, {count($e)}, {avg($e/salary)} }

XQuery Order By Clause Order by clause only allows ordering by “overt” criterion Say by an attribute value Relevance ranking Is often proprietary Can’t be expressed easily as function of set to be ranked Is better abstracted out of query formulation (cf. www)

Resources Jan-Marco Bremer’s publications on xml and ir: XML resources at W3C Ronald Bourret on native XML databases: htm htm Norbert Fuhr and Kai Grossjohann. XIRQL: A query language for information retrieval in XML documents. In Proceedings of the 24th International ACM SIGIR Conference, New Orleans, Louisiana, September 2001.