CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002.

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002

Recap: Web Anatomy www.ibm.com…/~newbie//…/…/leaf.htm

Recap:Size of the Web Capture – Recapture technique Assumes engines get independent random subsets of the Web E2 contains x% of E1. Assume, E2 contains x% of the Web as well Knowing size of E2 compute size of the Web Size of the Web = 100*E2/x E1 E2 WEB Bharat & Broder: 200 M (Nov 97), 275 M (Mar 98) Lawrence & Giles: 320 M (Dec 97)

Recent Measurements Source: http://www.searchengineshowdown.com/stats/change.shtml

Today’s Topics Web IR infrastructure Search deployment XML intro XML indexing and search

Web IR Infrastructure Connectivity Server Fast access to links to support link analysis Term Vector Database Fast access to document vectors to augment link analysis

Connectivity Server [CS1: Bhar98b, CS2 & 3: Rand01] Fast web graph access to support connectivity analysis Stores mappings in memory from URL to outlinks, URL to inlinks Applications HITS, Pagerank computations Crawl simulation Graph algorithms: web connectivity, diameter etc. Visualizations

Usage Input Graph algorithm + URLs + Values URLs to FPs to IDs Execution Graph algorithm runs in memory IDs to URLs Output URLs + Values Translation Tables on Disk URL text: 9 bytes/URL (compressed from ~80 bytes ) FP(64b) -> ID(32b): 5 bytes ID(32b) -> FP(64b): 8 bytes ID(32b) -> URLs: 0.5 bytes

ID assignment Partition URLs into 3 sets, sorted lexicographically High: Max degree > 254 Medium: 254 > Max degree > 24 Low: remaining (75%) IDs assigned in sequence (densely) E.g., HIGH IDs: Max(indegree, outdegree) > 254 IDURL … 9891 www.amazon.com/ 9912 www.amazon.com/jobs/ … 9821878 www.geocities.com/ … 40930030 www.google.com/ … 85903590 www.yahoo.com/ Adjacency lists  In memory tables for Outlinks, Inlinks  List index maps from an ID to start of adjacency list

Adjacency List Compression - I … … 98 132 153 98 147 153 … … 104 105 106 List Index Sequence of Adjacency Lists … … -6 34 21 -8 49 6 … … 104 105 106 List Index Delta Encoded Adjacency Lists Adjacency List: - Smaller delta values are exponentially more frequent (80% to same host) - Compress deltas with variable length encoding (e.g., Huffman) List Index pointers: 32b for high, Base+16b for med, Base+8b for low - Avg = 12b per pointer

List Index Pointers URL Info LC:TID … FRQ:RL … Base (4 bytes) Offsets For 16 IDs offset ID to adjacency list lookup ID Adjacency lists

Adjacency List Compression - II Inter List Compression Basis: Similar URLs may share links Close in ID space => adjacency lists may overlap Approach Define a representative adjacency list for a block of IDs Adjacency list of a reference ID Union of adjacency lists in the block Represent adjacency list in terms of deletions and additions when it is cheaper to do so Measurements Intra List + Starts: 8-11 bits per link (580M pages/16GB RAM) Inter List: 5.4-5.7 bits per link (870M pages/16GB RAM.)

Term Vector Database [Stat00] Fast access to 50 word term vectors for web pages Term Selection: Restricted to middle 1/3 rd of lexicon by document frequency Top 50 words in document by TF.IDF. Term Weighting: Deferred till run-time (can be based on term freq, doc freq, doc length) Applications Content + Connectivity analysis (e.g., Topic Distillation) Topic specific crawls Document classification Performance Storage: 33GB for 272M term vectors Speed: 17 ms/vector on AlphaServer 4100 (latency to read a disk block)

Architecture URL Info LC:TID … FRQ:RL … 128 Byte TV Record Terms Freq Base (4 bytes) Bit vector For 480 URLids offset URLid to Term Vector Lookup URLid * 64 /480

Search Deployment Web IR is just one (very specific) type of IR Commercially most important IR application: Enterprise search (large corporations) Problem different from Web IR Peer-2-Peer (P2P) search Another search deployment strategy

Enterprise Search Deployment Database Corporate Network Company Web Site E-Commerce Web Portals Enterprises Proprietary content Public content World Wide Web Sources Markets Search Boxes Content Location Content Management Groupware

1st Generation: Classic Information Retrieval 2nd Generation: Driven by WWW 3rd Generation: Discovery (Text Mining) User:Trained specialistEveryoneEveryone and software agents Scope:Small, closed collectionsIntranet/Extranet Structured, semistructured and unstructured information Technology:Pattern/string matching Pattern/string matching and external factors for relevance ranking + categorization Introduction of linguistic and semantic processing 1985 - 1993 1994 - 1999 2000+ Evolution of Enterprise Search

Enterprise IR is a lot more than search … Security Cannot search what you should not read Content organization & creation Automatic classification Taxonomy generation Support for multiple languages, multiple formats Conduits into databases and other content management -- homes for “valuable” content Information processing tools Annotation Range searches Custom ranking criteria Cross lingual tools,… Individual preferences Personalization Notification, …

Peer-To-Peer (P2P) Search No central index Each node in a network builds and maintains own index Each node has “servent” software On booting, servent pings ~4 other hosts Connects to those that respond Initiates, propagates and serves requests

Which hosts to connect to? The ones you connected to last time Random hosts you know of Request suggestions from central (or hierarchical) nameservers All govern system’s shape and efficiency

Serving P2P search requests Send your request to your neighbors They send it to their neighbors decrement “time to live” for query query dies when ttl = 0 Send search matches back along requesting path

Some P2P Networks Gnutella Kazaa Bearshare Aimster Grokster Morpheus

P2P: Information Retrieval Issues Why is this more difficult than centralized IR?

P2P: Information Retrieval Issues Selection of nodes to query Merging of results Spam

What is XML? eXtensible Markup Language A framework for defining markup languages No fixed collection of markup tags Each XML language targeted for application All XML languages share features Enables building of generic tools

Basic Structure An XML document is an ordered, labeled tree character data leaf nodes contain the actual data (text strings) data nodes must be non-empty and non- adjacent to other character data nodes element nodes, are each labeled with a name (often called the element type), and a set of attributes, each consisting of a name and a value, can have child nodes

XML Example

FileCab This chapter describes the commands that manage the FileCab inet application.

Elements Elements are denoted by markup tags thetext Element start tag: foo Attribute: attr1 The character data: thetext Matching element end tag:

XML vs HTML Relationship?

XML vs HTML HTML is a markup language for a specific purpose (display in browsers) XML is a framework for defining markup languages HTML can be formalized as an XML language (XHTML) XML defines logical structure only HTML: same intention, but has evolved into a presentation language

XML: Design Goals Separate syntax from semantics to provide a common framework for structuring information Allow tailor-made markup for any imaginable application domain Support internationalization (Unicode) and platform independence Be the future of (semi)structured information (do some of the work now done by databases)

Why Use XML? Represent semi-structured data (data that are structured, but don’t fit relational model) XML is more flexible than DBs XML is more structured than simple IR You get a massive infrastructure for free

Applications of XML XHTML CML – chemical markup language WML – wireless markup language ThML – theological markup language Having a Humble Opinion of Self EVERY man naturally desires knowledge Aristotle, Metaphysics, i. 1. ; but what good is knowledge without fear of God? Indeed a humble rustic who serves God is better than a proud intellectual who neglects his soul to study the course of the stars. Augustine, Confessions V. 4.

XML Schemas Schema = syntax definition of XML language Schema language = formal language for expressing XML schemas Examples DTD XML Schema (W3C) Relevance for XML IR Our job is much easier if we have a (one) schema

XML Tutorial http://www.brics.dk/~amoeller/XML/index.h tml http://www.brics.dk/~amoeller/XML/index.h tml (Anders Møller and Michael Schwartzbach) Previous (and some following) slides are based on their tutorial

XML Indexing and Search

Native XML Database Uses XML document as logical unit Should support Elements Attributes PCDATA (parsed character data) Document order Contrast with DB modified for XML Generic IR system modified for XML

XML Indexing and Search Most native XML databases have taken an DB approach Exact match Evaluate path expressions No IR type relevance ranking Only a few that focus on relevance ranking

Timber: XML as DB extension DB: search tuples Timber: search trees Main focus Complex and variable structure of trees (vs. tuples) Ordering XML query optimization vs relational optimization

ToXin Native XML database Exploits overall path structure Supports any general path query Query evaluation in three stages Preselection stage Selection stage Postselection stage

ToXin: Motivation Strawman: Index all paths occurring in database Does not allow backward navigation Example query: find all the titles of articles from 1990

Query Evaluation Stages Pre-selection First navigation down the tree Selection Value selection according to filter Post-selection Navigation up and down again

Factors Impacting Performance Data source specific Document size Number of XML nodes and values Path complexity (degree of nesting) Average value size Query specific Selectiveness of path constraint Size of query answer Number of elements selected by filter

Benchmark Parameters

Query Classification

Evaluation

ToXin: Summary Efficient native XML database All paths are indexed (not just from root) Path index linear in corpus size Shortcomings Order of nodes ignored Semantics of IDRefs ignored What is missing?

IR/Relevance Ranking for XML Why is this difficult?

IR XML Challenge 1: Term Statistics There is no document unit in XML How do we compute tf and idf? Global tf/idf over all text context is useless Indexing granularity

IR XML Challenge 2: Fragments IR systems don’t store content (only index) Need to go to document for displaying fragment Easier in DB framework

Relevance Ranking for XML Will revisit next week

Querying XML Semistructured queries XPath XQuery

Types of (Semi)Structured Queries Location/position (“chapter no.3”) Simple attribute/value /play/title contains “hamlet” Path queries title contains “hamlet” /play//title contains “hamlet” Complex graphs Employees with two managers All of the above: mixed structure/content Subsumes: hyperlinks

XPath Declarative language for Addressing (used in XLink/XPointer and in XSLT) Pattern matching (used in XSLT and in XQuery) Location path a sequence of location steps separated by / Example: child::section[position()<6] / descendant::cite / attribute::href

Axes in XPath ancestor, ancestor-or-self, attribute, child, descendent, descendent-or-self, following, following-sibling, namespace, parent, preceding, preceding-sibling, self

Location steps A single location step has the form: axis :: node-test [ predicate ] The axis selects a rough set of candidate nodes (e.g. the child nodes of the context node). The node-test performs an initial filtration of the candidates based on their types (chardata node, processing instruction, etc.), or names (e.g. element name). The predicates (zero or more) cause a further, potentially more complex, filtration child::section[position()<6]

XQuery SQL for XML Usage scenarios Human-readable documents Data-oriented documents Mixed documents (e.g., patient records) Relies on XPath XML Schema datatypes Turing complete XQuery is still a working draft. More than a hundred open issues as of 2002.11.10

XQuery The principal forms of XQuery expressions are: path expressions element constructors FLWR ("flower") expressions list expressions conditional expressions quantified expressions datatype expressions Evaluated with respect to a context

FLWR FOR $p IN document("bib.xml")//publisher LET $b := document("bib.xml”)//book[publisher = $p] WHERE count($b) > 100 RETURN $p FOR generates an ordered list of bindings of publisher names to $p LET associates to each binding a further binding of the list of book elements with that publisher to $b at this stage, we have an ordered list of tuples of bindings: ($p,$b) WHERE filters that list to retain only the desired tuples RETURN constructs for each tuple a resulting value

XQuery vs SQL Order matters! document("zoo.xml")//chapter[2]//figure[capt ion = "Tree Frogs"] XQuery is turing complete, SQL is not.

XQuery Example Møller and Schwartzbach

XQuery Standard on Ranking (2.3.1) Document order defines a total ordering among all the nodes seen by the language processor. Within a given document, the document node is the first node, followed by element nodes, text nodes, comment nodes, and processing instruction nodes in the order of their representation in the XML form of the document (after expansion of entities). Element nodes occur before their children. The namespace nodes of an element immediately follow the element node, in implementation- defined order. The attribute nodes of an element immediately follow its namespace nodes, and are also in implementation-defined order. The relative order of nodes in distinct documents is implementation-defined but stable within a given query or transformation. In other words, given two distinct documents A and B, if a node in document A is before a node in document B, then every node in document A is before every node in document B. The relative order among free-floating nodes (those not in a document) is implementation-defined.

Next Week (12/3) XML indexing and search II Metadata indexing and search Dublin Core, RDF, DAML+OIL

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002.

Similar presentations

Presentation on theme: "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002.

Similar presentations

Presentation on theme: "CS276A Text Information Retrieval, Mining, and Exploitation Lecture 15 26 Nov 2002."— Presentation transcript:

Similar presentations

About project

Feedback