Presentation on theme: "Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa."— Presentation transcript:
Next generation search engines Paolo Ferragina Dipartimento di Informatica, Pisa
Our journey today! Web search engines XML search engines Basic Research on data compression, indexing and mining
More than 85% users arrive to a site from a SE Web Searches: 45% Google, 29% Yahoo, 13% MSN, 5% ASK,... Toolbar searches: 49.6% Google, 46.1% Yahoo,... SE impact onto: Web structure, knowledge and understanding, social behavior.... and marketing !! 33% users believe that the results of a query are the best place where to buy things !! Ads (4B $ in USA, 2B in Europe, 180M in Italy) Paid search: 65% Google, 25% Yahoo, 8% MSN,... Portal search: 15% Yahoo, 10% MSN, 7% AOL-Google,... Much interest...
Retrieve the docs that are relevant for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query : paradigm bag of words Relevant ?!?...We face many difficulties, especially on the Web!!! Goal of a Search Engine
Web is huge: 8 bil pages [Google] We need to rank the results !!
Languages/Encodings Hundreds of languages: 55 (Jul01) Home pages: In 1997: English 82%, the next 15 take 13% In 2001: English 53%, the next 9 take 30% Distributed authorship Millions of people creating pages with their own style… Not all have the purest motives in providing high-quality information - commercial motives drive spamming. Web is heterogeneous Extracting significant data is difficult !!
Web is highly dynamic [154 sites, 2004] A good coverage of the indexed Web is difficult !! Normalized wrt first week
User Queries are difficult Query composition: Short 2001: 2.54 terms avg 80% less than 3 terms Imprecise terms 78% of the queries are not modified Query results: Users are lazy: 85% look at just one page of results
User Needs are variegate Informational – want to learn about something (~40%) Navigational – want to go to a page (~25%) Transactional – want to do something (~35%) Access a service Downloads Shop Asthma Alitalia NY weather Mars surface images Nikon CoolPix
Evolution of Search Engines First generation -- use only on-page, web-text data Word frequency and language Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page) Third generation -- answer the need behind the query Focus on user need, rather than on query Integrate multiple data-sources Click-through data Query mining 1995-1997 AltaVista, Excite, Lycos, etc 1998: Google, now everyone No winner yet !! Various players: Google, Yahoo, Msn, Ask,… Fourth generation Information Supply [Andrei Broder, VP emerging search tech, Yahoo! Research]
Yahoo! World Search Yahoo! Image, Yahoo! Video, Yahoo! Local, Yahoo! News, Yahoo! Shopping Search, Communication Yahoo! Mail, Yahoo! Messenger, My Web, Yahoo! Personals, Yahoo! 360º, Yahoo! Photos, Flickr, delicious,... Yahoo! Answers Content: Yahoo! Sports, Yahoo! Finance, Yahoo! Music, Yahoo! Movies, Yahoo! News, Yahoo! Games. My Yahoo! Mobile: Yahoo! Mobile Yahoo! Go Commerce: Yahoo! Shopping, Yahoo! Autos, Yahoo! Auctions, Yahoo! Travel, Small Business: Yahoo! Small Business Yahoo! Domains, Yahoo! Web Hosting, Yahoo! Merchant Solutions, Yahoo! Business Email, HotJobs Advertising: Yahoo! Search Marketing Yahoo! Publisher Network. [source: R. Baeza-Yates]
Yahoo! numbers [April, 06] 15 languages, 20 countries, 6B users Each day: 1 million new accounts 3.4 billion page views 10 Tb of data processed (total, 20Pb) 2 billion Mail+Messenger sent [source: R. Baeza-Yates]
Yahoo! Research Barcelona Starting date: May 2006, Barcelona Director: Ricardo Baeza-Yates Areas: Web Mining and Web Search People: more than 10 and… fast growing !! Why me ? First academic grant in Europe Three years project on Data compression and indexing on hierarchical memories [source: R. Baeza-Yates]
Data to be mined or searched Crawled data (large, heterogeneous, …) Web Pages & Links Blogs Items for sale: Shopping, Travel, etc. RSS Feeds Produced data (high quality, sparse,…) Yahoos Web: YCars, YHealth, Ytravel,… Edited news, purchased news,… Direct interaction (quality??) Social links Tagged content [source: R. Baeza-Yates]
The wisdom of the crowd can be used to improve the search and extraction process
Observed data Query Logs spelling, synonyms, phrases (named entities), substitutions,… Clicks relevance, intent, … There is a new type of economics that has emerged and that the world doesn't understand, Web usage data is an amazing leading indicator because it tells you where intent is heading U. Fayyad, Yahoo Chief Data Officer
Our future goals… Deploy user actions, e.g. queries + clicks + … Implicit semantic information It's free and unbiased Large volume … the Semantic Web Hypothesis - Explicit Semantic Information Obstacle - Us Possible uses: Query suggestion Query disambiguation Adv suggestions Web-site design...
An XML excerpt Donald E. Knuth The TeXbook Addison-Wesley 1986 Donald E. Knuth Ronald W. Moore An Analysis of Alpha-Beta Pruning 293-326 1975 6 Artificial Intelligence...
The literature on XML indexing... Various tools are available TreSy [Cribecu, 1997] eXist [TU Darmstadt, 2002] GalaTex [AT&T, 2004] Some of their limitations Run on a single machine Use a lot of computational resources (time, space,…) Limit the indexable XML document structure XML document types data centric [relational data: DB exports] text centric [literary texts, reports, emails, news, …]
Application Level Our proposal: Tauro Query interface XML based Query solver analysis + optimization Result retriever indexing data structure Data Collection manager data compression snippet extraction
The first scenario: Client-Server Context of use : Biblio search,...
The second scenario: Peer-to-Peer Context of use: Collaborative search
Exploit the power of the crowd The largest library of XML tagged text collections …and the power of search engines A suite of search + text mining tools Syntactic text comparison Motifs extraction for text pattern identification Concept identification via LSI Our goal...
You find already loaded rare texts in editions and translations coming from 400 and 500
you can visually compose sophisticated structural queries
http://signum.sns.it Everything on the finger tips of humanists Nokia 770, Origami (Microsoft ), SmartPhones, … Stay in touch...
Basic research Recurrent themes of this talk Large volume of data Efficient search Hierarchical memory systems: L1-L2 caches, RAM, (Multi-) Disks, (Web) Network, … Basic algorithmic tools Indexing data structures Data compression Do we face a paradoxical situation ?
Six years ago... [now, J. ACM 05] Opportunistic Data Structures with Applications P. Ferragina, G. Manzini Survey by Navarro-Makinen cites more than 50 papers on the subject !!
Joint effort with Navarros group at Univ. Chile Some figures over hundreds of MBs of data: Count(P) takes few millisecs Locate(P) takes few millisecs for each occurrence of P Space is about [bzip ~ 20%] 22% (support just Count ops) 35% ( Count, Locate ops)
Compressed index for XML [Ferragina et al, WWW 06] Query (counting) time 8 ms, Navigation time 3 ms UniPi is patenting it !!
Next generation search engines Paolo Ferragina University of Pisa Thanks !!
An XML excerpt Donald E. Knuth The TeXbook Addison-Wesley 1986 Donald E. Knuth Ronald W. Moore An Analysis of Alpha-Beta Pruning 293-326 1975 6 Artificial Intelligence... It is verbose !
A tree interpretation... XML document exploration Tree navigation XML document search Labeled subpath searches Subset of XPath [W3C]
The Problem Summary indexes (like Dataguide, 1-index or 2-index ) large space and do not support content searches XML-aware compressors (like XMill, XmlPpm, ScmPpm,...) need the whole decompression We wish to devise a compressed representation for a labeled tree T that efficiently supports some operations: Navigational operations Subpath and content searches Visualization operation XML-queriable compressors (like XPress, XGrind, XQzip,...) poor compression and scan of the whole (compressed) file XML-native search engines might exploit this tool as a core block for query optimization and (compressed) storage
A transform for labeled trees [Ferragina et al, IEEE Focs 05] We propose the XBW-transform that linearizes a labeled tree T in 2 arrays such that: the compression of T reduces to the compression of these two arrays (via e.g. gzip, bzip2, ppm,...) the indexing of T reduces to implement simple rank/select query operations over these two arrays A = a b a a a c b c d a b e c d... Rank( a, 7 ) = #a in A[1,7] = 4 Select( a, 2 ) = pos 2 ° a = 3
The XBW-Transform C BAB Dc ca baD c Da b C B D c a c A b a D c B D b a S C B C D B C B C C A C D A C C B C D B C B C S upward labeled paths Permutation of tree nodes Step 1. Visit the tree in pre-order. For each node, write down its label and the labels on its upward path
The XBW-Transform C BAB Dc ca baD c Da b CbaDDcDaBABccabCbaDDcDaBABccab S A C B C C D A C D B C S upward labeled paths Step 2. Stably sort according to S
XBW takes optimal space 100101010011011100101010011011 The XBW-Transform C BAB Dc ca baD c Da b CbaDDcDaBABccabCbaDDcDaBABccab S A C B C C D A C D B C S Step 3. Add a binary array S last marking the rows corresponding to last children S last XBW XBW can be built and inverted in optimal time
An illustrative example Pcdata Tags, Attributes and the symbol = XBW is compressible: S and S pcdata are locally homogeneous S last has some structure
A general algorithmic paradigm Basic approach (…now only for text and labelled trees) Transform the input data in few arrays Index (+compress) to support Rank/Select Theory: Soda 06 (2), Cpm 06 (2), Icalp 06 (2), DCC 06 (1) Experimental: Wea 06 (2) A lot of interest around it: http://pizzachili.di.unipi.it or http://pizzachili.dcc.uchile.cl You can test it:
A general algorithmic paradigm Basic (magic ?!?) approach Transform the input data in few arrays Index (+compress) them to support Rank/Select ops Theory: Soda 06 (2), Cpm 06 (2), Icalp 06 (2), DCC 06 (1) Experimental: Wea 06 (2) A lot of interest around it: A = a b a a a c b c d a b e c d... Rank( a, 7 ) = #a in A[1,7] = 4 Select( a, 2 ) = pos 2 ° a = 3