 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. www.deri.org 1 The Architecture of a Large-Scale Web Search and Query Engine.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Copyright 2008 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute 1 From OntoSelect to OntoSelect-SWSE.
 Copyright 2006 Digital Enterprise Research Institute. All rights reserved. The Future is Now JeromeDL A Digital Library on Social Semantic.
1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 Technical Developments Related to Quality Issues Brian Kelly UK Web Focus UKOLN University of Bath Bath, BA2 7AY
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
CSCI 572 Project Presentation Mohsen Taheriyan Semantic Search on FOAF profiles.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Search Jiawei Rong Authors Semantic Search, in Proc. Of WWW Author R. Guhua (IBM) Rob McCool (Stanford University) Eric Miller.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Watson Supporting Next Generation Semantic Web Applications Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Marta Sabou, Sofia Angeletou, Enrico.
Searching The Web Search Engines are computer programs (variously called robots, crawlers, spiders, worms) that automatically visit Web sites and, starting.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 John Breslin (for Stefan Decker) Site Interoperability Projects.
Overview of Search Engines
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved Digital Enterprise Research Institute Semantic Search for CMS IKS.
1 Copyright © 2004, Oracle. All rights reserved. Introduction to Oracle Forms Developer and Oracle Forms Services.
Aardvark Anatomy of a Large-Scale Social Search Engine.
Master Thesis Defense Jan Fiedler 04/17/98
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Module 10 Administering and Configuring SharePoint Search.
Collaborative Annotation of the AMI Meeting Corpus Jean Carletta University of Edinburgh.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Search Engine Architecture
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Search Tools and Search Engines Searching for Information and common found internet file types.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Expert (and Novice) Finding.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 A Sitemap extension to enable efficient interaction with large.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
1 NETE4631 Using Google Web Services Lecture Notes #6.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
2014 Semantic-based Code and Documentation Search Engine Reshma Thumma Oct 10,2014 #GHC
CS 440 Database Management Systems Web Data Management 1.
June 30, 2005 Public Web Site Search Project Update: 6/30/2005 Linda Busdiecker & Andy Nguyen Department of Information Technology.
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Search can be Your Best Friend You just Need to Know How to Talk to it IW 306 Ágnes Molnár.
Information Retrieval in Practice
Map Reduce.
Search Engines & Subject Directories
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Declarative Creation of Enterprise Applications
Search Engines & Subject Directories
Search Engines & Subject Directories
Presentation transcript:

 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine Andreas Harth Joint work with Aidan Hogan, Juergen Umbrich, Stefan Decker

2 Current Search Climate Major search engines (Google, Yahoo, Microsoft) offer keyword searches over hypertext documents Search engines are powerful at expressing general searches, but are poor at expressing complex queries: –e.g. podcasts about gardening –e.g. pictures of your home town –e.g. people that Rudi Studer knows –e.g. pictures of friends of Norman Walsh –e.g. weather-related WSDL services Smaller sites, such as online social networks, scientific databases, digital libraries, collaborative data repositories, etc. provide semantically rich data and offer specialized search interfaces – mostly backed up by relational databases

3 Semantic Web Search Engine Data integration on Web scale, to leverage structured data available under open licenses Allow people to pose queries over the integrated corpus Allow for programmatic access to the corpus (via SPARQL)

4 Hypertext Web vs. Semantic Web

5

6 Topical Subgraphs First, Match nodes in a large graph satifsying a query Then, select the sourrounding nodes and arcs Topical subgraph contains all information required to further process results

7 Semantic Web Search Engine Architecture Index Crawler Extraction Consolidation Indexing Query Proc Ranking UI

8 Obtaining Information Data from the HTML Web –DMOZ sites Data from the XML Web –CiteSeer –DBLP –RSS, Podcasts Data from the RDF Web –DMOZ categories –SwissProt –Wikipedia –FOAF, SIOC, DC, …

9 Optimized Index on Quadruples Data model: subject/predicate/object/context 16 different lookup patterns for quads (node substituted by variable) – e.g. (s, ?, ?, ?), (?, p, o, ?), … Naive solution: put a separate index on s, p, o, and c, and compute join form combinations But: joins are costly Solution: 16 indexes to cover all quadruple patterns But: very costly to maintain 16 indexes Index with concatenated keys allows to re-use access patterns – saves 10 indexes Huffmann coding to save space on disk and in memory

10 Providing Information to the Casual User Ranking required in case of large result sets Link-based ranking algorithms (such as PageRank, HITS) not applicable to directed labeled graphs ReConRank: –link-based ranking on structured data –can exploit labeled links –takes into account provenance of data –operates on topical subgraph – local ranking yields higher quality

11 Example Input Dataset Example graph returned by keyword search for “ReConRank”, n = 1 4 keyword hits (red outline) 4 rankable resources (yellow outline)

12 Example Input Dataset (with Context) Example graph returned by keyword search for “ReConRank”, n = 1 4 keyword hits (red outline) 4 rankable resources (yellow outline)

13 Solution: Combined Resource Context Graph Shown is the result of combining the resource graph with context grpah, including the implied links (depicted with hollow green arrowheads) Graph is well connected !

14 Performance Evaluation

15 Conclusion SWSE is a distributed system for processing large amounts of Web content Crawler does syntax integration Storage component features keyword index and complete index on quads for fast lookups Ranking is scalable and fast, applicable to arbitrary RDF, but needs more quality evaluation Design philosophy: keep the system simple, to be able to optimize and distribute easily Algorithms designed for distributed setting -- partition the data and task at hand and distribute to many machines

16 Prototype online with dataset crawled starting from ISWC 2006 web site plus DBLP in RDF plus Wikipedia in WikiOnt Acknowledgements: DERI Lion (SFI/02/CE1/l131)