1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Improved TF-IDF Ranker
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Search Engines and Information Retrieval
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Crawler-Based Search Engine By Ryan Caplet, Morris Wright and Bryan Chapman.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Search Engines and Information Retrieval Chapter 1.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Patient Empowerment for Chronic Diseases System Sifat Islam Graduate Student, Center for Systems Integration, FAU, Copyright © 2011 Center.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Introduction to Nutch CSCI 572: Information Retrieval and Search Engines Summer 2010.
Kelly Boccia Abi Natarajan Konstantin Livitski Senthil Anand Subbanan Meyyappan 1.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
SharePoint 2010 Search Architecture The Connector Framework Enhancing the Search User Interface Creating Custom Ranking Models.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Uwe SchindlerGES 2007 – May 2-4, 2007 Data Information Service based on Open Archives Initiative Protocols and Apache Lucene Uwe Schindler 1, Benny Bräuer.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Information Retrieval
Lucene Jianguo Lu.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Data mining in web applications
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Search Engine Architecture
Searching and Indexing
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Multimedia Information Retrieval
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Information Retrieval
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Chapter 5: Information Retrieval and Web Search
The Search Engine Architecture
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek

2 Outline  Problem background and motivation  Project Goals  System architecture  The social network tool Facebook app Gathering user data  The dictionary Gathering documents and data Building the lexicon LSI  The Indexer and Search Building the Index Servicing search requests

3 Background and Motivation Traditional search Keyword based: not optimal Low recall, high precision Stress on – formulate a query effectively Enhancements Automate query reformulation using – relevance feedback from previous search semantic meaning extraction to aid search

4 Goals Demonstrate the usability of semantic search concepts Use social networking data to develop prototype implementation Make search framework generic What makes a good lexicon / dictionary for focused search requirements

5 Arhictecture

Query results  Detailed Architecture Facebook Application (PHP) Lucene Search & Index (Java) -- xml -- Query terms Parser -- Dictionary -- (WordNet + LSI) lookup Facebook Server

7 Search Front end Screenshot here

8 Gathering Facebook user data  Users who add application allow storage of profile information  Use ‘profile_update_time’ to limit updates  For search over friends’ profiles, the data is cached temporarily if the friend is not already a registered user  Facebook privacy restrictions on storage of private data Workaround – server side cron jobs – periodically update database if profile is updated

9 Dictionary Rationale Goal: Find the semantically related web pages given a query. Solution: Add some semantically related keywords in our queries. The dictionary serves as the pool of words, from which we can extract the semantically related words. Approach  In order to determine the semantic relation between pairs of terms, we need to analyze a very large number of documents.  When we have a collection of documents at hand, we need to preprocess the document by removing noise.  Parse the documents and extract those keywords whose occurrence is greater than some threshold.

10 LSI Latent Semantic Indexing is a method that we could calculate the relatedness score of each pair of terms.  Each document can be parsed into vectors  LSI can determine the orthonormal basis for the document space  Assume the orthonormal basis is U The relatedness score could be calculated as U*U’.  The semantic relatedness actually is calculated through the co-occurrence of pairs of terms.  The size of our dictionary is 10,775 and we have crawled 7142 documents.  All the term document matrix is calculated through sparse matrix operation.

11 Gathering data  Crawling was done using WebSphinx, an open source crawler(  We crawled around 10,000 pages from blogs and other social media to build the dictionary  Pages were crawled mainly from these sites  Crawled data was filtered for removing noise such as unicode characters, tags and other non-text material.

12 WordNet  WordNet is a lexical database of English language developed in Princeton University  WordNet provides “SysNets” which are set of conceptually semantic words.  We used WrodNet to derive conceptually semantic words  We aggregate the related words obtained from WordNet and Dictionary.

Indexing & Search Lucene API:  Lucene is a software library, and concerns with text indexing and searching.  It’s “NOT” a ready-to- use application like a file-search program, a web crawler, or a web site search engine.

Indexing & Search Indexing breaks down into three main operations:  Conversion from data to text,  Analyzing/stemming,  saving it to the index (inverted index). Searching  Parsing the Query,  Analysing the Query,  Search in inverted index Updating the indexes on regular basis:  A Document must first be deleted from an index and then re-added to it.

Indexing & Search Some properties of Lucene utilised for semantic search:  Analyser : Eliminates the stop words & stores words in it base form.  Keyword : Not to be analyzed (stemmed), but is indexed  Updating the indexes on regular basis (A Document will be deleted from an index and then re-added to it.)  Search Facility extended: * Keyword1 AND/OR Keyword2 * + Keyword1 – Keywords2 (Extended)  Ranking formula for results:

Indexing & Searching : Implementation Picks Document Add Fields Adds to Index Indexed Files Query/ Search word Analyzer/parser Index Search Document Ids Corpus

Search & Tradeoffs  Search by fieldname (space tradeoff)  Referencing words before and following of search-keyword. (speed tradeoff)

.. Facebook Application (PHP) Lucene Search & Index (Java) ---- XML File Abc xyz 24 Sports, Music Atlanta Moderate Hello 27 XML Parser ---- Text File Abc xyz 24 Sports, Music Atlanta Moderate Facebook Server

19 Extensions and future work  Coupling semantic search with traditional search techniques to achieve ‘whole’ solution Relevance feedback from previous search for instance  Performance testing of search results  Relevance sorting of results (partially)

20 Questions (?) Thank you