Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.

Similar presentations


Presentation on theme: "1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek."— Presentation transcript:

1 1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek

2 2 Outline  Problem background and motivation  Project Goals  System architecture  The social network tool Facebook app Gathering user data  The dictionary Gathering documents and data Building the lexicon LSI  The Indexer and Search Building the Index Servicing search requests

3 3 Background and Motivation Traditional search Keyword based: not optimal Low recall, high precision Stress on – formulate a query effectively Enhancements Automate query reformulation using – relevance feedback from previous search semantic meaning extraction to aid search

4 4 Goals Demonstrate the usability of semantic search concepts Use social networking data to develop prototype implementation Make search framework generic What makes a good lexicon / dictionary for focused search requirements

5 5 Arhictecture

6 Query results  Detailed Architecture Facebook Application (PHP) Lucene Search & Index (Java) -- xml -- Query terms Parser -- Dictionary -- (WordNet + LSI) lookup Facebook Server

7 7 Search Front end Screenshot here

8 8 Gathering Facebook user data  Users who add application allow storage of profile information  Use ‘profile_update_time’ to limit updates  For search over friends’ profiles, the data is cached temporarily if the friend is not already a registered user  Facebook privacy restrictions on storage of private data Workaround – server side cron jobs – periodically update database if profile is updated

9 9 Dictionary Rationale Goal: Find the semantically related web pages given a query. Solution: Add some semantically related keywords in our queries. The dictionary serves as the pool of words, from which we can extract the semantically related words. Approach  In order to determine the semantic relation between pairs of terms, we need to analyze a very large number of documents.  When we have a collection of documents at hand, we need to preprocess the document by removing noise.  Parse the documents and extract those keywords whose occurrence is greater than some threshold.

10 10 LSI Latent Semantic Indexing is a method that we could calculate the relatedness score of each pair of terms.  Each document can be parsed into vectors  LSI can determine the orthonormal basis for the document space  Assume the orthonormal basis is U The relatedness score could be calculated as U*U’.  The semantic relatedness actually is calculated through the co-occurrence of pairs of terms.  The size of our dictionary is 10,775 and we have crawled 7142 documents.  All the term document matrix is calculated through sparse matrix operation.

11 11 Gathering data  Crawling was done using WebSphinx, an open source crawler(http://www.cs.cmu.edu/~rcm/websphinx)  We crawled around 10,000 pages from blogs and other social media to build the dictionary  Pages were crawled mainly from these sites http://en.wikipedia.org http://directory.yahoo.com http://www.blogspot.com  Crawled data was filtered for removing noise such as unicode characters, tags and other non-text material.

12 12 WordNet  WordNet is a lexical database of English language developed in Princeton University  WordNet provides “SysNets” which are set of conceptually semantic words.  We used WrodNet to derive conceptually semantic words  We aggregate the related words obtained from WordNet and Dictionary.

13 Indexing & Search Lucene API:  Lucene is a software library, and concerns with text indexing and searching.  It’s “NOT” a ready-to- use application like a file-search program, a web crawler, or a web site search engine.

14 Indexing & Search Indexing breaks down into three main operations:  Conversion from data to text,  Analyzing/stemming,  saving it to the index (inverted index). Searching  Parsing the Query,  Analysing the Query,  Search in inverted index Updating the indexes on regular basis:  A Document must first be deleted from an index and then re-added to it.

15 Indexing & Search Some properties of Lucene utilised for semantic search:  Analyser : Eliminates the stop words & stores words in it base form.  Keyword : Not to be analyzed (stemmed), but is indexed  Updating the indexes on regular basis (A Document will be deleted from an index and then re-added to it.)  Search Facility extended: * Keyword1 AND/OR Keyword2 * + Keyword1 – Keywords2 (Extended)  Ranking formula for results:

16 Indexing & Searching : Implementation Picks Document Add Fields Adds to Index Indexed Files Query/ Search word Analyzer/parser Index Search Document Ids Corpus

17 Search & Tradeoffs  Search by fieldname (space tradeoff)  Referencing words before and following of search-keyword. (speed tradeoff)

18 .. Facebook Application (PHP) Lucene Search & Index (Java) ---- XML File ----- Abc xyz 24 Sports, Music Atlanta Moderate ------------------- Hello 27 XML Parser ---- Text File ----- Abc xyz 24 Sports, Music Atlanta Moderate Facebook Server

19 19 Extensions and future work  Coupling semantic search with traditional search techniques to achieve ‘whole’ solution Relevance feedback from previous search for instance  Performance testing of search results  Relevance sorting of results (partially)

20 20 Questions (?) Thank you


Download ppt "1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek."

Similar presentations


Ads by Google