Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc.

Similar presentations


Presentation on theme: "The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc."— Presentation transcript:

1 The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc. {ntoulas, 2 University of California Los Angeles {ntoulas,

2 2 February 2014 WWW 2005 Chiba Japan Motivation Current Web search engines identify relevant pages based on keyword matching Example: jaguar Jaguar Cars Official worldwide web site of Jaguar Cars.

3 2 February 2014 WWW 2005 Chiba Japan Motivation Is keyword matching enough ? Natural languages are inherently ambiguous Example: jaguar The car brand ? Apple Mac OS X 10.2 ? The animal ? Chemical software …

4 2 February 2014 WWW 2005 Chiba Japan The Infocious Web Search Engine Uses Language Analysis techniques to: Resolve ambiguities inside Web pages Rank the Web pages based on the coherence (quality) of the text Help users organize the results in intuitive ways through categorization Provide suggestions for query refinement

5 2 February 2014 WWW 2005 Chiba Japan What is different about Infocious ? Search Engines today do not apply Language Analysis to the level Infocious does It is not simply a matter of applying existing algorithms: need optimizations for Web scale Features made possible only through language analysis Makes Language Analysis features intuitive (yet powerful) for the user

6 2 February 2014 WWW 2005 Chiba Japan Architecture

7 2 February 2014 WWW 2005 Chiba Japan Architecture Crawler Follows links to discover Web pages Refreshes changed pages using sampling [VLDB02] Can download pages from the Hidden Web [JCDL05]

8 2 February 2014 WWW 2005 Chiba Japan Architecture Linguistic Processing Resolves language ambiguities [COLING02] Annotates Web pages Extracts concepts Extracts named entities Operates at crawl speed

9 2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Part-of-speech (POS) tagging Example: house plants Done probabilistically: Given sentence S, set of tags T find T best (S) = arg max T P(T | S)... most house plants are hybrids of plant species... garden built to house our most valuable plants... Adj Noun Noun Verb Noun Prep Noun Noun Noun VerbD Inf Verb PronP Adv Adj Noun

10 2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation POS information stored inside the index User can manually specify POS at query time (or click on examples) Query N:house N:plants GreenPatio.Com – Tips for buying house plants. Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants.... Low Light Plants for the House Is a common name for plants in the species Dieffenbachia.... As with most house plants … plantfacts.htm

11 2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Query V:house N:plants Over Wintering Bonsai … One method is to build a cold frame to house your plants in the winter.... Keeping Your Sunroom Cozy … And if you want to house a hot tub or plants, think about enclosing the … doityourself.com/sunroom/sunroomcozy.htm POS information stored inside the index User can manually specify POS at query time (or click on examples)

12 2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Word-sense disambiguation Previous Example: jaguar Approach through Web page categorization Use the categories of DMOZ (~600,000) Given set of categories C and a page d Find max c C P(c|d) In Infocious a page may belong to multiple categories

13 2 February 2014 WWW 2005 Chiba Japan Categorization The category of a result is highlighted onMouseOver() Allow users to restrict search within a category: jaguar cat:Computers Can also be done by clicking on a category Jaguar Cars Official worldwide site of jaguar cars Apple Mac OS X The Apple Mac OS Product page Computers Recreation/AutosComputers Apple Mac OS X

14 2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Concept Extraction More accurate phrase identification: Identify concepts through a set of rules (pre- specified or automatically learned) Example: VerbPhrase-PrepPhrase-NounPhras lightly tossed with salad dressing tossed with oil and vinegar dressing tossed immediately with blue-cheese dressing Reduced to Concept: tossed with dressing In the profession of cooking oil is an important ingredient

15 2 February 2014 WWW 2005 Chiba Japan Answering a query Default is AND-semantics Query disambiguation (e.g. in query train a pet Infocious knows train has to be a verb) Ranking takes into account a variety of factors Presence of keywords, Proximity Title, URL, formatting, font size, coloring etc. Popularity of a page measured by in/out links TextQuality

16 2 February 2014 WWW 2005 Chiba Japan Architecture TextQuality Summarize probabilities from Linguistic Processing into one metric Promote coherent text Demote incoherent text

17 2 February 2014 WWW 2005 Chiba Japan TextQuality (disabled) Promotes well-written pages (preferable from the user perspective) Britney Spears Pictures – britney spears pictures … picture of britney spears, hot pictures of britney spears … britney-spears-pictures.hotyoungstars.com/nude/ Hot Britney Spears Pics - hot britney spears pics,... britney spears, new hot pics of britney spears,... hot-britney-spears-pics.hotyoungstars.com/nude/ Britney Spears Photos – britney spears photos … spears, britney spears nude photos, nude photos of … britney-spears-photos.hotyoungstars.com/nude/ TextQuality DISABLED

18 2 February 2014 WWW 2005 Chiba Japan Is Britney Spears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer … azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm IMPERSONATORS – BRITNEY SPEARS Is Proud to Present! Contact: Gary Shortall Back… Britney Spears Coke Habit Britney Spears Coke Habit Destroys Her… TextQuality ENABLED TextQuality (enabled) Promotes well-written pages (preferable from the user perspective)

19 2 February 2014 WWW 2005 Chiba Japan Other Language Analysis-Enhanced Features Key phrases: Present a list of the salient concepts within the results Related topics: Concepts related to the present query Hone your search: Suggestion of more specific queries Spell Checking Personalization: I like Sports but not Politics

20 2 February 2014 WWW 2005 Chiba Japan Evaluation of Categorization Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy Infocious actually employs an improved classification technique (76% accuracy) We used four different flavors of NB on 100,000 Web pages: C1: Words C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts

21 2 February 2014 WWW 2005 Chiba Japan Evaluation of Categorization C1: Words only C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts 3% accurary increase – 8% error reduction

22 2 February 2014 WWW 2005 Chiba Japan User Interface

23 2 February 2014 WWW 2005 Chiba Japan Conclusion Infocious: uses language analysis to improve Web search Resolves language ambiguities Incorporates text coherence in the ranking Provides query suggestions and refinements Organizes information intuitively through categorization

24 2 February 2014 WWW 2005 Chiba Japan Related Work Web Search Engines: Google, Yahoo!, MSNSearch, Ask/Teoma, Altavista, Looksmart, Vivisimo, … Enterprise Search Autonomy, Inquira, Inxight, iPhrase, … Answer Engines BrainBoost, …

25 2 February 2014 WWW 2005 Chiba Japan Ongoing work Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls Apply our Language Analysis algorithms to additional languages Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…) Investigate how to use Language Analysis to improve relevance in advertisements

26 2 February 2014 WWW 2005 Chiba Japan Thank you ! You can check out our Search Engine at:


Download ppt "The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc."

Similar presentations


Ads by Google