The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc.

Slides:



Advertisements
Similar presentations
Downloading Textual Hidden-Web Content Through Keyword Queries
Advertisements

Geographically Focused Collaborative Crawling Hyun Chul Lee University of Toronto & Genieknows.com Joint work with Weizheng Gao (Genieknows.com) Yingbo.
Searching The Internet Practical Strategies. URLs Look at the URL to determine what type of organization produced the site..com is a commercial site..edu.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 4.1 Chapter 4 : Searching the Web The mechanics.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
A Domain Level Personalization Technique A. Campi, M. Mazuran, S. Ronchi.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
IWM14 Publicising your site. How will anyone find your site? Going public Host Domain name Search engines Getting noticed Rising higher.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Information Retrieval in Practice
Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Information Retrieval
Overview of Search Engines
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
Slide 1 Today you will: think about criteria for judging a website understand that an effective website will match the needs and interests of users use.
Searching “Search results are only as good as the query you pose and how you search. There is no silver bullet”
Search Engine Optimization. Introduction SEO is a technique used to optimize a web site for search engines like Google, Yahoo, etc. It improves the volume.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Survey of Semantic Annotation Platforms
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen, CS Division, UC Berkeley Susan Dumais, Microsoft Research ACM:CHI April.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
CSCI-235 Micro-Computer in Science Internet Search.
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Search Engine Marketing SEM = Search Engine Marketing SEO = Search Engine Optimization optimizing (altering/changing) your page in order to get a higher.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Lawrence Snyder University of Washington, Seattle © Lawrence Snyder 2004.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
Information Retrieval in Practice
Search Engine Architecture
Chapter Five Web Search Engines
Search Market and Technologies
SEARCH ENGINE OPTIMIZATION SEO. What is SEO? It is the process of optimizing structure, design and content of your website in order to increase traffic.
Search Engines & Subject Directories
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Search Engines & Subject Directories
Search Engines & Subject Directories
CS246: Information Retrieval
Best Helpful SEO Tips For Good Content Writing 2019 Presented By:- Abhinav Shashtri.
Presentation transcript:

The Infocious Web Search Engine: Improving Web Searching Through Linguistic Analysis Alexandros Ntoulas 1,2 Gerald Chao 1 Junghoo Cho 2 1 Infocious Inc. {ntoulas, 2 University of California Los Angeles {ntoulas,

2 February 2014 WWW 2005 Chiba Japan Motivation Current Web search engines identify relevant pages based on keyword matching Example: jaguar Jaguar Cars Official worldwide web site of Jaguar Cars.

2 February 2014 WWW 2005 Chiba Japan Motivation Is keyword matching enough ? Natural languages are inherently ambiguous Example: jaguar The car brand ? Apple Mac OS X 10.2 ? The animal ? Chemical software …

2 February 2014 WWW 2005 Chiba Japan The Infocious Web Search Engine Uses Language Analysis techniques to: Resolve ambiguities inside Web pages Rank the Web pages based on the coherence (quality) of the text Help users organize the results in intuitive ways through categorization Provide suggestions for query refinement

2 February 2014 WWW 2005 Chiba Japan What is different about Infocious ? Search Engines today do not apply Language Analysis to the level Infocious does It is not simply a matter of applying existing algorithms: need optimizations for Web scale Features made possible only through language analysis Makes Language Analysis features intuitive (yet powerful) for the user

2 February 2014 WWW 2005 Chiba Japan Architecture

2 February 2014 WWW 2005 Chiba Japan Architecture Crawler Follows links to discover Web pages Refreshes changed pages using sampling [VLDB02] Can download pages from the Hidden Web [JCDL05]

2 February 2014 WWW 2005 Chiba Japan Architecture Linguistic Processing Resolves language ambiguities [COLING02] Annotates Web pages Extracts concepts Extracts named entities Operates at crawl speed

2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Part-of-speech (POS) tagging Example: house plants Done probabilistically: Given sentence S, set of tags T find T best (S) = arg max T P(T | S)... most house plants are hybrids of plant species... garden built to house our most valuable plants... Adj Noun Noun Verb Noun Prep Noun Noun Noun VerbD Inf Verb PronP Adv Adj Noun

2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation POS information stored inside the index User can manually specify POS at query time (or click on examples) Query N:house N:plants GreenPatio.Com – Tips for buying house plants. Why keep natural indoor plants.... Tips for buying house plants. Care for indoor plants Low Light Plants for the House Is a common name for plants in the species Dieffenbachia.... As with most house plants … plantfacts.htm

2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Query V:house N:plants Over Wintering Bonsai … One method is to build a cold frame to house your plants in the winter Keeping Your Sunroom Cozy … And if you want to house a hot tub or plants, think about enclosing the … doityourself.com/sunroom/sunroomcozy.htm POS information stored inside the index User can manually specify POS at query time (or click on examples)

2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Disambiguation Word-sense disambiguation Previous Example: jaguar Approach through Web page categorization Use the categories of DMOZ (~600,000) Given set of categories C and a page d Find max c C P(c|d) In Infocious a page may belong to multiple categories

2 February 2014 WWW 2005 Chiba Japan Categorization The category of a result is highlighted onMouseOver() Allow users to restrict search within a category: jaguar cat:Computers Can also be done by clicking on a category Jaguar Cars Official worldwide site of jaguar cars Apple Mac OS X The Apple Mac OS Product page Computers Recreation/AutosComputers Apple Mac OS X

2 February 2014 WWW 2005 Chiba Japan Linguistic Processing: Concept Extraction More accurate phrase identification: Identify concepts through a set of rules (pre- specified or automatically learned) Example: VerbPhrase-PrepPhrase-NounPhras lightly tossed with salad dressing tossed with oil and vinegar dressing tossed immediately with blue-cheese dressing Reduced to Concept: tossed with dressing In the profession of cooking oil is an important ingredient

2 February 2014 WWW 2005 Chiba Japan Answering a query Default is AND-semantics Query disambiguation (e.g. in query train a pet Infocious knows train has to be a verb) Ranking takes into account a variety of factors Presence of keywords, Proximity Title, URL, formatting, font size, coloring etc. Popularity of a page measured by in/out links TextQuality

2 February 2014 WWW 2005 Chiba Japan Architecture TextQuality Summarize probabilities from Linguistic Processing into one metric Promote coherent text Demote incoherent text

2 February 2014 WWW 2005 Chiba Japan TextQuality (disabled) Promotes well-written pages (preferable from the user perspective) Britney Spears Pictures – britney spears pictures … picture of britney spears, hot pictures of britney spears … britney-spears-pictures.hotyoungstars.com/nude/ Hot Britney Spears Pics - hot britney spears pics,... britney spears, new hot pics of britney spears,... hot-britney-spears-pics.hotyoungstars.com/nude/ Britney Spears Photos – britney spears photos … spears, britney spears nude photos, nude photos of … britney-spears-photos.hotyoungstars.com/nude/ TextQuality DISABLED

2 February 2014 WWW 2005 Chiba Japan Is Britney Spears over the edge? Is Britney Spears over the edge? … Britney Spears is a singer … azwestern.edu/modern_lang/esl/cjones/mag/spring2004/britney.htm IMPERSONATORS – BRITNEY SPEARS Is Proud to Present! Contact: Gary Shortall Back… Britney Spears Coke Habit Britney Spears Coke Habit Destroys Her… TextQuality ENABLED TextQuality (enabled) Promotes well-written pages (preferable from the user perspective)

2 February 2014 WWW 2005 Chiba Japan Other Language Analysis-Enhanced Features Key phrases: Present a list of the salient concepts within the results Related topics: Concepts related to the present query Hone your search: Suggestion of more specific queries Spell Checking Personalization: I like Sports but not Politics

2 February 2014 WWW 2005 Chiba Japan Evaluation of Categorization Using Naïve Bayes classifiers for illustration: Language Analysis improves accuracy Infocious actually employs an improved classification technique (76% accuracy) We used four different flavors of NB on 100,000 Web pages: C1: Words C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts

2 February 2014 WWW 2005 Chiba Japan Evaluation of Categorization C1: Words only C2: Words + POS tags C3: Words + extracted concepts C4: Words + POS + extracted concepts 3% accurary increase – 8% error reduction

2 February 2014 WWW 2005 Chiba Japan User Interface

2 February 2014 WWW 2005 Chiba Japan Conclusion Infocious: uses language analysis to improve Web search Resolves language ambiguities Incorporates text coherence in the ranking Provides query suggestions and refinements Organizes information intuitively through categorization

2 February 2014 WWW 2005 Chiba Japan Related Work Web Search Engines: Google, Yahoo!, MSNSearch, Ask/Teoma, Altavista, Looksmart, Vivisimo, … Enterprise Search Autonomy, Inquira, Inxight, iPhrase, … Answer Engines BrainBoost, …

2 February 2014 WWW 2005 Chiba Japan Ongoing work Increase index size (currently ~1 billion pages) through surface & hidden Web-crawls Apply our Language Analysis algorithms to additional languages Leverage our Language-annotated repository for additional features (e.g. summarization, machine translation,…) Investigate how to use Language Analysis to improve relevance in advertisements

2 February 2014 WWW 2005 Chiba Japan Thank you ! You can check out our Search Engine at: