Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Large-Scale Entity-Based Online Social Network Profile Linkage.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Best Practices for Website Design & Web Content Management.
IS 360 Web Promotion. Slide 2 Overview How to attract visitors.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Bill G. Kelm - Spring 2007 Searching Intelligently How to do better research using your favorite search engine.
A field is a unit of information. Limit search by the title field.
PAD workshops Site engine optimisation Stephen Sangar Web Officer - PAD 30 November 2011 Page 1.
SEO-SEARCH ENGINE OPTIMIZATION SEO is an act of to make a website rich for Search Engines and Visitors. SEO simply get the Website Ranking Higher.
What You Will Learn? - What is Interlinking and Why It’s Important? - What is SILO Structure? - Old school ways of Interlinking.
WageIndicator SEO, December 10, 2008 Irene van Beveren Today: 0.Why SEO is important 1.Keyword Strategies 2.Title Tags 3.Internal Links 4.Duplicate Content.
SEO for Web Designers By Alfredo Palconit, Jr.. I. What is SEO? A process of improving a site’s traffic and rank from organic search engine results. Notes:
Search Engine Optimization (SEO) Week 07 Dynamic Web TCNJ Jean Chu.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Chapter 5 Searching for Truth: Locating Information on the WWW.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Search Engine Optimization ext 304 media-connection.com The process affecting the visibility of a website across various search engines to.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
SEO  What is it?  Seo is a collection of techniques targeted towards increasing the presence of a website on a search engine.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Stands for “Search Engine Optimization” Process of improving “visibility” of a web site to search engines in order to help search ranking Attracts more.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007 Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones,
Topical Categorization of Large Collections of Electronic Theses and Dissertations Venkat Srinivasan & Edward A. Fox Virginia Tech, Blacksburg, VA, USA.
Encyclopaedia Idea1 New Library Feature Proposal 22 The Encyclopaedia.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Instructional Guide. How does EasyBib make research easier? Citation Generation Easily create a bibliography in MLA, APA, and Chicago styles Export to.
Algorithmic Detection of Semantic Similarity WWW 2005.
Digital libraries and web- based information systems Mohsen Kamyar.
A process of taking your best guesses. Companies have web sites where you can access your information.
Pamela Drake December 11, 2015 SEARCH ENGINE OPTIMIZATON (SEO)
What is Seo? Search Engine Optimization for Dummies.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
By Pamela Drake SEARCH ENGINE OPTIMIZATION. WHAT IS SEO? Search engine optimization (SEO) is the process of affecting the visibility of a website or a.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Natural Language Processing Lab National Taiwan University The splog Detection Task and A Solution Based on Temporal and Link Properties Yu-Ru Lin et al.
Science Fair Resources. Access the FBISD resources at
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Heat-seeking Honeypots: Design and Experience John P. John, Fang Yu, Yinglian Xie, Arvind Krishnamurthy and Martin Abadi WWW 2011 Presented by Elias P.
1 Chapter 5 (3 rd ed) Your library is an excellent resource tool. Your library is an excellent resource tool.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
HOW TO USE GOOGLE WEBMASTER TOOLS TO IMPROVE SEO ? GOOGLE WEBMASTEER.
Information Organization: Overview
Web Mining Ref:
Search Engine Optimisation
Taxonomies, Lexicons and Organizing Knowledge
Seattle Event Finder Justin Meyer Jessica Leung Jennifer Hanson
Searching for Truth: Locating Information on the WWW
Panagiotis G. Ipeirotis Luis Gravano
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Toward Large Scale Integration
Deep SEARCH 9 A new tool in the box for automatic content classification: DS9 Machine Learning uses Hybrid Semantic AI ConTech November.
Information Organization: Overview
Presentation transcript:

Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003

Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google)

Search Interface search terms

Search Results results

Goals and Motivation Hidden Webs are informative No current search engines can index them (even Google) Next-generation search engine Automatic discovery of search interface Classification/categorization of hidden websites Generating queries to search interfaces Crawling and indexing of these web pages

Tasks Crawling Search Interface Detection Domain classification

Crawling 2.2M URLs from dmoz 1.7M eventually Crawled in November G/4G - before/after compression Root level web pages only e.g.

Why root-level only? 80% of search interface contained in root-level (from UIUC) Efficient, cost effective 3B web pages compared to 8M web sites

Search Interface Classification Most search interfaces are inside tags Identify specific features( e.g. keywords, special tags, etc ) that are common in all search interfaces

Search Interface Classification Potential attributes we’ve considered

Action count

Select count

Password field

Training sets for C4.5 Initially only positive training set Several classification iterations using real web data For each iteration, add correct classifications into the positive training set and negative training sets For misclassified web pages, do the same

Training set 3 iterations seem sufficient

Results Checked via random sampling- select 100 random web pages and manually check the correctness of the classification 91.5% accuracy- correctly identifies search interfaces (precision) 87.5% accuracy- correctly identifies non-search interfaces

Results Random sampling estimation: search interfaces currently exist on our data set OCLC estimated about 8.7M unique websites in 2003 Total #of search interface on the web (upper bound)

Domain Classification Manually extract domain specific keywords Cars – odometer, mileage, airbag, acura, … Books – ISBN, author, title, publication, … 240 keywords used 4 target categories {Books, Cars, Entertainment, Travel} + “Others”

Domain Classification Navie Bayes classifier Bad result Keywords used not specific enough to distinguish between domains Websites span over different topics Probabilistic Trap of analysis based on content only

Domain Classification C4.5 classification tree “Better” result More are classified as “Others” Deterministic Improvement needed More keywords Link structure Analysis of search results

Conclusion A tool for automatic search interface detection Rough estimate of the total number of search interfaces  size of Hidden Web Domain classification Still need improvment

Some statistics Precision Books – 34% Cars – 41 % Entertainment – 48% Travel – 58% Some examples – Books – Entertainment – Travel – Others – Cars – Travel Others