Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
A Quality Focused Crawler for Health Information Tim Tang.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Focused Crawling in Depression Portal Search: A Feasibility Study Thanh Tin Tang (ANU) David Hawking (CSIRO) Nick Craswell (Microsoft) Ramesh Sankaranarayana(ANU)
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Web Crawling Notes by Aisha Walcott
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado! Francisco Tenorio! Jacques.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Chapter 5 Searching for Truth: Locating Information on the WWW.
Relevance and Quality of Health Information on the Web Tim Tang DCS Seminar October, 2005.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
THE BASICS OF THE WEB Davison Web Design. Introduction to the Web Main Ideas The Internet is a worldwide network of hardware. The World Wide Web is part.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
Chapter 5 Searching for Truth: Locating Information on the WWW.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences
11 CANTINA: A Content- Based Approach to Detecting Phishing Web Sites Reporter: Gia-Nan Gao Advisor: Chin-Laung Lei 2010/6/7.
Using Hyperlink structure information for web search.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Gregor Gisler-Merz How to hit in google The anatomy of a modern web search engine.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Publication Spider Wang Xuan 07/14/2006. What is publication spider Gathering publication pages Using focused crawling With the help of Search Engine.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
Information Retrieval Part 2 Sissi 11/17/2008. Information Retrieval cont..  Web-Based Document Search  Page Rank  Anchor Text  Document Matching.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Post-Ranking query suggestion by diversifying search Chao Wang.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Exploring Traversal Strategy for Web Forum Crawling Yida Wang, Jiang-Ming Yang, Wei Lai, Rui Cai Microsoft Research Asia, Beijing SIGIR
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
Data mining in web applications
Search Engine Optimization
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Web Crawling.
Information Retrieval
Searching for Truth: Locating Information on the WWW
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Information Retrieval and Web Design
Presentation transcript:

Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November, 2005

2 Outlines  Problems and Motivation  The experiment –Focused crawling –Relevance and quality prediction –The three crawlers –Measures for relevance and quality –Results, findings  Future work

3 Why Health Information on the Web?  The Internet is a free medium  Health information of various quality  Incorrect health advice may be dangerous  High user demand for Web health information

4 Problems  Relevance (in IR): –Topical relevance based on text –Navigational and distillation relevance based on links  None of these techniques guarantee quality  Our previous study ( Tang et al., JIR ‘05 ) showed Google returns a lot of low-quality health results -> PageRank does not guarantee quality

5 Problems: Quality of Health Info  Quality of health information is often measured by evidence-based medicine which are Interventions supported by a systematic review of the evidence as effective.  Low quality health information originate from untrusted sites: personal home pages, commercial sites, chat sites, web forums, and even some published materials,…

6 Wrong Advice from an Article

7 Dangerous Information from Personal Web Pages

8 Commercial Promotion

9 Why Domain-specific Search?  Impose domain restriction  Results from previous work ( Tang et. al, JIR ‘05 )  Quality: Domain-specific engines performed much better than Google  Relevance: GoogleD was best  Coverage analysis: BPS & 4sites are poor EngineRelevanceQuality GoogleD BPS sites Google

10 The Problems of Domain-specific Engines  The current method to build domain-specific engines is very expensive: manual, rule-based.  Example: BluePages Search – A depression portal at the ANU ( ) –Manual judgments of health sites by domain experts for two weeks to decide what to include in the index. –Low coverage: only 207 Web sites in the index –Tedious maintenance process: Web pages change, cease to exist, new pages come out, etc. -> A quality focused crawler may be a cheaper approach, maintaining high quality while improving coverage

11 The FC Process  Designed to selectively fetch content relevant to a specified topic of interest using the Web’s hyperlink structure. URL Frontier Link extractorDownload Classifier {URLs, link info} dequeue {URLs, scores} enqueue Link info = anchor text, URL, source page’s content, so on.

12 Relevance Prediction  anchor text: text appearing in a hyperlink  text around the link: 50 bytes before and after the link  URL words: Words formed by parsing the URL address

13 Relevance Indicators  URL: herapy.html => URL words: depression, com, psychotherapy  Anchor text: psychotherapy  Text around the link: –50 bytes before: section, learn –50 bytes after: talk, therapy, standard, treatment

14 Methods  Machine learning approach: Train and test relevant and irrelevant Web pages using the discussed indicators.  Evaluated different learning algorithms: k-nearest neighbor, Naïve Bayes, C4.5, Perceptron.  Result: The C4.5 decision tree was the best to predict relevance.  A Laplace correction formula ( Margineantu et al., LNS, ‘02 ) was used to produce a confidence score ( confidence_level ) at each leaf node of the tree.  The same method applied to predict quality but not successful!!! -> Link anchor context cannot predict quality

15 Quality Prediction  Using evidence-based medicine, and  Using Relevance Feedback (RF) technique

16 Evidence-based Medicine  Evidence-based treatments were divided into single and 2-word terms.  Example: –Cognitive behavioral therapy -> cognitive, behavioral, therapy, cognitive behavioral, behavioral therapy

17 Relevance Feedback  Well-known IR approach of query by examples.  Basic idea: Do an initial query, get feedback from users about what documents are relevant, then add words from relevant document to the query.  Goal: Add terms to the query in order to get more relevant results.  Usually, 20 terms are added into the query in total

18 Our RF Approach  Not for relevance, but Quality  Not only single terms, but also phrases  Generate a list of single terms and 2-word phrases and their associated weights  Select the top weighted terms and phrases  Cut-off points at the lowest-ranked term that appears in the evidence-based treatment list  20 phrases and 29 single words form a ‘quality query’

19 Predicting Quality  For downloaded pages, quality score (QScore) is computed using a modification of the BM25 formula, taking into account term weights.  Quality of a new page is then predicted based on the quality of all the downloaded pages linking to it. (Assumption: There is quality locality, pages with similar content are inter-connected (Davison, SIGIR ‘00))  Predicted quality score of a page with n downloaded source pages: PScore = Σ QScore/n … Downloaded sources P1 P2 Pn Target

20 Combining Relevance and Quality  We need to balance between relevance and quality  Quality and relevance score combination is new  Our method uses a product of the two scores: URLScore = confidence_level * PScore  Other ways to combine these scores will be explored in future work  A quality focused crawler will rely on this combined score to order its crawl queue

21 The Three Crawlers  The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison)  The Relevance crawler: For topical relevance, ordering the crawl queue using the C4.5 decision tree  The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.

22 Measures  Relevance: The relevance performance of the three crawlers were evaluated using a relevance classifier.  Quality: were judged by domain experts using the evidence-based guidelines from the Centre for Evidence Based Mental Health (CEBMH). –Overall quality: taking into account all pages –High and low quality categories: the top 25%, and the bottom 25% results in each crawl were compared.

23 Results

24 Relevance

25 Quality

26 High Quality Pages AAQ = Above Average Quality: top 25%

27 Low Quality Pages BAQ = Below Average Quality: bottom 25%

28 Findings  Topical-relevance can be predicted using link anchor context.  Relevance feedback technique proved its usefulness in quality prediction.  Domain-specific search portals can be successfully built using focused crawling techniques.

29 Future Work  We only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain.  Other ways of combining relevance and quality should be explored.  Experiments to compare our quality crawl with other health portals is necessary.