Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

July 2010 D2.1 Upgrading strategy Javier Soto Catalog Release 3. Communities.
CHAPTER 30 THE HTML 5 FORMS PROCESSING. LEARNING OBJECTIVES What the three form elements are How to use the HTML 5 tag to specify a list of words’ form.
Information Retrieval in Practice
Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Supporting Multilingual Paths on the WWW Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III, Avital Arora Texas A&M University.
Interfaces for Selecting and Understanding Collections.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.
Managing Change in Distributed Collections Frank M. Shipman III Luis Francisco-Revilla Richard Furuta Center for the Study of Digital Libraries Texas A&M.
Focus Group Methodology  Five focus groups science educators (n = 38)  K-5, 6-12 (inservice and preservice group), undergraduate faculty (two groups)
The Walden ’ s Paths Quiz Engine Avital Arora, Emily Barker, Unmil P. Karadkar, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Frank Shipman, Suvendu.
Managing Change on the Web Luis Francisco-Revilla Frank M. Shipman Richard Furuta Unmil Karadkar Avital Arora Center for the Study of Digital Libraries.
Eric Sieverts University Library Utrecht IT Department Institute for Media & Information Management (Hogeschool van Amsterdam)
Recognizing User Interest and Document Value from Reading and Organizing Activities in Document Triage Rajiv Badi, Soonil Bae, J. Michael Moore, Konstantinos.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
The Walden's Paths Virtual Directories Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III Texas A&M University Structuring.
Context-Based Metrics For Evaluating Changes to Web Pages Thesis Defense By Suvendu Kumar Dash Texas A&M University.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Projects in the Intelligent User Interfaces Group Frank Shipman Associate Director, Center for the Study of Digital Libraries.
Walden’s Paths Principal Investigators: Richard Furuta, Frank Shipman Center for the Study of Digital Libraries Texas.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Perception of Content, Structure, and Presentation Changes in Web-based Hypertext Luis Francisco-Revilla Frank M. Shipman III Richard Furuta Unmil Karadkar.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
12 Developing a Web Site Section 12.1 Discuss the functions of a Web site Compare and contrast style sheets Apply cascading style sheets (CSS) to a Web.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Vidispine Data Model Vidispine Bootcamp. Overview Collection Storage File Item Shape Item Component Shape Component Metadata abstract entity physical.
Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
January 2005MERLOT Reusable Learning Design Guidelines OVERVIEW FOR MERLOT Copyright 2005 Reusable Learning This work is licensed under a Attribution-NoDerivs-NonCommercial.
Algorithmic Detection of Semantic Similarity WWW 2005.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Current Information To help you find current news and information, many search engines and directories include a hyperlink to a "What's new" page. Many.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Artificial Intelligence Techniques Internet Applications 4.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.
David Huynh, Stefano Mazzocchi, David Karger Piggy Bank: Experience the Semantic Web inside your web browser Web Semantics: Science, Services and Agents.
Connecting Interface Metaphors to Support Creation of Path-based Collections Unmil P. Karadkar, Andruid Kerne, Richard Furuta, Luis Francisco-Revilla,
Lesson 6: Databases and Web Search Engines
Presented by: Hassan Sayyadi
Eric Sieverts University Library Utrecht Institute for Media &
Lesson 6: Databases and Web Search Engines
ACM Digital Repository Classification Results
Presentation transcript:

Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University

Distributed Collections The Web is continuously changing –.gov and.edu pages change less frequently than.com pages (1999) Collection managers cannot control changes –Bookmark lists –Yahoo! directories –Web portals (NSDL) –Walden’s Paths

Changes to Items in Collections Items in collections –Play specific roles –Are semantically related To each other To the collection Change to an item may –Change its relationship to the collection Less coherent with other items (default assumption) More or no change in relationship –Affect the role it plays in the collection Less suitable (default assumption) More suitable or no effect on the role

Research Contributions Develop techniques to help collection managers cope with changes –Change, migration, disappearance Path Manager – A tool that helps collection managers cope with changes –Quantity of change –Nature of change –Relevance to the collection Dealing with missing pages –Find exact matches –Suggest similar pages

Management of Distributed Collections Detection of change is easy Determination of –Quantity of change is relatively easy –Relevance of change is less easy –Meaning of change is difficult Approaches –Human validation (Yahoo! surfers) –Automatic detection of change (Path Manager)

Path Manager – The tool Collection-level overviewPage-level overviewPage details Types of change –Content changes (what) –Presentation changes (how) –Structural changes (linking) –Behavioral changes (scripting – not addressed)

Collection-level Overview

Page-level Overview Little Change Server unreachable 404 error No change Drastic change

Page Details Page Information Modification details

Context-based Change Detection Context consists of –Content from other pages in the path –Annotations created by the author –Additional metadata provided by the author Distinguishes between edited and replaced pages

Evaluation 20 paths, pages selected from Yahoo! Directories Each path consisted of 10 to 12 pages Pages were randomly selected –no flash presentations or images A page in each path was randomly selected for replacement Each selected page was replaced by 3 pages –CNN Financials (large change) –Elephants (large change) –A page from the same Yahoo! Directory (small change)

Results – Distribution of Context-based changes More than -4-4 to 2More than 2 Replacement by a member of the Yahoo! Directory 1 (5%)10 (50%)9 (45%) Replacement by non- member 25 (62.5%)15 (37.5%)0 (0%) Replacements resulting in moving towards and away from the context vector Experimental thresholds Distinction between similar and different pages Managers can now focus on divergent pages

For more information on Walden’s Paths Principal Investigators: Richard Furuta Frank Shipman

Approach Context Generation Phase –Create weighted page term vectors W = log(tf) + constant scaling factor Known nouns are allocated higher weights –Create weighted context term vectors Exclude the page for which context vector is being generated Change Detection Phase –Calculate page term vector for changed page –Calculate the angle between new page term vector and context term vector –Difference between initial and current angle is a measure of the change

Content-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average Range30.8 to to to 84.5 Standard deviation Angle between original and replacing pages (in degrees) High angle of change for all cases

Context-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average Range-23.2 to to to 14.3 Standard deviation Difference in angle to Yahoo directory between original and replacing pages (in degrees) Results agree with the intuitive expectation

Dealing with Missing Pages Pages may disappear due to a variety of reasons –Reconfiguration of Web sites –Change or expiration of domain names Temporary or permanent Threaten integrity of paths –Continuity of narrative structure –Completeness of collection Strategies –Find new locations for pages that have moved (exact replacements) –Find acceptable replacements for pages that have vanished (similar pages)

Approach – Information Extraction Phase Extract keyphrases from page –Extends the “Robust Hyperlinks” approach –Tag text in pages with part-of-speech tagger –Extract 1, 2 or 3 word keyphrases –n, n-n, a-n, n-n-n, a-n-n –Use HTML formatting for additional guidance Only or tags may separate terms in a phrase Store separate lists of keyphrases to use for locating exact replacements and similar pages

Approach – Locating Exact Replacements Keyphrases help discriminate this page from others on the Web TF-IDF-based measure Spelling mistakes and unusually uncommon words are most valuable Order keyphrase list by decreasing value of TF-IDF measure While locating pages –Begin with a (user-specified) keyphrase set –Search for pages that match these terms –Add a term and retry until the result set is as desired

Approach – Finding Similar Pages Rare phrases hinder search for similar pages Weed out phrases that have occurred less frequently than a certain threshold value Remaining phrases are then ordered by decreasing value of their TF-IDF measure While locating pages –Start with the most restrictive set of phrases –Reduce one phrase at a time until the desired result size is achieved Similarity is contextual Varies from person to person