Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

July 2010 D2.1 Upgrading strategy Javier Soto Catalog Release 3. Communities.

A Novel Scheme for Video Similarity Detection Chu-Hong Hoi, Steven March 5, 2003.

Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.

Supporting Multilingual Paths on the WWW Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III, Avital Arora Texas A&M University.

Interfaces for Selecting and Understanding Collections.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.

Managing Change in Distributed Collections Frank M. Shipman III Luis Francisco-Revilla Richard Furuta Center for the Study of Digital Libraries Texas A&M.

Computing Trust in Social Networks

Dynamically Growing Hypertext Collections - Pratik Dave, - Paul Logasa Bogen II - Unmil Karadkar Luis Francisco-Revilla, Richard Furuta, Frank Shipman.

The Walden ’ s Paths Quiz Engine Avital Arora, Emily Barker, Unmil P. Karadkar, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Frank Shipman, Suvendu.

Walden's Paths Richard Furuta, Frank M. Shipman III, Hugh Wilson Avital Arora, Luis Francisco-Revilla, Unmil P. Karadkar, Emily Luke, James Vasek Center.

Template-based Authoring of Educational Artifacts Texas A & M University Center for the Study of Digital Libraries *Department of Educational Psychology.

Richard Furuta Texas A&M University Center for the Study of Digital Libraries and The Department of Computer Science Firing a transition.

Managing Change on the Web Luis Francisco-Revilla Frank M. Shipman Richard Furuta Unmil Karadkar Avital Arora Center for the Study of Digital Libraries.

Digital Library Service Integration (DLSI) --> Looking for Collections and Services to be DLSI Testbeds

Managing Distributed Collections: Evaluating Web Page Change, Movement, and Replacement Richard Furuta and Frank Shipman Center for the Study of Digital.

Recognizing User Interest and Document Value from Reading and Organizing Activities in Document Triage Rajiv Badi, Soonil Bae, J. Michael Moore, Konstantinos.

The Walden's Paths Virtual Directories Unmil P. Karadkar, Luis Francisco-Revilla, Richard Furuta, Frank M. Shipman III Texas A&M University Structuring.

Context-Based Metrics For Evaluating Changes to Web Pages Thesis Defense By Suvendu Kumar Dash Texas A&M University.

The sequence of folders to a file or folder is called a(n) ________.

Projects in the Intelligent User Interfaces Group Frank Shipman Associate Director, Center for the Study of Digital Libraries.

Walden’s Paths Principal Investigators: Richard Furuta, Frank Shipman Center for the Study of Digital Libraries Texas.

Overview of Search Engines

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

Kumar Srijan ( ) Syed Ahsan( ). Problem Statement To create a Neural Networks based multiclass object classifier which can do rotation,

Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”

Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.

Perception of Content, Structure, and Presentation Changes in Web-based Hypertext Luis Francisco-Revilla Frank M. Shipman III Richard Furuta Unmil Karadkar.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

12 Developing a Web Site Section 12.1 Discuss the functions of a Web site Compare and contrast style sheets Apply cascading style sheets (CSS) to a Web.

XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.

Project Overview Graduate Selection Process Project Goal Automate the Selection Process.

Chapter 6: Information Retrieval and Web Search

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.

January 2005MERLOT Reusable Learning Design Guidelines OVERVIEW FOR MERLOT Copyright 2005 Reusable Learning This work is licensed under a Attribution-NoDerivs-NonCommercial.

Algorithmic Detection of Semantic Similarity WWW 2005.

Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Vector Space Models.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.

An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,

Artificial Intelligence Techniques Internet Applications 4.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Web Spam Taxonomy Zoltán Gyöngyi, Hector Garcia-Molina Stanford Digital Library Technologies Project, 2004 presented by Lorenzo Marcon 1/25.

Connecting Interface Metaphors to Support Creation of Path-based Collections Unmil P. Karadkar, Andruid Kerne, Richard Furuta, Luis Francisco-Revilla,

Information Retrieval

Representation of documents and queries

ACM Digital Repository Classification Results

Presentation transcript:

Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University

Distributed Collections The Web is eternally changing –.gov and.edu pages change less frequently than.com pages (1999) Collection managers cannot control changes –Bookmark lists –Yahoo! directories –Web portals (NSDL) –Walden’s Paths

Walden ’ s Paths The Walden ’ s Paths Project is developing tools to help you organize, annotate, and maintain collections of different locations chosen from the World-Wide Web. ControlsAnnotation Original Web page

Information about the path Paths in the system and in the subcollection holding the current path Currently on stop 3 of 6 Scroll stop list Current path ’ s name Source for content Switch to presentation mode

Management of Distributed Collections Detection of change is easy Determination of –Quantity of change is relatively easy –Relevance of change is less easy –Quality of change is difficult Approaches –Human validation (Yahoo! surfers) –Automatic detection of change (Path Manager)

Path Manager Features –Java-based –HTML pages –Web page state Original Last validated Most recent Kinds of change –Content changes (what) –Presentation changes (how) –Structural changes (linking) –Behavioral changes –Paths or bookmark lists –Signatures Paragraphs Headings Images Links Checksum

Path Status Overview Path Manager Page Status Overview Little Change Server unreachable 404 error No change Drastic change Page Information Modification details Quantitative information

Is ALL Change Bad? Quite unrelated to the philosophical question Items in a collection –Play specific roles –Are semantically related To each other To the collection Change to an item may –Change its relationship to the collection Less coherent with other items (default assumption) More coherent or no change in relationship –Affect the role it plays in the collection Less suitable (default assumption) More suitable or no effect on the role

Context-based Change Detection Augments our content-based approach Context consists of –Content from other pages in the path –Annotations created by the author –Additional metadata provided by the author Distinguishes between edited and replaced pages

Approach Context Generation Phase –Create weighted page term vectors W = log(tf) + constant scaling factor Known nouns are allocated higher weights –Create weighted context term vectors Exclude the page for which context vector is being generated Change Detection Phase –Calculate page term vector for changed page –Calculate the angle between new page term vector and context term vector –Difference between initial and current angle is a measure of the change

Evaluation 20 paths, pages selected from Yahoo! Directories Each path consisted of 10 to 12 pages Pages were randomly selected –no flash presentations or images A page in each path was randomly selected for replacement Each selected page was replaced by 3 pages –CNN Financials (large change) –Elephants (large change) –A page from the same Yahoo! Directory (small change) Intuitive expectation

Results – Content-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average Range30.8 to to to 84.5 Standard deviation Angle between original and replacing pages (in degrees) High angle of change for all cases (even for similar pages)

Results – Context-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average Range-23.2 to to to 14.3 Standard deviation Difference in angle to Yahoo directory between original and replacing pages (in degrees) In line with intuitive expectation

Results – Distribution of Context-based changes More than -4Between -4 and 2More than 2 Replacement by a member of the Yahoo! Directory 1 (5%)10 (50%)9 (45%) Replacement by non-member25 (62.5%)15 (37.5%)0 (0%) Replacements resulting in moving towards and away from the context vector Preliminary experimental thresholds –2-degree or greater change converges the page to the path –2 to negative 4 degrees is treated as no change –Greater than negative 4 degrees diverges the page from the path

Dealing with Missing Pages Pages may disappear due to a variety of reasons –Reconfiguration of Web sites –Change or expiration of domain names Temporary or permanent Threaten integrity of paths –Continuity of narrative structure –Completeness of collection Strategies –Find new locations for pages that have moved (exact replacements) –Find acceptable replacements for pages that have vanished (similar pages)

Approach – Information Extraction Phase Extract keyphrases from page –Extends the “Robust Hyperlinks” approach –Tag text in pages with part-of-speech tagger –Extract 1, 2 or 3 word keyphrases –n, n-n, a-n, n-n-n, a-n-n –Use HTML formatting for additional guidance Only or tags may separate terms in a phrase Store separate lists of keyphrases to use for locating exact replacements and similar pages

Approach – Locating Exact Replacements Keyphrases help discriminate this page from others on the Web TF-IDF-based measure Spelling mistakes and unusually uncommon words are most valuable Order keyphrase list by decreasing value of TF- IDF measure While locating pages –Begin with a (user-specified) keyphrase set –Search for pages that match these terms –Add a term and retry until the result set is as desired

Approach – Finding Similar Pages Rare phrases hinder search for similar pages Weed out phrases that have occurred less frequently than a certain threshold value Remaining phrases are then ordered by decreasing value of their TF-IDF measure While locating pages –Start with the most restrictive set of phrases –Reduce one phrase at a time until the desired result size is achieved Similarity is contextual Varies from person to person

Future Work Integrate context-based algotrithms with Path Manager Integrate the algorithms for locating page replacements Improve algorithms for keyphrase extraction Etc. etc.

For more information on Walden’s Paths Principal Investigators: Richard Furuta Frank Shipman