Managing Change on the Web Luis Francisco-Revilla Frank M. Shipman Richard Furuta Unmil Karadkar Avital Arora Center for the Study of Digital Libraries Texas A&M University
What is this talk about? A system approach to help in managing digital libraries with collections of fluid resources with distributed location and ownership
Modern paradigms of digital libraries Pointers rather than the resources Web-based collections NSDL ( Meta-documents High fluidity Changes vary in relevance Little system aid for assessing relevance of changes
This is a problem everybody has: Bookmark lists Yahoo! catalogues Search engines indices
Related work David Johnson PhD Dissertation, University of Washington Document distance Weighted, asymmetric Change monitoring systems AIDE, URL Minder, WatzNew Fine-grained yes/no detection WebWatcher (evolving) “Interesting” Identification Syskill & Webert, Do-I-Care-Agent, Letizia Personal, reader specific, profile-based
Motivation Managing Walden’s Paths collection Paths are meta-documents Sequential arrangement of Web pages Rhetorically coherent Contextualized Distributed ownership Distributed authorship Continuous revision of the collection
Mechanisms for addressing the issue Caching the pages Caching strategies Some changes are desirable Fluid paths Ephemeral paths Rhetorical coherence
The real issue Mechanisms only allowed limited reaction to changes Detecting changes is easy but determining the relevance is difficult Humans are still required to determine the significance of changes In order to react to changes the assessment of their relevance is required
The perception of change (overview) Observe how humans perceive changes of Web pages Inform and evaluate the approach and design Questions 1. Do people view the same changes in a different way when given different amounts of time? 2. What kind of changes are easily perceived? 3. Of what kind of changes do users want to be notified?
Kinds of change Content changes (what) Presentation changes (how) Structural changes (linking) Behavioral changes
Results and implications Presentation changes were usually perceived as irrelevant The desire of notification and the perception of overall change increased as the degree of content change did Time played a larger role for the perception of structural changes than for the content changes As the degree of structural change increased, so did the desire of notification Links are useful metrics
Path Manager: the system Java based Paths or bookmark lists HTML pages Functional state of the document Original Valid Last-time
Algorithms Variation of Johnson Weighted sum of additions, deletions and modifications for each metric Added metric for structure changes Flexible Asymmetric Lack normalization Proportional Determines the proportion of modification for each metric Simple Symmetrical Normalized
Initial interface
Overall change relevance assessment
Document signatures Paragraph checksums Headlines Links Keywords Global checksum
View of change metrics
Detailed view of page metrics
Path information
Web page retrieval and connectivity Potentially slow and unpredictable Parallel retrieval Multi-threaded Multiple attempts and retries Different states Connection state Retrieval state Analysis state
Challenges and limitations Heuristic identification of document structure (I.e. headings) Indirection Behavior Dynamic pages
Conclusions Managing distributed collections of documents remains challenging and time consuming requiring the assistance of humans The Path Manager supports the maintenance of collection of Web pages by recognizing, evaluating and informing the user of relevant changes keeps track of the original, valid and last-time state of Web pages The study conducted indicated the desire for structural changes to be included in the determination of overall change
Contact information Luis Francisco-Revilla Frank M. Shipman, III Richard Furuta Unmil Karadkar Avital Arora