Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,

Similar presentations


Presentation on theme: "Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,"— Presentation transcript:

1 Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla, Richard Furuta, Unmil Karadkar, Frank Shipman Center for the Study of Digital Libraries and the Department of Computer Science Texas A&M University

2 Distributed Collections The Web is eternally changing –.gov and.edu pages change less frequently than.com pages (1999) Collection managers cannot control changes –Bookmark lists –Yahoo! directories –Web portals (NSDL) –Walden’s Paths

3 Walden ’ s Paths The Walden ’ s Paths Project is developing tools to help you organize, annotate, and maintain collections of different locations chosen from the World-Wide Web. ControlsAnnotation Original Web page

4 Information about the path Paths in the system and in the subcollection holding the current path Currently on stop 3 of 6 Scroll stop list Current path ’ s name Source for content Switch to presentation mode

5 Management of Distributed Collections Detection of change is easy Determination of –Quantity of change is relatively easy –Relevance of change is less easy –Quality of change is difficult Approaches –Human validation (Yahoo! surfers) –Automatic detection of change (Path Manager)

6 Path Manager Features –Java-based –HTML pages –Web page state Original Last validated Most recent Kinds of change –Content changes (what) –Presentation changes (how) –Structural changes (linking) –Behavioral changes –Paths or bookmark lists –Signatures Paragraphs Headings Images Links Checksum

7 Path Status Overview Path Manager Page Status Overview Little Change Server unreachable 404 error No change Drastic change Page Information Modification details Quantitative information

8 Is ALL Change Bad? Quite unrelated to the philosophical question Items in a collection –Play specific roles –Are semantically related To each other To the collection Change to an item may –Change its relationship to the collection Less coherent with other items (default assumption) More coherent or no change in relationship –Affect the role it plays in the collection Less suitable (default assumption) More suitable or no effect on the role

9 Context-based Change Detection Augments our content-based approach Context consists of –Content from other pages in the path –Annotations created by the author –Additional metadata provided by the author Distinguishes between edited and replaced pages

10 Approach Context Generation Phase –Create weighted page term vectors W = log(tf) + constant scaling factor Known nouns are allocated higher weights –Create weighted context term vectors Exclude the page for which context vector is being generated Change Detection Phase –Calculate page term vector for changed page –Calculate the angle between new page term vector and context term vector –Difference between initial and current angle is a measure of the change

11 Evaluation 20 paths, pages selected from Yahoo! Directories Each path consisted of 10 to 12 pages Pages were randomly selected –no flash presentations or images A page in each path was randomly selected for replacement Each selected page was replaced by 3 pages –CNN Financials (large change) –Elephants (large change) –A page from the same Yahoo! Directory (small change) Intuitive expectation

12 Results – Content-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average78.181.975.1 Range30.8 to 88.177.0 to 87.735.1 to 84.5 Standard deviation15.652.8910.76 Angle between original and replacing pages (in degrees) High angle of change for all cases (even for similar pages)

13 Results – Context-based Metrics Replaced withPage about elephants CNN Financials page Page in same Yahoo! directory Average-7.8-9.11.9 Range-23.2 to 1.6-45.0 to 0.9-15.2 to 14.3 Standard deviation6.9510.576.80 Difference in angle to Yahoo directory between original and replacing pages (in degrees) In line with intuitive expectation

14 Results – Distribution of Context-based changes More than -4Between -4 and 2More than 2 Replacement by a member of the Yahoo! Directory 1 (5%)10 (50%)9 (45%) Replacement by non-member25 (62.5%)15 (37.5%)0 (0%) Replacements resulting in moving towards and away from the context vector Preliminary experimental thresholds –2-degree or greater change converges the page to the path –2 to negative 4 degrees is treated as no change –Greater than negative 4 degrees diverges the page from the path

15 Dealing with Missing Pages Pages may disappear due to a variety of reasons –Reconfiguration of Web sites –Change or expiration of domain names Temporary or permanent Threaten integrity of paths –Continuity of narrative structure –Completeness of collection Strategies –Find new locations for pages that have moved (exact replacements) –Find acceptable replacements for pages that have vanished (similar pages)

16 Approach – Information Extraction Phase Extract keyphrases from page –Extends the “Robust Hyperlinks” approach –Tag text in pages with part-of-speech tagger –Extract 1, 2 or 3 word keyphrases –n, n-n, a-n, n-n-n, a-n-n –Use HTML formatting for additional guidance Only or tags may separate terms in a phrase Store separate lists of keyphrases to use for locating exact replacements and similar pages

17 Approach – Locating Exact Replacements Keyphrases help discriminate this page from others on the Web TF-IDF-based measure Spelling mistakes and unusually uncommon words are most valuable Order keyphrase list by decreasing value of TF- IDF measure While locating pages –Begin with a (user-specified) keyphrase set –Search for pages that match these terms –Add a term and retry until the result set is as desired

18 Approach – Finding Similar Pages Rare phrases hinder search for similar pages Weed out phrases that have occurred less frequently than a certain threshold value Remaining phrases are then ordered by decreasing value of their TF-IDF measure While locating pages –Start with the most restrictive set of phrases –Reduce one phrase at a time until the desired result size is achieved Similarity is contextual Varies from person to person

19 Future Work Integrate context-based algotrithms with Path Manager Integrate the algorithms for locating page replacements Improve algorithms for keyphrase extraction Etc. etc.

20 For more information on Walden’s Paths http://www.csdl.tamu.edu/walden/ walden@csdl.tamu.edu Principal Investigators: Richard Furuta (furuta@csdl.tamu.edu) Frank Shipman (shipman@csdl.tamu.edu)


Download ppt "Managing Distributed Collections: Evaluating Web Page Changes, Movement, and Replacement Zubin Dalal, Suvendu Dash, Pratik Dave, Luis Francisco-Revilla,"

Similar presentations


Ads by Google