Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access Methods 1. Background o The Web has become the main publication.

Slides:



Advertisements
Similar presentations
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
The New Version of Academic Universe and the Product Suite Presented by Beth P. Bigman, J.D., Information Professional Consultant.
Click a NOTUS Suite- product for a short description NOTUS REGIONAL NOTUS Regional helps regions perform the tasks related to the reimbursement of providers.
Latin American and Human Rights Web Archiving as part of Research Library Special Collections Kent Norsworthy LLILAS Benson Digital Curation Coordinator,
Metadata for Digital Content Jane Mandelbaum, Ann Della Porta, Rebecca Guenther.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Compare&Contrast: Using the Web to Discover Comparable Cases for News Stories Presenter: Aravind Krishna Kalavagattu.
Information Access Douglas W. Oard College of Information Studies and Institute for Advanced Computer Studies Design Understanding.
1 Minerva The Web Preservation Project. 2 Team Members Library of Congress Roger Adkins Cassy Ammen Allene Hayes Melissa Levine Diane Kresh Jane Mandelbaum.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
With Internet Explorer 9 Getting Started© 2013 Pearson Education, Inc. Publishing as Prentice Hall1 Exploring the World Wide Web with Internet Explorer.
1  Ex Libris Ltd., Internal and Confidential Ex Libris Primo Sofia July 2013 Roman Piontek Key-Account Manager.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Search Engines and Information Retrieval Chapter 1.
Final Search Terms: Archiving (digital or data) Authentication (data) Conservation (digital or data) Curation (digital or data) Cyberinfrastructure Data.
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
In addition to Word, Excel, PowerPoint, and Access, Microsoft Office® 2013 includes additional applications, including Outlook, OneNote, and Office Web.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Master Thesis Defense Jan Fiedler 04/17/98
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
ERIC and the WorldCat Registry Lawrence Henry ERIC Program Manager Joanna White WorldCat Registry Product Manager.
United Nations Economic Commission for Europe Statistical Division Seasonal Adjustment Process with Demetra+ Anu Peltola Economic Statistics Section, UNECE.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
>> Introduction To The Internet Mr. Garel St. BACHS.
Short-Term Economic Statistics Working PartyJune Short Term Economic Statistics Timeliness Framework Richard McKenzie OECD.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
The Internet 8th Edition Tutorial 4 Searching the Web.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
The ISI Web of Knowledge nce/training/wok/#tab3.
Web Search Algorithms By Matt Richard and Kyle Krueger.
Directions for Hypertext Research: Exploring the Design Space for Interactive Scholarly Communication John J. Leggett & Frank M. Shipman Department of.
September 05Eason et al LWW61 A ‘Joined-Up’ Electronic Journal Service: User Attitudes and Behaviour Ken Eason 1, Ross MacIntyre 2 and Ann Apps 2 1 The.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Algorithmic Detection of Semantic Similarity WWW 2005.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Working Memory and Learning Underlying Website Structure
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Search and Access Technologies for Large Scale Web Archives Joseph JaJa, Sangchul Song, and Mike Smorul Institute for Advanced Computer Studies Department.
Video Active Presentation Agenda: –Demonstration of videoactive.eu Frontend and Backend fiatifta.dk Copenhagen September 2008.
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
UOS Personalized Search Zhang Tao 장도. Zhang Tao Data Mining Contents Overview 1 The Outride Approach 2 The outride Personalized Search System 3 Testing.
Microsoft Office 2008 for Mac – Illustrated Unit D: Getting Started with Safari.
Primo at the British Library Mandy Stewart. 2 About the British Library The British Library is the National Library of the UK It is a world-class.
Access to Electronic Journals and Articles in ARL Libraries By Dana M. Caudle Cecilia M. Schmitz.
The Internet and the WWW IT-IDT-5.1. History of the Internet How did the Internet originate? Goal: To function if part of network were disabled Became.
PAWN: Producer-Archive Workflow Network
Human Computer Interaction Lecture 21 User Support
Search Engines and Search techniques
Text Based Information Retrieval
Joseph JaJa, Mike Smorul, and Sangchul Song
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Martin Rajman, Martin Vesely
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
2010 February 11 - L. Dobson, Librarian
Movie Recommendation System
Manuscript Transcription Assistant Initiative
Web Information retrieval (Web IR)
Information Retrieval and Web Design
Information Retrieval and Web Design
April, 2019.
Presentation transcript:

Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa 3. Existing Access Methods 1. Background o The Web has become the main publication medium world- wide, covering almost every facet of human activity. However, the Web is an ephemeral medium. o Web archives offer unique opportunities for knowledge discovery due to the richness of their contents extending over significant periods of time. o We need effective and scalable access strategies for web archives covering significant temporal spans. 4. Problems With Existing Methods o Inefficient handling of time-constrained search. o Ineffective delivery of search results Inadequate relevancy scoring. Scoring is performed over the entire history. Ungrouped search results. URL is not unique in web archives – time dependent. Considering different versions of the same URL tend to have similar contents, It is highly likely to have the first result page “polluted’ by multiple versions of the same URL. Users can want to focus more on a specific time-period within the results. Lack of a group-scoring methodology. What group to show on the top is not clear without a group- scoring methodology. 2. Our Goals: Development of o An effective technology to explore and search for information in a web archive. The technology must enable effective information exploration and discovery. o A framework that produces ranking and evaluation of archived web contents within a temporal context as required by the user. o Methods to determine the relevance of a group of web objects within the temporal context. A group can consist of a series of temporally-contiguous versions of a single URL, or of web objects archived within some time span. o A framework that allows effective search using keywords and time spans for large scale web archives. 5. Overview of our Approach: o Efficient time-constrained search by maintaining separate inverted lists for a given time window  See Block 6. o Scoring within a temporal context by computing term weights as a function of time  See Block 7. o Grouping similar search results, while scoring search results as a group  See Block 9 and 10. Search all, and then Filter  Very inefficient!! September 11 attacks - Wikipedia, the free encyclopedia September 11 attacks - Wikipedia, the free encyclopedia The September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital Archive September 11 Digital Archive Uses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims … 9/11 Tributes, September 11 Tributes and Memorials to the Victims … Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th /11 World Trade Center, k National Commission on Terrorist Attacks Upon the United States National Commission on Terrorist Attacks Upon the United States Commission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, k … and 4 million other pages pertaining to the September 11 th Attack … September 11 attacks - Wikipedia, the free encyclopedia September 11 attacks - Wikipedia, the free encyclopedia The September 11 attacks (often referred to as nine-eleven, written 9/11) were a series of coordinated suicide attacks by al-Qaeda upon the United States on … en.wikipedia.org/wiki/September_11,_2001_attacks September 11 Digital Archive September 11 Digital Archive Uses electronic media to collect, preserve, and present the history of the September 11, 2001 attacks in New York, Virginia, and Pennsylvania and the public … 911digitalarchive.org/ 9/11 Tributes, September 11 Tributes and Memorials to the Victims … 9/11 Tributes, September 11 Tributes and Memorials to the Victims … Tributes of 9/11 - September 11th 9/11 memorials. For the Victims their Families and the many Heroes of September 11th /11 World Trade Center, k National Commission on Terrorist Attacks Upon the United States National Commission on Terrorist Attacks Upon the United States Commission chartered to prepare a full and complete account of the circumstances surrounding the September 11, 2001 terrorist attacks, k … and 4 million other pages pertaining to the September 11 th Attack … Ethiopian calendar - Wikipedia, the free encyclopedia Ethiopian calendar - Wikipedia, the free encyclopedia Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian),... en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, Mars Global Surveyor: Aerobraking APOD: September 11, Mars Global Surveyor: Aerobraking September 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap html - 5k … and only 560 other pages that are irrelevant to the September 11 th Attack Ethiopian calendar - Wikipedia, the free encyclopedia Ethiopian calendar - Wikipedia, the free encyclopedia Thus the first day of the Ethiopian year, 1 Mäskäräm, for years between 1901 and 2099 (inclusive), is usually September 11 (Gregorian),... en.wikipedia.org/wiki/Ethiopian_calendar - 43k APOD: September 11, Mars Global Surveyor: Aerobraking APOD: September 11, Mars Global Surveyor: Aerobraking September 11, 1997 See Explanation. Clicking on the picture will download the highest resolution version available. Mars Global Surveyor: Aerobraking … apod.nasa.gov/apod/ap html - 5k … and only 560 other pages that are irrelevant to the September 11 th Attack “Find web pages that contain ‘September 11 th ’ before 2001” Chronological Listing Directory Hybrid Text-Search

6. Basic Techniques Determine a snapshot of web contents covering a time window SC k = { All web objects valid within a time interval [t k ~t k+1 ) } Given SCs, we maintain a set of inverted lists for each SC, and build a time hierarchy or a combined multi-version tree. 8. Search User Interface 9. Grouping Search Results 10. Group-wide Scoring Grouping is good, but now which group to place first on the result page? Simple method : use average or highest score among members More effective method: compute a relevancy score as a group. Instead of tf(t), we use df(t), document frequency of t in group. Instead of idf(t), we use igf(t), inverse group frequency. We extend some of the best known IR technologies for group ranking. 7. Scoring within a Temporal Context Relevancy scoring is based on the time that an web page was archived. The same contents will have different relevancy scores when the temporal contexts are different. (e.g. one was archived several months before the other) First page polluted by the same URL Grouped by URL (collapsed) Grouped by URL (collapsed) Grouped by Time Grouped by URL (expanded) Same contents, different archive dates  different scores!! SC 1 SC 2 SC K B- Tree PL SC1-w1 PL SC1-w2 w1w1 w1w1 w2w2 w2w2 PL SC2-w1 PL SC2-w2 w1w1 w1w1 w2w2 w2w2 PL SC1-wN wNwN wNwN B-Tree SC 1 SC 2 SC K SC 1 SC 2 SC K w1w1 w1w1 w2w2 w2w2 wNwN wNwN SC 1 SC 2 SC K PL SC1-w1 PL SC1-w2 PL SC2-w1 PL SCK-w1 Multi-version Tree Multi-version Tree PL SC1-w1 PL SC1-w2 w1w1 w1w1 w2w2 w2w2 wNwN wNwN w3w3 w3w3 PL SC2-w1 PL SC4-w1 PL SC1-w2 PL SC1-w3 PL SC1-wN Search and Access Strategies for Web Archives Sangchul Song and Joseph JaJa