Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of.

Slides:



Advertisements
Similar presentations
Building an Ontology for Crisis, Tragedy, and Recovery Oct. 1, 2009 NKOS Workshop, ECDL 2009 Corfu, Greece Uma Murthy, Edward Fox, Naren Ramakrishnan,
Advertisements

DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
1 What is the Internet Archive We are a Digital Library Mission Statement: Universal access to human knowledge Founded in 1996 by Brewster Kahle in San.
From Web Archiving services to Web scale data processing platform Internet Memory Research GA IIPC, Paris, May 19th 2014.
Integrated Digital Event Web Archive and Library (IDEAL) and Aid for Curators Archive-It Partner Meeting Montgomery, Alabama Mohamed Farag & Prashant Chandrasekar.
From CTRnet to IDEAL (and Qatar, VT, SiteStory, UPS, …) NSF IA WIRE Workshop Harvard -- June 16, 2014 Edward A. Fox,
Presentation at WebEx Meeting June 15,  Context  Challenge  Anticipated Outcomes  Framework  Timeline & Guidance  Comment and Questions.
Archive What I See Now Mat Kelly, Michael L. Nelson, Michele C. Weigle Old Dominion University Web Science and Digital.
Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.
Reference 2.0: Using New Web Technologies to Enhance Public Service Texas Library Association Conference April 17, 2008 Stephen F. Austin State University’s.
1 CHCI Visit by Dean Benson, Associate Dean Lesko KW II Rm – 10/10/2011 Digital Library Research Laboratory Torgersen Hall Rm 2030 –
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
eGovernance Under guidance of Dr. P.V. Kamesam IBM Research Lab New Delhi Ashish Gupta 3 rd Year B.Tech, Computer Science and Engg. IIT Delhi.
Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.
D igital L ibraries Digital Dreams, or Digital Realities?
Corporation For National Research Initiatives NSF SMETE Library Building the SMETE Library: Getting Started William Y. Arms.
Designing, Developing, and Evaluating an Interdisciplinary Digital Library Curriculum Jeffrey Pomerantz School of Information & Library Science University.
Archive-It Architecture Introduction April 18, 2006 Dan Avery Internet Archive 1.
1 Archiving and Preserving the Web Kristine Hanna Internet Archive April 2006.
1 Archive-It Training University of Maryland July 12, 2007.
1 Advanced Archive-It Application Training: Archiving Social Networking and Social Media Sites.
Xpantrac connection with IDEAL Sloane Neidig, Samantha Johnson, David Cabrera, Erika Hoffman CS /6/2014.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
CS 5604 Spring 2015 Classification Xuewen Cui Rongrong Tao Ruide Zhang May 5th, 2015.
1 Finding Collaborators Worldwide James Werle, Univ. of Washington, Jennifer Oxenford, MAGPI/UPENN, Tim.
Web Archives, IDEAL, and PBL Overview Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science Virginia Tech Blacksburg, VA, USA 21.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Archiving and Preserving the Web Dan Avery Kristine Hanna Merrilee Proffitt Internet Archive RLG April 2006.
Web Capture team Office of strategic initiatives February 27, 2006 Selecting Content from the Web: Challenges and Experiences of the Library of Congress.
Building Scalable Web Archives Florent Carpentier, Leïla Medjkoune Internet Memory Foundation IIPC GA, Paris, May 2014.
Presenter: Karla Strieb Assistant Executive Director Transforming Research Libraries June 3, 2010 Supporting E-science: Progress at Research Institutions.
Digital Library Research Laboratory Torgersen Hall 2030 – (part of IT at VT) and Department of Computer Science CS4624: Multimedia, Hypertext,
Introducing the Hurricane Preparedness and Recovery Web Portal - October 8, Presented by Charles R. McClure, PhD Director, FSU Information Institute.
ATLAS Outreach & Education News & Collaboration. News Reporting ATLAS progress and results to the world ATLAS Week - 11 Oct 2011S. Goldfarb - ATLAS Outreach.
In addition to Word, Excel, PowerPoint, and Access, Microsoft Office® 2013 includes additional applications, including Outlook, OneNote, and Office Web.
Collaborative Research: Curriculum Development for Digital Library Education Presentation in May 1,2006
Creating and Operating a Digital Library for Information and Learning– the GROW Project Muniram Budhu Department of Civil Engineering & Engineering Mechanics.
1 Archive-It: Archiving and Preserving Born Digital Content NDIIPP June 2009 Molly Bragg Partner Specialist Internet Archive.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CTRnet: A Crisis, Tragedy, & Recovery Network ( Oct.16, 2009 VCOM Research Day Blacksburg, VA USA Edward Fox Bidisha.
Learning and Engagement in Library Spaces Suzanne E. Thorin Ruth Lilly University Dean of University Libraries and Associate Vice President for Digital.
LIKES: Educating the Next Generation of Knowledge Society Builders Authors: Wingyan Chung, Edward A. Fox, Steven D. Sheetz, Seungwon Yang Presenter: Wingyan.
Solr Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer Science Virginia Tech,
NanoHUB.org and HUBzero™ Platform for Reproducible Computational Experiments Michael McLennan Director and Chief Architect, Hub Technology Group and George.
National Center for Supercomputing Applications Barbara S. Minsker, Ph.D. Associate Professor National Center for Supercomputing Applications and Department.
XXDL and CSTC and Virginia Tech NSDL Fall 2000 PI Meeting September 22-24, 2000 NSF, Arlington, VA Edward A. Fox CS DLRL.
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 IBM Academic Initiative Introduction for Pamplin School of Business Virginia Tech – October 13, 2011 “IBM Academic Skills Cloud and Computing Education.
The Boston TV News Digital Library: Partners WGBH Media Library and Archives (WGBH) Northeast Historic Film (NHF) Boston Public Library (BPL)
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
1 ETDs for Life Panel ETD 2014: 17 th Int’l Symposium on ETDs Leicester, England Edward A. Fox Executive Director, NDLTD,
Crisis, Tragedy and Recovery Network (CTRnet) Slides by Kiran Chitturi, Edward A. Fox, and the CTRnet team
Problem Based Learning To Build And Search Tweet And Web Archives Richard Gruss Edward A. Fox Digital Library Research Laboratory Dept. of Computer Science.
ELISQ Seminar Qatar National Library 20 May 2015 Introduction by Edward A. Fox Professor, Computer Science, Virginia Tech Blacksburg, VA USA
GFURR seminar Can Collecting, Archiving, Analyzing, and Accessing Webpages and Tweets Enhance Resilience Research and Education? Edward A. Fox, Andrea.
ELISQ Systems Demonstration Sagnik Ray Choudhury Doha -- May 2015.
CTRnet Digital Library for Disaster Information Services Seungwon Yang 1, Andrea Kavanaugh 1, Nádia P. Kozievitch 4, Lin Tzy Li 1,4,5, Venkat Srinivasan.
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
Use cases for BnF broad crawls Annick Lorthios. 2 Step by step, the first in-house broad crawl The 2010 broad crawl has been performed in-house at the.
Proposal for Term Project
Collection Management Webpages
Extraction, aggregation and classification at Web Scale
Introducing Qwory, a Business-to-Business Search Engine That’s Powered by Microsoft Azure and Detects Vital Contact Information for Businesses MICROSOFT.
Collection Management Webpages Final Presentation
CS6604 Digital Libraries IDEAL Webpages Presented by
Tweet URL Analysis Guoxin Sun, Kehan Lyu, Liyan Li
Web archives as a research subject
Knowledge Sharing Mechanism in Social Networking for Learning
Presentation transcript:

Crisis, Tragedy, and Recovery Network Digital Library (CTRnet) + Web Archiving in Qatar and VT Edward A. Fox, Seungwon Yang, & CTRnet Team Department of Computer Science, Virginia Tech Workshop at WADL’13, July 25-26, 2013

Outline  Introduction  Project goal  Members & collaborators  Main Archiving Tasks  Sub-Projects  Dissemination Efforts  IDEAL Project  Qatar  VT  Acknowledgments  Collaboration 2

CTRnet Project Goal  Developing integrative approaches:  Collect, analyze, and visualize disaster information with a DL 3

Members & Collaborators  Project members from multi-disciplinary areas  Computer Science (HCI, Information Retrieval)  Accounting and Information Systems  Sociology  Collaboration with the Internet Archive (IA)  Developed web archives  Heritrix crawler  Crawled data hosted by Wayback Machine in IA  Raw data downloaded and locally analyzed  Attended Archive-It Partners Meeting  Introduced the CTRnet team’s crawling approach using tweets 4

Outline  Introduction  Main Archiving Tasks  Disaster webpage archives  Disaster tweet archives  Sub-Projects  Dissemination Efforts  IDEAL Project  Qatar  VT  Acknowledgment  Collaboration 5

Disaster Webpage Archives  Webpages, PDFs, and multimedia content crawled from the Web  45 archives and growing (8.8 TB+)  Active archives: 6 Boston marathon blast 2013Global Emergency Overview 2013 Boko Haram Attack 2013Hurricane Sandy 2012 Center for Research on the Epidemiology of Disasters (CRED) 2012 Japan Earthquake 2011 CTRnet: Emergency Preparedness Information 2011 Texas fertilizer plant explosion 2013

Disaster Tweet Archives  More than 120 tweet archives and growing  Use Twitter Streaming API  Hashtags and keyword-based archiving 7 Naturalfloods, earthquakes, wildfires, tsunami, hurricanes Man-madeshooting, transportation accidents, plane crash PoliticalMiddle East protests, Iran elections Healthdiabetes, obesity, cancer, mental illness

Outline  Introduction  Main Archiving Tasks  Sub-Projects  Social media use during political crisis  Topic tagging of webpages  Visualizing emergency phases in tweets  Water main break visualization  Focused crawling  LucidWorks tool for big data processing  Dissemination Efforts  IDEAL Project  Qatar  VT  Acknowledgment  Collaboration 8

Social Media Use in Political Crisis (1/2)(2/7 - 2/14, 2011)  Total 514,782 tweets 9 No. Tweets

Social Media Use in Political Crisis (2/2)  Opinion Leadership in Egypt Uprising 2011  514,782 tweets (one week around Mubarak’s resignation)  Total 79,000 unique users  Presumably posting from Egypt  4,710  Individuals excluding organizations  3,675  Opinion leaders  ,000 followers in top 10% (365) individuals  Bios: blogger/activist, writer/reporter, lawyer/executive director, social media consultant,…  ‘elite’ type actors 10

Topic Tagging of Webpages: Xpantrac 11

Visualizing Emergency Phases in Tweets (ISCRAM 2013) (1/2) Four phases of emergency management model 12

Visualizing Emergency Phases in Tweets (2/2) 13

Water Main Break Visualization 14 Tweets collected with keywords Selected tweets with location information (lat/long, geonames) Event locations displayed with details

Focused Crawling  IA collections  Identify a CTR event, list keywords  Query online news sources, identify URLs in tweets  Use URLs as initial seeds for crawling; IA provides access  Modified version of the LibSVM classifier  Reduced noise  3000 documents about school shootings  Next-generation focused crawler  Combines evidence signals for relevance estimation (using Bayesian networks)  Solves Tunneling problem using AI approaches (Reinforcement Learning) 15

LucidWorks Big Data Tool  Powerful tool with components:  Hadoop – for distributed computing  Lucene & Solr – for indexing, searching  Hbase – distributed database for Hadoop  Mahout – distributed machine learning  Oozie – workflow  Kafka: high throughput distributed messaging  Zookeeper: maintaining distributed coordination  Pig: high-level platform for creating MapReduce programs  Packaged as a virtual appliance in Ubuntu for easy installation  Processing of WARC files downloaded from IA 16

Outline  Introduction  Main Archiving Tasks  Sub-Projects  Dissemination Efforts  Conferences  Journal papers  Meetings attended  IDEAL Project  Qatar  VT  Acknowledgment  Collaboration 17

Dissemination Efforts  Conferences, Workshops  JCDL, ISCRAM, Digital Government, CHI, WADL  Meetings Attended  NSF workshop: Crisis Informatics 2012, 2011  Archive-It Partners Meeting  2012 (Annapolis, MD), 2011 (Lexington, KY)  Publications  Please see 18

Outline  Introduction  Main Archiving Tasks  Sub-Projects  Dissemination Efforts  IDEAL Project  Extension of CTRnet  Scope broadened beyond crisis events (e.g., community)  NSF funding pending  Qatar  VT  Acknowledgment  Collaboration 19

Integrated Digital Event Archive and Library (IDEAL) Project  Extension of CTRnet with broadened scope:  Continue disaster archiving  Large social media data processing  Multimedia (images, videos) shared in social media  Digital government research  Public opinion mining, mood perception from tweets  Community issue detection from tweets  Technologies: event detection, focused crawling, analysis/visualization, integration of archive + DL capabilities 20

Outline  Introduction  Main Archiving Tasks  Sub-Projects  Dissemination Efforts  IDEAL Project  Qatar  VT  Acknowledgment  Collaboration 21

Qatar Project NPRP Project Objectives/Aims A. Research and prototype digital library systems and infrastructure for Qatar, focusing initially on Qatari information related to government and scholarly activities. Leverage the crawling engine from Penn State‘s SeerSuite software infrastructure, and extend it beyond its current focus on English to support Arabic-English collections, and to cover a broad range of scholarly disciplines, and all types of government information. … (with collaboration of National Library) 22

Qatar Project NPRP Project Objectives/Aims (cont’d) B. Research and build the digital library community in Qatar, supporting digital library use, services, collection development, tailored systems, and advancing toward a Knowledge Society. Study scholarly activities, and engage in community building in Qatar, so DLs can be tailored to specific domains and to the unique needs of Qatar. Through workshops, a consulting center at the proposed Institute, and collaborative efforts with libraries and museums in Qatar, we will identify particular needs and uses, and tailor collections, systems, and services, to lead toward the Qatari Knowledge Society. 23

VT  Half of campus web servers use the central CMS  Many other web servers cover varied content  Coverage by Internet Archive is OK, but for parts of the overall campus Web, crawling is infrequent  Discussions with IT, Library, University Relations, about  Heretrix  Memento support  SiteStory 24

Outline  Introduction  Main Archiving Tasks  Sub-Projects  Dissemination Efforts  IDEAL Project  Qatar  VT  Acknowledgment  Collaboration 25

Acknowledgment 26  NSF for funding:  Grant: CTRnet IIS  Proposal: IDEAL IIS , Integrated Digital Event Archive and Library  The Internet Archive:  Heritrix crawler  hosting the crawls and resulting archives

Collaboration  We invite anyone to collaborate with us!  Contact:  Edward A. Fox 27

Thank you! Questions/Comments? 28