Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A.

Similar presentations


Presentation on theme: "Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A."— Presentation transcript:

1 Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A. Fox mmagdy@vt.edummagdy@vt.edu, fox@vt.edufox@vt.edu DLRL, CS @ Virginia Tech April 27 – May 1, 2015

2 Acknowledgments Related Funding: – 2007-2008: NSF IIS-0736055, DL-VT416: A Digital Library Testbed for Research Related to 4/16/2007 at Virginia Tech – 2009-2013: NSF IIS-0916733, Crisis, Tragedy, and Recovery network (CTRnet) – 2013-2016: NSF IIS-1319578, Integrated Digital Event Archive & Library (IDEAL) The Internet Archive (Kristine Hanna, co-PI): – Heritrix crawler and other tools and support – Hosting the crawls and resulting archives IDEAL team also includes Drs. Kavanaugh, Sheetz, and Shoemaker; and GRA Sunshin Lee

3 Outline Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work

4 Building archives for events – 1 Manual Curation We have created ~ 60 collections ( https://archive-it.org/organizations/156 )https://archive-it.org/organizations/156 These collections are about disaster events: bombings, earthquakes, hurricanes, plane crashes, shootings, floods, fires Manual preparation of URLs and archiving using Archive-it service

5 Sample Web Collections Collection NameNo. of Seeds Alabama University Shooting116 April 16 Archive88 Chile Earthquake19 Nevada air race crash 64 China Floods60 Encephalitis (India)59 Hurricane Irene70

6 Building archives for events - 2 Seeds from social media (Twitter) We created more than 600 tweet collections with ~ 1 billion tweets For each collection we extract URLs in the tweets, fetch webpages, and archive just those webpages Webpage collections are of two types: – Disaster events: shootings, earthquakes, plane crashes, hurricanes, bombings, terrorism, floods, fire – Community and political events

7 Sample Tweet Collections CollectionKeywords/HashtagsNo. of TweetsStart date Hurricane Sandyhurricane sandy3,219,3832012-10-26 Ebola#ebola1,855,6802014-07-30 Ferguson shooting#Ferguson1,580,4792014-08-11 Thanksgiving#Thanksgiving214,8882014-11-20 AirAsia Plane Crash#QZ8501174,3532014-12-30 Charlie Hebdo shooting #CharlieHebdo451,0092015-01-07 Iran Talks#IranTalks117,9662015-04-02 For full list check: http://hadoop.dlib.vt.edu:81/twitter/

8 Building archives for events - 2 Seeds from social media Event Collect Tweets Tweet Collection Extract URLs Shortened URLs Expand Original Webpages Archive WARC Index SOLR Browse Wayback Search Access Keyword/Hashtag Collect Archive/Organize/Analyze

9 Building archives for events - 3 Focused Crawling Curator selects high quality seed URLs Use Event Focused Crawler (EFC) to retrieve webpages that are highly similar to those with the seed URLs Curator can configure EFC to adjust the number of webpages retrieved and the quality of retrieved webpages (similarity threshold)

10 Building archives for events - 3 Focused Crawling

11 Outline Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work

12 Event Model and Representation Modeling events – What happened, where, and when Information retrieval – Helps find What part (Vector Space/Probabilistic) Natural language processing – Helps find Where and When parts (Named Entity Recognition)

13 Event Model and Representation Educational activities – CS4984 Computational Linguistics (Fall 2014) – CS5604 Information Retrieval (Spring 2015) Equipment – Hadoop cluster with 20 data nodes – 612 RAM, 76 Cores, and 60 TB Disk Processing methods – Stanford Named Entity Recognition – Mahout routine for topic identification – Python programming for text analysis (Hadoop streaming)

14 Outline Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future Work

15 Assessing archive quality using event model Approaches to textual and linguistic analysis of an archive – Frequent and important words in whole collection – Important sentences, sentences that have one or more frequent words – Frequent entities (location and dates) extracted from important sentences

16 Assessing archive quality using event model Aggregation Named Entity Recognition Sentence Tokenization Keyword Matching Text Extraction Event Model Topic: (t 1,t 2,..,t n ) Location: (l 1,l 2,..,l n ) Date: (d 1,d 2,…,d n ) Topic: (t 1,t 2,..,t n ) Location: (l 1,l 2,..,l n ) Date: (d 1,d 2,…,d n ) Sentence s Selected Sentences Event Entities Text Content Webpages Frequent Words Frequency Analysis

17 Example Ebola Outbreak (22 documents) Top 10 frequent words and top 2 sentences which includes 2 or more frequent words Frequent WordsImportant SentencesExtracted Entities Ebola Virus Disease Health 2014 Africa West Ago University Outbreak - Outbreak of Ebola virus disease in West Africa: third update, 1 August 2014. (7) DATE: ['August 2014'], LOCATION: ['West Africa'] - ECDC (2014) Outbreak of Ebola virus disease in West Africa. (7) LOCATION: [u'West Africa']

18 Outline Building archives for events Event modeling and representation Assessing archive quality using event model Quality tool and results Future work

19 Archive Quality Assessment http://nick.dlib.vt.edu/EventModel/ Input: – Existing collections, WARC file, Text file with list of URLs Frontend: HTML, Javascript/Dojo Backend: Python, NLTK

20 Sample Results

21

22

23 Future Work Use event model to: – Summarize event collection (generate most informative sentence) – Extract relevant parts from webpage

24 Thank You Questions? Mohamed Farag Dr. Edward A. Fox mmagdy@vt.edummagdy@vt.edu, fox@vt.edufox@vt.edu

25 IDEAL Interface http://nick.dlib.vt.edu/ideal/collections/index.ph p http://nick.dlib.vt.edu/ideal/collections/index.ph p Collections – 11 events categories, 2 events each (Small and Big size) – Total 1.6 M documents Services: – Search (keywords, web collections text) – Browse (Event categories and events metadata, web and tweet collections)

26 Technologies Search engine – Solr 4.9 (http://lucene.apache.org/solr/)http://lucene.apache.org/solr/ Web Interface – Apache server – JavaScript - Solr library (https://github.com/evolvingweb/ajax-solr/wiki )https://github.com/evolvingweb/ajax-solr/wiki Tweets archiving – yourTwapperKeeper (https://github.com/540co/yourtwapperkeeper )https://github.com/540co/yourtwapperkeeper Webpages archiving – Archive-it service from Internet Archive (https://archive-it.org/ )https://archive-it.org/

27 Collections Category/CollectionBigSmall AccidentTrain derailment in QuebecTexas factory explosion BombingBoston bombingSomalia Blast CommunityBlacksburg eventsLabor day and world cup 2014 Disease OutbreakEbolaencephalitis EarthquakeTurkey earthquakeVirginia earthquake and others FireBrazil night club fireTexas wild fire FloodPakistan floodChina flood and Islip 13 inch rain HurricaneHurricane SandyTyphoon Haiyan Plane CrashRussia Plane CrashNevada air race crash ShootingApril 16 shootingNorway shooting and others

28 Search Interface

29 Searching Sandy

30 Faceted Search Search all events under Fire

31 Faceted Search Search Brazil Night Club Fire

32 Browse Interface

33 Select Event Type

34 Select Event

35 Hurricane Events


Download ppt "Web Archive Content Analysis: Disaster Events Case Study IIPC 2015 General Assembly Stanford University and Internet Archive Mohamed Farag Dr. Edward A."

Similar presentations


Ads by Google