2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

CMo: When Less Is More Yevgen Borodin Jalal Mahmud I.V. Ramakrishnan Context-Directed Browsing for Mobiles.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
How to make the most of your website: It’s one of your best marketing, branding, awareness tools.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
The Role of Twitter in YouTube Videos Diffusion George Christodoulou EPFL Switzerland Laboratory for Internet Computing Department of Computer Science.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Evaluating Search Engine
Freshness Policy Binoy Dharia, K. Rohan Gandhi, Madhura Kolwadkar Department of Computer Science University of Southern California Los Angeles, CA.
1 Caching in HTTP Representation and Management of Data on the Internet.
Looking at both the Present and the Past to Efficiently Update Replicas of Web Content Luciano Barbosa * Ana Carolina Salgado ! Francisco Tenorio ! Jacques.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
HTTP Hypertext Transfer Protocol. HTTP messages HTTP is the language that web clients and web servers use to talk to each other –HTTP is largely “under.
What’s a Web Cache? Why do people use them? Web cache location Web cache purpose There are two main reasons that Web cache are used:  to reduce latency.
Chapter 5 Searching for Truth: Locating Information on the WWW.
The Social Web: A laboratory for studying s ocial networks, tagging and beyond Kristina Lerman USC Information Sciences Institute.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Search Engines and their Public Interfaces: Which APIs are the Most Synchronized? Frank McCown and Michael L. Nelson Department of Computer Science, Old.
Information Re-Retrieval Repeat Queries in Yahoo’s Logs Jaime Teevan (MSR), Eytan Adar (UW), Rosie Jones and Mike Potts (Yahoo) Presented by Hugo Zaragoza.
How to make the most of your website: It’s one of your best marketing, branding, awareness tools.
TwitterSearch : A Comparison of Microblog Search and Web Search
Tweet, Tweet, Tweet… Tweeting Assignments & Discussions Kara Damm, Technology Integration Specialist.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
*Chapter One: What is Footnote?* Footnote allows people to find and share over 70 million historical documents Use the search engine to explore documents.
DIRECT CERTIFICATION Patricia Winders Director’s Conference July 29, 2015.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
SOCIAL NETWORKS AND THEIR IMPACTS ON BRANDS Edwin Dionel Molina Vásquez.
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Chapter 5 Searching for Truth: Locating Information on the WWW.
12/2014 Heidi Larson HeidiL_edc.  Setting up an account  Twitter vocabulary – With Strategy tips  How to Tweet  Why to Tweet  How to get started.
Usability Issues Documentation J. Apostolakis for Geant4 16 January 2009.
Understanding Cross-site Linking in Online Social Networks Yang Chen 1, Chenfan Zhuang 2, Qiang Cao 1, Pan Hui 3 1 Duke University 2 Tsinghua University.
TEMPORAL SPREAD IN ARCHIVED COMPOSITE RESOURCES (WORK IN PROGRESS) SCOTT G. AINSWORTH MICHAEL L. NELSON OLD DOMINION UNIVERSITY COMPUTER SCIENCE WADL 2013.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Scott Ainsworth, Ahmed AlSum, Hany SalahEldeen, Michele C. Weigle, Michael L. Nelson Old Dominion University, USA {sainswor, aalsum, hany, mweigle,
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Interactive Science Publishing: A Joint OSA-NLM Project Michael J. Ackerman National Library of Medicine.
Microblogs: Information and Social Network Huang Yuxin.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Social Media 101 An Overview of Social Media Basics.
Music Video Redundancy and Half-Life in YouTube Matthias Prellwitz and Michael L. Nelson TPDL 2011 Berlin,
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Search Worms, ACM Workshop on Recurring Malcode (WORM) 2006 N Provos, J McClain, K Wang Dhruv Sharma
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
+ User-induced Links in Collaborative Tagging Systems Ching-man Au Yeung, Nicholas Gibbins, Nigel Shadbolt CIKM’09 Speaker: Nonhlanhla Shongwe 18 January.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
We.b : The web of short URLs Demetris Antoniades, lasonas Polakis, Gerogios Kontaxis, Elias Athansapoulos, Sotiris loannidis, Evangelos P.Markatos, Thomas.
Overview of Servlets and JSP
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Measuring User Influence in Twitter: The Million Follower Fallacy Meeyoung Cha Hamed Haddadi Fabricio Benevenuto Krishna P. Gummadi.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Introduction to Digital Libraries Assignment #1 Old Dominion University Department of Computer Science CS 751/851 Spring 2015 Michael L. Nelson 01/22/15.
Resurrecting My Revolution Using Social Link Neighborhood in Bringing Context to the Disappearing Web Hany SalahEldeen & Michael Nelson Resurrecting My.
Internet Searching: Finding Quality Information
DM-Group Meeting Liangzhe Chen, Nov
by Jakob Gray, and Sara Inglis, Jerry Sun
Personalizing Search on Shared Devices
Searching for Truth: Locating Information on the WWW
Characterization of Search Engine Caches
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Searching for Truth: Locating Information on the WWW
Searching for Truth: Locating Information on the WWW
Presentation transcript:

2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy Dissertation Defense Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson Dr. Michele C. Weigle Dr. Hussein M. Abdel-Wahab Dr. M’Hammed Abdous Committee : May 5 th, 2015

2015 Hany SalahEldeen Dissertation Defense2 All tweets are equal… …but some are more equal than the others

2015 Hany SalahEldeen Dissertation Defense3 It is imperative to know… 1.How long would these last? 2.And if lost, is there a backup somewhere? 3.Is this what the author intended?

2015 Hany SalahEldeen Dissertation Defense4 To maintain historical integrity Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.

2015 Hany SalahEldeen Dissertation Defense5 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense6 People rely on social media for most updated information

2015 Hany SalahEldeen Dissertation Defense7 Social media is more than kitty photos Marie Colvin January 12, 1956 – February 22, 2012 Rémi Ochlik 16 October 1983 – 22 February 2012 Ahmed Assem 1987 – July 8, 2013

2015 Hany SalahEldeen Dissertation Defense8 For the web is dark, and full of missing content… Accessed in July out 8 external links on Remi’s Wikipedia page return 404

2015 Hany SalahEldeen Dissertation Defense9 even for content shared in social media Accessed in July 2014

2015 Hany SalahEldeen Dissertation Defense10 News sites are also prone to change Accessed in July 2014

2015 Hany SalahEldeen Dissertation Defense11 So are specialized sites Accessed in July 2014

2015 Hany SalahEldeen Dissertation Defense12 Research Problem: Author’s Intention ≠ Reader’s Experience

2015 Hany SalahEldeen Dissertation Defense13 Research Implication Author’s Intention ≠ Reader’s Experience Broken Inconsistent Web and Historical Records

2015 Hany SalahEldeen Dissertation Defense14 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense15 Social Post

2015 Hany SalahEldeen Dissertation Defense16 The anatomy of a tweet Author’s username Other user mention Tweet Body Hash Tag Shortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options

2015 Hany SalahEldeen Dissertation Defense17 3 URIs = 3 Chances to fail

2015 Hany SalahEldeen Dissertation Defense18 URL shortening and aliasing curl -L -I HTTP/ Moved Permanently Server: nginx Date: Mon, 07 Jul :19:48 GMT Cache-Control: private; max-age=90 Location: losing-my-revolution-year.html Mime-Version: 1.0 Set-Cookie: _bit=53bae4c f10- cb1cf10a;domain=.bit.ly;expires=Sat Jan 3 18:19: ;path=/; HttpOnly Content-Type: text/html;charset=utf-8 Content-Length: 167 HTTP/ OK Expires: Mon, 07 Jul :19:52 GMT Date: Mon, 07 Jul :19:52 GMT Cache-Control: private, max-age=0 Last-Modified: Mon, 07 Jul :19:07 GMT ETag: "e b103-4daa-a3f2- d0509ebab51f" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE Alternate-Protocol: 80:quic Content-Type: text/html;charset=UTF-8 Content-Length: 0

2015 Hany SalahEldeen Dissertation Defense19 Life cycle of a social post

2015 Hany SalahEldeen Dissertation Defense20 Life cycle of a social post tweets

2015 Hany SalahEldeen Dissertation Defense21 Life cycle of a social post tweets Links to

2015 Hany SalahEldeen Dissertation Defense22 Life cycle of a social post tweets What the reader receives Links to Same state the author intended

2015 Hany SalahEldeen Dissertation Defense23 Life cycle of a social post tweets What the reader receives Links to Same state the author intended Ideally!

2015 Hany SalahEldeen Dissertation Defense24 Life cycle of a social post tweets What the reader receives Links to Same state the author intended After a period of time

2015 Hany SalahEldeen Dissertation Defense25 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared After a period of time

2015 Hany SalahEldeen Dissertation Defense26 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed After a period of time

2015 Hany SalahEldeen Dissertation Defense27 Memento framework *

2015 Hany SalahEldeen Dissertation Defense28 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense29 Related Work Social media analysis: Understanding Microblogging Zhao 2009 Yang 2010 Newman 2003 Kwak 2010 Java 2007 Cha 2009 History Narration Vieweg 2010 Starbird Qu 2011 Neubig 2011 Lehman and Lalmas User’s Web Search Intention Ashkan 2009 Lee 2005 Loser 2008 Azzopardi 2009 Baeza-Yates 2006 Dai 2011 Commercial Intention Guo 2010 Benczur 2007 Sentiment Analysis Mishne 2006 Bollen 2011 Access to Archives Van de Sompel 2009 Persistence of shared resources – Nelson 2002 – Sanderson 2011 – McCown 2007 URL Shortening – Antoniades 2011 Tweeting, Micro-blogging and Popularity – Wu 2011 – Java 2007 – Kwak 2010 Social Networks Growth and Evolution – Meeder 2011 Further details: refer to chapter 3

2015 Hany SalahEldeen Dissertation Defense30 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense31 Research Question: Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency?

2015 Hany SalahEldeen Dissertation Defense32 Research Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9

2015 Hany SalahEldeen Dissertation Defense33 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense34 Shared ResourceTimeUser Our analysis covers three angles

2015 Hany SalahEldeen Dissertation Defense35 Shared ResourceTimeUser Loss and Persistence of Shared Resources

2015 Hany SalahEldeen Dissertation Defense36 Shared ResourceTimeUser Alive First: Estimate social media content loss

2015 Hany SalahEldeen Dissertation Defense37 Six socially significant events EventSourceYear Iranian ElectionSNAP Dataset2009 H1N1 Virus OutbreakSNAP Dataset2009 Michael Jackson’s DeathSNAP Dataset2009 Obama’s Nobel Peace PrizeSNAP Dataset2009 The Egyptian RevolutionTwitter, Websites, Books2011 The Syrian UprisingTwitter API2012

2015 Hany SalahEldeen Dissertation Defense38 Twitter tag expansion and filtration

2015 Hany SalahEldeen Dissertation Defense39 Twitter tag expansion increases precision

2015 Hany SalahEldeen Dissertation Defense40 What are people sharing?

2015 Hany SalahEldeen Dissertation Defense41 Existence on the live web and in the archives For each unique URL we resolved the final HTTP response and considered 2 classes: Success: 200 OK Failure: 4XX, 50X families and the 30X loop redirects or soft 404s. Utilize the memento aggregator: Archived: if it has at least one memento in the timemap

2015 Hany SalahEldeen Dissertation Defense42 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

2015 Hany SalahEldeen Dissertation Defense43 Shared ResourceTimeUser Alive Missing Second: Can we measure existence and disappearance as a function of time?

2015 Hany SalahEldeen Dissertation Defense44 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

2015 Hany SalahEldeen Dissertation Defense45 Timeline of Events

2015 Hany SalahEldeen Dissertation Defense46 Timeline of Events

2015 Hany SalahEldeen Dissertation Defense47 Social Events Having a Bimodal Time Distribution

2015 Hany SalahEldeen Dissertation Defense48 Timeline of Events

2015 Hany SalahEldeen Dissertation Defense49 Social Events Having a Bimodal Time Distribution

2015 Hany SalahEldeen Dissertation Defense50 Existence as a function of time

2015 Hany SalahEldeen Dissertation Defense51 Existence as a function of time

2015 Hany SalahEldeen Dissertation Defense52 Results: Publications and Articles: 1.H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the social media documentation is gone. losing-my-revolution-year.html, losing-my-revolution-year.html 2.H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, Conclusion: Existence could be estimated as a function of time Measured 21,625 resources from 6 data sets in archives & live web. After a year from publishing about 11% of content shared on social media will be gone. After this we are losing roughly 0.02% daily.

2015 Hany SalahEldeen Dissertation Defense53 Revisiting Existence after a year MJIranH1N1ObamaEgyptSyria Measured37.10%37.50%28.17%30.56%26.29%31.62%32.47%24.64%7.55%12.68% Predicted31.72%31.42%31.96%30.98%30.16%29.68%29.60%28.36%19.80%11.54% Error5.38%6.08%3.79%0.42%3.87%1.94%2.87%3.72%12.25%1.14% MJIranH1N1ObamaEgyptSyria Measured48.61%40.32%60.80%55.04%47.97%52.14%48.38%40.58%23.73%0.56% Predicted61.78%61.18%62.26%60.30%58.66%57.70%57.54%55.06%37.94%21.42% Error13.17%20.86%1.46%5.26%10.69%5.56%9.16%14.48%14.21%20.86% Average Prediction Error = 11.57% in all cases, our archival predictions were too optimistic Missing Archived Average Prediction Error = 4.15% in all cases, our missing predictions were acceptable

2015 Hany SalahEldeen Dissertation Defense54 Shared ResourceTimeUser Alive Missing Replaced Third: Can we use social context to find replacements of missing resources?

2015 Hany SalahEldeen Dissertation Defense55 Context discovery and shared resource replacement Problem: 140 characters limits the description of the linked resource. If it went missing, can we get the next best thing? Solution: Shared links typically have several tweets, responses, and retweets We can mine these traces for context and viable replacements

2015 Hany SalahEldeen Dissertation Defense56 Context Discovery Linking to:

2015 Hany SalahEldeen Dissertation Defense57 What if the resource disappeared? Linking to:

2015 Hany SalahEldeen Dissertation Defense58 Use Topsy to discover tweets sharing the same link

2015 Hany SalahEldeen Dissertation Defense59 Social Context Extraction { "URI": " "Related Tweet Count": 500, "Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling #collaborativerevolution #feb11http://t.co/qxusp70...", "Users who talked @webdocumentario:...", "All associated unique links:": " "All other links associated:": " ", "Most frequent link appearing:": " "Number of times the Most frequent link appearing:": 49, "Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt - A crowd sourced documentary project ================= "Number of times the Most frequent tweet appearing:": 46, "The longest common phrase appearing:": "RT 2ke0rEjP is an interactive documentary website that YOU can help create Get your Jan25 stories ready! Pl RT", "Number of times the Most common phrase appearing:": 18 }

2015 Hany SalahEldeen Dissertation Defense60 Build a Tweet Document A tweet document represents the concatenation of all extracted tweets: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “

2015 Hany SalahEldeen Dissertation Defense61 Tweet Signature Tweet Document: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “ Tweet Signature = top 5 most frequent terms from Tweet Document documentary project daysinegypt check sourced

2015 Hany SalahEldeen Dissertation Defense62 Query Google with the Tweet Signature

2015 Hany SalahEldeen Dissertation Defense63 Search Engine Results The original resource

2015 Hany SalahEldeen Dissertation Defense64 Search Engine Results The original resource The others are good replacement candidates

2015 Hany SalahEldeen Dissertation Defense65 Recommendation Evaluation We extract a dataset of resources that are currently available: Pretend these resources no longer exist (for a baseline) Each of the resources are textual based Each resource has at least 30 retrievable tweets.  Extracted 731 unique resources We use boiler plate removal library to remove the template from the: linked resources top 10 retrieved results from Google  We use cosine similarity to compare the documents

2015 Hany SalahEldeen Dissertation Defense66 Similarity measures in resource replacement % similarity % of the cases we found a replacement with >=70% similarity

2015 Hany SalahEldeen Dissertation Defense67 Conclusion: We can find viable replacements for missing shared resources Results: 41% of the test cases we can find a replacement page with at least 70% similarity to the original missing resource The search results provide a mean reciprocal rank of 0.43 Publications: 1.H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.

2015 Hany SalahEldeen Dissertation Defense68 Now we finished analyzing the shared resource…what’s next?

2015 Hany SalahEldeen Dissertation Defense69 Shared ResourceTimeUser Alive Missing Replaced Footprints on the web

2015 Hany SalahEldeen Dissertation Defense70 The tweet, the resource…and time time Posted a tweet Read the tweet Relevancy of the resource to the tweet changed through time  we need to measure that Another tweet posted And another … We need to measure tweet relevance through time

2015 Hany SalahEldeen Dissertation Defense71 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Longitudinal Study: Rate of change of shared content

2015 Hany SalahEldeen Dissertation Defense72 Pilot 1: Resource change in the first 80 hours after tweeting

2015 Hany SalahEldeen Dissertation Defense73 Pilot 2: Delta days from Bitly creation for just tweeted content Dataset size = 4,000

2015 Hany SalahEldeen Dissertation Defense74 Pilot 3: Dataset of 1,000 freshly created Bitlys  depth = 0  depth = 1  depth = 6

2015 Hany SalahEldeen Dissertation Defense75 What domains do users link to?

2015 Hany SalahEldeen Dissertation Defense76 What categories* do users link to? * Extracted from Alexa.com

2015 Hany SalahEldeen Dissertation Defense77 Summation of Intention in Social Content Through Time Longitudinal study: We record the change over an extended period of time: Content: we download a snapshot of the resource every 45 minutes Metadata: we collect meta data about the resource Facebook likes, posts Tweets in the last hour Bitly clicklogs and shares Average data size: ~1 TB per month

2015 Hany SalahEldeen Dissertation Defense78 Hourly analysis over an extended period of time

2015 Hany SalahEldeen Dissertation Defense79 There is a difference between t tweet and t click After just one hour, 4% of the resources have changed by 30%. After six hours, the percentage doubled to be 8% changed by 40%. After a day the change rate slowed to be 12% of the resources changed by 40%. After that it almost stabilizes at 17% of the resources to be changed by 40%.

2015 Hany SalahEldeen Dissertation Defense80 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation First: Resource – Time – Public Archives

2015 Hany SalahEldeen Dissertation Defense81 Revisited: Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

2015 Hany SalahEldeen Dissertation Defense82 But on a more general notion we want to know…

2015 Hany SalahEldeen Dissertation Defense83 How much of the web is archived? Goal: Estimate how much of the public web is present in the public archives and how many copies are available? Action: Getting 4 different datasets from 4 different sources: Search Engines Indices Bit.ly DMOZ Delicious.

2015 Hany SalahEldeen Dissertation Defense84 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages , New York, NY, USA, ACM.

2015 Hany SalahEldeen Dissertation Defense85 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages , New York, NY, USA, ACM. Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives % 92% 23% 26%

2015 Hany SalahEldeen Dissertation Defense86 Side Experiment: Analyzing the quality of the archives and the archived content Goal: Assessing the quality of the web archives Better discussed in Justin Brunelle’s work Publications: 1.J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)

2015 Hany SalahEldeen Dissertation Defense87 A question emerged: When did a certain resource first appear on the web?

2015 Hany SalahEldeen Dissertation Defense88 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Second: When was the resource created?

2015 Hany SalahEldeen Dissertation Defense89 Idea Web pages leave trails as well since the day they were created…

2015 Hany SalahEldeen Dissertation Defense90 Web Resource Web trails A web page could leave a trail of one of the following denoting its existence: References Links (anchors) Social media likes and interactions. URL shortening. Backlinks The creation date of any of the associated events/trails could be an estimate of the creation date.

2015 Hany SalahEldeen Dissertation Defense91 Resource’s timeline

2015 Hany SalahEldeen Dissertation Defense92 Observations Recorded 1.Last modified date from the response header. 2.First Appearance of a backlink. 3.First Tweet published. 4.First Bitly Shortened URL created. 5.Time stamp of first memento in the archives. 6.Date of the last crawl by the search engine.

2015 Hany SalahEldeen Dissertation Defense93 Carbon Date service

2015 Hany SalahEldeen Dissertation Defense94 Carbon Dating API { "self": " "URI": " "Estimated Creation Date": " T04:02:33", "Last Modified": "", "Bitly.com": " T12:00:00", "Topsy.com": " T23:31:42", "Backlinks": " T05:35:44", "Google.com": " T00:00:00", "Archives": [ [ "Earliest", " T04:02:33" ], [ "By_Archive", { " " T05:28:26", " " T05:28:26", " " T04:02:33", " " T18:05:09" } ] }

2015 Hany SalahEldeen Dissertation Defense95 Evaluation Dataset  From each we randomly selected 100 unique URLs to create our gold standard dataset

2015 Hany SalahEldeen Dissertation Defense96 Evaluation Applied our 6 methods on 1200 resources. Get leftmost estimate. Number of ResourcesPercentage An estimate found91076% Exact matching estimate39333% No estimate found29024% Total Resources %

2015 Hany SalahEldeen Dissertation Defense97 Actual Vs. Estimated Dates

2015 Hany SalahEldeen Dissertation Defense98 Conclusion: We can estimate the creation date of resources correctly Results: Succeeded in estimating the creation date accurately in 75.90% of the resources. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013

2015 Hany SalahEldeen Dissertation Defense99 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities.

2015 Hany SalahEldeen Dissertation Defense100 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. Yes, it’s better than mine… I admit it

2015 Hany SalahEldeen Dissertation Defense101 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation User’s Temporal Intention

2015 Hany SalahEldeen Dissertation Defense102 Problem: There is an inconsistency between what the tweet’s author intended to share at time t tweet and what the reader might actually read upon clicking on the link at time t click.

2015 Hany SalahEldeen Dissertation Defense103 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting What is Intention and how to detect it?

2015 Hany SalahEldeen Dissertation Defense104 Amazon’s Mechanical Turk Crowdsourcing Internet marketplace Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.* *

2015 Hany SalahEldeen Dissertation Defense105 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train

2015 Hany SalahEldeen Dissertation Defense106 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train Problem: It is not as easy as it seems!

2015 Hany SalahEldeen Dissertation Defense107 How NOT to classify temporal intention 101 The tweet is presented along with the two snapshots: at t tweet at t click

2015 Hany SalahEldeen Dissertation Defense108 And compared MT results with Experts Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members. For 9 MT assignments per tweet: If we allowed 4-5 splits we have 58% match with WS-DL. If we allowed 3-6 splits or better we got 31% match  Which is worse than flipping a coin!

2015 Hany SalahEldeen Dissertation Defense109 Idea: We need to transform the problem from intention to relevance.

2015 Hany SalahEldeen Dissertation Defense110 Relevance tasks are simpler MT workers are more accustomed to classification tasks and it requires minimum amount of explanation Transform a hard problem to an easy one Is that a cat? - Yes - No

2015 Hany SalahEldeen Dissertation Defense111 Temporal Intention Relevancy Model (TIRM) Between t tweet and t click : The linked resource could have: Changed Not changed The tweet and the linked resource could be: Still relevant No longer relevant

2015 Hany SalahEldeen Dissertation Defense112 Resource is changed but relevant The resource changed But it is still relevant  Intention: need the current version of the resource at any time

2015 Hany SalahEldeen Dissertation Defense113 Relevancy and Intention mapping Current

2015 Hany SalahEldeen Dissertation Defense114 Resource is changed and not relevant  Intention: need the past version of the resource at any time The resource changed But it is no longer relevant

2015 Hany SalahEldeen Dissertation Defense115 Relevancy and Intention mapping Past Current

2015 Hany SalahEldeen Dissertation Defense116 Resource is not changed and relevant  Intention: need the past version of the resource at any time The resource is not changed And it is relevant

2015 Hany SalahEldeen Dissertation Defense117 Relevancy and Intention mapping Past Current Past

2015 Hany SalahEldeen Dissertation Defense118 Resource is not changed and not relevant  Intention: I am not sure which version of the resource I need The resource is not changed But it is not relevant

2015 Hany SalahEldeen Dissertation Defense119 Relevancy and Intention mapping Past Current PastNot Sure

2015 Hany SalahEldeen Dissertation Defense120 Validation: Update the MT experiment MT workers ≡ judgments of the experts (WS-DL members) ✓ Is the content still relevant to the tweet?

2015 Hany SalahEldeen Dissertation Defense121 Mechanical Turk Workers Vs. Experts For 100 tweets, WS-DL members % of agreement: Cohen’s K =  almost perfect agreement Agreement in 3-2 split or more votes 93% Agreement in 4-1 split or more votes 80% Agreement with 5-0 votes 60%

2015 Hany SalahEldeen Dissertation Defense122 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Can we model this temporal intention?

2015 Hany SalahEldeen Dissertation Defense123 Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique.  We extracted 5,937 unique instances

2015 Hany SalahEldeen Dissertation Defense124 Time delta between the tweet and the closest memento Randomly selected 1,124 instances Time delta range: 3.07 minutes to hours Average: hours ~ 1 day Tweet time After Tweet time Before Tweet time

2015 Hany SalahEldeen Dissertation Defense125 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments % Non-Relevant Assignments % 5 MT workers agreeing (5-0 split) % 4 MT workers agreeing (4-1 split) % 3 MT workers agreeing (3-2 close call split) %

2015 Hany SalahEldeen Dissertation Defense126 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments % Non-Relevant Assignments % 5 MT workers agreeing (5-0 split) % 4 MT workers agreeing (4-1 split) % 3 MT workers agreeing (3-2 close call split) %

2015 Hany SalahEldeen Dissertation Defense127 Intention modeling: Feature extraction For each tweet we perform: Link analysis Social media mining Archival existence Sentiment analysis Content similarity Entity identification

2015 Hany SalahEldeen Dissertation Defense128 Training the classifier From the feature extraction phase we extracted 39 different features to train the classifier. Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%

2015 Hany SalahEldeen Dissertation Defense129 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets Number of mementos Tweet similarity with current page Similarity: Current & past page Similarity: Tweet & past page Original URI’s depth0.032

2015 Hany SalahEldeen Dissertation Defense130 Testing the model We tested against: The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used in training. The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1) DatasetStatus 200Status 404 or otherRelevant %Non-Relevant % Extended 4,813 instances96.77%3.23%96.74%3.26% MJ’s Death57.54%42.46%93.24%6.76% H1N1 Outbreak8.96%91.04%97.48%2.52% Iran Elections68.21%31.79%94.69%5.31% Obama’s Nobel Prize62.86%37.14%93.89%6.11% Syrian Uprising80.80%19.20%70.26%29.75%

2015 Hany SalahEldeen Dissertation Defense131 Idea: We need to transform the problem from intention to relevance. Now we need to transform it back! Recap…

2015 Hany SalahEldeen Dissertation Defense132 Recap: Relevancy and Intention mapping Past Reading the wrong history

2015 Hany SalahEldeen Dissertation Defense133 Mapping TIRM We used 70% similarity as a threshold of relevancy. Reading the wrong history In up to 25% of the cases

2015 Hany SalahEldeen Dissertation Defense134 Conclusion: We can model users’ temporal intention accurately and efficiently Results: We successfully transformed the complicated problem of intention to a simpler one of relevance. We successfully collected a gold standard dataset of temporal user intention. We found a temporal inconsistency in the shared resource up to 25% of the cases according to the dataset. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.

2015 Hany SalahEldeen Dissertation Defense135 So we modeled intention… can we make it better?

2015 Hany SalahEldeen Dissertation Defense136 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets Number of mementos Tweet similarity with current page Similarity: Current & past page Similarity: Tweet & past page Original URI’s depth0.0324

2015 Hany SalahEldeen Dissertation Defense137 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets Number of mementos Tweet similarity with current page Similarity: Current & past page Similarity: Tweet & past page Original URI’s depth0.0324

2015 Hany SalahEldeen Dissertation Defense138 Enhancing TIRM Extending and tuning the features: Linguistic feature analysis Semantic similarity analysis using latent topic modeling Dataset balancing Feature selection and minimization

2015 Hany SalahEldeen Dissertation Defense139 A whole lot of features! 39  65 different features in extended TIRM Further details: refer to chapter 7

2015 Hany SalahEldeen Dissertation Defense140 TIRM enhancement and minimization results

2015 Hany SalahEldeen Dissertation Defense141 Point of Confusion: C Point of Certainty: S  Strongest Current Intention From binary to probabilistic strength Further details: refer to chapter 7

2015 Hany SalahEldeen Dissertation Defense142 Intention strength formulation Intention strength magnitude of the new resource: Generalization in regards of class:

2015 Hany SalahEldeen Dissertation Defense143 Intention strength across instances in dataset

2015 Hany SalahEldeen Dissertation Defense144

2015 Hany SalahEldeen Dissertation Defense145 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting Can we find a relation between the modeled intention and time …to predict it?

2015 Hany SalahEldeen Dissertation Defense146 Remember: Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique.  We extracted 5,937 unique instances

2015 Hany SalahEldeen Dissertation Defense147 Intention strength across time time Resource = Closest memento Resource = current version We have 10 mementos of the resource uniformly distributed … We can calculate intention strength at every point

2015 Hany SalahEldeen Dissertation Defense148 Intention strength across time Dataset collection and calculation framework

2015 Hany SalahEldeen Dissertation Defense149 Behavior of instances in different classes time Intention strength Steady Current Intention Steady Past Intention Changing Intention

2015 Hany SalahEldeen Dissertation Defense150 Behavior of instances in different classes

2015 Hany SalahEldeen Dissertation Defense151 Given the features we already collected can we classify tweets according to their behavioral class?

2015 Hany SalahEldeen Dissertation Defense152 Classifying intention behavior across time

2015 Hany SalahEldeen Dissertation Defense153 If we can limit the features to the ones that exist before tweet time can we perform a prediction?

2015 Hany SalahEldeen Dissertation Defense154 Classifying intention behavior across time  We can perform a prediction!

2015 Hany SalahEldeen Dissertation Defense155 Intention behavior prediction classifier

2015 Hany SalahEldeen Dissertation Defense156 Conclusion: We can predict the author’s temporal intention Results: We can predict for the author whether the intention conveyed to the readers will be consistent or will it change with 77% accuracy. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.

2015 Hany SalahEldeen Dissertation Defense157 At this time, we successfully detected, modeled and predicted User’s Temporal Intention in Shared Content

2015 Hany SalahEldeen Dissertation Defense158 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting User Temporal Intention Temporal Intention Model

2015 Hany SalahEldeen Dissertation Defense159 So we built an awesome prediction model for Temporal Intention… what next?

2015 Hany SalahEldeen Dissertation Defense160 A Framework of Temporal Intention time Posted a tweet Read the tweet Tools for authors Enrich the archives with current content for posterity

2015 Hany SalahEldeen Dissertation Defense161 Prediction API

2015 Hany SalahEldeen Dissertation Defense162 Tools for Authors

2015 Hany SalahEldeen Dissertation Defense163 Temporal Intention Implementation time Posted a tweet Read the tweet Tools for readers Maintain the temporal consistence of content

2015 Hany SalahEldeen Dissertation Defense164 Tools for readers

2015 Hany SalahEldeen Dissertation Defense165 Tools for readers 1.Temporal preservation of vulnerable content 2.Version recommendation based on temporal intention estimation Target Publication: Utilizing Temporal Intention Prediction for Just-in-time Preservation and Recommendation of Vulnerable Social Media Content. WSDM 2016

2015 Hany SalahEldeen Dissertation Defense166 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

2015 Hany SalahEldeen Dissertation Defense167 Accomplished Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9

2015 Hany SalahEldeen Dissertation Defense168 Also, our work reached fame…

2015 Hany SalahEldeen Dissertation Defense169 The Virginian Pilot

2015 Hany SalahEldeen Dissertation Defense the-decaying-web BBC.com

2015 Hany SalahEldeen Dissertation Defense171 Popular Mechanics February 2014 issue, page 20

2015 Hany SalahEldeen Dissertation Defense172 3 x MIT Technology Review page/ reconstruct-lost-web-pages/ is-vanishing-from-the-web-say-computer-scientists/

2015 Hany SalahEldeen Dissertation Defense173 Mashable

2015 Hany SalahEldeen Dissertation Defense174 Mashable Yes I am Indiana Jones of the internet

2015 Hany SalahEldeen Dissertation Defense175 Publications PublishedSubmittedIn preparationPlanned JCDL 2011TPDL 2015WWW 2016IJDL 2016 TPDL 2012SIGIR 2016WSDM 2016 JCDL 2013 TPDL 2013 WWW 2013 DL 2014 AAAI 2015 IJDL 2015 JCDL 2015

2015 Hany SalahEldeen Dissertation Defense176 Remember Rémi Ochlik? Rémi Ochlik 16 October 1983 – 22 February 2012

2015 Hany SalahEldeen Dissertation Defense177 … and the missing content about him? Accessed in July 2014

2015 Hany SalahEldeen Dissertation Defense178 We can maintain the consistency of history Our Temporal Intention Relevancy Model