Presentation is loading. Please wait.

Presentation is loading. Please wait.

2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy.

Similar presentations


Presentation on theme: "2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy."— Presentation transcript:

1 2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy Dissertation Defense Old Dominion University Department of Computer Science Advisor: Dr. Michael L. Nelson Dr. Michele C. Weigle Dr. Hussein M. Abdel-Wahab Dr. M’Hammed Abdous Committee : May 5 th, 2015

2 2015 Hany SalahEldeen Dissertation Defense2 All tweets are equal… …but some are more equal than the others

3 2015 Hany SalahEldeen Dissertation Defense3 It is imperative to know… 1.How long would these last? 2.And if lost, is there a backup somewhere? 3.Is this what the author intended?

4 2015 Hany SalahEldeen Dissertation Defense4 To maintain historical integrity Since tweets are considered the first draft of history… the historical integrity of the tweets could be compromised.

5 2015 Hany SalahEldeen Dissertation Defense5 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

6 2015 Hany SalahEldeen Dissertation Defense6 People rely on social media for most updated information

7 2015 Hany SalahEldeen Dissertation Defense7 Social media is more than kitty photos Marie Colvin January 12, 1956 – February 22, 2012 Rémi Ochlik 16 October 1983 – 22 February 2012 Ahmed Assem 1987 – July 8, 2013

8 2015 Hany SalahEldeen Dissertation Defense8 For the web is dark, and full of missing content… Accessed in July 2014 3 out 8 external links on Remi’s Wikipedia page return 404

9 2015 Hany SalahEldeen Dissertation Defense9 even for content shared in social media Accessed in July 2014

10 2015 Hany SalahEldeen Dissertation Defense10 News sites are also prone to change Accessed in July 2014

11 2015 Hany SalahEldeen Dissertation Defense11 So are specialized sites Accessed in July 2014

12 2015 Hany SalahEldeen Dissertation Defense12 Research Problem: Author’s Intention ≠ Reader’s Experience

13 2015 Hany SalahEldeen Dissertation Defense13 Research Implication Author’s Intention ≠ Reader’s Experience Broken Inconsistent Web and Historical Records

14 2015 Hany SalahEldeen Dissertation Defense14 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

15 2015 Hany SalahEldeen Dissertation Defense15 Social Post

16 2015 Hany SalahEldeen Dissertation Defense16 The anatomy of a tweet Author’s username Other user mention Tweet Body Hash Tag Shortened URL to resource Publishing timestamp Social Post Shared Resource Interaction options

17 2015 Hany SalahEldeen Dissertation Defense17 3 URIs = 3 Chances to fail

18 2015 Hany SalahEldeen Dissertation Defense18 URL shortening and aliasing curl -L -I http://bit.ly/losing_revolutionhttp://bit.ly/losing_revolution HTTP/1.1 301 Moved Permanently Server: nginx Date: Mon, 07 Jul 2014 18:19:48 GMT Cache-Control: private; max-age=90 Location: http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html Mime-Version: 1.0 Set-Cookie: _bit=53bae4c4-00328-04f10- cb1cf10a;domain=.bit.ly;expires=Sat Jan 3 18:19:48 2015;path=/; HttpOnly Content-Type: text/html;charset=utf-8 Content-Length: 167 HTTP/1.1 200 OK Expires: Mon, 07 Jul 2014 18:19:52 GMT Date: Mon, 07 Jul 2014 18:19:52 GMT Cache-Control: private, max-age=0 Last-Modified: Mon, 07 Jul 2014 18:19:07 GMT ETag: "e3555826-b103-4daa-a3f2- d0509ebab51f" X-Content-Type-Options: nosniff X-XSS-Protection: 1; mode=block Server: GSE Alternate-Protocol: 80:quic Content-Type: text/html;charset=UTF-8 Content-Length: 0

19 2015 Hany SalahEldeen Dissertation Defense19 Life cycle of a social post

20 2015 Hany SalahEldeen Dissertation Defense20 Life cycle of a social post tweets

21 2015 Hany SalahEldeen Dissertation Defense21 Life cycle of a social post tweets Links to

22 2015 Hany SalahEldeen Dissertation Defense22 Life cycle of a social post tweets What the reader receives Links to Same state the author intended

23 2015 Hany SalahEldeen Dissertation Defense23 Life cycle of a social post tweets What the reader receives Links to Same state the author intended Ideally!

24 2015 Hany SalahEldeen Dissertation Defense24 Life cycle of a social post tweets What the reader receives Links to Same state the author intended After a period of time

25 2015 Hany SalahEldeen Dissertation Defense25 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared After a period of time

26 2015 Hany SalahEldeen Dissertation Defense26 Life cycle of a social post tweets What the reader receives Links to Same state the author intended The resource has disappeared The resource has changed After a period of time

27 2015 Hany SalahEldeen Dissertation Defense27 Memento framework * http://mementoweb.org/guide/rfc/http://mementoweb.org/guide/rfc/

28 2015 Hany SalahEldeen Dissertation Defense28 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

29 2015 Hany SalahEldeen Dissertation Defense29 Related Work Social media analysis: Understanding Microblogging Zhao 2009 Yang 2010 Newman 2003 Kwak 2010 Java 2007 Cha 2009 History Narration Vieweg 2010 Starbird 2010-2012 Qu 2011 Neubig 2011 Lehman and Lalmas 2012- 2013 User’s Web Search Intention Ashkan 2009 Lee 2005 Loser 2008 Azzopardi 2009 Baeza-Yates 2006 Dai 2011 Commercial Intention Guo 2010 Benczur 2007 Sentiment Analysis Mishne 2006 Bollen 2011 Access to Archives Van de Sompel 2009 Persistence of shared resources – Nelson 2002 – Sanderson 2011 – McCown 2007 URL Shortening – Antoniades 2011 Tweeting, Micro-blogging and Popularity – Wu 2011 – Java 2007 – Kwak 2010 Social Networks Growth and Evolution – Meeder 2011 Further details: refer to chapter 3

30 2015 Hany SalahEldeen Dissertation Defense30 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

31 2015 Hany SalahEldeen Dissertation Defense31 Research Question: Can we estimate the users’ intention at the time of posting and reading to predict and maintain temporal consistency?

32 2015 Hany SalahEldeen Dissertation Defense32 Research Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9

33 2015 Hany SalahEldeen Dissertation Defense33 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

34 2015 Hany SalahEldeen Dissertation Defense34 Shared ResourceTimeUser Our analysis covers three angles

35 2015 Hany SalahEldeen Dissertation Defense35 Shared ResourceTimeUser Loss and Persistence of Shared Resources

36 2015 Hany SalahEldeen Dissertation Defense36 Shared ResourceTimeUser Alive First: Estimate social media content loss

37 2015 Hany SalahEldeen Dissertation Defense37 Six socially significant events EventSourceYear Iranian ElectionSNAP Dataset2009 H1N1 Virus OutbreakSNAP Dataset2009 Michael Jackson’s DeathSNAP Dataset2009 Obama’s Nobel Peace PrizeSNAP Dataset2009 The Egyptian RevolutionTwitter, Websites, Books2011 The Syrian UprisingTwitter API2012

38 2015 Hany SalahEldeen Dissertation Defense38 Twitter tag expansion and filtration

39 2015 Hany SalahEldeen Dissertation Defense39 Twitter tag expansion increases precision

40 2015 Hany SalahEldeen Dissertation Defense40 What are people sharing?

41 2015 Hany SalahEldeen Dissertation Defense41 Existence on the live web and in the archives For each unique URL we resolved the final HTTP response and considered 2 classes: Success: 200 OK Failure: 4XX, 50X families and the 30X loop redirects or soft 404s. Utilize the memento aggregator: Archived: if it has at least one memento in the timemap

42 2015 Hany SalahEldeen Dissertation Defense42 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

43 2015 Hany SalahEldeen Dissertation Defense43 Shared ResourceTimeUser Alive Missing Second: Can we measure existence and disappearance as a function of time?

44 2015 Hany SalahEldeen Dissertation Defense44 Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

45 2015 Hany SalahEldeen Dissertation Defense45 Timeline of Events

46 2015 Hany SalahEldeen Dissertation Defense46 Timeline of Events

47 2015 Hany SalahEldeen Dissertation Defense47 Social Events Having a Bimodal Time Distribution

48 2015 Hany SalahEldeen Dissertation Defense48 Timeline of Events

49 2015 Hany SalahEldeen Dissertation Defense49 Social Events Having a Bimodal Time Distribution

50 2015 Hany SalahEldeen Dissertation Defense50 Existence as a function of time

51 2015 Hany SalahEldeen Dissertation Defense51 Existence as a function of time

52 2015 Hany SalahEldeen Dissertation Defense52 Results: Publications and Articles: 1.H. M. SalahEldeen. Losing My Revolution: A year after the Egyptian Revolution, 10% of the social media documentation is gone. http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html, 2012.http://ws-dl.blogspot.com/2012/02/2012-02-11- losing-my-revolution-year.html 2.H. M. SalahEldeen and M. L. Nelson. Losing my revolution: how many resources shared on social media have been lost? In Proceedings of the Second international conference on Theory and Practice of Digital Libraries, TPDL'12, 2012. Conclusion: Existence could be estimated as a function of time Measured 21,625 resources from 6 data sets in archives & live web. After a year from publishing about 11% of content shared on social media will be gone. After this we are losing roughly 0.02% daily.

53 2015 Hany SalahEldeen Dissertation Defense53 Revisiting Existence after a year MJIranH1N1ObamaEgyptSyria Measured37.10%37.50%28.17%30.56%26.29%31.62%32.47%24.64%7.55%12.68% Predicted31.72%31.42%31.96%30.98%30.16%29.68%29.60%28.36%19.80%11.54% Error5.38%6.08%3.79%0.42%3.87%1.94%2.87%3.72%12.25%1.14% MJIranH1N1ObamaEgyptSyria Measured48.61%40.32%60.80%55.04%47.97%52.14%48.38%40.58%23.73%0.56% Predicted61.78%61.18%62.26%60.30%58.66%57.70%57.54%55.06%37.94%21.42% Error13.17%20.86%1.46%5.26%10.69%5.56%9.16%14.48%14.21%20.86% Average Prediction Error = 11.57% in all cases, our archival predictions were too optimistic Missing Archived Average Prediction Error = 4.15% in all cases, our missing predictions were acceptable

54 2015 Hany SalahEldeen Dissertation Defense54 Shared ResourceTimeUser Alive Missing Replaced Third: Can we use social context to find replacements of missing resources?

55 2015 Hany SalahEldeen Dissertation Defense55 Context discovery and shared resource replacement Problem: 140 characters limits the description of the linked resource. If it went missing, can we get the next best thing? Solution: Shared links typically have several tweets, responses, and retweets We can mine these traces for context and viable replacements

56 2015 Hany SalahEldeen Dissertation Defense56 Context Discovery Linking to: http://beta.18daysinegypt.com/

57 2015 Hany SalahEldeen Dissertation Defense57 What if the resource disappeared? Linking to: http://beta.18daysinegypt.com/

58 2015 Hany SalahEldeen Dissertation Defense58 Use Topsy to discover tweets sharing the same link

59 2015 Hany SalahEldeen Dissertation Defense59 Social Context Extraction { "URI": "http://beta.18daysinegypt.com/", "Related Tweet Count": 500, "Related Hashtags": "#tran #citizensx #arabspring #visualstorytelling #collaborativerevolution #feb11http://t.co/qxusp70...", "Users who talked about this": "@petra_stienen: @waleedrashed: @omarsamra @ungormite: @dcisbusy @webdocumentario:...", "All associated unique links:": "http://t.co/63X1f3f1 http://t.co/reBh6c4V http://t.co/B3GuhQN4 http://t.co/X2sjf4Rf http://t.co/P9iR28fH http://t.co/1C4EPh8h...", "All other links associated:": "http://vimeo.com/35368376 http://mashable.com/2012/01/21/18daysinegypt-2/ ", "Most frequent link appearing:": "http://t.co/2ke0rEjP", "Number of times the Most frequent link appearing:": 49, "Most frequent tweet posted and reposted:": "Check out 18DaysInEgypt - A crowd sourced documentary project ================= via @18daysinegypt", "Number of times the Most frequent tweet appearing:": 46, "The longest common phrase appearing:": "RT 2ke0rEjP is an interactive documentary website that YOU can help create Get your Jan25 stories ready! Pl RT", "Number of times the Most common phrase appearing:": 18 }

60 2015 Hany SalahEldeen Dissertation Defense60 Build a Tweet Document A tweet document represents the concatenation of all extracted tweets: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “

61 2015 Hany SalahEldeen Dissertation Defense61 Tweet Signature Tweet Document: do you have a story to tell about your 18 days of revolution? share it or contact sara 18days brand new interactive storytelling project on egyptian revolution a very creative platform to tell your story daysinegypt marches heading to tahrir square now from all over cairo it's all over again use the website to document your revolutionary stories and share them with the world! check out awesome documentary project crowdsourcing a people's narrative of the egyptian revolution … ” “ Tweet Signature = top 5 most frequent terms from Tweet Document documentary project daysinegypt check sourced

62 2015 Hany SalahEldeen Dissertation Defense62 Query Google with the Tweet Signature

63 2015 Hany SalahEldeen Dissertation Defense63 Search Engine Results The original resource

64 2015 Hany SalahEldeen Dissertation Defense64 Search Engine Results The original resource The others are good replacement candidates

65 2015 Hany SalahEldeen Dissertation Defense65 Recommendation Evaluation We extract a dataset of resources that are currently available: Pretend these resources no longer exist (for a baseline) Each of the resources are textual based Each resource has at least 30 retrievable tweets.  Extracted 731 unique resources We use boiler plate removal library to remove the template from the: linked resources top 10 retrieved results from Google  We use cosine similarity to compare the documents

66 2015 Hany SalahEldeen Dissertation Defense66 Similarity measures in resource replacement ----70% similarity---- 41% of the cases we found a replacement with >=70% similarity

67 2015 Hany SalahEldeen Dissertation Defense67 Conclusion: We can find viable replacements for missing shared resources Results: 41% of the test cases we can find a replacement page with at least 70% similarity to the original missing resource The search results provide a mean reciprocal rank of 0.43 Publications: 1.H. SalahEldeen and M. L. Nelson. Resurrecting my revolution: Using social link neighborhood in bringing context to the disappearing web. In Research and Advanced Technology for Digital Libraries- International Conference on Theory and Practice of Digital Libraries, TPDL 2013, 2013.

68 2015 Hany SalahEldeen Dissertation Defense68 Now we finished analyzing the shared resource…what’s next?

69 2015 Hany SalahEldeen Dissertation Defense69 Shared ResourceTimeUser Alive Missing Replaced Footprints on the web

70 2015 Hany SalahEldeen Dissertation Defense70 The tweet, the resource…and time time Posted a tweet Read the tweet Relevancy of the resource to the tweet changed through time  we need to measure that Another tweet posted And another … We need to measure tweet relevance through time

71 2015 Hany SalahEldeen Dissertation Defense71 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Longitudinal Study: Rate of change of shared content

72 2015 Hany SalahEldeen Dissertation Defense72 Pilot 1: Resource change in the first 80 hours after tweeting

73 2015 Hany SalahEldeen Dissertation Defense73 Pilot 2: Delta days from Bitly creation for just tweeted content Dataset size = 4,000

74 2015 Hany SalahEldeen Dissertation Defense74 Pilot 3: Dataset of 1,000 freshly created Bitlys http://www.cnn.com  depth = 0 http://www.cnn.com/world  depth = 1 http://www.cnn.com/2009/SHOWBIZ/Music/06/25/jackson  depth = 6

75 2015 Hany SalahEldeen Dissertation Defense75 What domains do users link to?

76 2015 Hany SalahEldeen Dissertation Defense76 What categories* do users link to? * Extracted from Alexa.com

77 2015 Hany SalahEldeen Dissertation Defense77 Summation of Intention in Social Content Through Time Longitudinal study: We record the change over an extended period of time: Content: we download a snapshot of the resource every 45 minutes Metadata: we collect meta data about the resource Facebook likes, posts Tweets in the last hour Bitly clicklogs and shares Average data size: ~1 TB per month

78 2015 Hany SalahEldeen Dissertation Defense78 Hourly analysis over an extended period of time

79 2015 Hany SalahEldeen Dissertation Defense79 There is a difference between t tweet and t click After just one hour, 4% of the resources have changed by 30%. After six hours, the percentage doubled to be 8% changed by 40%. After a day the change rate slowed to be 12% of the resources changed by 40%. After that it almost stabilizes at 17% of the resources to be changed by 40%.

80 2015 Hany SalahEldeen Dissertation Defense80 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation First: Resource – Time – Public Archives

81 2015 Hany SalahEldeen Dissertation Defense81 Revisited: Resources Missing and Archived CollectionPercentage MissingPercentage Archived 23.49% H1N1 Outbreak 41.65% 36.24% Michael Jackson 39.45% 26.98% Iran 43.08% 24.59% Obama 47.87% 10.48%Egypt20.18% 7.04%Syria 5.35% 31.62%30.78% 24.47%36.26% 25.64%43.87% 26.15%46.15%

82 2015 Hany SalahEldeen Dissertation Defense82 But on a more general notion we want to know…

83 2015 Hany SalahEldeen Dissertation Defense83 How much of the web is archived? Goal: Estimate how much of the public web is present in the public archives and how many copies are available? Action: Getting 4 different datasets from 4 different sources: Search Engines Indices Bit.ly DMOZ Delicious.

84 2015 Hany SalahEldeen Dissertation Defense84 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM.

85 2015 Hany SalahEldeen Dissertation Defense85 Conclusion: It depends on the source Results: Publication: S. G. Ainsworth, A. Alsum, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. How much of the web is archived? In Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, JCDL '11, pages 133-136, New York, NY, USA, 2011. ACM. Changes since 2011: no more free SE APIs; greatly reduced IA quarantine period; 15 public web archives 2013 95% 92% 23% 26%

86 2015 Hany SalahEldeen Dissertation Defense86 Side Experiment: Analyzing the quality of the archives and the archived content Goal: Assessing the quality of the web archives Better discussed in Justin Brunelle’s work Publications: 1.J. F. Brunelle, M. Kelly, H. SalahEldeen, M. C. Weigle, and M. L. Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources. In Proceedings of the 2014 IEEE/ACM Joint Conference on Digital Libraries (JCDL), 2014 (Best student paper award)

87 2015 Hany SalahEldeen Dissertation Defense87 A question emerged: When did a certain resource first appear on the web?

88 2015 Hany SalahEldeen Dissertation Defense88 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Second: When was the resource created?

89 2015 Hany SalahEldeen Dissertation Defense89 Idea Web pages leave trails as well since the day they were created…

90 2015 Hany SalahEldeen Dissertation Defense90 Web Resource Web trails A web page could leave a trail of one of the following denoting its existence: References Links (anchors) Social media likes and interactions. URL shortening. Backlinks The creation date of any of the associated events/trails could be an estimate of the creation date.

91 2015 Hany SalahEldeen Dissertation Defense91 Resource’s timeline

92 2015 Hany SalahEldeen Dissertation Defense92 Observations Recorded 1.Last modified date from the response header. 2.First Appearance of a backlink. 3.First Tweet published. 4.First Bitly Shortened URL created. 5.Time stamp of first memento in the archives. 6.Date of the last crawl by the search engine.

93 2015 Hany SalahEldeen Dissertation Defense93 Carbon Date service

94 2015 Hany SalahEldeen Dissertation Defense94 Carbon Dating API { "self": "http://cd.cs.odu.edu/cd?url=http://www.cnn.com", "URI": "http://www.cnn.com", "Estimated Creation Date": "1998-12-06T04:02:33", "Last Modified": "", "Bitly.com": "2008-06-08T12:00:00", "Topsy.com": "2015-01-25T23:31:42", "Backlinks": "2003-03-12T05:35:44", "Google.com": "2005-01-11T00:00:00", "Archives": [ [ "Earliest", "1998-12-06T04:02:33" ], [ "By_Archive", { "http://archive.today/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://arquivo.pt/wayback/wayback/20000815052826/http://www.cnn.com/": "2000-08-15T05:28:26", "http://wayback.vefsafn.is/wayback/20011106102722/http://www.cnn.com/": "1998-12-06T04:02:33", "http://web.archive.org/web/20131218180509/http://www.cnn.com/": "2013-12-18T18:05:09" } ] }

95 2015 Hany SalahEldeen Dissertation Defense95 Evaluation Dataset  From each we randomly selected 100 unique URLs to create our gold standard dataset

96 2015 Hany SalahEldeen Dissertation Defense96 Evaluation Applied our 6 methods on 1200 resources. Get leftmost estimate. Number of ResourcesPercentage An estimate found91076% Exact matching estimate39333% No estimate found29024% Total Resources1200100%

97 2015 Hany SalahEldeen Dissertation Defense97 Actual Vs. Estimated Dates

98 2015 Hany SalahEldeen Dissertation Defense98 Conclusion: We can estimate the creation date of resources correctly Results: Succeeded in estimating the creation date accurately in 75.90% of the resources. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Carbon dating the web: Estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web Companion, TempWeb03, WWW '13, 2013

99 2015 Hany SalahEldeen Dissertation Defense99 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. http://cd.cs.odu.edu/

100 2015 Hany SalahEldeen Dissertation Defense100 Alexander Nwala did an awesome job releasing the second version of Carbon Date which is more reliable, multithreaded, faster, can handle multiple requests, has caching capabilities. Yes, it’s better than mine… I admit it

101 2015 Hany SalahEldeen Dissertation Defense101 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation User’s Temporal Intention

102 2015 Hany SalahEldeen Dissertation Defense102 Problem: There is an inconsistency between what the tweet’s author intended to share at time t tweet and what the reader might actually read upon clicking on the link at time t click.

103 2015 Hany SalahEldeen Dissertation Defense103 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting What is Intention and how to detect it?

104 2015 Hany SalahEldeen Dissertation Defense104 Amazon’s Mechanical Turk Crowdsourcing Internet marketplace Co-ordinates the use of human intelligence to perform tasks that computers are currently unable to do.* * http://en.wikipedia.org/wiki/Amazon_Mechanical_Turkhttp://en.wikipedia.org/wiki/Amazon_Mechanical_Turk

105 2015 Hany SalahEldeen Dissertation Defense105 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train

106 2015 Hany SalahEldeen Dissertation Defense106 Goal: Understand and collect user intention data via MT Tweets datasetIntention Classification Tasks User Intention Data Classifier Train Problem: It is not as easy as it seems!

107 2015 Hany SalahEldeen Dissertation Defense107 How NOT to classify temporal intention 101 The tweet is presented along with the two snapshots: at t tweet at t click

108 2015 Hany SalahEldeen Dissertation Defense108 And compared MT results with Experts Experts: Manually assigning a version to each tweet via a face to face meeting with WS-DL members. For 9 MT assignments per tweet: If we allowed 4-5 splits we have 58% match with WS-DL. If we allowed 3-6 splits or better we got 31% match  Which is worse than flipping a coin!

109 2015 Hany SalahEldeen Dissertation Defense109 Idea: We need to transform the problem from intention to relevance.

110 2015 Hany SalahEldeen Dissertation Defense110 Relevance tasks are simpler MT workers are more accustomed to classification tasks and it requires minimum amount of explanation Transform a hard problem to an easy one Is that a cat? - Yes - No

111 2015 Hany SalahEldeen Dissertation Defense111 Temporal Intention Relevancy Model (TIRM) Between t tweet and t click : The linked resource could have: Changed Not changed The tweet and the linked resource could be: Still relevant No longer relevant

112 2015 Hany SalahEldeen Dissertation Defense112 Resource is changed but relevant The resource changed But it is still relevant  Intention: need the current version of the resource at any time

113 2015 Hany SalahEldeen Dissertation Defense113 Relevancy and Intention mapping Current

114 2015 Hany SalahEldeen Dissertation Defense114 Resource is changed and not relevant  Intention: need the past version of the resource at any time The resource changed But it is no longer relevant

115 2015 Hany SalahEldeen Dissertation Defense115 Relevancy and Intention mapping Past Current

116 2015 Hany SalahEldeen Dissertation Defense116 Resource is not changed and relevant  Intention: need the past version of the resource at any time The resource is not changed And it is relevant

117 2015 Hany SalahEldeen Dissertation Defense117 Relevancy and Intention mapping Past Current Past

118 2015 Hany SalahEldeen Dissertation Defense118 Resource is not changed and not relevant  Intention: I am not sure which version of the resource I need The resource is not changed But it is not relevant

119 2015 Hany SalahEldeen Dissertation Defense119 Relevancy and Intention mapping Past Current PastNot Sure

120 2015 Hany SalahEldeen Dissertation Defense120 Validation: Update the MT experiment MT workers ≡ judgments of the experts (WS-DL members) ✓ Is the content still relevant to the tweet?

121 2015 Hany SalahEldeen Dissertation Defense121 Mechanical Turk Workers Vs. Experts For 100 tweets, WS-DL members % of agreement: Cohen’s K = 0.854  almost perfect agreement Agreement in 3-2 split or more votes 93% Agreement in 4-1 split or more votes 80% Agreement with 5-0 votes 60%

122 2015 Hany SalahEldeen Dissertation Defense122 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Can we model this temporal intention?

123 2015 Hany SalahEldeen Dissertation Defense123 Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique.  We extracted 5,937 unique instances

124 2015 Hany SalahEldeen Dissertation Defense124 Time delta between the tweet and the closest memento Randomly selected 1,124 instances Time delta range: 3.07 minutes to 56.04 hours Average: 25.79 hours ~ 1 day Tweet time After Tweet time Before Tweet time

125 2015 Hany SalahEldeen Dissertation Defense125 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments 92982.65% Non-Relevant Assignments 19517.35% 5 MT workers agreeing (5-0 split) 58952.40% 4 MT workers agreeing (4-1 split) 30927.49% 3 MT workers agreeing (3-2 close call split) 22620.11%

126 2015 Hany SalahEldeen Dissertation Defense126 Training Dataset R current : The state of the resource at current time. R click : The state of the resource at click time. Relevant Assignments 92982.65% Non-Relevant Assignments 19517.35% 5 MT workers agreeing (5-0 split) 58952.40% 4 MT workers agreeing (4-1 split) 30927.49% 3 MT workers agreeing (3-2 close call split) 22620.11%

127 2015 Hany SalahEldeen Dissertation Defense127 Intention modeling: Feature extraction For each tweet we perform: Link analysis Social media mining Archival existence Sentiment analysis Content similarity Entity identification

128 2015 Hany SalahEldeen Dissertation Defense128 Training the classifier From the feature extraction phase we extracted 39 different features to train the classifier. Using 10-fold cross validation, the Cost Sensitive Classifier Based on Random Forests gave the highest success rate = 90.32%

129 2015 Hany SalahEldeen Dissertation Defense129 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.053 5Similarity: Tweet & past page0.044 6Original URI’s depth0.032

130 2015 Hany SalahEldeen Dissertation Defense130 Testing the model We tested against: The remaining 4,813 from the original 5,937 instances after extracting the 1,124 used in training. The Tweet Collections based on historic events. (MJ, Obama, Iran, Syria, & H1N1) DatasetStatus 200Status 404 or otherRelevant %Non-Relevant % Extended 4,813 instances96.77%3.23%96.74%3.26% MJ’s Death57.54%42.46%93.24%6.76% H1N1 Outbreak8.96%91.04%97.48%2.52% Iran Elections68.21%31.79%94.69%5.31% Obama’s Nobel Prize62.86%37.14%93.89%6.11% Syrian Uprising80.80%19.20%70.26%29.75%

131 2015 Hany SalahEldeen Dissertation Defense131 Idea: We need to transform the problem from intention to relevance. Now we need to transform it back! Recap…

132 2015 Hany SalahEldeen Dissertation Defense132 Recap: Relevancy and Intention mapping Past Reading the wrong history

133 2015 Hany SalahEldeen Dissertation Defense133 Mapping TIRM We used 70% similarity as a threshold of relevancy. Reading the wrong history In up to 25% of the cases

134 2015 Hany SalahEldeen Dissertation Defense134 Conclusion: We can model users’ temporal intention accurately and efficiently Results: We successfully transformed the complicated problem of intention to a simpler one of relevance. We successfully collected a gold standard dataset of temporal user intention. We found a temporal inconsistency in the shared resource up to 25% of the cases according to the dataset. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Reading the correct history?: Modeling temporal intention in resource sharing. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '13, 2013.

135 2015 Hany SalahEldeen Dissertation Defense135 So we modeled intention… can we make it better?

136 2015 Hany SalahEldeen Dissertation Defense136 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.0527 5Similarity: Tweet & past page0.04401 6Original URI’s depth0.0324

137 2015 Hany SalahEldeen Dissertation Defense137 Most significant features sorted by information gain RankFeatureGain Ratio 1Existence of celebrities in tweets0.149 2Number of mementos0.090 3Tweet similarity with current page0.071 4Similarity: Current & past page0.0527 5Similarity: Tweet & past page0.04401 6Original URI’s depth0.0324

138 2015 Hany SalahEldeen Dissertation Defense138 Enhancing TIRM Extending and tuning the features: Linguistic feature analysis Semantic similarity analysis using latent topic modeling Dataset balancing Feature selection and minimization

139 2015 Hany SalahEldeen Dissertation Defense139 A whole lot of features! 39  65 different features in extended TIRM Further details: refer to chapter 7

140 2015 Hany SalahEldeen Dissertation Defense140 TIRM enhancement and minimization results

141 2015 Hany SalahEldeen Dissertation Defense141 Point of Confusion: C Point of Certainty: S  Strongest Current Intention From binary to probabilistic strength Further details: refer to chapter 7

142 2015 Hany SalahEldeen Dissertation Defense142 Intention strength formulation Intention strength magnitude of the new resource: Generalization in regards of class:

143 2015 Hany SalahEldeen Dissertation Defense143 Intention strength across instances in dataset

144 2015 Hany SalahEldeen Dissertation Defense144

145 2015 Hany SalahEldeen Dissertation Defense145 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting Can we find a relation between the modeled intention and time …to predict it?

146 2015 Hany SalahEldeen Dissertation Defense146 Remember: Data Collection From SNAP dataset we extracted: Tweets in English Each has an embedded URI pointing to an external resource. The embedded URI is shortened via Bit.ly The external resource: Still persists. Has at least 10 mementos. Is unique.  We extracted 5,937 unique instances

147 2015 Hany SalahEldeen Dissertation Defense147 Intention strength across time time Resource = Closest memento Resource = current version We have 10 mementos of the resource uniformly distributed … We can calculate intention strength at every point

148 2015 Hany SalahEldeen Dissertation Defense148 Intention strength across time Dataset collection and calculation framework

149 2015 Hany SalahEldeen Dissertation Defense149 Behavior of instances in different classes time Intention strength Steady Current Intention Steady Past Intention Changing Intention

150 2015 Hany SalahEldeen Dissertation Defense150 Behavior of instances in different classes

151 2015 Hany SalahEldeen Dissertation Defense151 Given the features we already collected can we classify tweets according to their behavioral class?

152 2015 Hany SalahEldeen Dissertation Defense152 Classifying intention behavior across time

153 2015 Hany SalahEldeen Dissertation Defense153 If we can limit the features to the ones that exist before tweet time can we perform a prediction?

154 2015 Hany SalahEldeen Dissertation Defense154 Classifying intention behavior across time  We can perform a prediction!

155 2015 Hany SalahEldeen Dissertation Defense155 Intention behavior prediction classifier

156 2015 Hany SalahEldeen Dissertation Defense156 Conclusion: We can predict the author’s temporal intention Results: We can predict for the author whether the intention conveyed to the readers will be consistent or will it change with 77% accuracy. Publications: 1.H. M. SalahEldeen and M. L. Nelson. Predicting Temporal Intention in Resource Sharing. In Proceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL '15, 2015.

157 2015 Hany SalahEldeen Dissertation Defense157 At this time, we successfully detected, modeled and predicted User’s Temporal Intention in Shared Content

158 2015 Hany SalahEldeen Dissertation Defense158 Shared ResourceTimeUser Alive Missing Replaced Rate of Change Archive & Creation Detecting Modeling Predicting User Temporal Intention Temporal Intention Model

159 2015 Hany SalahEldeen Dissertation Defense159 So we built an awesome prediction model for Temporal Intention… what next?

160 2015 Hany SalahEldeen Dissertation Defense160 A Framework of Temporal Intention time Posted a tweet Read the tweet Tools for authors Enrich the archives with current content for posterity

161 2015 Hany SalahEldeen Dissertation Defense161 Prediction API

162 2015 Hany SalahEldeen Dissertation Defense162 Tools for Authors

163 2015 Hany SalahEldeen Dissertation Defense163 Temporal Intention Implementation time Posted a tweet Read the tweet Tools for readers Maintain the temporal consistence of content

164 2015 Hany SalahEldeen Dissertation Defense164 Tools for readers

165 2015 Hany SalahEldeen Dissertation Defense165 Tools for readers 1.Temporal preservation of vulnerable content 2.Version recommendation based on temporal intention estimation Target Publication: Utilizing Temporal Intention Prediction for Just-in-time Preservation and Recommendation of Vulnerable Social Media Content. WSDM 2016

166 2015 Hany SalahEldeen Dissertation Defense166 Motivation BackgroundRelated ResearchResearch QuestionUser-Time-Shared Resource Conclusions

167 2015 Hany SalahEldeen Dissertation Defense167 Accomplished Goals Detect the temporal intention of the: 1.Author upon sharing time 2.The reader upon dereferencing time Model this intention as a function of time, nature of the resource, and its context. Predict how resources change with time and the intention behind sharing them to minimize inconsistency. Implement the prediction model to automatically preserve vulnerable social content that is prone to change or loss and provide a smooth temporal navigation of the social web. Further details: refer to chapter 6 Further details: refer to chapter 7 Further details: refer to chapter 8 Further details: refer to chapter 9

168 2015 Hany SalahEldeen Dissertation Defense168 Also, our work reached fame…

169 2015 Hany SalahEldeen Dissertation Defense169 The Virginian Pilot

170 2015 Hany SalahEldeen Dissertation Defense170 http://www.bbc.com/future/story/20120 927-the-decaying-web BBC.com

171 2015 Hany SalahEldeen Dissertation Defense171 Popular Mechanics February 2014 issue, page 20

172 2015 Hany SalahEldeen Dissertation Defense172 3 x MIT Technology Review http://www.technologyreview.com/view/513996/how-to-carbon-date-a-web- page/ http://www.technologyreview.com/view/519391/internet-archaeologists- reconstruct-lost-web-pages/ http://www.technologyreview.com/view/429274/history-as-recorded-on-twitter- is-vanishing-from-the-web-say-computer-scientists/

173 2015 Hany SalahEldeen Dissertation Defense173 Mashable

174 2015 Hany SalahEldeen Dissertation Defense174 Mashable Yes I am Indiana Jones of the internet

175 2015 Hany SalahEldeen Dissertation Defense175 Publications PublishedSubmittedIn preparationPlanned JCDL 2011TPDL 2015WWW 2016IJDL 2016 TPDL 2012SIGIR 2016WSDM 2016 JCDL 2013 TPDL 2013 WWW 2013 DL 2014 AAAI 2015 IJDL 2015 JCDL 2015

176 2015 Hany SalahEldeen Dissertation Defense176 Remember Rémi Ochlik? Rémi Ochlik 16 October 1983 – 22 February 2012

177 2015 Hany SalahEldeen Dissertation Defense177 … and the missing content about him? Accessed in July 2014

178 2015 Hany SalahEldeen Dissertation Defense178 We can maintain the consistency of history Our Temporal Intention Relevancy Model


Download ppt "2015 Hany SalahEldeen Dissertation Defense1 Detecting, Modeling, & Predicting User Temporal Intention in Social Media Hany M. SalahEldeen Doctor of Philosophy."

Similar presentations


Ads by Google