Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Similar presentations


Presentation on theme: "Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston."— Presentation transcript:

1 Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston

2 Contents Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion Conclusion

3 Introduction What is a blog? –First appeared in 1994 –Peter Merholz in early 1999 –60 million as of November 2006 Information often republished by other blog users

4 Introduction Form a complex social structure Propagation of information could be visualised as infection Paper aims to track infection through blogspace and determine the original source Most-related work on spread of foot- and-mouth disease

5 Terminology Meme Infected Patient zero Infection inference Infection tree

6 Spread of URLs Infection: www.giantmicrobes.com Data source: www.blogpulse.com

7 Spread of URLs Do not expect all blogs which mention a given URL to have seen it at the source Aim is to determine the infection source for any given blog Most URLs appearing on blogs are free- floating –From external channels, different URLs for same page Cannot guarantee links with timelines and infection inference but can rule out some possibilities and find the most plausible

8 Spread of URLs Blogrolls –Two-way links to other blogs (e.g. trackbacks) –One user links to anothers blog and that automatically links back to the original Frequently find no explicit links to explain infection –Via links very rare

9 Inferring Infection Routes Where explicit links are not present, use 5 classifiers to infer likely routes –Number of blog-blog links in common –Number of blog-non-blog links in common –Text similarity –Order and frequency of repeated infections –In- and out-link counts for both blogs

10 Inferring Infection Routes Classify blogs likeliness to be linked based on similarity –Blog-blog and blog-non-blog links: –Textual similarity: Term Frequency-Inverse Document Frequency weighted vector Features obtained from full text and differential text crawls

11 Inferring Infection Routes Similarity features often useful in predicting the existence of a link

12 Inferring Infection Routes Classify explicit links likeliness to participate in infection Infection six times more likely to happen again where it has happened previously % Blog Pairs Citing 1 Common URL Link typeSameA > BA < BEither A B 17.424.5 45 A B 10.922.917.036 None0.61.51.33

13 Inferring Infection Routes Likeliness of links to participate in infection not generally linked to similarity of blogs

14 Inferring Infection Routes First link classifier used with a three-class SVM performed with only 57% accuracy –Difficult to distinguish reciprocated and unreciprocated links Second link classifier performed better –SVM: 91.2% accuracy –Logistic regression: 91.9% accuracy but based on fewer factors

15 Inferring Infection Routes Additional classifiers were created for plausible infection routes from links –Logistic regression: up to 77% accuracy –SVM: up to 71.5% accuracy Accuracy depended on which subset of classifiers was selected

16 Visualisation From inferred routes, can construct infection trees Directed Acyclic Graph (DAG) created for each URL Thinned out to make it more manageable Label each link with an inference score and dynamically control the display

17 Visualisation Sparse Tree Algorithm: For blog A and URL x, collect sets of blogs, B –indicated by A as explicit sources of URL x –explicitly linked to A and also infected by a common URL x –with an unreciprocated link to A that were infected by URL x prior to A –inferred by the classifier with timing restrictions

18 Visualisation For each blog A infected by URL x and for the first non-empty set, draw a link to each blog B in that set If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link Note: doesnt guarantee an upward link for each blog

19 Visualisation Further refinement incorporates via data to incorporate hidden blogs Both types of graphs are available as a web service for any users

20 Visualisation Giant Microbes Infection Tree: CNN News Story Infection Tree:

21 Discussion Incompleteness of crawl Small dataset Unknown robustness of classifiers Meme residing at multiple URLs A B C

22 Discussion Novel application of infection model to blogspace Useful visualisation tool developed Further research into influence of graph structure on spread of infection Could be useful for blog search engines

23 Conclusion Difficult objectives achieved to a limited extent Problems with dataset affect significance of work Further work required to fully determine usefulness of technique

24 Summary Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion

25 Any questions?


Download ppt "Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston."

Similar presentations


Ads by Google