Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston

Contents Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion Conclusion

Introduction What is a blog? –First appeared in 1994 –Peter Merholz in early 1999 –60 million as of November 2006 Information often republished by other blog users

Introduction Form a complex social structure Propagation of information could be visualised as infection Paper aims to track infection through blogspace and determine the original source Most-related work on spread of foot- and-mouth disease

Terminology Meme Infected Patient zero Infection inference Infection tree

Spread of URLs Infection: www.giantmicrobes.com Data source: www.blogpulse.com

Spread of URLs Do not expect all blogs which mention a given URL to have seen it at the source Aim is to determine the infection source for any given blog Most URLs appearing on blogs are free- floating –From external channels, different URLs for same page Cannot guarantee links with timelines and infection inference but can rule out some possibilities and find the most plausible

Spread of URLs Blogrolls –Two-way links to other blogs (e.g. trackbacks) –One user links to anothers blog and that automatically links back to the original Frequently find no explicit links to explain infection –Via links very rare

Inferring Infection Routes Where explicit links are not present, use 5 classifiers to infer likely routes –Number of blog-blog links in common –Number of blog-non-blog links in common –Text similarity –Order and frequency of repeated infections –In- and out-link counts for both blogs

Inferring Infection Routes Classify blogs likeliness to be linked based on similarity –Blog-blog and blog-non-blog links: –Textual similarity: Term Frequency-Inverse Document Frequency weighted vector Features obtained from full text and differential text crawls

Inferring Infection Routes Similarity features often useful in predicting the existence of a link

Inferring Infection Routes Classify explicit links likeliness to participate in infection Infection six times more likely to happen again where it has happened previously % Blog Pairs Citing 1 Common URL Link typeSameA > BA < BEither A B 17.424.5 45 A B 10.922.917.036 None0.61.51.33

Inferring Infection Routes Likeliness of links to participate in infection not generally linked to similarity of blogs

Inferring Infection Routes First link classifier used with a three-class SVM performed with only 57% accuracy –Difficult to distinguish reciprocated and unreciprocated links Second link classifier performed better –SVM: 91.2% accuracy –Logistic regression: 91.9% accuracy but based on fewer factors

Inferring Infection Routes Additional classifiers were created for plausible infection routes from links –Logistic regression: up to 77% accuracy –SVM: up to 71.5% accuracy Accuracy depended on which subset of classifiers was selected

Visualisation From inferred routes, can construct infection trees Directed Acyclic Graph (DAG) created for each URL Thinned out to make it more manageable Label each link with an inference score and dynamically control the display

Visualisation Sparse Tree Algorithm: For blog A and URL x, collect sets of blogs, B –indicated by A as explicit sources of URL x –explicitly linked to A and also infected by a common URL x –with an unreciprocated link to A that were infected by URL x prior to A –inferred by the classifier with timing restrictions

Visualisation For each blog A infected by URL x and for the first non-empty set, draw a link to each blog B in that set If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link Note: doesnt guarantee an upward link for each blog

Visualisation Further refinement incorporates via data to incorporate hidden blogs Both types of graphs are available as a web service for any users

Visualisation Giant Microbes Infection Tree: CNN News Story Infection Tree:

Discussion Incompleteness of crawl Small dataset Unknown robustness of classifiers Meme residing at multiple URLs A B C

Discussion Novel application of infection model to blogspace Useful visualisation tool developed Further research into influence of graph structure on spread of infection Could be useful for blog search engines

Conclusion Difficult objectives achieved to a limited extent Problems with dataset affect significance of work Further work required to fully determine usefulness of technique

Summary Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion

Any questions?

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Similar presentations

Presentation on theme: "Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Similar presentations

Presentation on theme: "Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston."— Presentation transcript:

Similar presentations

About project

Feedback