Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

Predicting User Interests from Contextual Information
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Face Alignment by Explicit Shape Regression
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Prediction Modeling for Personalization & Recommender Systems Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Finding your friends and following them to where you are by Adam Sadilek, Henry Kautz, Jeffrey P. Bigham Presented by Guang Ling 1.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA.
Chapter 19: Information Retrieval
Recommender systems Ram Akella November 26 th 2008.
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Overview of Search Engines
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Bayesian Networks. Male brain wiring Female brain wiring.
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Crawling and Aligning Scholarly Presentations and Documents from the Web By SARAVANAN.S 09/09/2011 Under the guidance of A/P Min-Yen Kan 10/23/
Amy Dai Machine learning techniques for detecting topics in research papers.
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Chapter 12: Web Usage Mining - An introduction Chapter written by Bamshad Mobasher Many slides are from a tutorial given by B. Berendt, B. Mobasher, M.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Algorithmic Detection of Semantic Similarity WWW 2005.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Evgeniy Gabrilovich and Shaul Markovitch
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Post-Ranking query suggestion by diversifying search Chao Wang.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Implicit Structure and Dynamics of.
Date: 2013/9/25 Author: Mikhail Ageev, Dmitry Lagun, Eugene Agichtein Source: SIGIR’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Improving Search Result.
Analyzing and Predicting Question Quality in Community Question Answering Services Baichuan Li, Tan Jin, Michael R. Lyu, Irwin King, and Barley Mak CQA2012,
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Erasmus University Rotterdam
Information Retrieval
Presented by: Prof. Ali Jaoua
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Mining Anchor Text for Query Refinement
Connecting the Dots Between News Article
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston

Contents Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion Conclusion

Introduction What is a blog? –First appeared in 1994 –Peter Merholz in early 1999 –60 million as of November 2006 Information often republished by other blog users

Introduction Form a complex social structure Propagation of information could be visualised as infection Paper aims to track infection through blogspace and determine the original source Most-related work on spread of foot- and-mouth disease

Terminology Meme Infected Patient zero Infection inference Infection tree

Spread of URLs Infection: Data source:

Spread of URLs Do not expect all blogs which mention a given URL to have seen it at the source Aim is to determine the infection source for any given blog Most URLs appearing on blogs are free- floating –From external channels, different URLs for same page Cannot guarantee links with timelines and infection inference but can rule out some possibilities and find the most plausible

Spread of URLs Blogrolls –Two-way links to other blogs (e.g. trackbacks) –One user links to anothers blog and that automatically links back to the original Frequently find no explicit links to explain infection –Via links very rare

Inferring Infection Routes Where explicit links are not present, use 5 classifiers to infer likely routes –Number of blog-blog links in common –Number of blog-non-blog links in common –Text similarity –Order and frequency of repeated infections –In- and out-link counts for both blogs

Inferring Infection Routes Classify blogs likeliness to be linked based on similarity –Blog-blog and blog-non-blog links: –Textual similarity: Term Frequency-Inverse Document Frequency weighted vector Features obtained from full text and differential text crawls

Inferring Infection Routes Similarity features often useful in predicting the existence of a link

Inferring Infection Routes Classify explicit links likeliness to participate in infection Infection six times more likely to happen again where it has happened previously % Blog Pairs Citing 1 Common URL Link typeSameA > BA < BEither A B A B None

Inferring Infection Routes Likeliness of links to participate in infection not generally linked to similarity of blogs

Inferring Infection Routes First link classifier used with a three-class SVM performed with only 57% accuracy –Difficult to distinguish reciprocated and unreciprocated links Second link classifier performed better –SVM: 91.2% accuracy –Logistic regression: 91.9% accuracy but based on fewer factors

Inferring Infection Routes Additional classifiers were created for plausible infection routes from links –Logistic regression: up to 77% accuracy –SVM: up to 71.5% accuracy Accuracy depended on which subset of classifiers was selected

Visualisation From inferred routes, can construct infection trees Directed Acyclic Graph (DAG) created for each URL Thinned out to make it more manageable Label each link with an inference score and dynamically control the display

Visualisation Sparse Tree Algorithm: For blog A and URL x, collect sets of blogs, B –indicated by A as explicit sources of URL x –explicitly linked to A and also infected by a common URL x –with an unreciprocated link to A that were infected by URL x prior to A –inferred by the classifier with timing restrictions

Visualisation For each blog A infected by URL x and for the first non-empty set, draw a link to each blog B in that set If more than one link exists between A and a previously infected blog, use the classifier score to remove all but the highest scoring link Note: doesnt guarantee an upward link for each blog

Visualisation Further refinement incorporates via data to incorporate hidden blogs Both types of graphs are available as a web service for any users

Visualisation Giant Microbes Infection Tree: CNN News Story Infection Tree:

Discussion Incompleteness of crawl Small dataset Unknown robustness of classifiers Meme residing at multiple URLs A B C

Discussion Novel application of infection model to blogspace Useful visualisation tool developed Further research into influence of graph structure on spread of infection Could be useful for blog search engines

Conclusion Difficult objectives achieved to a limited extent Problems with dataset affect significance of work Further work required to fully determine usefulness of technique

Summary Introduction Terminology Spread of URLs Inferring Infection Routes Visualisation Discussion

Any questions?