Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA.

Slides:



Advertisements
Similar presentations
Tracking Information Epidemics in Blogspace A paper synopsis Alistair Wright, Ken Tan, Kisan Kansagra, Jenn Houston.
Advertisements

Slide 1 Bayesian Model Fusion: Large-Scale Performance Modeling of Analog and Mixed- Signal Circuits by Reusing Early-Stage Data Fa Wang*, Wangyang Zhang*,
Meme spread in networks Matthew Simmons, Lada Adamic, Eytan Adar School of Information University of Michigan.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Lada Adamic School of Information University of Michigan, Ann Arbor.
Lada Adamic School of Information University of Michigan, Ann Arbor.
Hasan T Karaoglu. Introduction Blogs are different! Methods are different! Contents are different! Some methods on Some Content of Some Blogs Discussion.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Existing tools to analyze Blogosphere. IceRocket Ice Spy – Spy on what others are searching. Blog Trends – Identifies the trend of particular terms in.
CS345 Data Mining Recommendation Systems Netflix Challenge Anand Rajaraman, Jeffrey D. Ullman.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Meme spread in networks Matthew Simmons, Lada Adamic, Eytan Adar School of Information University of Michigan.
Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’
Automatic Blog Monitoring and Summarization Ka Cheung “Richard” Sia PhD Prospectus.
BECOME A BLOGGER: Create a Classroom that Extends Beyond the Boundaries of the School Building.
Increasing HG awareness on the web. Aim “cost-effective use of the internet to increase awareness, understanding and take-up of Human Givens ideas”
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Multimedia Databases (MMDB)
Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Lada Adamic, HP Labs, Palo Alto, CA. Talk outline Information flow through blogs Information flow through Search through networks Search within.
Community Building Through Your Web Site: Library Blogs and RSS Feeds Michael Stephens Dominican University Tame the Web.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
To Blog or Not to Blog: Characterizing and Predicting Retention in Community Blogs Imrul Kayes 1, Xiang Zuo 1, Da Wang 2, Jacob Chakareski 3 1 University.
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T, the AT&T logo and all other AT&T marks contained herein are trademarks of AT&T Intellectual.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Predicting Positive and Negative Links in Online Social Networks
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Are You a Blog Luddite? (…that rhymes with troglodyte) SLA ~ Iowa Chapter November 5, 2004 Laura L. Leavitt Business Reference Librarian University of.
Weblogs & their effect on old Media By: Helin Marte Maggie Rende.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Hypersearching the Web, Chakrabarti, Soumen Presented By Ray Yamada.
Blogging in China and Online Public Sphere  Chinese Blogosphere  Blogging and Online Public Sphere Ying Jiang-- University of Adelaide.
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
CiteSight: Contextual Citation Recommendation with Differential Search Avishay Livne 1, Vivek Gokuladas 2, Jaime Teevan 3, Susan Dumais 3, Eytan Adar 1.
Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University College LAPP-Top Computer Science February 2005.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
IBM Lotus Software © 2006 IBM Corporation IBM Lotus Notes Domino Blog Template Steve Castledine.
Blogging. Website and blog A website, also written as web site,or simply site, is a set of related web pages typically served from a single web domain.
Communicating in the 21 st Century BLOGS. Review at least 3 different blogs online. List the three sites and address these questions: What is the purpose.
Pamela Drake December 11, 2015 SEARCH ENGINE OPTIMIZATON (SEO)
11 A Classification-based Approach to Question Routing in Community Question Answering Tom Chao Zhou 1, Michael R. Lyu 1, Irwin King 1,2 1 The Chinese.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Implicit Structure and Dynamics of.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Applying Link-based Classification to Label Blogs Smriti Bhagat, Irina Rozenbaum Graham Cormode.
UKOLN is supported by: Using Blogs Effectively Within Your Library: Introduction A Half-Day Workshop Brian Kelly UKOLN University of Bath Bath, UK
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Modeling Influence Opinions and Structure in Social Media
Source: Procedia Computer Science(2015)70:
RSS What can it do for you? Rachel Hyland Systems Librarian
Link Prediction and Network Inference
DEBBIE CHENG * LISA HANKIN * JOHN MARK JOSLING
Asymmetric Transitivity Preserving Graph Embedding
Yingze Wang and Shi-Kuo Chang University of Pittsburgh
Presentation transcript:

Implicit Structure and Dynamics of BlogSpace Eytan Adar, Li Zhang, Lada Adamic, & Rajan Lukose HP Labs, Palo Alto, CA

Blogs (web logs) contain online stamped entries date and time stamps list of read blogs URL that is being commented on via link

Blogs: structure and transmission Blog use: –Record real-world and virtual experiences –Note and discuss things “seen” on the net Blog structure: blog-to-blog linking Use + Structure –Great to track “memes” (catchy ideas) Patterns of information flow –How does the popularity of a topic evolve over time? –Who is getting information from whom? Ranking algorithms that take advantage of transmission patterns

Related Work Link prediction in social networks: Butts, C. Network Inference, Error, and Information (In)Accuracy: A Bayesian Approach, Social Networks, 25(2): Dombroski, M., P. Fischbeck, and K. Carley, An Empirically-Based Model for Network Estimation and Prediction, NAACSOS conference proceeding, Pittsburgh, PA, O’Madadhain J., Smyth P., Adamic L., Learning Predictive Models for Link Formation, Sunbelt 2005 (hope you were there!) Getoor, L., N. Friedman, D. Koller, and B. Taskar, Learning Probabilistic Models of Link Structure, Journal of Machine Learning Research, vol. 3(2002), pp Adamic L., Adar E., Friends and neighbors on the Web, Social Networks, Kleinberg, J., and.D. Liben-Nowell, The Link Prediction Problem for Social Networks’, in Proceedings of CIKM ’03 (New Orleans, LA, November 2003), ACM Press. Blog ranking: Technorati, BlogPulse, Daypop… Blog epidemic tracking: Blogdex at MIT media lab, Cameron Marlow, Sunbelt 2003 BlogPulse

Intelliseek’s BlogPulse Service for tracking trends in the blogosphere: popular URLs, phrases, people

BlogPulse Data analyzed 37,153 blogs Differential daily crawls (to find new posts) for May 2003 Full page crawl for May 18, 2003 to capture blogrolls 175,712 URLs occurring on > 2 blogs

Popularity Time Slashdot Effect BoingBoing Effect Tracking popularity over time Blogdex, BlogPulse, etc. track the most popular links/phrases of the day

Election Map Cartograms Michael Gastner, Cosma Shalizi, and Mark Newman University of Michigan

Popularity Time Tracking popularity over time

Clustering information popularity profiles May 2003 Total # of mentions substantial (40) URL mentioned for the first time in May

K-means clustering 259 URLs in the sample satisfy criteria Take normalized cumulative profiles all mentions day K-means minimizes the sum of the differences within each cluster 4 clusters captured most of the differences

Different kinds of information have different popularity profiles Products, etc. Major-news site (editorial content) – back of the paper Slashdot postings Front-page news 1 234

Cluster Profile# urlsexamples 1Sharp peak on day 1 followed by fast decay38Slashdot postings 2Day 1 peak followed by decay46Front page news 3Day 2 peak followed by gradual decay51Editorial content, Sun java release 4Sustained interest124iPod, iTunes, quizzila Popularity profiles cluster 4 cluster 3 cluster 2 cluster 1

Micro example: Giant Microbes

Microscale Dynamics What do we need track specific info ‘epidemics’? –Timings –Underlying network b1b1 b1b1 Time of infection t0t0 t1t1 b2b2 b2b2 b3b3 b3b3

Microscale Dynamics Challenges –Root may be unknown –Multiple possible paths –Uncrawled space, alternate media ( , voice) –No links b1b1 b1b1 Time of infection t0t0 t1t1 b2b2 b2b2 b3b3 b3b3 ? ? bnbn bnbn

Microscale Dynamics who is getting info from whom Via Links (< 2 % of links, 50% within sample) unambiguous Multiple explicit links: which link is more likely No explicit links (70%) which implicit path is more likely

Link Inference Use machine learning algorithms: A) Support Vector Machine (SVM) B) Logistic Regression What we can use Full text Blogs in common Links in common History of infection BoingBoing WIRED

Percentage of blog pairs sharing at least one link link typesame dayA after BA before B A  B17.4%24.5% A  B10.9%22.9%17.0% A,B unlinked 0.6%1.5%1.3%

Similarity in links between reciprocated, unreciprocated, and non-linked blog pairs

Blog A Blog B + T infection (Blog B) > T infection (Blog A) Blog A Blog B - Positive Example Negative Example InfectedUninfected Training on positive and negative examples of ‘infection’

Prediction results Link Inference: SVM 91% accuracy regression 92% accuracy (blog-blog links most predictive) Infection inference: SVM 71.5% accuracy: using blog and non-blog link similarity + timing features (A before B)/n A, (B before A)/n A, (A same day B)/n A,, … Regression: 75% accuracy using only timing features

time inferred actual uncrawled blog or media source Sources of error Coarseness and sparseness of timing data (1 day resolution) Mirror URLS (actually helps) Incomplete crawls B A C

Visualization by Eytan Adar GUESS tool (build your own, see 5:30!) –Using GraphViz (by AT&T) layouts Simple algorithm –If single, explicit link exists, draw it (add node if needed) –Otherwise use ML algorithm Pick the most likely explicit link Pick the most likely possible link Tool lets you zoom around space, control threshold, link types, etc.

Giant Microbes epidemic visualization via link explicit link inferred linkblog

iRank Find early sources of good information using inferred information paths or timing b1b1 b1b1 b2b2 b2b2 b3b3 b3b3 b4b4 b4b4 b5b5 b5b5 bnbn bnbn … True source Popular site

iRank Algorithm Draw a weighted edge for all pairs of blogs that cite the same URL higher weight for mentions closer together run PageRank control for ‘spam’ Time of infection t0t0 t1t1

Do Bloggers Kill Kittens? 02:00 AM Friday Mar. 05, 2004 PST Wired publishes:Wired "Warning: Blogs Can Be Infectious.” 7:25 AM Friday Mar. 05, 2004 PST Slashdot posts:Slashdot "Bloggers' Plagiarism Scientifically Proven" 9:55 AM Friday Mar. 05, 2004 PST Metafilter announcesMetafilter "A good amount of bloggers are outright thieves."

For more info Information Dynamics HP Blog Epidemic Analyzer Eytan, Li, Lada & Rajan

CNN: Wal-Mart banishes bawdy mags