1 Exploring Blog Networks Patterns and a Model for Information Propagation Mary McGlohon In collaboration with Jure Leskovec, Christos Faloutsos Natalie.

Slides:



Advertisements
Similar presentations
1 Dynamics of Real-world Networks Jure Leskovec Machine Learning Department Carnegie Mellon University
Advertisements

FUNNEL: Automatic Mining of Spatially Coevolving Epidemics Yasuko Matsubara, Yasushi Sakurai (Kumamoto University) Willem G. van Panhuis (University of.
Modeling Blog Dynamics Speaker: Michaela Götz Joint work with: Jure Leskovec, Mary McGlohon, Christos Faloutsos Cornell University Carnegie Mellon University.
Cost-effective Outbreak Detection in Networks Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, Natalie Glance.
Analysis and Modeling of Social Networks Foudalis Ilias.
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
Link Analysis: PageRank
Patterns of Influence in a Recommendation Network Jure Leskovec, CMU Ajit Singh, CMU Jon Kleinberg, Cornell School of Computer Science Carnegie Mellon.
Finding Self-similarity in People Opportunistic Networks Ling-Jyh Chen, Yung-Chih Chen, Paruvelli Sreedevi, Kuan-Ta Chen Chen-Hung Yu, Hao Chu.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Power Laws: Rich-Get-Richer Phenomena
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
1 A Random-Surfer Web-Graph Model (Joint work with Avrim Blum & Hubert Chan) Mugizi Rwebangira.
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE.
Mining and Searching Massive Graphs (Networks)
NetMine: Mining Tools for Large Graphs Deepayan Chakrabarti Yiping Zhan Daniel Blandford Christos Faloutsos Guy Blelloch.
Weighted Graphs and Disconnected Components Patterns and a Generator Mary McGlohon, Leman Akoglu, Christos Faloutsos Carnegie Mellon University School.
1 Epidemic Spreading in Real Networks: an Eigenvalue Viewpoint Yang Wang Deepayan Chakrabarti Chenxi Wang Christos Faloutsos.
Cascading Behavior in Large Blog Graphs Patterns and a Model Leskovec et al. (SDM 2007)
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Statistics & Modeling By Yan Gao. Terms of measured data Terms used in describing data –For example: “mean of a dataset” –An objectively measurable quantity.
CS Lecture 6 Generative Graph Models Part II.
Blogosphere  What is blogosphere?  Why do we need to study Blog-space or Blogosphere?
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
INFERRING NETWORKS OF DIFFUSION AND INFLUENCE Presented by Alicia Frame Paper by Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Kraus.
Cascading Behavior in Large Blog Graphs: Patterns and a model offence.
Defense: Knowledge Sharing and Yahoo Answers: Everyone Knows Something L. A. Adamic, et al.
Advanced Topics in Data Mining Special focus: Social Networks.
Web Projections Learning from Contextual Subgraphs of the Web Jure Leskovec, CMU Susan Dumais, MSR Eric Horvitz, MSR.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
1 Dynamic Models for File Sizes and Double Pareto Distributions Michael Mitzenmacher Harvard University.
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
A Measurement-driven Analysis of Information Propagation in the Flickr Social Network WWW09 报告人: 徐波.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Models of Influence in Online Social Networks
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Weighted Graphs and Disconnected Components Patterns and a Generator IDB Lab 현근수 In KDD 08. Mary McGlohon, Leman Akoglu, Christos Faloutsos.
Information Diffusion Mary McGlohon CMU /23/10.
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
Jure Leskovec PhD: Machine Learning Department, CMU Now: Computer Science Department, Stanford University.
Murtaza Abbas Asad Ali. NETWORKOLOGY THE SCIENCE OF NETWORKS.
Jure Leskovec Computer Science Department Cornell University / Stanford University Joint work with: Jon Kleinberg (Cornell), Christos.
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Du, Faloutsos, Wang, Akoglu Large Human Communication Networks Patterns and a Utility-Driven Generator Nan Du 1,2, Christos Faloutsos 2, Bai Wang 1, Leman.
R-MAT: A Recursive Model for Graph Mining Deepayan Chakrabarti Yiping Zhan Christos Faloutsos.
How Do “Real” Networks Look?
RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School.
An Effective Method to Improve the Resistance to Frangibility in Scale-free Networks Kaihua Xu HuaZhong Normal University.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
A Latent Social Approach to YouTube Popularity Prediction Amandianeze Nwana Prof. Salman Avestimehr Prof. Tsuhan Chen.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Speaker : Yu-Hui Chen Authors : Dinuka A. Soysa, Denis Guangyin Chen, Oscar C. Au, and Amine Bermak From : 2013 IEEE Symposium on Computational Intelligence.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Linear Regression.
What Stops Social Epidemics?
NetMine: Mining Tools for Large Graphs
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
R-MAT: A Recursive Model for Graph Mining
The likelihood of linking to a popular website is higher
How Do “Real” Networks Look?
Cost-effective Outbreak Detection in Networks
Lecture 21 Network evolution
Presentation transcript:

1 Exploring Blog Networks Patterns and a Model for Information Propagation Mary McGlohon In collaboration with Jure Leskovec, Christos Faloutsos Natalie Glance, Matthew Hurst Sandia National Labs- July 6, 2007 (As seen at SIAM- Data Mining 2007)

2 Long-term Goals ● How does information on the Web propagate? ● With what pattern do ideas catch on, diffuse, and decrease in popularity? ● Can we build a model for this propagation?

3 Why blogs? ● Blogs are a widely used medium of information for many topics and have become an important mode of communication. ● Blogs cite one another, creating a record of how information and ideas spread through a social network. ● This record is publicly available.

4 Why do we care? ● Understanding how the blog network works is important for: – Social issues: Political mapping, social trends and change, reactions to mass media. – Economic issues: Marketing, predicting commercial success, discovering links between companies. Example: blogs in the 2004 election. [Adamic, Glance 2005]

5 Immediate Goals ● Temporal questions: Does popularity have half-life? Is there periodicity? ● Topological questions: What topological patterns do posts and blogs follow? What shapes do cascades take on? Stars? Chains? Something else? ● Generative model: Can we build a generative model that mimics properties of cascades?

6 Outline Motivation  Preliminaries  Concepts and terminology  Data  Temporal Observations  Topological Observations  Cascade Generation Model  Discussion & Conclusions

7 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. slashdot boingboing

8 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. slashdot boingboing The iPhone is here, hooray!

9 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. slashdot boingboing The iPhone is here, hooray! At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because …

10 What is a blog? ● A blog is a frequently-updated webpage. ● A blog’s author updates the blog using posts. ● Each post has a permanent hyperlink, and may contain links to other blog posts. slashdot boingboing The iPhone is here, hooray! At this link, Slashdot says the iPhone has arrived. But I’m not buying one, because … Here Boingboing says they’re not buying an iPhone. They’re just jealous.

11 Blogosphere network From blogs to networks 1 Non-trivial vs. trivial cascades Stars vs. chains Nodes a,b,c,d are cascade initiators e is a connector Cascades Blog networkPost network slashdot boingboing Dlisted MichelleMalki n slashdot boingboing Dlisted MichelleMalki n

12 Blogosphere network Non-trivial vs. trivial cascades Cascades From networks to cascades slashdot boingboing DlistedMichelleMalki n slashdot boingboing Dlisted MichelleMalki n

13 From networks to cascades Non-trivial vs. trivial cascades Cascade initiators are first sources of information We also have stars and chains Blogosphere network Cascades slashdot boingboing Dlisted MichelleMalki n

14 Dataset (Nielsen Buzzmetrics) ● Gathered from August-September 2005* ● Used set of 44,362 blogs, traced cascades ● 2.4 million posts, ~5 million out-links, 245,404 blog- to-blog links Time [1 day] Number of posts

15 Outline Motivation Preliminaries Concepts and terminology Data  Temporal Observations  Does blog traffic behave periodically?  How does popularity change over time?  Topological Observations  Cascade Generation Model  Discussion & Conclusions  Future Work

16 Temporal Observations Does blog traffic behave periodically? Posts have “weekend effect”, less traffic on Saturday/Sunday.

17 Temporal Observations Does blog traffic behave periodically? Monday appears to compensate for this behavior, but it is not actually the case. We normalize data: count norm = count / p d where p d is percentage of links on that day. Same data, normalized Monday post dropoff- days after post Number in-links (log)

18 Temporal Observations How does post popularity change over time? Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006]. Observation 1: The probability that a post written at time t p acquires a link at time t p +  is: p(t p +  )   1.5 Days after post Cascades Number of in-links

19 Outline Motivation Preliminaries Temporal Observations Does blog traffic behave periodically? How does post popularity change over time?  Topological Observations  What are graph properties for blog networks?  What shapes do cascades take on? Stars, chains, or something else?  Cascade Generation Model  Discussion & Conclusions  Future Work

20 Topological Observations What graph properties does the blog network exhibit?

21 Topological Observations What graph properties does the blog network exhibit? How connected? ● 44,356 nodes, 122,153 edges ● Half of blogs belong to largest connected component.

22 Topological Observations What power laws does the blog network exhibit? Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3. This suggests strong rich-get-richer phenomena. Number of blog in-links (log scale)Number of blog out-links (log scale) Count (log scale)

23 Topological Observations How are blog in- and out-degree related? In-links and out-links are not correlated. (correlation coefficient 0.16) Number of blog in-links (log scale) Number of blog out-links (log scale)

24 Topological Observations What graph properties does the post network exhibit?

25 Topological Observations What graph properties does the post network exhibit? Very sparsely connected: 98% of posts are isolated.

26 Topological Observations Both in-and out-degree follow power laws: In-degree has PL exponent -2.15, out-degree has PL exponent What power laws does the post network exhibit? Post in-degree Count Post out-degree Count

27 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). a b c d e

28 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. a b c d e a b c d e

29 Topological Observations How do we measure how information flows through the network? We gather cascades using the following procedure: – Find all initiators (out-degree 0). – Follow in-links. – Produces directed acyclic graph. a b c d e a b c d e d e b c e a

30 Topological Observations How do we measure how information flows through the network? Common cascade shapes are extracted using algorithms in [Leskovec2006].

31 Topological Observations How do we measure how information flows through the network? Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures. Cascade size (# nodes) Number of edges Cascade size Effective diameter

32 Topological Observations How do we measure how information flows through the network? We work with a bag of cascades– each cascade is a disconnected subgraph. We now explore some graph properties of cascades.

33 Topological Observations As before, in- and out-degree in bag of cascades follow power laws. What graph properties do cascades exhibit? Cascade node in-degree Cascade node out-degree Count

34 Topological Observations Cascade size distributions also follow power law. What graph properties do cascades exhibit?

35 Topological Observations Cascade size distributions also follow power law. What graph properties do cascades exhibit? Observation 2: The probability of observing a cascade on n nodes follows a Zipf distribution: p(n)  n -2 Cascade size (# of nodes) Count

36 Topological Observations What graph properties do cascades exhibit? Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5).

37 Topological Observations What graph properties do cascades exhibit? Stars and chains also follow a power law, with different exponents (star -3.1, chain -8.5). Size of chain (# nodes) Count Size of star (# nodes) Count

38 Outline Motivation Preliminaries Temporal Observations Topological Observations What are graph properties for blog networks? What shapes and patterns do cascades take on?  Cascade Generation Model  Epidemiological Background  Proposed Model  Experimental Validation  Discussion & Conclusions  Future Work

39 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

40 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

41 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

42 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

43 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

44 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

45 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

46 Epidemiological models ● We consider modeling cascade generation as an epidemic, with ideas as viruses. ● We use the SIS model: – At any time, an entity is in one of two states: susceptible or infected. – One parameter  determines how easily spreading conversations are. – [Hethcote2000]

47 Cascade Generation Model B1B1 3 B2B2 B3B3 B4B Begin with Blog Net.

48 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 0. Begin with Blog Net, but ignore edge weights. Example– B1 links to B2, B2 links to B1, B4 links to B2 and B1, as well as itself B3 is isolated, linking to itself.

49 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 1. Randomly pick a blog to infect, add node to cascade B1B1

50 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 2. Infect each in-linked neighbor with probability . B1B1

51 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 2. Infect each in-linked neighbor with probability . B1B1 INFECT DO NOT INFECT

52 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 3. Add infected neighbors to cascade. B1B1 B4B4

53 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 4. Set “old” infected nodes to uninfected. B1B1 B4B4

54 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. B1B1 B4B4

55 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. B1B1 B4B4 DO NOT INFECT

56 Cascade Generation Model B1B1 B2B2 B3B3 B4B4 4. Set “old” infected nodes to uninfected. Repeat steps 2-4 until no nodes are infected. B1B1 B4B4 Completed cascade!

57 CGM matches observations ● After trying several values, we decide on  =.025. ● 10 simulations, 2 million cascades each ● Most frequent cascades: 7 of 10 matched exactly. model data

58 CGM matches observations Cascade size in this model also follows a power law-- the model distribution is shown with the real data points. Cascade size (number of nodes) Count

59 CGM matches observations ● Stars and chains both follow power laws, close to those observed in real data. Count Star size Count Chain size

60 Results in brief ● Analyzed one of largest available collections of blog information. ● Two networks: “Post network” and “blog network”. ● Discovered several properties of the networks. ● Also analyzed properties of “cascades”. ● Presented generative model for cascades.

61 Immediate questions: answered ● Temporal questions: Does popularity have half-life? Is there periodicity? – Popularity dropoff follows a power-law distribution exactly as found in response times in other work. We do find that posts follow weekly periodicity. Days after post Number of in-links

62 Immediate questions: answered ● Topology: What topological patterns do posts and blogs follow? What shapes to cascades take on? Stars? Chains? Something else? – We find power law distributions in almost every topological property. In cascade shapes, stars are more common than chains, and size of cascades follow a power law. Cascades are tree-like. Size of chain (# nodes) Count Size of star (# nodes) Count

63 Immediate questions: answered ● Can a simple model replicate this behavior? – Yes. We developed a model based on the SIS model in epidemiology. It is a simple model with only one parameter, and it produces behavior remarkably similar to that found in the dataset. Count Star size Count Chain size

64 Future work and applications ● This work suggested that ideas may behave like viruses under an SIS model. ● This may be useful for mapping social/political trends. ● Further investigation into these properties may also allow us early detection of changes in social or economic structure.

65 Related work ● For explanation of SIS model: – [Hethcote2000] H.W. Hethcote. The mathematics of infectious diseases. SIAM Rev., 42(4):599–653, ● For algorithms for extracting cascade shapes: – [Leskovec2006] J. Leskovec, A. Singh, and J. Kleinberg. Patterns of influence in a recommendation network. PAKDD ● For some modeling of power laws: – [Vazquez2006] A. Vazquez, J. G. Oliveira, Z. Dezso, K. I. Goh, I. Kondor, and A. L. Barabasi. Modeling bursts and heavy tails in human dynamics. Physical Review E, 73:036127, 2006.

66 Additional Info Mary McGlohon

67 Acknowledgments ● Mary McGlohon was partially supported by an NSF Graduate Fellowship. ● Jure Leskovec was partially supported by a Microsoft Fellowship.

68 Questions?

69 ● EXTRA SLIDES BEGIN HERE!

70 Preliminaries- PCA ● We will work with very high-dimensional data (~9,000 dimensions). ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Conversation mass upwards Hypothetically, for each blog...

71 Preliminaries- PCA ● We will work with very high-dimensional data (~9,000 dimensions). ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Conversation mass upwards Hypothetically, for each blog...

72 Preliminaries- PCA ● We will work with very high-dimensional data (~9,000 dimensions). ● Principal Component Analysis is a method of dimensionality reduction. Depth upwards Hypothetically, for each blog... Conversation mass upwards

73 Preliminaries- PCA = xx v1 We can represent any real N x M matrix X as X= U x  x V t where U is size N x r, r is the rank of matrix X,  is diagonal r x r matrix and V is M x r. Details XU  VtVt

74 Preliminaries- PCA ● Reduce dimensionality by setting all other components of  to zero. = xx Details

75 Preliminaries- PCA Reference: Fukunaga, K. (1990). Introduction to Statistical Pattern Recognition, Academic Press. ~ xx Details

76 Preliminaries- Regularizing data ● Not everything in life is normally distributed. Total In-links Total Conversation Mass Downwards Blog properties, linear-linear scale

77 Preliminaries- Regularizing data ● Not everything in life is normally distributed. Total In-links Total Conversation Mass Downwards Blog properties, linear-linear scale 99.4% of points!

78 Preliminaries: Regularizing data ● Not everything in life is normally distributed. Total In-links Total Conversation Mass Downwards Blog properties, linear-linear scale Try to fit a line...

79 Preliminaries: Regularizing data ● Not everything in life is normally distributed. Total In-links Total Conversation Mass Downwards Blog properties, linear-linear scale Try to fit a line... Outliers dramatically affect fit.

80 Preliminaries: Regularizing data ● Not everything in life is normally distributed. ● Therefore, we propose to take log(count+1). Total In-links Total Conversation Mass Downwards Blog properties, log-log scale

81 Preliminaries: Regularizing data ● Not everything in life is normally distributed. ● Therefore, we propose to take log(count+1). Total In-links Total Conversation Mass Downwards Blog properties, log-log scale Outliers’ effects are minimized.

82 ● Suppose we want to cluster blogs based on content. What features do we use per blog?

83 CascadeType Perform PCA on sparse matrix. Use log(count+1) Project onto 2 PC….01 … … … 5.1 … 4.2 … boingboing slashdot ………… ~9,000 cascade types ~44,000 blogs

84 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

85 CascadeType: Results ● Observation: Content of blogs and cascade behavior are often related. Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

86 ● Suppose we want to cluster blog posts. What features do we use?

87 Preliminaries- Blogs ● There are several terms we use to describe cascades: ● In-link, out-link – Green node has one out-link – Yellow node has one in-link. ● Depth downwards/upwards – Pink node has an upward depth of 1, – downward depth of 2. ● Conversation mass upwards/downwards – Pink node has upward CM 1, – downward CM 3

88 PostFeatures.6.1 … boingboing-p boingboing-p … … slashdot-p slashdot-p001 # in-links #out-links CM up CM down depth up depth down ~2,400,000 posts Run PCA…

89 PostFeatures: Results Observation: Posts within a blog tend to retain similar network characteristics.

90 PostFeatures: Results Observation: Posts within a blog tend to retain similar network characteristics. MichelleMalkin Dlisted –PC1 ~ CM upward –PC2 ~ CM downward –We show this scatter plot instead.

91 Ranking blogs by PostFeatures ● Conversation mass up/down gives a better understanding of the blog posts than in-links and out-links. ● Therefore, we may choose to rank blogs based on these attributes.

92 Blogs ranked by CM vs in-links 1michellemalkin.com 2boingboing.net 3imao.us (75) 4captainsquartersblog.com/mt 5instapundit.com 6radioequalizer.blogspot.com (53) 7powerlineblog.com 8waxy.org/links 9washingtonmonthly.com 10 kottke.org/reminder 1boingboing.net 2michellemalkin.com 3instapundit.com 4waxy.org/links 5kottke.com/reminder 6patriotdaily.com (11) 7captainsquartersblog.com/mt 8powerlineblog.com 9washingtonmonthly.com 10 petashon.com (30) Top blogs by conversation mass Top blogs by in-links

93 Blogs ranked by CM vs in-links 1michellemalkin.com 2boingboing.net 3imao.us (75) 4captainsquartersblog.com/mt 1boingboing.net 2michellemalkin.com 3instapundit.com 4waxy.org/links Top blogs by conversation mass Top blogs by in-links in-links: 2 CM: 6 in-links: 5 CM: 5 –Perhaps IMAO has longer cascades, just fewer inlinks. –While petashun has “stars” petashon.com (30)

94 BlogTimeFractal: some time series ● Problem: time series data is nonuniform and difficult to analyze. ● Any patterns? ● Any measures? in-links over time

95 BlogTimeFractal: Definitions ● Any patterns? ● Self similarity! ● The law describes self-similarity. ● For any sequence, we divide it into two equal- length subsequences. 80% of traffic is in one, 20% in the other. – Repeat recursively.

96 Self-similarity ● The bias factor for the law is b= Details

97 Self-similarity ● The bias factor for the law is b= Q: How do we estimate b? Details

98 Self-similarity ● The bias factor for the law is b= Q: How do we estimate b? A: Entropy plots! Details

99 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution R= T/2. ● Record entropy H R

100 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution R= T/2. ● Record entropy H R ● Recursively take finer resolutions.

101 BlogTimeFractal ● An entropy plot plots entropy vs. resolution. ● From time series data, begin with resolution r= T/2. ● Record entropy H r ● Recursively take finer resolutions.

102 BlogTimeFractal: Definitions ● Entropy measures the non-uniformity of histogram at a given resolution. ● We define entropy of our sequence at given R : where p(t) is percentage of posts from a blog on interval t, R is resolution and 2 R is number of intervals. Details

103 BlogTimeFractal ● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor. ● Lemma: For traffic generated by a b-model, the bias factor b obeys the equation: s= - b log 2 b – (1-b) log 2 (1-b)

104 Entropy Plots ● Linear plot  Self-similarity Resolution Entropy

105 Entropy Plots ● Linear plot  Self-similarity ● Uniform: slope s=1. bias=.5 ● Point mass: s=0. bias=1 Resolution Entropy

106 Entropy Plots ● Linear plot  Self-similarity ● Uniform: slope s=1. bias=.5 ● Point mass: s=0. bias=1 Resolution Entropy Michelle Malkin in-links, s= 0.85 By Lemma 1, b= 0.72

107 BlogTimeFractal: Results ● Observation: Most time series of interest are self-similar. ● Observation: Bias factor is approximately that is, more bursty than uniform (70/30 law). in-links, b=.72conversation mass, b=.76number of posts, b=.70 Entropy plots: MichelleMalkin

108 ● Other related work

109 [Ali-Hasen, Adamic 2007] Expressing Social Relationships on the Blog through Links and Comments Analyzed three blog communities : Dallas-Fort Worth -Most links are external to community (91%) -Low centralization -Low reciprocity UAE -Fewer links external to community -More centralization -Obvious “hub” structure Kuwait -Fewest links external to community (53%) -Highly centralized -Much reciprocity

110 [Duarte et. al. 2007] ● Classified blogs into parlor, register, and broadcast. Total sessions Fractions of sessions with comments parlor register broadcast

111 [Adar et. al. 2004] ● Implicit Structure and the Dynamics of Blogspace Suggested that ideas behaved like epidemics. Presented iRank based on how “infectious” a blog was. (giant microbes, a site infectious in more ways than one)