Power laws, Pareto distribution and Zipf's law M. E. J. Newman Presented by: Abdulkareem Alali.

Slides:

Advertisements

Similar presentations

Lecture 4 PY 427 Statistics 1 Fall 2006 Kin Ching Kong, Ph.D

Advertisements

Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.

Chapter 2: Frequency Distributions

Hydrologic Statistics Reading: Chapter 11, Sections 12-1 and 12-2 of Applied Hydrology 04/04/2006.

STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!

1 Functions and Applications

Slopes and Areas Frequently we will want to know the slope of a curve at some point. Or an area under a curve. We calculate slope as the change in height.

Experiments and Variables

Power Laws By Cameron Megaw 3/11/2013. What is a Power Law?

Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 2 Picturing Variation with Graphs.

Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters

Random Sampling and Data Description

Power Laws: Rich-Get-Richer Phenomena

Lecture 10: Power Laws CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.

Chapter 3: Central Tendency

Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately describes the center of the.

Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.

Information Networks Power Laws and Network Models Lecture 3.

Chapter 1: Introduction to Statistics

1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.

3.3 Density Curves and Normal Distributions

Chapter 3: Central Tendency. Central Tendency In general terms, central tendency is a statistical measure that determines a single value that accurately.

Chapter 9 Statistics Section 9.1 Frequency Distributions; Measures of Central Tendency.

STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.

ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:

Continuous Probability Distributions  Continuous Random Variable  A random variable whose space (set of possible values) is an entire interval of numbers.

Copyright © 2013, 2009, and 2007, Pearson Education, Inc. 1 PROBABILITIES FOR CONTINUOUS RANDOM VARIABLES THE NORMAL DISTRIBUTION CHAPTER 8_B.

Models and Algorithms for Complex Networks Power laws and generative processes.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.

 Frequency Distribution is a statistical technique to explore the underlying patterns of raw data.  Preparing frequency distribution tables, we can.

COURSE: JUST 3900 INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE Instructor: Dr. John J. Kerbs, Associate Professor Joint Ph.D. in Social Work and Sociology.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.

Graphing Data: Introduction to Basic Graphs Grade 8 M.Cacciotti.

TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.

McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 2 Descriptive Statistics: Tabular and Graphical Methods.

BPS - 3rd Ed. Chapter 161 Inference about a Population Mean.

CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.

Stracener_EMIS 7305/5305_Spr08_ Reliability Data Analysis and Model Selection Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering.

Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.

How Do “Real” Networks Look?

Chapter 2: Frequency Distributions. Frequency Distributions After collecting data, the first task for a researcher is to organize and simplify the data.

Power law distribution

1 Frequency Distributions. 2 After collecting data, the first task for a researcher is to organize and simplify the data so that it is possible to get.

Statistical Properties of Text

Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.

Hydrological Forecasting. Introduction: How to use knowledge to predict from existing data, what will happen in future?. This is a fundamental problem.

Basic Review - continued tom.h.wilson Department of Geology and Geography West Virginia University Morgantown, WV.

1 Statistical Analysis - Graphical Techniques Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7370/5370 STAT 5340 : PROBABILITY AND.

Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.

Virtual University of Pakistan

Physics 114: Lecture 13 Probability Tests & Linear Fitting

How Do “Real” Networks Look?

Lecture 11: Scale Free Networks

How Do “Real” Networks Look?

How Do “Real” Networks Look?

Topic 5: Exploring Quantitative data

Lecture 4: “Scale free” networks

How Do “Real” Networks Look?

Motion and Force. Motion and Force Chapter Twelve: Distance, Time, and Speed 12.1 Distance, Direction, and Position 12.2 Speed 12.3 Graphs of Motion.

Continuous Statistical Distributions: A Practical Guide for Detection, Description and Sense Making Unit 3.

Range, Width, min-max Values and Graphs

Chapter Nine: Using Statistics to Answer Questions

Reading, Constructing, and Analyzing Graphs

Presentation transcript:

Power laws, Pareto distribution and Zipf's law M. E. J. Newman Presented by: Abdulkareem Alali

Intro: Measurements distribution One noticed observation on measuring quantities that they are scaled or centered around a typical value. As an example: – would be the heights of human beings. Most adult human beings are about 180cm tall. tallest and shortest adult men as having had heights 272cm and 57cm respectively, making the ratio 4.8. –another example of a quantity with a typical scale the speeds in miles per hour of cars on the motorway. Speeds are strongly peaked around 75mph.

Intro: Measurements distribution

Another observation not all things we measure are peaked around a typical value. Some vary over an enormous dynamic range sometimes many orders of magnitude. As an example: The largest population of any city in the US is 8.00 million for New York City (2000). Americas smallest town is Duffield, Virginia, with a population of 52. the ratio of largest to smallest population is at least

Intro: Measurements distribution

America with a total population of 300 million people, you could at most have about 40 cities the size of New York. And the 2700 cities cannot have a mean population of more than 110,000. A histogram of city sizes plotted with logarithmic horizontal and vertical axes follows quite closely a straight line.

Intro: Measurements distribution

Such histogram can be represented as ln(y) = A ln(x) + c Let p(x)dx be the fraction of cities with population between x and x + dx. If the histogram is a straight line on log-log scales, then ln(p(x)) = - ln(x) + c p(x) = C x, C = e c

Intro: power low distribution This kind of distribution p(x) = C x is called the power low distribution. Power low implies that small occurrences are extremely common, whereas large instances are extremely rare.

Next: I.Ways of detecting power-law behavior. II.Give empirical evidence for power laws in a variety of systems.

Example on an artificially generated data set Take 1 million random numbers from a distribution with = 2.5 A normal histogram of the numbers, produced by binning them into bins of equal size 0.1. That is, the first bin goes from 1 to 1.1, the second from 1.1 to 1.2, and so forth. On the linear scales used this produces a nice smooth curve.

problem with Linear scale plot of straight bin of the data How many times did the number 1 or 3843 or occur, Power-law relationship not as apparent, Only makes sense to look at smallest bins whole range first few bins

I. Measuring Power Laws The author presents 3 ways to identifying power-law behavior: 1.Log-log plot 2.Logarithmic binning 3.Cumulative distribution function

1. Log-log plot Logarithmic axes : powers of a number will be uniformly spaced =1, 2 1 =2, 2 2 =4, 2 3 =8, 2 4 =16, 2 5 =32, 2 6 =64, ….

1. Log-log plot To fit power-law distributions the most common and not very accurate method: –Bin the different values of x and create a frequency histogram ln(x) ln (# of times x occurred)

problem with the Linear scale log-log plot of straight bin of the data the right-hand end of the distribution is noisy. Each bin only has a few samples in it, if any. So the fractional fluctuations in the bin counts are large and this appears as a noisy curve on the plot. here we have tens of thousands of observations when x < 10 Noise in the tail, less data in bins

Solution1: 2. Logarithmic binning is to vary the width of the bins in the histogram. Normalizing the sample counts by the width of the bins they fall in. Number samples in a bin of width x should be divided by x to get a count per unit interval of x. The normalized sample count becomes independent of bin width on average. Most common choice is a fixed multiple wider bin than the one before it.

Logarithmic binning Example : Choose a multiplier of 2 and create bins that span the intervals 1 to 1.1, 1.1 to 1.3, 1.3 to 1.7 and so forth (i.e., the sizes of the bins are 0.1, 0.2, 0.4 and so forth). This means the bins in the tail of the distribution get more samples than they would if bin sizes were fixed. Bins appear more equally spaced. Logarithmic binning still have noise at the tail.

Solution2: 3. Cumulative distribution function No loss of information –No need to bin, has value at each observed value of x. To have a cumulative distribution –i.e. how many of the values of x are at least x. –The cumulative probability of a power law probability distribution is also power law but with an exponent – 1.

Cumulative distribution function

Power laws, Pareto distribution and Zipf's law Cumulative distributions are sometimes also called rank/frequency. Cumulative distributions with a power-law form are sometimes said to follow Zipfs law or a Pareto distribution, after two early researchers. Zipfs law and Pareto distribution are effectively synonymous with power-law distribution. Zipfs law and the Pareto distribution differ from one another in the way the cumulative distribution is plottedZipf made his plots with x on the horizontal axis and P(x) on the vertical one; Pareto did it the other way around. This causes much confusion in the literature, but the data depicted in the plots are of course identical.

Cumulative distributions vs. rank/frequency Sorting and ranking measurements and then plotting rank against those measurements is usually the quickest way to construct a plot of the cumulative distribution of a quantity. This the way the author used to plot all of the cumulative distributions in his paper.

Cumulative distributions vs. rank/frequency Plotting of the cumulative distribution function P(x) of the frequency with which words appear in a body of text: We start by making a list of all the words along with their frequency of occurrence. Now the cumulative distribution of the frequency is defined such that P(x) is the fraction of words with frequency greater than or equal to x (P(X x) ). Alternatively one could simply plot the number of words with frequency greater than or equal to x.

Cumulative distributions vs. rank/frequency For example : The most frequent word, which is the in most written English texts. If x is the frequency with which this word occurs, then clearly there is exactly one word with frequency greater than or equal to x, since no other word is more frequent. Similarly, for the frequency of the second most common word usually ofthere are two words with that frequency or greater, namely of and the. And so forth. In other words, if we rank the words in order, then by definition there are n words with frequency greater than or equal to that of the nth most common word. Thus the cumulative distribution P(x) is simply proportional to the rank n of a word. This means that to make a plot of P(x) all we need do is sort the words in decreasing order of frequency, number them starting from 1, and then plot their ranks as a function of their frequency. Such a plot of rank against frequency was called by Zipf a rank/frequency plot.

Estimate from observed data One way is to fit the slope of the line in plots and this is the most commonly used method. For example, for the plot that was generated by Logarithmic binning gives = 2.26 ± 0.02, which is incompatible with the known value of = 2.5 from which the data were generated. An alternative, simple and reliable method for extracting the exponent is to employ the formula which gives = ± to the generated data.

Examples of power laws a.Word frequency: Estoup. b.Citations of scientific papers: Price. c.Web hits: Adamic and Huberman d.Copies of books sold. e.Diameter of moon craters: Neukum & Ivanov. f.Intensity of solar flares: Lu and Hamilton. g.Intensity of wars: Small and Singer. h.Wealth of the richest people. i.Frequencies of family names: e.g. US & Japan not Korea. j.Populations of cities.

The following graph is plotted using Cumulative distributions

Real world data for x min and x min frequency of use of words12.20 number of citations to papers number of hits on web sites12.40 copies of books sold in the US telephone calls received magnitude of earthquakes diameter of moon craters intensity of solar flares intensity of wars31.80 net worth of Americans$600m2.09 frequency of family names population of US cities

Not everything is a power law a.The abundance of North American bird species. b.The number of entries in peoples address c.The distribution of the sizes of forest fires.

Not everything is a power law

Conclusion The power-law statistical distributions seen in a wide variety of natural and man-made phenomena, from earthquakes and solar flares to populations of cities and sales of books. We have seen examples of power-law distributions in real data and seen 3 ways that have been used to m easuring power laws.

References Power laws, Pareto distributions and Zipfs law. M. E. J. Newman, Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI U.S.A.

End