Presentation is loading. Please wait.

Presentation is loading. Please wait.

(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Similar presentations


Presentation on theme: "(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena."— Presentation transcript:

1 (with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena

2 What do these have in common? The grades of students in a class. The weights of apples. The high temperatures in Boston on July 4 th. The heights of Dutch men. The speed of cars on I-90. These measurements are well-characterized by the average and the standard deviation. Most instances are typical. Seeing an outlier is very surprising.

3 City populations 1. New York8,310,212 2. Los Angeles 3,834,340 3. Chicago2,836,658 4. Houston 2,208,180 5. Phoenix1,552,259 6. Philadelphia1,449,634 7. San Antonio 1,328,984 8. San Diego1,266,731 9. Dallas1,266,372 10. San Jose 939,899

4 City populations 1. New York8,310,212 2. Los Angeles 3,834,340 3. Chicago2,836,658 21. Boston, MA 625,087 248. Cambridge, MA 106,038 25,375. Lost Springs, WY 1 A few cities with high population Many cities with low population

5 City populations Cities ordered on population range

6 Word Frequencies

7 Power Law: The number of cities with population > k is proportional to k -c.

8 “fraction of items” “popularity = k”

9 Power Law: Fraction f(k) of items with popularity k is proportional to k -c. f(k) k -c log [f(k)] log [k -c ] log [f(k)] -c log [k] y -c x

10 A power law is a straight line on a log-log plot.

11 Number of Web page in-links (Broder+)

12 Examples (some better than others) frequency of words protein-interaction degree distribution Internet (AS) degree distribution severity of inter-state wars severity of terrorist attacks frequency of bird sightings size of blackouts book sales population of US cities size of religions number of citations papers authored popularity of surnames number of web hits number of web links, with cut-off number of phone calls size of email address book number of species per genus

13 What is going on? Nature seems to create bell curves (range around an average) Human activity seems to create power laws (popularity skewing)

14 Network Science: Scale-Free Property 2012 “seems to”

15 How can we use this to… fight spam? The main idea behind “Spam, Damn Spam and Statistics” Spammers manufacture pages and links to fool search engines In this process, they will overdo it Their actions would likely fall outside the normal human activity Let’s look for outliers in the power laws!

16 Web page out-degrees There are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected.

17 Web page in-degrees There are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected

18 Length of the URL’s host The 100 longest hostnames reveal that 80 of them belong to adult site and 11 refer to the financial and credit related sites

19 Number of host name resolutions to a single IP There are 100,000’s host names mapped to a single IP, The record-breaking IP is referred by 8,967,154 host names

20 Clusters of similar pages (shingling) The blue group is mainly spam. 15 of 20 largest clusters have 2,080,112 spam pages The red group has duplicated content, not spam).

21 Spammers are studious!

22 Why does data exhibit power laws? imitationPower law Can imitation explain the size of the Web parts?

23 Constructing a model of the Web 1. Pages are created in order, named 1, 2, …, N 2. When created, page j links to a page randomly: a) With probability p, picking a page i uniformly at random from pages 1, …, j-1 b) With probability (1-p), pick page i uniformly at random and link to the page that i links too imitation randomness This is the well-studied “preferential attachment” model of Web generation

24 The rich get richer 2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too 1/43/4

25 The rich get richer 2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too Equivalently, 2 b)With prob. (1-p), pick a page proportional to its in-degree and link to it

26 Information cascades and the rich Information cascade = some people get a little bit richer by chance and then rich-get-richer dynamics = the random rich people get a lot richer very fast

27 Is popularity predictable? Why is Harry Potter popular? If we could re-play history, would we still read Harry Potter en masse, or would it be some other book? (But then, why JK Rowling had troubles publishing it at first?)

28 Is popularity… random? Why “hits” in cultural markets are much more successful than average (and yet so hard to predict)? Can we study it with an experiment? “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market” 14,000 participants randomly assigned to “social influence” and “independent” conditions chose between 48 songs by unknown bands in 8+1 parallel worlds Subject See what others downloaded No information World 1 World 8 World 0

29 Music download site – 8+1 worlds 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies 18 3 47 2 59 7 10 1 The best songs never went to the bottom, the worse never became popular. But their order changed a lot.


Download ppt "(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena."

Similar presentations


Ads by Google