(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
EK Ch 17: Power laws and rich-get-richer phenomena (with an application of Web Spam detection Spam, Damn Spam and Statistics ) Spam, Damn Spam and Statistics.
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
The Diversity of Samples from the Same Population Thought Questions 1.40% of large population disagree with new law. In parts a and b, think about role.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Power Laws: Rich-Get-Richer Phenomena
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
The Barabási-Albert [BA] model (1999) ER Model Look at the distribution of degrees ER ModelWS Model actorspower grid www The probability of finding a highly.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
Measurement and Analysis of Online Social Networks By Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, Bobby Bhattacharjee Attacked.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
How to Analyse Social Network? : Part 2 Power Laws and Rich-Get-Richer Phenomena Thank you for all referred contexts and figures.
Computer Science 1 Web as a graph Anna Karpovsky.
Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area.
Add image. 3 “ Content is NOT king ” today 3 40 analog cable digital cable Internet 100 infinite broadcast Time Number of TV channels.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
Web Characterization: What Does the Web Look Like?
Population Geography.
Models and Algorithms for Complex Networks Power laws and generative processes.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Warm-up 2.1 Visualizing Distributions: Shape, Center and Spread.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
COLOR TEST COLOR TEST. Social Networks: Structure and Impact N ICOLE I MMORLICA, N ORTHWESTERN U.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
Spatial Variation in Search Engine Queries Lars Backstrom, Jon Kleinberg, Ravi Kumar and Jasmine Novak.
CS315-Web Search & Data Mining. A Semester in 50 minutes or less The Web History Key technologies and developments Its future Information Retrieval (IR)
Lotkaian Informetrics and applications to social networks L. Egghe Chief Librarian Hasselt University Professor Antwerp University Editor-in-Chief “Journal.
Section 9.3: Confidence Interval for a Population Mean.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
How Do “Real” Networks Look?
Population Geography. Population National Geographic - 7 billion National Geographic - Are You Typical?
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
The US in 1900 and today Jan 19, Population million Foreign born 7.4% Black 11.5% Hispanic ? Under 1840% Over 654% Married (age 18+)(60%)
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
MATH 2311 Review for Exam 2.
Topics In Social Computing (67810)
How Do “Real” Networks Look?
STATISTICS For Research
Normal Distributions.
Generative Model To Construct Blog and Post Networks In Blogosphere
MATH Review for Exam 2.
How Do “Real” Networks Look?
How Do “Real” Networks Look?
How Do “Real” Networks Look?
Network Science: A Short Introduction i3 Workshop
Detecting Phrase-Level Duplication on the World Wide Web
Peer-to-Peer and Social Networks
Power Law.
Graph and Link Mining.
Diffusion in Networks
Network Models Michael Goodrich Some slides adapted from:
Presentation transcript:

(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena

What do these have in common? The grades of students in a class. The weights of apples. The high temperatures in Boston on July 4 th. The heights of Dutch men. The speed of cars on I-90. These measurements are well-characterized by the average and the standard deviation. Most instances are typical. Seeing an outlier is very surprising.

City populations 1. New York8,310, Los Angeles 3,834, Chicago2,836, Houston 2,208, Phoenix1,552, Philadelphia1,449, San Antonio 1,328, San Diego1,266, Dallas1,266, San Jose 939,899

City populations 1. New York8,310, Los Angeles 3,834, Chicago2,836, Boston, MA 625, Cambridge, MA 106,038 25,375. Lost Springs, WY 1 A few cities with high population Many cities with low population

City populations Cities ordered on population range

Word Frequencies

Power Law: The number of cities with population > k is proportional to k -c.

“fraction of items” “popularity = k”

Power Law: Fraction f(k) of items with popularity k is proportional to k -c. f(k) k -c log [f(k)] log [k -c ] log [f(k)] -c log [k] y -c x

A power law is a straight line on a log-log plot.

Number of Web page in-links (Broder+)

Examples (some better than others) frequency of words protein-interaction degree distribution Internet (AS) degree distribution severity of inter-state wars severity of terrorist attacks frequency of bird sightings size of blackouts book sales population of US cities size of religions number of citations papers authored popularity of surnames number of web hits number of web links, with cut-off number of phone calls size of address book number of species per genus

What is going on? Nature seems to create bell curves (range around an average) Human activity seems to create power laws (popularity skewing)

Network Science: Scale-Free Property 2012 “seems to”

How can we use this to… fight spam? The main idea behind “Spam, Damn Spam and Statistics” Spammers manufacture pages and links to fool search engines In this process, they will overdo it Their actions would likely fall outside the normal human activity Let’s look for outliers in the power laws!

Web page out-degrees There are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected.

Web page in-degrees There are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected

Length of the URL’s host The 100 longest hostnames reveal that 80 of them belong to adult site and 11 refer to the financial and credit related sites

Number of host name resolutions to a single IP There are 100,000’s host names mapped to a single IP, The record-breaking IP is referred by 8,967,154 host names

Clusters of similar pages (shingling) The blue group is mainly spam. 15 of 20 largest clusters have 2,080,112 spam pages The red group has duplicated content, not spam).

Spammers are studious!

Why does data exhibit power laws? imitationPower law Can imitation explain the size of the Web parts?

Constructing a model of the Web 1. Pages are created in order, named 1, 2, …, N 2. When created, page j links to a page randomly: a) With probability p, picking a page i uniformly at random from pages 1, …, j-1 b) With probability (1-p), pick page i uniformly at random and link to the page that i links too imitation randomness This is the well-studied “preferential attachment” model of Web generation

The rich get richer 2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too 1/43/4

The rich get richer 2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too Equivalently, 2 b)With prob. (1-p), pick a page proportional to its in-degree and link to it

Information cascades and the rich Information cascade = some people get a little bit richer by chance and then rich-get-richer dynamics = the random rich people get a lot richer very fast

Is popularity predictable? Why is Harry Potter popular? If we could re-play history, would we still read Harry Potter en masse, or would it be some other book? (But then, why JK Rowling had troubles publishing it at first?)

Is popularity… random? Why “hits” in cultural markets are much more successful than average (and yet so hard to predict)? Can we study it with an experiment? “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market” 14,000 participants randomly assigned to “social influence” and “independent” conditions chose between 48 songs by unknown bands in 8+1 parallel worlds Subject See what others downloaded No information World 1 World 8 World 0

Music download site – 8+1 worlds 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies The best songs never went to the bottom, the worse never became popular. But their order changed a lot.