(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.

Slides:



Advertisements
Similar presentations
The Structure of the Web Mark Levene (Follow the links to learn more!)
Advertisements

Analysis and Modeling of Social Networks Foudalis Ilias.
Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters
Lecture 21 Network evolution Slides are modified from Jurij Leskovec, Jon Kleinberg and Christos Faloutsos.
EK Ch 17: Power laws and rich-get-richer phenomena (with an application of Web Spam detection Spam, Damn Spam and Statistics ) Spam, Damn Spam and Statistics.
Social Networks 101 P ROF. J ASON H ARTLINE AND P ROF. N ICOLE I MMORLICA.
The Diversity of Samples from the Same Population Thought Questions 1.40% of large population disagree with new law. In parts a and b, think about role.
Information Networks Generative processes for Power Laws and Scale-Free networks Lecture 4.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Power Laws: Rich-Get-Richer Phenomena
Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.
On Power-Law Relationships of the Internet Topology Michalis Faloutsos Petros Faloutsos Christos Faloutsos.
Web Graph Characteristics Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!)
4. PREFERENTIAL ATTACHMENT The rich gets richer. Empirical evidences Many large networks are scale free The degree distribution has a power-law behavior.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
Complex Networks Third Lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA TexPoint fonts used in EMF. Read the.
CS728 Lecture 5 Generative Graph Models and the Web.
Masters Thesis Defense Amit Karandikar Advisor: Dr. Anupam Joshi Committee: Dr. Finin, Dr. Yesha, Dr. Oates Date: 1 st May 2007 Time: 9:30 am Place: ITE.
Network Models Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Models Why should I use network models? In may 2011, Facebook.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
The Barabási-Albert [BA] model (1999) ER Model Look at the distribution of degrees ER ModelWS Model actorspower grid www The probability of finding a highly.
The structure of the Internet. How are routers connected? Why should we care? –While communication protocols will work correctly on ANY topology –….they.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
CS 345A Data Mining Lecture 1
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
CS Lecture 6 Generative Graph Models Part II.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Measurement and Analysis of Online Social Networks By Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, Bobby Bhattacharjee Attacked.
Graphs and Topology Yao Zhao. Background of Graph A graph is a pair G =(V,E) –Undirected graph and directed graph –Weighted graph and unweighted graph.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 7 May 14, 2006
How to Analyse Social Network? : Part 2 Power Laws and Rich-Get-Richer Phenomena Thank you for all referred contexts and figures.
Computer Science 1 Web as a graph Anna Karpovsky.
Peer-to-Peer and Social Networks Random Graphs. Random graphs E RDÖS -R ENYI MODEL One of several models … Presents a theory of how social webs are formed.
Add image. 3 “ Content is NOT king ” today 3 40 analog cable digital cable Internet 100 infinite broadcast Time Number of TV channels.
(Social) Networks Analysis III Prof. Dr. Daning Hu Department of Informatics University of Zurich Oct 16th, 2012.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
Mark Levene, An Introduction to Search Engines and Web Navigation © Pearson Education Limited 2005 Slide 9.1 Chapter 9 : Social Networks What is a social.
Web Characterization: What Does the Web Look Like?
Models and Algorithms for Complex Networks Power laws and generative processes.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
COLOR TEST COLOR TEST. Social Networks: Structure and Impact N ICOLE I MMORLICA, N ORTHWESTERN U.
1 Statistical Properties for Text Rong Jin. 2 Statistical Properties of Text  How is the frequency of different words distributed?  How fast does vocabulary.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Social Network Analysis Prof. Dr. Daning Hu Department of Informatics University of Zurich Mar 5th, 2013.
Lotkaian Informetrics and applications to social networks L. Egghe Chief Librarian Hasselt University Professor Antwerp University Editor-in-Chief “Journal.
(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena.
Lecture 10: Network models CS 765: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Most of contents are provided by the website Network Models TJTSD66: Advanced Topics in Social Media (Social.
Randomness, Probability, and Simulation
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
How Do “Real” Networks Look?
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 Patterns of Cascading Behavior in Large Blog Graphs Jure Leskoves, Mary McGlohon, Christos Faloutsos, Natalie Glance, Matthew Hurst SDM 2007 Date:2008/8/21.
Netlogo demo. Complexity and Networks Melanie Mitchell Portland State University and Santa Fe Institute.
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
Topics In Social Computing (67810)
How Do “Real” Networks Look?
Normal Distributions.
Generative Model To Construct Blog and Post Networks In Blogosphere
How Do “Real” Networks Look?
How Do “Real” Networks Look?
The likelihood of linking to a popular website is higher
How Do “Real” Networks Look?
Peer-to-Peer and Social Networks
Power Law.
Graph and Link Mining.
Statistics PSY302 Review Quiz One Spring 2017
Diffusion in Networks
Network Models Michael Goodrich Some slides adapted from:
Presentation transcript:

(with an application of Web Spam detection) CS315-Web Search and Mining Power Laws and Rich-Get-Richer Phenomena

What do these have in common? The grades of students in a class. The weights of apples. The high temperatures in Boston on July 4 th. The heights of Dutch men. The speed of cars on I-90. These measurements are well-characterized by the average and the standard deviation. Most instances are typical. Seeing an outlier is very surprising.

City populations 1. New York8,310, Los Angeles 3,834, Chicago2,836, Houston 2,208, Phoenix1,552, Boston, MA625, Cambridge, MA 106,038 25,375. Lost Springs, WY 1 A few cities with high population Many cities with low population

Word Frequencies

Power Law: The number of cities with population > k is proportional to k -c.

“fraction of items” “popularity = k”

Power Law: Fraction f(k) of items with popularity k is proportional to k -c. f(k) k -c log [f(k)] log [k -c ] log [f(k)] -c log [k] y -c x

A power law is a straight line on a log-log plot.

What other things look like powerlaws? Number of Web page in-links (Broder+)

Examples (some better than others) frequency of words protein-interaction degree distribution Internet (AS) degree distribution severity of inter-state wars severity of terrorist attacks frequency of bird sightings size of blackouts book sales population of US cities size of religions number of citations papers authored popularity of surnames number of web hits number of web links, with cut-off number of phone calls size of address book number of species per genus

What is going on? Nature seems to create bell curves (result of sum of many small independent parameters) Human activity seems to create power laws (result of imitation)

Network Science: Scale-Free Property 2012 “seems to”

How can we use this to… fight spam? The main idea behind “Spam, Damn Spam and Statistics” Spammers manufacture pages and links to fool search engines In this process, they will overdo it Their actions would likely fall outside the normal human activity Let’s look for outliers in the power laws!

Web page out-degrees There are 158,290 pages with out-degree 1301, while according to the overall trend only 1,700 such pages are expected.

Web page in-degrees There are 369,457 pages have the in-degree of 1001, while according to the trend only 2,000 such pages are expected

Length of the URL’s host The 100 longest hostnames reveal that 80 of them belong to adult site and 11 refer to the financial and credit related sites

Number of host name resolutions to a single IP There are 100,000’s host names mapped to a single IP, The record-breaking IP is referred by 8,967,154 host names

Clusters of similar pages (shingling) The blue group is mainly spam. 15 of 20 largest clusters have 2,080,112 spam pages The red group has duplicated content, not spam).

Spammers are studious!

Why does data exhibit power laws? imitationPower law Can imitation explain the number of in-links?

Constructing a model of Web growth that simulates imitation 1. Pages are created in order, named 1, 2, …, N 2. When created, page j links to a page randomly: a) With probability p, picking a page i uniformly at random from pages 1, …, j-1 b) With probability (1-p), pick page i uniformly at random and link to the page that i links too imitation randomness This is the well-studied “preferential attachment” model of Web generation

Rule 2b creates the “rich get richer” phenomenon 2 b) With prob. (1-p), pick page i uniformly at random and link to the page that i links too 1/43/4 Equivalently, 2 b)With prob. (1-p), pick a page proportional to its in-degree and link to it

Simulation of Preferential Attachment

Information cascades and the rich Information cascade = some people get a little bit richer by chance and then rich-get-richer dynamics = the random rich people get a lot richer very fast

Is popularity predictable? Why is Harry Potter popular? If we could re-play history, would we still read Harry Potter en masse, or would it be some other book? (But then, why JK Rowling had troubles publishing it at first?)

Is popularity… random? Why “hits” in cultural markets are much more successful than average (and yet so hard to predict)? Can we study it with an experiment? “Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market” 14,000 participants randomly assigned to “social influence” and “independent” conditions chose between 48 songs by unknown bands in 8+1 parallel worlds Subject See what others downloaded No information World 1 World 8 World 0

Music download site – 8+1 worlds 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies 1.“Let’s go driving,” Barzin 2.“Silence is sexy,” Einstu ̈ rzende Neubauten 3.“Go it alone,” Noonday Underground 10.“Picadilly Lilly,” Tiger Lillies The best songs never went to the bottom, the worse never became popular. But their order changed a lot.