Social Analytics for BI

Slides:



Advertisements
Similar presentations
Community Detection and Graph-based Clustering
Advertisements

Spread of Influence through a Social Network Adapted from :
Maximizing the Spread of Influence through a Social Network
Introduction to Social Network Analysis Lluís Coromina Departament d’Economia. Universitat de Girona Girona, 18/01/2005.
Maximizing the Spread of Influence through a Social Network
Relationship Mining Network Analysis Week 5 Video 5.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
By: Roma Mohibullah Shahrukh Qureshi
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Overview of Web Data Mining and Applications Part I
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Models of Influence in Online Social Networks
Data Mining Techniques
Copyright © 2014 Pearson Education, Inc. 1 It's what you learn after you know it all that counts. John Wooden Key Terms and Review (Chapter 5)
Data Analysis in YouTube. Introduction Social network + a video sharing media – Potential environment to propagate an influence. Friendship network and.
© 2008 IBM Corporation ® Atlas for Lotus Connections Unlock the power of your social network! Customer Overview Presentation An IBM Software Services for.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Murtaza Abbas Asad Ali. NETWORKOLOGY THE SCIENCE OF NETWORKS.
A Graph-based Friend Recommendation System Using Genetic Algorithm
Date: 2012/4/23 Source: Michael J. Welch. al(WSDM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Topical semantics of twitter links 1.
Copyright © 2012, SAS Institute Inc. All rights reserved. ANALYTICS IN BIG DATA ERA ANALYTICS TECHNOLOGY AND ARCHITECTURE TO MANAGE VELOCITY AND VARIETY,
Maximizing the Spread of Influence through a Social Network Authors: David Kempe, Jon Kleinberg, É va Tardos KDD 2003.
Most of contents are provided by the website Graph Essentials TJTSD66: Advanced Topics in Social Media.
Slides are modified from Lada Adamic
Social Network Analysis. Outline l Background of social networks –Definition, examples and properties l Data in social networks –Data creation, flow and.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
HCC class lecture 21: Intro to Social Networks John Canny 4/11/05.
How to Analyse Social Network? Social networks can be represented by complex networks.
Informatics tools in network science
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
CRIM6660 Terrorist Networks Lesson 1: Introduction, Terms and Definitions.
1 New metrics for characterizing the significance of nodes in wireless networks via path-based neighborhood analysis Leandros A. Maglaras 1 Dimitrios Katsaros.
Information Retrieval in Practice
Social Networks Some content from Ding-Zhu Du, Lada Adamic, and Eytan Adar.
Cohesive Subgraph Computation over Large Graphs
Empirical Analysis of Implicit Brand Networks on Social Media
Greedy & Heuristic algorithms in Influence Maximization
Groups of vertices and Core-periphery structure
Copyright © Zeph Grunschlag,
Romantic Partnerships and the Dispersion of Social Ties: A Network Analysis of Relationship Status on Facebook By: Lars Backstrom - Facebook Inc, Jon Kleinberg.
New models for power law:
Applications of graph theory in complex systems research
Department of Computer and IT Engineering University of Kurdistan
A Network Science Approach to Fake News Detection on Social Media
Link-Based Ranking Seminar Social Media Mining University UC3M
Comparison of Social Networks by Likhitha Ravi
E-Commerce Theories & Practices
The internet and the value chains of the media industry Session 4
Analyzing and Securing Social Networks
Peer-to-Peer and Social Networks
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Information Retrieval
Centralities (4) Ralucca Gera,
Peer-to-Peer and Social Networks Fall 2017
CS 594: Empirical Methods in HCC Social Network Analysis in HCI
Mining Social Networks. Contents  What are Social Networks  Why Analyse Them?  Analysis Techniques.
Ying Dai Faculty of software and information science,
Ensemble learning.
Graph and Link Mining.
Practical Applications Using igraph in R Roger Stanton
Important Problem Types and Fundamental Data Structures
Social Media Interview
Presentation transcript:

Social Analytics for BI PART B

Extracting Social Network data for BI: workflow

Data collection: sources and methods Social network media access to comprehensive historic data sets and also real-time access to sources (Twitter, facebook, blogs..) News data access to historic data and real-time news data sets, possibly through data licenses (cf. software license). Examples: Google news Public data- access to scraped and archived important public data; available through RSS feeds, blogs or open government databases. Programmable interfaces- access to simple application programming interfaces (APIs) to scrape and store other available data sources that may not be automatically collected (e.g. the content of web sites) .

Social media providers Freely available databases —can be freely downloaded, e.g., Wikipedia (http://dumps.wikimedia.org) and the Enron e-mail data set (http://www.cs.cmu.edu/*enron/) Data access via tools—sources that provide controlled access to their social media. An example is Google’s Trends. These further subdivided into: Free sources—repositories that are freely accessible, but the tools protect or may limit access to the ‘raw’ data in the repository, such as the range of tools provided by Google. Commercial sources —data resellers that charge for access to their social media data. Gnip and DataSift provide commercial access to Twitter data through a partnership, and Thomson Reuters to news data. Data access via APIs —social media data repositories providing programmable HTTP-based access to the data via APIs (e.g., Twitter, Facebook and Wikipedia).

Extracting Social Network data for BI: workflow Labelling (tagging) social data

Preparing social data for storage Social media data is generated by humans and therefore is unstructured (i.e., it lacks a pre-defined structure or data model), algorithms are required to transform it into structured data before storage (data analytics need structured databases). Social data need then to be preprocessed, tagged and then parsed in order to analyze it. Adding extra information to the data (e.g, tagging the data with sentiment indicators, or with a topic indicator –e.g. music, politics.. )- can be performed manually or via rules engines, which seek patterns or interpret the data using techniques such as data mining and text analytics. Algorithms exploit the linguistic, auditory and visual structure inherent in all of the forms of human communication.

Methods for social data processing Computational statistics—refers to computationally intensive statistical methods including resampling methods, Markov chain Monte Carlo methods, local regression, kernel density estimation and principal components analysis. Machine learning—algorithms for the autonomous acquisition and integration of knowledge learnt from experience, analytical observation, etc. Natural Language Processing — algorithms for part of speech tagging, syntactic analysis, semantic analysis

Extracting Social Network data for BI: workflow Methods for Data Modeling: network-based And text-based

Text-based modeling Will be analyzed later (sentiment analysis, topic extraction..)

Network-based Modeling Graph-based measures: Based on the graph-structure of the network

Graph-based modeling Previously surveyed measures of influence, such as buzz, applause etc. are based on surface metrics (e.g. number of retweets, etc): graph-based measures go more in-depth. Objective here: model the social network as a graph Use graph-based methods/algorithms to identify “relevant players” in the network Relevant players = more influential, according to some criterion Use graph-based methods to identify communities (community detection) Use graph-based methods to analyze the “spread” of information

Graph-based measures of social influence Use graph-based methods/algorithms to identify “relevant players” in the network Relevant players = more influential, according to some criterion Use graph-based methods to identify global network properties and communities (community detection) Use graph-based methods to analyze the “spread” of information

Modeling a Social Network as a graph NODE= “actor, vertices, points” i.e. the social entity who participates in a certain network EDGE= “connection, edges, arcs, lines, ties” is defined by some type of relationship between these actors (e.g. friendship, reply/re-tweet, partnership between connected companies..)

SN = graph A network can then be represented as a graph data structure We can apply a variety of measures and analysis to the graph representing a given SN Edges in a SN can be directed or undirected (e.g. friendship, co-authorship are usually undirected, emails are directed)

What is the meaning of edges?

Facebook in undirected (friendship is mutual)

Twitter is a directed graph (friendship is not necessarily bidirectional)

In general, a relation can be: Binary or Valued Directed or Undirected Social Network as a graph In general, a relation can be: Binary or Valued Directed or Undirected a b c e d a b c e d Directed, binary Undirected, binary b d a b c e d 1 3 1 2 4 a c e Directed, Valued Undirected, Valued

Example of directed, valued: Sentiment relations among parties during a political campaign. Color: positive (green) negative (red). Intensity (thikness of edges): related to number of mutual references

Graph-based measures of social influence: key players Using graph theory, we can identify key players in a social network Key players are nodes (or actors, or vertexes) with some measurable connectivity property Two important concepts in a network are the ideas of centrality and prestige of an actor. Centrality more suited for undirected, prestige for directed Another important notion is that of bridgeness, or brokerage (people connecting other people)

Finding key players: Centrality Centrality refers to (one dimension of) location, identifying where an actor resides in a network. Mostly used for undirected networks (like Facebook, since friendship is always mutual). In general, the aim is to formalize intuitive notions about the distinction between insiders and outsiders.

Centrality Conceptually, centrality is fairly straight forward: we want to identify which nodes are in the ‘center’ of the network. In practice, identifying exactly what we mean by ‘center’ is somewhat complicated. Three standard centrality measures capture a wide range of “importance” in a network: Degree Closeness Betweenness

Centrality 1.Centrality Degree The most intuitive notion of centrality focuses on degree. Degree is the number of ties, and the actor with the most ties is the most important:

Closeness Centrality: A second measure of centrality is closeness centrality. An actor is considered important if he/she is relatively close to all other actors. Closeness is based on the inverse of the distance of each actor to every other actor in the network (distance: how many edges you need to travel starting from one node ni to reach the destination node nj). Closeness Centrality: Normalized Closeness Centrality

Closeness Centrality Closeness Centrality in the examples Distance Closeness normalized 0 1 1 1 1 1 1 1 .143 1.00 1 0 2 2 2 2 2 2 .077 .538 1 2 0 2 2 2 2 2 .077 .538 1 2 2 0 2 2 2 2 .077 .538 1 2 2 2 0 2 2 2 .077 .538 1 2 2 2 2 0 2 2 .077 .538 1 2 2 2 2 2 0 2 .077 .538 1 2 2 2 2 2 2 0 .077 .538 In the tables, cell (i,j) is the distance between node i and node j Distance Closeness normalized 0 1 2 3 4 4 3 2 1 .050 .400 1 0 1 2 3 4 4 3 2 .050 .400 2 1 0 1 2 3 4 4 3 .050 .400 3 2 1 0 1 2 3 4 4 .050 .400 4 3 2 1 0 1 2 3 4 .050 .400 4 4 3 2 1 0 1 2 3 .050 .400 3 4 4 3 2 1 0 1 2 .050 .400 2 3 4 4 3 2 1 0 1 .050 .400 1 2 3 4 4 3 2 1 0 .050 .400

Closeness entrality Closeness Centrality in the examples Distance Closeness normalized 0 1 2 3 4 5 6 .048 .286 1 0 1 2 3 4 5 .063 .375 2 1 0 1 2 3 4 .077 .462 3 2 1 0 1 2 3 .083 .500 4 3 2 1 0 1 2 .077 .462 5 4 3 2 1 0 1 .063 .375 6 5 4 3 2 1 0 .048 .286

Key players: Prestige The term prestige is used for directed networks since for this measure the direction is an important property of the relation. In this case we can define two different types of prestige: one for outgoing arcs (measures of influence), one for incoming arcs (measures of support). Examples: An actor has high influence, if he/she gives hints to several other actors (e.g. in Yahoo! Answers). An actor has high support, if a lot of people vote for him/her (many “likes”)

Measures of prestige Influence and support Influence domain Hubs and authorities Brockers

Measuring prestige: influence and support Influence and support: According to the direction/meaning of a relation, in and outdegree represent support or influence. (e.g., likes, votes for,. . . ).

Measuring prestige: influence domain Influence domain: The influence domain of an actor (node) in a directed network is the number (or proportion) of all other nodes which are connected by a path to this node. All other actors are in influence domain of actor 1: Prest(1)=10/10=1.

Limits of Influence domain Influence domain has an important limitation: all the nodes contribute equally to influence. Choices by actors 2, 3, and 7 are more important to person 1 than indirect choices by 4, 5, 6, and 8. Individuals 9 and 10 contribute even less to the prestige of 1.

Measuring prestige: Hubs and Authorities, Page Rank Hubness is a good measure of influence Authority is a good measure of support Kleinberg’s algorithm (HITS) to compute authority and hubness degree of nodes, same as for link analysis Page Rank is a good measure of support HITS, Page Rank: see previous lessons

Example If Mrs. Green is the boss, employees referring directly to her are more important

Measuring prestige: Page Rank Page Rank is one of the main algorithms used by Google to rank web pages when you make a search (graph-based methods apply to any problem that can be modeled with a graph!) A complex method but basically the idea is that the rank of a node depends on the rank of the other nodes pointing at that node (like for in scientific citations, if you are cited by a Nobel Prize, then your “support” is higher than if you are cited my Mr. John Doe)

The basic principle of Page Rank

How does it work Node nj PR(nj )=100/2 + 9/3=53

How is it calculated? The rank of a node depends on the rank of its pointing nodes.. So it seems a circular problem: how can we compute it for all nodes? Start with a random guess of page rank values, and keep on adjusting values until values don’t change (steady state) In computer science, this is called RECURSION

1 Parto da pesi casuali

1

0,5 0,5

0,25 0,75

After several iterations.. 0,22 0,11 0,22 0,44 Why stops here?

More on finding “key players” : bridgeness Centrality, Hubs and Authorities, PageRank measure how a node (an individual in a social network) is “authoritative” There are other qualities we may want to compute, for example, the “bridgeness” (also called betweenness, brokerage, key separators..) People that link other people, acting as bridges Model based on communication flow: A person who lies on communication paths can control communication flow, and is thus important.

Example of bridge Algorithms to identify brockers are all based on some measure of the graph connectivity.

b a C d e f g h Measures of bridgeness: Betweenness Centrality Betweenness centrality counts the number of geodesic paths between i and k that actor j resides on. Geodesics are defined as the shortest path between points b a C d e f g h

Betweenness Centrality Where gjk = the number of geodesics (shortest) connecting jk, and gjk (ni)= the number of such paths that node i is on (count also in the start-end nodes of the path). Can also compute edge betweenness in the very same way

Example of computation

Many other measures Bridgeness, like centrality, is a local measure (based only on path counts) More sophisticated algorithms (like e.g. Page Rank for centrality) are available Mostly based on the notion of graph connectivity The idea is: what is we remove a node from the network? The highest the damage in term of connectivity, the highest the bridgeness value of the node

Finding Bridgeness /brokerage Good bridges = actors that are indispensable for the flow of communication within the network As for graph representation, good brockers are actors that, if removed from the graph, reduces graph connectivity. For example, it causes the creation of disconnected components (Jenny, Jack and John in the graph) This is why bridges are also called brokers or key separators

Graph-based measures of social influence Use graph-based methods/algorithms to identify “relevant players” in the network Relevant players = more influential, according to some criterion Use graph-based methods to identify global network properties and communities (community detection) Use graph-based methods to analyze the “spread” of information

Community detection Community: It is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group, cluster, cohesive subgroup, module in different contexts Community detection: discovering groups in a network where individuals’ group memberships are not explicitly given The categorization are not necessarily disjoint ; social media is becoming an aspect of the physical world 51 51

Community detection Why communities in social media? Human beings are social Easy-to-use social media allows people to extend their social life in unprecedented ways Difficult to meet friends in the physical world, but much easier to find friend online with similar interests Interactions between nodes can help determine communities

Communities in Social Media Two types of groups in social media Explicit Groups: formed by user subscriptions (e.g. Google groups, Twitter lists) Implicit Groups: implicitly formed by social interactions Some social media sites allow people to join groups, however it is still necessary to extract groups based on network topology Not all sites provide community platform Not all people want to make effort to join groups Groups can change dynamically Network interaction provides rich information about the relationship between users Can complement other kinds of information, e.g. user profile Help network visualization and navigation Provide basic information for other tasks, e.g. recommendation Groups are not active: the group members seldom talk to each other Explicitly vs. implicitly formed groups 53 53

Subjectivity of Community Definition Each component is a community A densely-knit community No objective definition for a community Visualization might help, but only for small networks Real-world networks tend to be noisy Need a proper criteria Definition of a community can be subjective. 54 54

Community detection = clustering The two problems are very similar And share the same complexity

A very simple method K-means Input is the number of communities you wish to identify, K Algorithm: Select K random nodes, each of these node represent the “cluster center” Assign every other node to the cluster it is more close to Compute the new cluster center Iterate until clusters become stable (no more nodes are moved from one cluster to the other)

Example with K=3

Graph-based measures of social influence Use graph-based methods/algorithms to identify “relevant players” in the network Relevant players = more influential, according to some criterion Use graph-based methods to identify global network properties and communities (community detection) Use graph-based methods to analyze the “spread” of information

University of British Columbia Influence Spread We live in communities and interact with our friends, family and even strangers. In the process, we influence each other. We live in communities and interact with our friends, family and even strangers. We share ideas and knowledge with them. In the process, we influence each other. This social influence, if modeled properly, can be leveraged in various applications like Viral Marketing, Recommender Systems, Feed Ranking etc. University of British Columbia

Social Network and Spread of Influence Social network plays a fundamental role as a medium for the spread of INFLUENCE among its members Opinions, ideas, information, innovation… Direct Marketing takes the “word-of-mouth” effects to significantly increase profits (Gmail, Tupperware popularization, Microsoft Origami …)

Social Network and Spread of Influence Examples: Hotmail grew from zero users to 12 million users in 18 months on a small advertising budget. A company selects a small number of customers and ask them to try a new product. The company wants to choose a small group with largest influence. Obesity grows as fat people stay with fat people (homofily relations) Viral Marketing..

Viral Marketing Identify influential customers One of the fundamental problems is identification of influential customers in social networks. If you convince them to adopt a product – say by offering discounts or free samples, then these customers, if they like the product will endorse it among their friends, and it will lead to large number of adoptions, driven by word-of-mouth effect. Convince them to adopt the product – Offer discount/free samples These customers endorse the product among their friends

Problem Setting Given Goal Question a limited budget B for initial advertising (e.g. give away free samples of product) estimates for influence between individuals Goal trigger a large cascade of influence (e.g. further adoptions of a product) Question Which set of individuals should B target at? Application besides product marketing spread an innovation detect stories in blogs (gossips) Epidemiological analysis if we can try to convince a subset of individuals to adopt a new product or innovation But how should we choose the few key individuals to use for seeding this process? Which blogs should one read to be most up to date?

What we need Models of influence in social networks. Obtain data about particular network (to estimate inter-personal influence). Algorithms to maximize spread of influence.

A simple algorithm Linear Threshold Model (only the intuition..) The basic model implies that each actor is influenced by those he/she is linked to The influence depends on the strength of the relation between two actors It also depends on the personal tendency of an actor to be influenced by others

Linear Threshold Model A node v has random threshold θv ~ U[0,1] (this model the tendency to be influenced: the higher the threshold, the lower is the influence of others on an actor opinion) A node v is influenced by each neighbor w according to a weight bvw such that This model the strength of the relation of actor v on actor w A node v becomes active when at least (weighted) θv fraction of its neighbors are active Given a random choice of thresholds, and an initial set of active nodes A0 (with all other nodes inactive), the diffusion process unfolds deterministically in discrete steps: in step t, all nodes that were active in step t-1 remain active, and we activate any node v for which the total weight of its active neighbors is at least Theta(v)

Example (weights on edges are the bu,v) Inactive Node 0.6 Active Node 0.2 0.2 Threshold 0.3 Active neighbors X 0.1 0.4 U 0.3 0.5 Stop! 0.2 0.5 w v

Outline Models of influence Influence maximization problem Linear Threshold Independent Cascade Influence maximization problem

Influence Maximization Problem Given a parameter k (budget), find a k-node set S to maximize f(S) In simpler terms: find the minimum number of influencer to “reward”, given the budget, which maximizes the number of individuals that can be “influenced” (through a cascade process of influence propagation” Several algorithms (you don’t need to learn..) the influence of a set of nodes A: the expected number of active nodes at the end of the process.