Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate.

Slides:



Advertisements
Similar presentations
Mobile Communication Networks Vahid Mirjalili Department of Mechanical Engineering Department of Biochemistry & Molecular Biology.
Advertisements

Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale Network Theory: Computational Phenomena and Processes Social Network.
Network Matrix and Graph. Network Size Network size – a number of actors (nodes) in a network, usually denoted as k or n Size is critical for the structure.
Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864.
Analysis and Modeling of Social Networks Foudalis Ilias.
Comparison of Social Networks by Likhitha Ravi. Outline What is a social network? Elements of social network Previous studies What is missing in previous.
Relationship Mining Network Analysis Week 5 Video 5.
Data Structure and Algorithms (BCS 1223) GRAPH. Introduction of Graph A graph G consists of two things: 1.A set V of elements called nodes(or points or.
On the Structure, Properties and Utility of Internal Corporate Blogs Pranam Kolari Tim Finin, Yelena Yesha, Yaacov Yesha Kelly Lyons, Stephen Perelgut,
American Chemical Society Navigating Social Networking and Collaboration Tools Christine Brennan Schmidt, Product Manager, WSO August 17, 2009.
UNDERSTANDING VISIBLE AND LATENT INTERACTIONS IN ONLINE SOCIAL NETWORK Presented by: Nisha Ranga Under guidance of : Prof. Augustin Chaintreau.
Graph & BFS.
NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions An Algorithm for Temporal Analysis of Social Positions Scott Christley, Greg Madey Dept.
TC2-Computer Literacy Mr. Sencer February 4, 2010.
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
A measure of betweenness centrality based on random walks Author: M. E. J. Newman Presented by: Amruta Hingane Department of Computer Science Kent State.
Copyright 2005 Thomson/South-Western Basic Letter & Memo Writing Fifth Edition Chapter 1: Effective Communication.
Lesson 19 Internet Basics.
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Introduction to WebCT Sheridan College Architectural Technology.
Predicting Developer Initiation from Social Activities Mohammad Gharehyazie Daryl Posnett Vladimir Filkov 1.
9.1. The Internet Domain Names and IP addresses. Aims Be able to compare terms such as Domain names and IP addresses URL,URI and URN Internet Registries.
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial
علیرضا فراهانی استاد درس: جعفری نژاد مهر Version Control ▪Version control is a system that records changes to a file or set of files over time so.
1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Access 2010 by Robert Grauer, Keith Mast, and Mary Anne.
Introduction to Sequence Diagrams
Section 8 – Ec1818 Jeremy Barofsky March 31 st and April 1 st, 2010.
Presented by Abirami Poonkundran.  Introduction  Current Work  Current Tools  Solution  Tesseract  Tesseract Usage Scenarios  Information Flow.
By Sushmitha. CONTENT CONTENT : What is internet ? How did internet develop ? Basic services of internet Uses of internet.
Social Network Analysis: A Non- Technical Introduction José Luis Molina Universitat Autònoma de Barcelona
Computer-Assisted Communication
Online Help-Seeking in a Large Science Class: A Social Network Analysis Perspective Erkan Er Learning, Design, and Technology AECT
Alias Detection Using Social Network Analysis Ralf Holzer, Bradley Malin, Latanya Sweeney LinkKDD 2005 Advisor: Dr. Koh Jia-Ling Reporter: Che-Wei,
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
Introduction to Graphs. Introduction Graphs are a generalization of trees –Nodes or verticies –Edges or arcs Two kinds of graphs –Directed –Undirected.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
Social Network Analysis (1) LING 575 Fei Xia 01/04/2011.
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 April 25, 2012.
School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2013 Figures are taken.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
Special Topics in Educational Data Mining HUDK5199 Spring 2013 March 25, 2012.
Web Client-Server Server Client Hypertext link TCP port 80.
Department of Information Business Discussion of a Large-Scale Open Source Data Collection Methodology Michael Hahsler and Stefan Koch Department of Information.
Mining Social Networks for Personalized Prioritization Shinjae Yoo, Yiming Yang, Frank Lin, II-Chul Moon [KDD ’09] 1 Advisor: Dr. Koh Jia-Ling Reporter:
NTU Natural Language Processing Lab. 1 Investment and Attention in the Weblog Community Advisor: Hsin-Hsi Chen Speaker: Sheng-Chung Yen.
Data Structures & Algorithms Graphs
A project from the Social Media Research Foundation: Finding direction in a sea of connection:
Computer-Supported Social Networks Caroline Haythornthwaite Graduate School of Library and Information Science University of Illinois at Urbana Champaign.
Yongqin Gao, Greg Madey Computer Science & Engineering Department University of Notre Dame © Copyright 2002~2003 by Serendip Gao, all rights reserved.
Partitioning The Network Copyright © 2012: HyperEdge Pty Ltd 1.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Chapter 20 - Electronic Mail Introduction Description Of Functionality –send a single message to many recipients. –send a message that includes text, voice,
Graphs Upon completion you will be able to:
HCC class lecture 21: Intro to Social Networks John Canny 4/11/05.
Informatics tools in network science
Class 2: Graph Theory IST402.
Wikitopia Community-based interactive communication and information-sharing tools Emily Bush Margaret Norris.
Lesson 10—Networking BASICS1 Networking BASICS The Internet and Its Tools Unit 3 Lesson 10.
Topical Analysis and Visualization of (Network) Data Using Sci2 Ted Polley Research & Editorial Assistant Cyberinfrastructure for Network Science Center.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
Social Media & Social Networking 101 Canadian Society of Safety Engineering (CSSE)
Classroom network analysis
Department of Computer and IT Engineering University of Kurdistan
Network Science: A Short Introduction i3 Workshop
Graphs Chapter 11 Objectives Upon completion you will be able to:
Analyzing Two Participation Strategies in an Undergraduate Course Community Francisco Gutierrez Gustavo Zurita
Graphs G = (V,E) V is the vertex set.
Presentation transcript:

Mining Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis

2 Motivation The social process is an important, hard to study, aspect of any software engineering effort Can be studied in many stable and mature OSS projects Nearly all communication is done via internet Records of both communication and development activity are freely available

3 Apache Communication and Development (since 1996) 100,000+ messages on dev mailing list 70,000 CVS commits to files

4 It is widely believed that OSS communities form a hierarchy Can we use social network analysis to examine these OSS communities? Image from Socialization in an Open Source Community, Nicolas Ducheneaut

5 Social Networks A network consisting of actors and their social ties to each other. Network of who dated who in high school. Courtesy of Mark Newman

6 Related Work Xu, Gao, Christley, and Madey looked at developers who worked on the same projects Crowston & Howison co-ocurrence of developers on a bug-report as a social link Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data. We believe that responses to s indicates a strong social link. Python AliceBob undirected link contribute Bug Report AliceBob undirected link resolve submit foo.c AliceBob undirected link commit Mailing List AliceBob directed link respond post

7 Issues with Mailing List Analysis Extracting conversation threads Rationalizing Timestamps Identifying targets in a broadcast medium Resolving Aliases Extracting Content

8 Aliases 2,544 different address aliases have been used on the apache dev mailing list since Many of these addresses belong to the same people. The following addresses were all used by Joe Orton.

9 Alias Analysis 1.Preprocess name and address. –Remove commas (“orton, joe” -> “joe orton”) –Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) –Remove common terms (list, admin, root) 2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what aliases are similar. –name-name: “joe orton” vs. “joe e. orton” – - vs –name- “joe orton” vs. 3. Manually post process aliases marked as similar to remove the high level of false positives 4. Use similar process to map CVS accounts to aliases addresses contain a tuple. Often the name is empty.

10 Alias Results 2,544 aliases used 2,008 unique “identities” used Many of the high volume participants had a large number of aliases

11 Creating the Social Network Each message has a message id. A response message contains an “in-response- to” header which includes the message id of the previous message. If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob. We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period.

12 Intro to Social Network Metrics In-degree – The number of links whose head is connected to a particular actor Out-degree – The number of links whose tail is connected to a particular actor Geodesic – A shortest path between two actors Betweenness – The number of geodesics that a particular actor lies on.

Example High Out-Degree High Betweenness High In-Degree

14 Betweenness more formally For a given vertex i Where σ st is the number of geodesics between s and t And σ st (i) is the number of those paths passing through vertex i Normalizing values so that the total of all betweenness sums to 1 is common

15 Everybody likes a pretty picture! This is the social network of some of the most active participants on the Apache developer mailing list. Each link indicates at least 150 messages between participants. Ryan Bloom has high betweenness in this network. Of the participants shown, he has the highest number of source file commits.

16 The distribution of in-degree and out-degree both exhibit a power-law character

17 Status of Developers vs. Non-Developers DeveloperNon-Developer Betweenness Out-degree In-Degree Largest difference is in betweenness

18 Correlation between communication and development ChangesSrc ChangesDoc ChangesOut-degreeIn-degreebetweenness Changes1 Src Changes Doc Changes Out-degree In-degree Betweenness High correlation between betweenness and source file changes Lower correlation between betweenness and document file changes Similar relationship for in- and out-degree.

19 Observations from the network The mailing list activity reflects a typical social network. Developers are the “key social brokers”. More active developers tend to be more important. Results robust: Postgres showed similar results.

20 Topics of future research Visualization of software and social data Who becomes a developer? Relationship between communication and collaboration networks Network Evolution Conway’s Law

21 Average In-Degree Months Avg In-Degree