Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate.

Similar presentations


Presentation on theme: "Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate."— Presentation transcript:

1 Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis

2 2 Motivation The social process is an important, hard to study, aspect of any software engineering effort Can be studied in many stable and mature OSS projects Nearly all communication is done via internet Records of both communication and development activity are freely available

3 3 Apache Communication and Development (since 1996) 100,000+ messages on dev mailing list 70,000 CVS commits to files

4 4 It is widely believed that OSS communities form a hierarchy Can we use social network analysis to examine these OSS communities? Image from Socialization in an Open Source Community, Nicolas Ducheneaut

5 5 Social Networks A network consisting of actors and their social ties to each other. Network of who dated who in high school. Courtesy of Mark Newman

6 6 Related Work Xu, Gao, Christley, and Madey looked at developers who worked on the same projects Crowston & Howison co-ocurrence of developers on a bug-report as a social link Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data. We believe that responses to emails indicates a strong social link. Python AliceBob undirected link contribute Bug Report AliceBob undirected link resolve submit foo.c AliceBob undirected link commit Mailing List AliceBob directed link respond post

7 7 Issues with Mailing List Analysis Extracting conversation threads Rationalizing Timestamps Identifying targets in a broadcast medium Resolving Email Aliases Extracting Content

8 8 Email Aliases 2,544 different email address aliases have been used on the apache dev mailing list since 1996. Many of these email addresses belong to the same people. The following email addresses were all used by Joe Orton. jeo101@york.ac.uk joe@orton.demon.co.uk joe@light.plus.com jorton@redhat.com joe@manyfish.co.uk

9 9 Email Alias Analysis 1.Preprocess name and address. –Remove commas (“orton, joe” -> “joe orton”) –Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) –Remove common email terms (list, admin, root) 2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar. –name-name: “joe orton” vs. “joe e. orton” –email-email: “jorton@foo.com” vs “jorton@bar.org” –name-email:“joe orton” vs. “jorton@foo.com” 3. Manually post process aliases marked as similar to remove the high level of false positives 4. Use similar process to map CVS accounts to email aliases Email addresses contain a tuple. Often the name is empty.

10 10 Alias Results 2,544 email aliases used 2,008 unique “identities” used Many of the high volume participants had a large number of aliases

11 11 Creating the Email Social Network Each email message has a message id. A response message contains an “in-response- to” header which includes the message id of the previous message. If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob. We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period.

12 12 Intro to Social Network Metrics In-degree – The number of links whose head is connected to a particular actor Out-degree – The number of links whose tail is connected to a particular actor Geodesic – A shortest path between two actors Betweenness – The number of geodesics that a particular actor lies on.

13 13 3 7 2 5 6 4 1 12 108 9 11 Example High Out-Degree High Betweenness High In-Degree

14 14 Betweenness more formally For a given vertex i Where σ st is the number of geodesics between s and t And σ st (i) is the number of those paths passing through vertex i Normalizing values so that the total of all betweenness sums to 1 is common

15 15 Everybody likes a pretty picture! This is the social network of some of the most active participants on the Apache developer mailing list. Each link indicates at least 150 messages between participants. Ryan Bloom has high betweenness in this network. Of the participants shown, he has the highest number of source file commits.

16 16 The distribution of in-degree and out-degree both exhibit a power-law character

17 17 Status of Developers vs. Non-Developers DeveloperNon-Developer Betweenness0.01140.000140 Out-degree0.006660.000451 In-Degree0.007940.000367 Largest difference is in betweenness

18 18 Correlation between communication and development ChangesSrc ChangesDoc ChangesOut-degreeIn-degreebetweenness Changes1 Src Changes0.7891 Doc Changes0.9320.5141 Out-degree0.5200.7120.3081 In-degree0.4740.6790.2630.9711 Betweenness0.5530.7570.3270.9550.9171 High correlation between betweenness and source file changes Lower correlation between betweenness and document file changes Similar relationship for in- and out-degree.

19 19 Observations from the network The mailing list activity reflects a typical social network. Developers are the “key social brokers”. More active developers tend to be more important. Results robust: Postgres showed similar results.

20 20 Topics of future research Visualization of software and social data Who becomes a developer? Relationship between communication and collaboration networks Network Evolution Conway’s Law

21 21 Average In-Degree Months Avg In-Degree


Download ppt "Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate."

Similar presentations


Ads by Google