Presentation is loading. Please wait.

Presentation is loading. Please wait.

Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864.

Similar presentations


Presentation on theme: "Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864."— Presentation transcript:

1 Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864

2 Table of Contents Background Information Problems Addressed Motivation Data Gathered Conclusion Personal Thoughts Question and Comments

3 Background Information Data mining for project comes from a single source of data Results can be applied to Libre Software Look at separately: Mailing Lists Bug Repositories

4 Background Information Libre Software shows Pareto law for commits: For each major artifact, 20% of developers are shown to contribute 80% of the activity in it.

5 Problems Addressed Are the people that commit so much in one artifact the same people in the other artifact? People use different identities in each artifact Current mining techniques focus on one artifact so cannot tell who is who

6 Motivation To gain insight into the social network and structure of libre software projects To find all the identities that correspond to one person Focus more on data analysis rather than the extraction process

7 Data Gathered Actor has access toFigure 1.0 artifacts Alternate rules for each artifact

8 Data Gathered Actor can post on more than one mailing list: bylchan@ca.ibm.com briancha@ca.ibm.com Source Files can appear with many identities:Brian Chan Brian bchan Interaction with versioning repository occurs through account in server machine Bug tracking systems require email address: i.e. Bugzilla

9 Data Gathered PrimaryFigure 2.0 Required Information Secondary Not Required for the transaction i.e. name in email

10 Data Gathered (cont’d) Automated process extracts data into data repository Figure 3.0

11 Data Gathered Sources Table: Lists where id information was originally extracted: i.e. file1.C bugreport230 Identification Table: Identity Id key to Source table

12 Data Gathered Persons Gender, Nationality, Hash Identifications Pseudo identity: bchan Match number with another identity Matches Tells which two identities belong to the same actor Table 1.0 1bchanbylchan@ca.ibm.comDeduction80% 1Brian Chanbylchan@ca.ibm.comSame Email90%

13 Data Gathered Matching during automated data gathering process Inference Automatic Heuristics Human Verification

14 Data Gathered Rule 1: Primary Identities may have part of the real name in it: Example User <username@example.com Rule 2 Identities can be built from another one nsurname@example.com, name.surname@example.com name surname@example.com Rule 3 Some projects or repositories have foresight to keep list information that can be used for matching

15 Data Gathered Still error in matching algorithms but in statistical gathering process, if it is small enough then can be ignored. Still use cleaning and verification.

16 Data Gathered Privacy Issues: Use Hash value (1 st Firewall) to reference information. Cannot reference Identifications directly Person ID (2 nd Firewall) Given in such a way so cannot infer real identity without direct access to Identifications table Given to unique person so hackers cannot find specific id

17 Conclusions Actors in Libre Software may use many different identities for development Paper deals with design of how to account for all the different people and who is actually doing what Discussed how privacy can be dealt with

18 Personal Thoughts Good Points: Effective Solution Good examination of all the different identities in business Unique interpretation of data mining

19 Personal Thoughts Points for improvement: No actual ‘data’ to view results Reference GNOME but never actually give statistical information from it Some interpretation is left to the reader

20 Questions and Comments


Download ppt "Developer Identification Methods for Integrated Data from Various Sources Gregorio Robles Jesus M. Gonzalez-Barahona Presented by Brian Chan Cisc 864."

Similar presentations


Ads by Google