Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.

Similar presentations


Presentation on theme: "Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using."— Presentation transcript:

1 Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using ST mining: A Research Plan Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame NAACSOS Conference Notre Dame, IN June 26-28, 2005

2 Outline Background Background Motivation Motivation Problem definition Problem definition Research data Research data Methodology Methodology Conclusion Conclusion

3 Background (OSS) What is OSS? What is OSS? Free to use, modify and distribute Free to use, modify and distribute Source code available and modifiable Source code available and modifiable Potential advantages over commercial software Potential advantages over commercial software Transparent and easy adoption Transparent and easy adoption Fast development Fast development Low cost Low cost Potential high quality Potential high quality Why study OSS? Why study OSS? Software engineering — new development and coordination methods Software engineering — new development and coordination methods Open content — model for other forms of open, shared collaboration Open content — model for other forms of open, shared collaboration Complexity — successful example of self-organization/emergence Complexity — successful example of self-organization/emergence Growing popularity Growing popularity Non-traditional governance and project management practices Non-traditional governance and project management practices Virtual --> Data! Virtual --> Data!

4 Open Source Software (OSS) Free … Free … to view source to view source to modify to modify to share to share of cost of cost Examples Examples Apache Apache Perl Perl GNU GNU Linux Linux Sendmail Sendmail Python Python KDE KDE GNOME GNOME Mozilla Mozilla Thousands more Thousands more Linux GNU Savannah

5 Leaders Linus Tolvalds Linux Larry Wall Perl Richard Stallman GNU Manifesto Eric Raymond Cathedral and Bazaar

6 Success of Apache Almost 70% Market Share (Netcraft.com) Almost 70% Market Share (Netcraft.com)

7 Research Approach Opportunity: Huge amounts of relatively good data

8 SourceForge.net VA Software Part of OSDN Started 12/1999 Collaboration tools 100 K Projects 100 K Developers 1 M Registered Users

9 150 GBytes of Data & Growing

10

11 Scale free distribution: developer participation # projects # of developers on that many projects 121488 23688 31086 4413 5177 676 735 8219 106 115 126 151 161 171 y =10.6905 - 3.70892 x R 2 = 0.979906 Log( # of Projects) Log(# of Developers) Scale Free – Power Law (developers)

12 Scale free distribution: project sizes Scale Free – Power Law (projects)

13 Background (DM) Characteristics of data set Characteristics of data set Incomplete, noisy, redundant Incomplete, noisy, redundant Complex structures, unstructured Complex structures, unstructured Heterogeneous Heterogeneous Database not designed for research, but to support project management services of SourceForge.net Database not designed for research, but to support project management services of SourceForge.net Temporal data is available, but not everything a researcher would want Temporal data is available, but not everything a researcher would want Inferencing/discovery of temporal data potentially valuable opportunity Inferencing/discovery of temporal data potentially valuable opportunity What is DM (Data mining) What is DM (Data mining) Nontrivial extraction of implicit, previously unknown and potentially useful information from data. Nontrivial extraction of implicit, previously unknown and potentially useful information from data.

14 Data Mining Procedure Raw data Relevant data Feature selection Algorithm application Result Evaluation Data Integration Data Pre-processing Database

15 Spatial-temporal DM (1) Temporal data mining Temporal data mining Discover the behavior-based knowledge instead of state-based knowledge. Discover the behavior-based knowledge instead of state-based knowledge. Example: many wolves -> fewer rabbits Example: many wolves -> fewer rabbits Relationship between timely feedback and quality of software/success of the OSS project Relationship between timely feedback and quality of software/success of the OSS project

16 Spatio-temporal DM New research domain: Spatio-temporal data mining New research domain: Spatio-temporal data mining Growing interest in spatio-temporal data mining Growing interest in spatio-temporal data mining Recommender systems Recommender systems Location based services Location based services Time based services Time based services GIS applications GIS applications Extension of classic data mining techniques into data set with spatial and temporal properties. Extension of classic data mining techniques into data set with spatial and temporal properties. Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., Intervals Intervals Points Points Hybrids Hybrids

17 Motivations Limitations of OSS research to date Limitations of OSS research to date Mostly feature based data mining to date Mostly feature based data mining to date Neglecting of the inherent spatial and temporal information in the OSS community Neglecting of the inherent spatial and temporal information in the OSS community SourceForge.net properties SourceForge.net properties Spatial information Spatial information Collaboration network Collaboration network Temporal information Temporal information History data and log tables History data and log tables

18 Spatial information in OSS? The collaboration network in SF The collaboration network in SF Study of the topology of the collaboration network. Study of the topology of the collaboration network. The network can be mapped as a graph The network can be mapped as a graph This graph is a non-Metric space This graph is a non-Metric space Spread of ideas (software engineering tools and practices, new project opportunities) Spread of ideas (software engineering tools and practices, new project opportunities)

19 Temporal information in OSS The network is evolving and the histories of the site and individual entities comprise the temporal information in the network. The network is evolving and the histories of the site and individual entities comprise the temporal information in the network. Discrete time points Discrete time points All the statistics are collected periodically. All the statistics are collected periodically. Partially ordered events Partially ordered events Multiple timelines existed in the system Multiple timelines existed in the system ? a b c d

20 ST Mining Different from classic data mining Different from classic data mining Spatial and temporal relationships are complicated Spatial and temporal relationships are complicated Metric and non-metric spatial relations Metric and non-metric spatial relations Temporal relations Temporal relations Intrinsic dependency and heterogeneity Intrinsic dependency and heterogeneity Scale effect in space and time Scale effect in space and time Significant modification of many data mining techniques are needed. Significant modification of many data mining techniques are needed.

21 Problem definition I Dependency analysis Dependency analysis Extension of associations to ST mining Extension of associations to ST mining Complicated associations Complicated associations Vertical (temporal) and horizontal (spatial) associations Vertical (temporal) and horizontal (spatial) associations Combination of vertical and horizontal associations Combination of vertical and horizontal associations Examples: lag effects between projects Examples: lag effects between projects Flexible associations Flexible associations Huge volume and scale effect of spatial-temporal data set introduce noise and error Huge volume and scale effect of spatial-temporal data set introduce noise and error Strict association is difficult to define Strict association is difficult to define

22 Problem definition II Topic of this study: prediction support Topic of this study: prediction support Clustering: group the projects with similar evolution. Clustering: group the projects with similar evolution. Summarization: summarize the representative characteristics of different project evolution patterns Summarization: summarize the representative characteristics of different project evolution patterns Prediction: predict the project evolution (based on the pattern discovered) Prediction: predict the project evolution (based on the pattern discovered)

23 Research Data SourceForge.net database dump June 2005 SourceForge.net database dump June 2005 117 tables 117 tables Records up to 30 million per table Records up to 30 million per table 23 Gigabytes 23 Gigabytes PostgreSQL PostgreSQL Three types of tables Three types of tables Data tables Data tables History tables History tables Statistics tables Statistics tables

24 Methodology Project development statistics Project development statistics Numerical statistics. Numerical statistics. Expertise and survey statistics. Expertise and survey statistics. Time series analysis Time series analysis Generate the time series for these statistics Generate the time series for these statistics Classification generation Classification generation ABN algorithm used ABN algorithm used Classifier evaluation Classifier evaluation Evaluation by comparing the predicted class with the actual class Evaluation by comparing the predicted class with the actual class

25 Numerical statistics Statistics tables have the information about project history Statistics tables have the information about project history Stats_project_months Stats_project_months Every record stands for a monthly history of a single project Every record stands for a monthly history of a single project Records from November 1999 to June 2005 Records from November 1999 to June 2005 There are 24 attributes in every record There are 24 attributes in every record Descriptive attributes (3) Descriptive attributes (3) Statistics (numeric) attributes (21) Statistics (numeric) attributes (21) We use the statistics attributes We use the statistics attributes

26 Statistics Attributes Attributes DevelopersPatches_opened DownloadsPatches_closed Subdomain_ViewsArtifacts_opened Page_viewsArtifacts_closed File_releasesTasks_opened Msg_postedTasks_closed Bug_openedHelp_requests Bug_closedCVS_checkouts Support_openedCVS_commits Site_viewsCVS_adds Support_closed

27 Expertise statistics Rating scores Rating scores Expertise rating Expertise rating User rating User rating Importance parameter Importance parameter Domain importance Domain importance Contribution parameter Contribution parameter

28 Time Series Time series used to describe the history of each attribute. Time series used to describe the history of each attribute. Time series: an ordered sequence of values of a variable at equally spaced time intervals. Time series: an ordered sequence of values of a variable at equally spaced time intervals. The available monthly values of each statistic is used to generate the time series. The available monthly values of each statistic is used to generate the time series. Goal is to study the project history patterns. Goal is to study the project history patterns. Description Description Prediction Prediction

29 Conclusion Project prediction using ST mining Project prediction using ST mining We used statistics to predict the project development We used statistics to predict the project development Calibration using new data is important to keep the prediction valid. Calibration using new data is important to keep the prediction valid.

30 Questions


Download ppt "Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using."

Similar presentations


Ads by Google