Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using.

Slides:



Advertisements
Similar presentations
The role of Domain Knowledge in a large scale Data Mining Project Kopanas I., Avouris N., Daskalaki S. University of Patras.
Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
The Evolution of Spatial Outlier Detection Algorithms - An Analysis of Design CSci 8715 Spatial Databases Ryan Stello Kriti Mehra.
NAACSOS 2005Scott Christley, Temporal Analysis of Social Positions An Algorithm for Temporal Analysis of Social Positions Scott Christley, Greg Madey Dept.
Week 9 Data Mining System (Knowledge Data Discovery)
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Towards Understanding: A Study of the SourceForge.net Community using Modeling and Simulation Yongqin Gao Greg Madey Computer Science & Engineering University.
Conceptual Framework for Agent- Based Modeling and Simulation: The Computer Experiment Yongqin GaoVincent Freeh Greg Madey CSE DepartmentCS Department.
Data Mining – Intro.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Advanced Database Applications Database Indexing and Data Mining CS591-G1 -- Fall 2001 George Kollios Boston University.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Tang: Introduction to Data Mining (with modification by Ch. Eick) I: Introduction to Data Mining A.Short Preview 1.Initial Definition of Data Mining 2.Motivation.
Cyber-Infrastructure for Agro-Threats Steve Goddard Computer Science & Engineering University of Nebraska-Lincoln.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Computers and Society Examine the extent to which Richard Stallman’s GNU manifesto has succeeded in challenging the dominance of conventionally distributed.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Tennessee Technological University1 The Scientific Importance of Big Data Xia Li Tennessee Technological University.
Dresden, ECCS’07 06/10/07 Science of complex systems for socially intelligent ICT Overview of background document Objective IST FET proactive.
Exploring Metropolitan Dynamics with an Agent- Based Model Calibrated using Social Network Data Nick Malleson & Mark Birkin School of Geography, University.
Chapter 1 Introduction to Data Mining
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
Data Mining By : Tung, Sze Ming ( Leo ) CS 157B. Definition A class of database application that analyze data in a database using tools which look for.
A Graph-based Friend Recommendation System Using Genetic Algorithm
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Data Mining By Dave Maung.
Topology and Evolution of the Open Source Software Community Advisors: Dr. Vincent W. Freeh Dr. Kevin Bowyer Supported in part by the National Science.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Last Words DM 1. Mining Data Steams / Incremental Data Mining / Mining sensor data (e.g. modify a decision tree assuming that new examples arrive continuously,
Agent 2004Scott Christley, Public Goods Theory of Open Source Community Public Goods Theory of the Open Source Development Community using Agent-based.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Open Source Examples – Linux; Apache; Firefox Requirements – Distributed w/ source code – License allows for modifications (GPL) – License remains w/ any.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
January 17, 2016Data Mining: Concepts and Techniques 1 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial,
Data Mining and Decision Support
Predicting the Location and Time of Mobile Phone Users by Using Sequential Pattern Mining Techniques Mert Özer, Ilkcan Keles, Ismail Hakki Toroslu, Pinar.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A RESEARCH SUPPORT SYSTEM FRAMEWORK FOR WEB DATA MINING Jin Xu, Yingping Huang, Gregory Madey Department of Computer Science and Engineering University.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
DataJewel 1 : Tightly Integrating Visualization with Temporal Data Mining Mihael Ankerst, David H. Jones, Anne Kao, Changzhou Wang 1 US patent pending.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
A Research Collaboratory for Open Source Software Research Yongqin Gao, Matt van Antwerp, Scott Christley, Greg Madey Computer Science & Engineering University.
Lake Arrowhead 2005Scott Christley, Understanding Open Source Understanding the Open Source Software Community Presented by Scott Christley Dept. of Computer.
CS570: Data Mining Spring 2010, TT 1 – 2:15pm Li Xiong.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Data Mining – Intro.
DATA MINING © Prentice Hall.
Introduction Multimedia initial focus
Introduction C.Eng 714 Spring 2010.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Mining Time-Series Databases
Mining and Analyzing Data from Open Source Software Repository
Data Warehousing and Data Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
CSE591: Data Mining by H. Liu
Presentation transcript:

Supported in part by the National Science Foundation – ISS/Digital Science & Technology Analysis of the Open Source Software development community using ST mining: A Research Plan Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame NAACSOS Conference Notre Dame, IN June 26-28, 2005

Outline Background Background Motivation Motivation Problem definition Problem definition Research data Research data Methodology Methodology Conclusion Conclusion

Background (OSS) What is OSS? What is OSS? Free to use, modify and distribute Free to use, modify and distribute Source code available and modifiable Source code available and modifiable Potential advantages over commercial software Potential advantages over commercial software Transparent and easy adoption Transparent and easy adoption Fast development Fast development Low cost Low cost Potential high quality Potential high quality Why study OSS? Why study OSS? Software engineering — new development and coordination methods Software engineering — new development and coordination methods Open content — model for other forms of open, shared collaboration Open content — model for other forms of open, shared collaboration Complexity — successful example of self-organization/emergence Complexity — successful example of self-organization/emergence Growing popularity Growing popularity Non-traditional governance and project management practices Non-traditional governance and project management practices Virtual --> Data! Virtual --> Data!

Open Source Software (OSS) Free … Free … to view source to view source to modify to modify to share to share of cost of cost Examples Examples Apache Apache Perl Perl GNU GNU Linux Linux Sendmail Sendmail Python Python KDE KDE GNOME GNOME Mozilla Mozilla Thousands more Thousands more Linux GNU Savannah

Leaders Linus Tolvalds Linux Larry Wall Perl Richard Stallman GNU Manifesto Eric Raymond Cathedral and Bazaar

Success of Apache Almost 70% Market Share (Netcraft.com) Almost 70% Market Share (Netcraft.com)

Research Approach Opportunity: Huge amounts of relatively good data

SourceForge.net VA Software Part of OSDN Started 12/1999 Collaboration tools 100 K Projects 100 K Developers 1 M Registered Users

150 GBytes of Data & Growing

Scale free distribution: developer participation # projects # of developers on that many projects y = x R 2 = Log( # of Projects) Log(# of Developers) Scale Free – Power Law (developers)

Scale free distribution: project sizes Scale Free – Power Law (projects)

Background (DM) Characteristics of data set Characteristics of data set Incomplete, noisy, redundant Incomplete, noisy, redundant Complex structures, unstructured Complex structures, unstructured Heterogeneous Heterogeneous Database not designed for research, but to support project management services of SourceForge.net Database not designed for research, but to support project management services of SourceForge.net Temporal data is available, but not everything a researcher would want Temporal data is available, but not everything a researcher would want Inferencing/discovery of temporal data potentially valuable opportunity Inferencing/discovery of temporal data potentially valuable opportunity What is DM (Data mining) What is DM (Data mining) Nontrivial extraction of implicit, previously unknown and potentially useful information from data. Nontrivial extraction of implicit, previously unknown and potentially useful information from data.

Data Mining Procedure Raw data Relevant data Feature selection Algorithm application Result Evaluation Data Integration Data Pre-processing Database

Spatial-temporal DM (1) Temporal data mining Temporal data mining Discover the behavior-based knowledge instead of state-based knowledge. Discover the behavior-based knowledge instead of state-based knowledge. Example: many wolves -> fewer rabbits Example: many wolves -> fewer rabbits Relationship between timely feedback and quality of software/success of the OSS project Relationship between timely feedback and quality of software/success of the OSS project

Spatio-temporal DM New research domain: Spatio-temporal data mining New research domain: Spatio-temporal data mining Growing interest in spatio-temporal data mining Growing interest in spatio-temporal data mining Recommender systems Recommender systems Location based services Location based services Time based services Time based services GIS applications GIS applications Extension of classic data mining techniques into data set with spatial and temporal properties. Extension of classic data mining techniques into data set with spatial and temporal properties. Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., Challenges: complexity of spatial information and difficulty in reasoning temporal information, e.g., Intervals Intervals Points Points Hybrids Hybrids

Motivations Limitations of OSS research to date Limitations of OSS research to date Mostly feature based data mining to date Mostly feature based data mining to date Neglecting of the inherent spatial and temporal information in the OSS community Neglecting of the inherent spatial and temporal information in the OSS community SourceForge.net properties SourceForge.net properties Spatial information Spatial information Collaboration network Collaboration network Temporal information Temporal information History data and log tables History data and log tables

Spatial information in OSS? The collaboration network in SF The collaboration network in SF Study of the topology of the collaboration network. Study of the topology of the collaboration network. The network can be mapped as a graph The network can be mapped as a graph This graph is a non-Metric space This graph is a non-Metric space Spread of ideas (software engineering tools and practices, new project opportunities) Spread of ideas (software engineering tools and practices, new project opportunities)

Temporal information in OSS The network is evolving and the histories of the site and individual entities comprise the temporal information in the network. The network is evolving and the histories of the site and individual entities comprise the temporal information in the network. Discrete time points Discrete time points All the statistics are collected periodically. All the statistics are collected periodically. Partially ordered events Partially ordered events Multiple timelines existed in the system Multiple timelines existed in the system ? a b c d

ST Mining Different from classic data mining Different from classic data mining Spatial and temporal relationships are complicated Spatial and temporal relationships are complicated Metric and non-metric spatial relations Metric and non-metric spatial relations Temporal relations Temporal relations Intrinsic dependency and heterogeneity Intrinsic dependency and heterogeneity Scale effect in space and time Scale effect in space and time Significant modification of many data mining techniques are needed. Significant modification of many data mining techniques are needed.

Problem definition I Dependency analysis Dependency analysis Extension of associations to ST mining Extension of associations to ST mining Complicated associations Complicated associations Vertical (temporal) and horizontal (spatial) associations Vertical (temporal) and horizontal (spatial) associations Combination of vertical and horizontal associations Combination of vertical and horizontal associations Examples: lag effects between projects Examples: lag effects between projects Flexible associations Flexible associations Huge volume and scale effect of spatial-temporal data set introduce noise and error Huge volume and scale effect of spatial-temporal data set introduce noise and error Strict association is difficult to define Strict association is difficult to define

Problem definition II Topic of this study: prediction support Topic of this study: prediction support Clustering: group the projects with similar evolution. Clustering: group the projects with similar evolution. Summarization: summarize the representative characteristics of different project evolution patterns Summarization: summarize the representative characteristics of different project evolution patterns Prediction: predict the project evolution (based on the pattern discovered) Prediction: predict the project evolution (based on the pattern discovered)

Research Data SourceForge.net database dump June 2005 SourceForge.net database dump June tables 117 tables Records up to 30 million per table Records up to 30 million per table 23 Gigabytes 23 Gigabytes PostgreSQL PostgreSQL Three types of tables Three types of tables Data tables Data tables History tables History tables Statistics tables Statistics tables

Methodology Project development statistics Project development statistics Numerical statistics. Numerical statistics. Expertise and survey statistics. Expertise and survey statistics. Time series analysis Time series analysis Generate the time series for these statistics Generate the time series for these statistics Classification generation Classification generation ABN algorithm used ABN algorithm used Classifier evaluation Classifier evaluation Evaluation by comparing the predicted class with the actual class Evaluation by comparing the predicted class with the actual class

Numerical statistics Statistics tables have the information about project history Statistics tables have the information about project history Stats_project_months Stats_project_months Every record stands for a monthly history of a single project Every record stands for a monthly history of a single project Records from November 1999 to June 2005 Records from November 1999 to June 2005 There are 24 attributes in every record There are 24 attributes in every record Descriptive attributes (3) Descriptive attributes (3) Statistics (numeric) attributes (21) Statistics (numeric) attributes (21) We use the statistics attributes We use the statistics attributes

Statistics Attributes Attributes DevelopersPatches_opened DownloadsPatches_closed Subdomain_ViewsArtifacts_opened Page_viewsArtifacts_closed File_releasesTasks_opened Msg_postedTasks_closed Bug_openedHelp_requests Bug_closedCVS_checkouts Support_openedCVS_commits Site_viewsCVS_adds Support_closed

Expertise statistics Rating scores Rating scores Expertise rating Expertise rating User rating User rating Importance parameter Importance parameter Domain importance Domain importance Contribution parameter Contribution parameter

Time Series Time series used to describe the history of each attribute. Time series used to describe the history of each attribute. Time series: an ordered sequence of values of a variable at equally spaced time intervals. Time series: an ordered sequence of values of a variable at equally spaced time intervals. The available monthly values of each statistic is used to generate the time series. The available monthly values of each statistic is used to generate the time series. Goal is to study the project history patterns. Goal is to study the project history patterns. Description Description Prediction Prediction

Conclusion Project prediction using ST mining Project prediction using ST mining We used statistics to predict the project development We used statistics to predict the project development Calibration using new data is important to keep the prediction valid. Calibration using new data is important to keep the prediction valid.

Questions