Presentation is loading. Please wait.

Presentation is loading. Please wait.

KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F.

Similar presentations


Presentation on theme: "KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F."— Presentation transcript:

1 KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F. Turini Thirteenth Italian Symposium on Sistemi Evoluti per Basi di Dati (SEBD-2005) Brixen, Italy – 19-22 June, 2005

2 SEBD 2005 - Brixen, June 2005 Application Area: KDD Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, understandable patterns in data.

3 SEBD 2005 - Brixen, June 2005 The CRISP-DM process Main focus on automatic-phases: Data pre-processing Modeling Post-processing Model evaluation

4 SEBD 2005 - Brixen, June 2005 In this work KDDML: an XML-based middleware language and system in support of the KDD process. KDDML as language. KDDML as system.

5 SEBD 2005 - Brixen, June 2005 Requirements R 1 : data/models repository should be available for storing input, output and intermediate objects of the KDD process. Several representations of data can be available. Automatic format conversions. Automatic meta-data mapping (e.g., ARFF, SQL). R 2 : specifying logical meta-data (meta-model) in addition to the physical data (model). R 3 : compositionality of mining operations in the design of the language (closure principle). R 4 : high extensibility of the system architecture.

6 SEBD 2005 - Brixen, June 2005 KDDML as XML-based System XML as data/model representation (R 1, R 2 ). Machine-processable language. XML as language definition. Ensures compositionality of operators (R 3 ). Extensibility and modularity (R 4 ).

7 SEBD 2005 - Brixen, June 2005 Data/Model Representation

8 SEBD 2005 - Brixen, June 2005 Data Format Separing the logical data from the physical instances. Data schema via proprietary XML. Actual data stored in CSV (Comma Separated Values). CSV has been chosen as a trade-off between readability (binary file) and space occupation (xml).

9 SEBD 2005 - Brixen, June 2005 Data Format: Example …. …. Logical Metadata Physical Data

10 SEBD 2005 - Brixen, June 2005 Model Format PMML (Predictive Model Markup Language) An industry standard for actual models representation as XML documents. Consists of DTDs for a wide spectrum of models, including RdA, decision trees, clustering, regression, neural networks. It does not cover the process of extracting models, but the exchange of extracted knowledge.

11 SEBD 2005 - Brixen, June 2005 Model Format: Example …. … …... Logical Metadata Physical Model

12 SEBD 2005 - Brixen, June 2005 Language

13 SEBD 2005 - Brixen, June 2005 Closure Principle (1) Arguments of an operator must be of an appropriate type and sequence. We denote the signature of an operator op:t 1 x … x t n  t by defining a DTD for KDDML queries that constraints sub- elements to be of type t 1, …, t n.

14 SEBD 2005 - Brixen, June 2005 Closure Principle (2) Where: kdd_query_trees: all operators returning a classification tree; kdd_query_table: all operators returning a table; TREE_CLASSIFY belongs to the kdd_query_table entity. f TREE_CLASSIFY : tree x table  table <!ELEMENT TREE_CLASSIFY ((%kdd_query_trees;), (%kdd_query_table;))>

15 SEBD 2005 - Brixen, June 2005 KDDML Types The set of types of KDDML operators consists of: Table, PPtable Tree, clusters, rda, sequence, hierarchy Algs, condition, expression

16 SEBD 2005 - Brixen, June 2005 KDDML Query structure The structure of a KDDML query has a precise format. XML tags element correspond to operation on data and models; XML attributes correspond to parameters of those operations XML sub-elements define the arguments passed to the operators (KDDML Types)............

17 SEBD 2005 - Brixen, June 2005 Example (1) Construction and application of a decision tree. Loading of an ARFF source as training set. Simple sampling on training set. Construction of a decision tree on sampled training set. Target attribute: play. Algorithm: C4.5. Loading of a test set from the system repository. Application of the decision tree on the test set.

18 SEBD 2005 - Brixen, June 2005 Example (2)........................ Repository Data Table Loader Source: weather_test.xml Tree Classify Tree Miner Alg: c4.5 Pruning confidence: 40% Num instances: 6 Sampling Alg: simple sampling Percentage: 66% Arff Loader Source: weather.arff Repository ARFF

19 SEBD 2005 - Brixen, June 2005 Language Operators Data/Model access. Preprocessing. Data Cleaning, Sampling, Normalization, Discretization. Model Extraction. Model application and evaluation. Model meta-reasoning & filtering.

20 SEBD 2005 - Brixen, June 2005 Example one: Discretization.... <PP_NUMERIC_DISCRETIZATION xml_dest= "census_discrete.xml", attribute_name = "age", label_type = "enumeration", enumerated_label_list = "young, middle, old">.... Discretization of a numeric attribute “age” into three intervals using the natural binning method.

21 SEBD 2005 - Brixen, June 2005 Example two: RdA filtering........ Selects the rules with item “bread” in the body and not having the item “milk” in the head and having exactly two items in the head and having the support greater than 30%.

22 SEBD 2005 - Brixen, June 2005 System Architecture

23 SEBD 2005 - Brixen, June 2005 Design targets Extensibility Data sources Algorithms Models Portability Modularity. Architecture structured in 3 layers.

24 SEBD 2005 - Brixen, June 2005 Architecture Layers Repository Layer Operators Layer Interpreter Layer To upper layers… DataModels Operators Layer: Implementation of language operators. is implemented as a Java class satisfying an interface. Interface is task-dependent. Repository Layer: Manages the read/write access to data and models repository. Manages the read/write access to data and models from external sources. Give a programmatic functionality to the higher layers. Interpreter Layer: Accepts a validated KDDML query and returns the result as XML document. Recursively traverse the DOM tree representation. The interpreter is not-affected by data/algorithms/model extensibility.

25 SEBD 2005 - Brixen, June 2005 KDDML as Middleware System Compiler Query MQL Query KDDML Results Repository Layer Operators Layer Interpreter Layer DataModels MQL High Level GUI Query KDDML

26 SEBD 2005 - Brixen, June 2005 Experiences with KDDML

27 SEBD 2005 - Brixen, June 2005 ClickWorld Extract DM models from visits to a city- news portal with the intent to characterize topics-of-interest of new visitors. M. Baglioni, U. Ferrara, A. Romei, S. Ruggieri, F. Turini Preprocessing and mining web log data for web personalization. 8th Italian Conf. on Artificial Intelligence : 237-249. Vol. 2829 of LNCS, September 2003. Preprocessing and mining web log data for web personalization.

28 SEBD 2005 - Brixen, June 2005 KDDML-G OP 1 OP OP 2 OP 3 A system for KDD on the GRID. Exploit the parallelism offered by the GRID Data immovability by moving the code on the place.

29 SEBD 2005 - Brixen, June 2005 Download KDDML http://kdd.di.unipi.it/kddml/ GNU (General Public Licence)


Download ppt "KDDML: A Middleware Language and System for Knowledge Discovery in Databases Dipartimento di Informatica, Università di Pisa A. Romei, S. Ruggieri, F."

Similar presentations


Ads by Google