Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski.

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski 1, Nathalie Colbach 3 1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia Emails: aneta.ivanovska@ijs.si, saso.dzeroski@ijs.sianeta.ivanovska@ijs.sisaso.dzeroski@ijs.si Tel: +386 1 477 3144 (Aneta Ivanovska) 2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium Email: celine.vens@cs.kuleuven.beceline.vens@cs.kuleuven.be 3 UMR1210, Biologie et Gestion des Adventices, INRA, 21000 Dijon, France Email : colbach@dijon.inra.frcolbach@dijon.inra.fr

13/09/2007EnviroInfo 2007 - Warsaw2 The GM problem  Genetically Modified (GM) crops First introduced for commercial production in 1996 Herbicide tolerant Pest-resistant  Concern: GM crops mixing with conventional or organic crops of the same species

13/09/2007EnviroInfo 2007 - Warsaw3 The GM problem (2)  Computer simulation model GENESYS Estimates the rate of adventitious presence of GM varieties in non-GM crops Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)

13/09/2007EnviroInfo 2007 - Warsaw4 Motivation  Predict the contamination of a field with GM material  The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)  Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive model Create a relational representation of the problem  In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE

5 Field plan Crop succession Crop management For each field and year: Plants SeedbankSeeds produced Number per m² Genotypic proportions Rape varieties (Colbach et al., 2001)

13/09/2007EnviroInfo 2007 - Warsaw6 Materials and methods: the dataset  Output from GENESYS  Large-risk field plan  maximizes the pollen and seed input into the central field  Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan

13/09/2007EnviroInfo 2007 - Warsaw7 Materials and methods: the dataset (2)  100 000 simulations for 25 years  Attributes: Geometry of the region (field-plan) Genetic variables For each field and year: crops and management techniques  Full details kept only for the last 4 years

13/09/2007EnviroInfo 2007 - Warsaw8 Materials and methods: relational data mining, relational classification trees  Propositional data mining techniques Single table Popular DM techniques: classification and regression decision trees  Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees

13/09/2007EnviroInfo 2007 - Warsaw9 Materials and methods: relational data mining, relational classification trees (2)  Data scattered over multiple relations (or tables): can be analyzed by conventional data mining techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization multi-relational approach takes into account the structure of the original data  Data represented in terms of relations: target(Field1, contaminated)  Background knowledge is also given

13/09/2007EnviroInfo 2007 - Warsaw10 Materials and methods: relational data mining, relational classification trees (3)  Relational vs. propositional classification trees - similarities: predict the value of a dependent variable (class) from the values of a set of independent variables (attributes) test in each inner node that tests the value of a certain attribute and compares it with a constant leaf nodes give a classification that applies to all instances that reach the leaf  Relational vs. propositional classification trees - differences: Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value Rel. trees: tests can also refer to background knowledge relations or tables

13/09/2007EnviroInfo 2007 - Warsaw11 An example of a relational classification tree targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<252 fieldDataYear(FieldA,0,Crop,Sowing Date), SowingDate<233 NEG neighbour(FieldA,FieldB,noborder) and fieldDataYear(FieldB,1,gm-OSR,SowingDate) NEG POS yes no yes no

13/09/2007EnviroInfo 2007 - Warsaw12 Experiments and results  Representation of the data: Target relation (data label): rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres) Background relations:  fieldDataYear(SimulationID, FieldID, Year, CultivationTechniques)  lastOSR(SimulationID, FieldID, LastGM, LastNonGM)  neighbour(Field1ID, Field2ID, NeighType)

13/09/2007EnviroInfo 2007 - Warsaw13 Experiments and results (2)  Discretized target attribute – 0.9%  Experimental settings: Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:  fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field  lastOSR(FieldID,LastGM,LastNonGM), for the target field Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:  neighbour(Field1ID, Field2ID, NeighType)

13/09/2007EnviroInfo 2007 - Warsaw14 Experiments and results (3)  TILDE’s experimental results – 3-fold cross-validation  Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20  Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate) PROPOSITIONALNEIGHBOR TREE SIZE1513 ACCURACY78.35%79.66%

13/09/2007EnviroInfo 2007 - Warsaw15 Conclusions  Use of relational data mining for analyzing an output of the complex simulation model GENESYS  Predict the contamination of the central field of a large-risk field plan  Built relational classification trees – first-order decision tree learner TILDE

13/09/2007EnviroInfo 2007 - Warsaw16 Conclusions (2)  Propositional and relational trees  Relational experiments – slightly better Due to using a fixed field plan and a fixed target field  Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models

13/09/2007EnviroInfo 2007 - Warsaw17 Acknowledgement  SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski.

Similar presentations

Presentation on theme: "Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski.

Similar presentations

Presentation on theme: "Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski."— Presentation transcript:

Similar presentations

About project

Feedback