Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
On-line learning and Boosting
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Decision trees for hierarchical multilabel classification A case study in functional genomics.
Decision Tree Approach in Data Mining
Data Mining Classification: Alternative Techniques
CONSEQUENCES OF PROPOSED EU LEGISLATION ON THRESHOLDS FOR THE ADVENTITIOUS PRESENCE OF GENETICALLY ENGINEERED (GE) SEEDS. Janet Cotter, Greenpeace Science.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Application of Stacked Generalization to a Protein Localization Prediction Task Melissa K. Carroll, M.S. and Sung-Hyuk Cha, Ph.D. Pace University, School.
Patch to the Future: Unsupervised Visual Prediction
SOURCE: “Co-existence project kicked-off”, European Biotechnology News, Vol. 4, 2005 European Commission project aimed at co- existence of GE and non-GE.
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Mining for High Complexity Regions Using Entropy and Box Counting Dimension Quad-Trees Rosanne Vetro, Wei Ding, Dan A. Simovici Computer Science Department.
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Decision Tree Algorithm
About ISoft … What is Decision Tree? Alice Process … Conclusions Outline.
Ensemble Learning: An Introduction
ACM SAC’06, DM Track Dijon, France “The Impact of Sample Reduction on PCA-based Feature Extraction for Supervised Learning” by M. Pechenizkiy,
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Ordinal Decision Trees Qinghua Hu Harbin Institute of Technology
Classification and Prediction: Regression Analysis
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
Basic Data Mining Techniques
Data Mining Techniques
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
Slides for “Data Mining” by I. H. Witten and E. Frank.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
Chapter 9 Neural Network.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Hierarchical multilabel classification trees for gene function prediction Leander Schietgat Hendrik Blockeel Jan Struyf Katholieke Universiteit Leuven.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Hierarchical Annotation of Medical Images Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loškovska 1, Sašo Džeroski 2 1 Department of Computer Science, Faculty.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Artificial Intelligence Project #3 : Analysis of Decision Tree Learning Using WEKA May 23, 2006.
The case against GM crops Alissa Cook policy officer Soil Association.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
The Organic Research Centre © The Organic Research Centre Welsh GM Co-existence proposals. June 2009.
Extracting the Common Structure of Compounds to Induce Plant Immunity Activation using ILP 20/08/2015 Department of Industrial Administration, Faculty.
Matic Perovšek, Anže Vavpeti č, Nada Lavra č Jožef Stefan Institute, Slovenia A Wordification Approach to Relational Data Mining: Early Results.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Slides for “Data Mining” by I. H. Witten and E. Frank.
ECML/PKDD 2003 Discovery Challenge Attribute-Value and First Order Data Mining within the STULONG project Anneleen Van Assche, Sofie Verbaeten,
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.
Introduction Hereditary predisposition (mutations in BRCA1 and BRCA2 genes) contribute to familial breast cancers. Eighty percent of the.
Classification Ensemble Methods 1
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
1 Finding Spread Blockers in Dynamic Networks (SNAKDD08)Habiba, Yintao Yu, Tanya Y., Berger-Wolf, Jared Saia Speaker: Hsu, Yu-wen Advisor: Dr. Koh, Jia-Ling.
Classification and Regression Trees
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Prepared by: Mahmoud Rafeek Al-Farra
Understanding Multi-Environment Trials
Presentation transcript:

Studying the Presence of Genetically Modified Variants in Organic Oilseed Rape by using Relational Data Mining Aneta Ivanovska 1, Celine Vens 2, Sašo Džeroski 1, Nathalie Colbach 3 1 Department of Knowledge Technologies, Jozef Stefan Institute, Jamova 39, SI-1000 Ljubljana, Slovenia s: Tel: (Aneta Ivanovska) 2 Department of Computer Science, K.U. Leuven, Celestijnenlaan 200A, B-3001 Leuven, Belgium 3 UMR1210, Biologie et Gestion des Adventices, INRA, Dijon, France

13/09/2007EnviroInfo Warsaw2 The GM problem  Genetically Modified (GM) crops First introduced for commercial production in 1996 Herbicide tolerant Pest-resistant  Concern: GM crops mixing with conventional or organic crops of the same species

13/09/2007EnviroInfo Warsaw3 The GM problem (2)  Computer simulation model GENESYS Estimates the rate of adventitious presence of GM varieties in non-GM crops Ranks the cropping systems according to their probability of gene flow between GM and non-GM oilseed rape (OSR)

13/09/2007EnviroInfo Warsaw4 Motivation  Predict the contamination of a field with GM material  The dataset produced by GENESYS was previously analyzed using propositional data mining techniques (Ivanovska et al., 2006)  Assumption: contamination of a field with GM seeds mostly depends on the cropping techniques and crops grown in the surrounding fields Exploit neighborhood relations in the predictive model Create a relational representation of the problem  In this study: investigate the use of relational data mining to analyze the dataset produced by GENESYS and use the relational data mining system TILDE

5 Field plan Crop succession Crop management For each field and year: Plants SeedbankSeeds produced Number per m² Genotypic proportions Rape varieties (Colbach et al., 2001)

13/09/2007EnviroInfo Warsaw6 Materials and methods: the dataset  Output from GENESYS  Large-risk field plan  maximizes the pollen and seed input into the central field  Focus of the analysis: predict the rate of adventitious presence in the central field of a large-risk field plan

13/09/2007EnviroInfo Warsaw7 Materials and methods: the dataset (2)  simulations for 25 years  Attributes: Geometry of the region (field-plan) Genetic variables For each field and year: crops and management techniques  Full details kept only for the last 4 years

13/09/2007EnviroInfo Warsaw8 Materials and methods: relational data mining, relational classification trees  Propositional data mining techniques Single table Popular DM techniques: classification and regression decision trees  Relational data mining techniques Multiple tables Relations between them Relational classification or regression trees

13/09/2007EnviroInfo Warsaw9 Materials and methods: relational data mining, relational classification trees (2)  Data scattered over multiple relations (or tables): can be analyzed by conventional data mining techniques, by transforming it into a propositional table (attribute-value representation) – propositionalization multi-relational approach takes into account the structure of the original data  Data represented in terms of relations: target(Field1, contaminated)  Background knowledge is also given

13/09/2007EnviroInfo Warsaw10 Materials and methods: relational data mining, relational classification trees (3)  Relational vs. propositional classification trees - similarities: predict the value of a dependent variable (class) from the values of a set of independent variables (attributes) test in each inner node that tests the value of a certain attribute and compares it with a constant leaf nodes give a classification that applies to all instances that reach the leaf  Relational vs. propositional classification trees - differences: Prop. trees: tests in the inner nodes compare the value of a variable (property of the object) to a value Rel. trees: tests can also refer to background knowledge relations or tables

13/09/2007EnviroInfo Warsaw11 An example of a relational classification tree targetField(FieldA) and fieldDataYear(FieldA,0,Crop,SowingDate), SowingDate<252 fieldDataYear(FieldA,0,Crop,Sowing Date), SowingDate<233 NEG neighbour(FieldA,FieldB,noborder) and fieldDataYear(FieldB,1,gm-OSR,SowingDate) NEG POS yes no yes no

13/09/2007EnviroInfo Warsaw12 Experiments and results  Representation of the data: Target relation (data label): rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres) Background relations:  fieldDataYear(SimulationID, FieldID, Year, CultivationTechniques)  lastOSR(SimulationID, FieldID, LastGM, LastNonGM)  neighbour(Field1ID, Field2ID, NeighType)

13/09/2007EnviroInfo Warsaw13 Experiments and results (2)  Discretized target attribute – 0.9%  Experimental settings: Propositional: besides the target relation rateOfAdvPresenceInField(SimulationID, FieldID, RateAdvPres), only (propositional) data for the target field is included (not using any relations among the fields), i.e., the following predicates are used:  fieldDataYear(FieldID,Year,Crop,SowingDate), for the target field  lastOSR(FieldID,LastGM,LastNonGM), for the target field Neighbor:the same relations were used as in the Propositional setting, but now other fields are introduced via the neighbour relation, starting at the target field:  neighbour(Field1ID, Field2ID, NeighType)

13/09/2007EnviroInfo Warsaw14 Experiments and results (3)  TILDE’s experimental results – 3-fold cross-validation  Example of a rule from the Propositional experiments: contamination([neg]):-targetfield(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, lastOSR(T, Gm, NonGm), Gm<20  Example of a rule from the Neighbor experiments: contamination([pos]):-targetField(T), fieldDataYear(T, 25, Crop, SowingDate), SowingDate<252, neighbour(T, FieldA, noborder), fieldDataYear(FieldA, 24, gm-OSR, SowingDate) PROPOSITIONALNEIGHBOR TREE SIZE1513 ACCURACY78.35%79.66%

13/09/2007EnviroInfo Warsaw15 Conclusions  Use of relational data mining for analyzing an output of the complex simulation model GENESYS  Predict the contamination of the central field of a large-risk field plan  Built relational classification trees – first-order decision tree learner TILDE

13/09/2007EnviroInfo Warsaw16 Conclusions (2)  Propositional and relational trees  Relational experiments – slightly better Due to using a fixed field plan and a fixed target field  Further work: Performing more experiments with GENESYS Different field plans Different target fields Analyze the results of other simulation models

13/09/2007EnviroInfo Warsaw17 Acknowledgement  SIGMEA (Sustainable Introduction of Genetically Modified organisms into European Agriculture)