Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.

Slides:



Advertisements
Similar presentations
Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Discriminative Structure and Parameter.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Probabilistic Latent-Factor Database Models Denis Krompaß 1, Xueyan Jiang 1,Maximilian Nickel 2 and Volker Tresp 1,3 1 Department of Computer Science.
Data Mining Classification: Alternative Techniques
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.
UNCERTML - DESCRIBING AND COMMUNICATING UNCERTAINTY Matthew Williams
Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Frequent Pattern Mining Toon CaldersBart Goethals ADReM research group.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
1 Unsupervised Learning With Non-ignorable Missing Data Machine Learning Group Talk University of Toronto Monday Oct 4, 2004 Ben Marlin Sam Roweis Rich.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
Overview of Search Engines
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Data Mining Techniques
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Ensemble Solutions for Link-Prediction in Knowledge Graphs
An Integration Framework for Sensor Networks and Data Stream Management Systems.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Dimitrios Skoutas Alkis Simitsis
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
1 Business System Analysis & Decision Making – Data Mining and Web Mining Zhangxi Lin ISQS 5340 Summer II 2006.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Computing & Information Sciences Kansas State University Wednesday, 22 Oct 2008CIS 530 / 730: Artificial Intelligence Lecture 22 of 42 Wednesday, 22 October.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
IHE Profile – SOA Analysis: In Progress Update Brian McIndoe January 18, 2011.
Center for Computational Intelligence, Learning, and Discovery Artificial Intelligence Research Laboratory Department of Computer Science Supported in.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
A Data Stream Publish/Subscribe Architecture with Self-adapting Queries Alasdair J G Gray and Werner Nutt School of Mathematical and Computer Sciences,
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Unsupervised Streaming Feature Selection in Social Media
CS Machine Learning Instance Based Learning (Adapted from various sources)
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
Learning Bayesian Networks for Complex Relational Data
Data Analytics 1 - THE HISTORY AND CONCEPTS OF DATA ANALYTICS
Learning Profiles from User Interactions
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Location Recommendation — for Out-of-Town Users in Location-Based Social Network Yina Meng.
Ontology-Based Information Integration Using INDUS System
Data Warehousing and Data Mining
Presentation transcript:

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Learning Relational Bayesian Classifiers from RDF Data Harris Lin, Neeraj Koul, and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Probability of patient P getting Alzheimer's disease within next 10 years? Patient Health Records Disease Prevention Linked Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Probability of customer C buying product P? Customer Purchase History Product Recommendation Linked Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Crime rate in region R ? Government Data City Planning Linked Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Overview We  Show how to learn relational Bayesian classifiers from RDF data in a setting where the data can be accessed only through statistical queries against a SPARQL endpoint  Demonstrate the advantages of the statistical query based approach over one that requires direct access to RDF data  Establish the conditions under which the predictive model can be updated incrementally in response to changes in the underlying data  Show how to cope with settings where the relevant data attributes to be used for building the predictive model are not known a priori  Provide open source implementation of the resulting algorithms

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Outline Background and Motivation Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Experiment 1: Communication Complexity Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori –Experiment 2: Selective Attribute Crawling Conclusion and Future Work

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Motivation: Learning Predictive Models from RDF Data The growth in RDF data on the web offers unprecedented opportunities for –building predictive models e.g., classifiers from RDF data –using the resulting models to guide decisions in a broad range of application domains Machine learning offers one of the most effective approaches to building predictive models from data Statistical relational learning [Getoor, 2007] offers a natural starting point for design of algorithms for learning predictive models from RDF data Predictive Model Learning Algorithm Data

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Kiefer et al. [2008] extend SPARQL with additional constructs to support data mining to obtain SPARQL-ML CREATE MINING MODEL statement for constructing a predictive model from RDF data PREDICT statement for applying the model to make predictions Tresp et al. [2009] use matrix completion methods e.g., Latent Dirichlet Allocation (LDA) to make predictions from a Boolean Matrix representation of RDF data Bicer et al. [2011] combine kernel-based support vector machines with relational learning to obtain Relational Kernel machines defined over RDF data Related Work: Learning Predictive Models from RDF Data

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Limitations of Existing Methods for Learning Predictive Models from RDF Data Key limitation: Current methods assume that the learning algorithm has direct access to RDF Data Direct access Predictive Model Learning Algorithm RDF Data

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS In practice: –RDF stores may not always provide data dumps (e.g. streaming social network data) –Access constraints may prevent access to RDF data (e.g. patient health records, Dr. Pentland’s keynote, “Linked Closed Data” workshop]) –Bandwidth constraints may limit the ability to transfer data from a remote RDF store to the local site where the learning algorithm resides (e.g. sensor data, and possibly “Linked Sensor Data” workshop]) –Learning algorithms that assume in-memory access to data cannot handle RDF datasets that are too large to fit in memory Needed: Approaches for learning from RDF data without direct access to RDF Data in settings where the data can be accessed only through statistical queries (e.g. against a SPARQL endpoint) Limitations of Existing Methods for Learning Predictive Models from RDF Data

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Outline Background and Motivation Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Experiment 1: Communication Complexity Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori –Experiment 2: Selective Attribute Crawling Conclusion and Future Work

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Statistical Query Based Approach to Learning Classifiers from RDF Data Uses the statistical query based learning framework of Caragea et al. [2005] Decompose the learner into two components –Statistical query generation: poses statistical queries to a data source to acquire the information needed by the learner –Model construction: uses the answers to statistical queries to update or refine a partial model The learner interacts with the RDF data store only through queries posed against a SPARQL access point Classifier RDF Data Query Interface Learning Algorithm Remote machineLocal machine

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Learning Relational Bayesian Classifiers from RDF Data A Relational Bayesian Classifier (RBC) [Getoor, 2007] is a relational variant of the Simple Bayesian Classifier (SBC) SBC –Each attribute in a data instance assumes a single value from the domain of the corresponding random variable –Assumes that data attributes are conditionally independent of class C RBC –Each data attribute takes a value that is a multi-set of elements chosen from the domains of the corresponding random variable –Assumes that each multi-set valued attribute is independent given the class

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS RBC Example RDF Schema: Goal: predict whether a movie receives more than $2M in its opening week –Target class: Movie –Target attribute: (openingReceipts) –Attribute 1: (hasActor, yearOfBirth) –Attribute 2: (hasActor, gender) Movie Actor Gender xsd:integer openingReceipts hasActor yearOfBirth gender

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS openingReceipts hasActor yearOfBirth genderInception Leonardo DiCaprio Ellen Page Titanic hasActor 1987 F yearOfBirth gender 1974 M 62M openingReceipts 28M hasActor MovieopeningReceiptshasActor, yearOfBirth hasActor, gender Inception62M{1987,1974}{F,M} Titanic28M{1974}{M} Flattening RBC Counts via SPARQL Queries

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Learning Relational Bayesian Classifiers from Statistical Queries against RDF data Our implementation relies on the aggregate queries supported by SPARQL 1.1 to express the statistical queries posed by RBC Learner RBC RDF Data SPARQL 1.1 RBC Learner Remote machineLocal machine

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Outline Background and Motivation Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Experiment 1: Communication Complexity Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori –Experiment 2: Selective Attribute Crawling Conclusion and Future Work

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Communication Complexity: Statistical Query Based RBC Learner compared with RBC Learner with Direct Access to Data Data prepared from LinkedMDB and Freebase SPARQL queries and results logged in plain text format Raw data dump in RDF/XML (also compared with compressed dumps) The cost of retrieving the statistics needed for learning < < the cost of retrieving the entire dataset

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Outline Background and Motivation Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Experiment 1: Communication Complexity Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori –Experiment 2: Selective Attribute Crawling Conclusion and Future Work

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Learning Relational Bayesian Classifiers from Statistical Queries against RDF data: Further Extensions Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori RBC RDF Data SPARQL 1.1 RBC Learner Remote machineLocal machine

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Updating Relational Bayesian Classifiers in Response to Updates to the Underlying RDF Store RBC classifiers are updatable (definition adapted from [Koul, 2010]) if the updates to the underlying RDF data store are clean An update is clean if any new tuples that are added as a result of update share objects with only the elements of the multi-sets that make up the attribute values used to construct the RBC before the update (see paper for precise definition) RBC constructed from RDF data prior to update Updated RBC Updated RDF Data SPARQL 1.1 RBC Learner

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Building Relational Bayesian Classifiers when the Attributes of Interest are not known A priori In order to pose statistical queries, the learner needs to know the attributes of RDF data (i.e., the RDF schema) If the attributes are not known a priori, we selectively crawl for attributes that are predictive of the class label using information gain to score candidate attributes

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS US Census RDF schema (by Tauberer) J. Tauberer. The 2000 U.S. census: 1 billion RDF triples.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Selective Attribute Crawling Experiment: Census Combined with Data.gov violent crime rate dataset Predict: A state's violent crime rate is over 400 per 100,000 population? Cross validation of 52 US states into 13 folds

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Top 6 Attributes Selected by BestFirst Strategy./families./families/type./households./households/1personHousehold/femaleHouseholder./households/2OrMorePersonHousehold/ familyHouseholds/marriedcoupleFamily/ noOwnChildrenUnder18Years./households/2OrMorePersonHousehold/ familyHouseholds/otherFamily/ femaleHouseholder_NoHusbandPresent

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Open Source Implementation Google Code Implemented as part of INDUS [Koul2008a], an open source system that learns predictive models from remote data source using sufficient statistics

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Outline Background and Motivation Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Experiment 1: Communication Complexity Extensions of the basic approach to settings where –The RDF data store is updated over time –The attributes of interest are not known a priori –Experiment 2: Selective Attribute Crawling Conclusion and Future Work

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Contributions  Statistical Query Based Approach to Learning Relational Bayesian Classifiers from RDF data Applicable to settings where data can be accessed only through statistical queries against a SPARQL endpoint Far more bandwidth efficient than the alternatives that assume direct access to RDF data Generalizable to other statistical relational learning algorithms  Extensions to settings where The RDF data store is updated over time The attributes of interest are not known a priori  Open source implementation of the algorithms

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Future Work  Statistical Query based variants of other relational learning algorithms for building predictive models from RDF data  Extensions of the approach to learn predictive models from multiple distributed RDF data stores (cf. RDF Query – Multiple Sources Maritim)  Open source implementation of other algorithms  Real-world applications

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Thank You! Questions?

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS SPARQL Aggregate Queries

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant Honavar. ISWC Research supported in part by NSF grant IIS Updatable Definition