Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.

Slides:



Advertisements
Similar presentations
Autonomic Scaling of Cloud Computing Resources
Advertisements

Anna Atramentov and Vasant Honavar* Artificial Intelligence Laboratory Department of Computer Science Iowa State University Ames, IA 50011, USA
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Probabilistic Latent-Factor Database Models Denis Krompaß 1, Xueyan Jiang 1,Maximilian Nickel 2 and Volker Tresp 1,3 1 Department of Computer Science.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Iowa State University Department of Computer Science, Iowa State University Artificial Intelligence Research Laboratory Center for Computational Intelligence,
Building and Analyzing Social Networks Web Data and Semantics in Social Network Applications Dr. Bhavani Thuraisingham February 15, 2013.
VIVO and Linked Open Data December 13, 2010 Dean B. Krafft Chief Technology Strategist and Director of IT Cornell University Library.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by grants from the National.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Xyleme A Dynamic Warehouse for XML Data of the Web.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Malicious parties may employ (a) structure-based or (b) label-based attacks to re-identify users and thus learn sensitive information about their rating.
Sparsity, Scalability and Distribution in Recommender Systems
Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
J. He, G. Kesidis and D.J. Miller – The Pennsylvania State University In collaboration with K. Levitt, J. Rowe, S.F. Wu – The University of California.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris Lin, Neeraj Koul, and Vasant.
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Lesley Charles November 23, 2009.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Samad Paydar Web Technology Lab. Ferdowsi University of Mashhad 10 th August 2011.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Iowa State University Department of Computer Science Artificial Intelligence Research Laboratory Research supported in part by a grant from the National.
Center for Computational Intelligence, Learning, and Discovery Artificial Intelligence Research Laboratory Department of Computer Science Supported in.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Large-scale Linked Data Management Marko Grobelnik, Andreas Harth (Günter Ladwig), Dumitru Roman Big Linked Data Tutorial Semantic Days 2012.
Using linked data to interpret tables Varish Mulwad September 14,
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
A User Experience-based Cloud Service Redeployment Mechanism KANG Yu Yu Kang, Yangfan Zhou, Zibin Zheng, and Michael R. Lyu {ykang,yfzhou,
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Indexing Correlated Probabilistic Databases Bhargav Kanagal, Amol Deshpande University of Maryland, College Park, USA SIGMOD Presented.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Author: Akiyoshi Matonoy, Toshiyuki Amagasay, Masatoshi Yoshikawaz, Shunsuke Uemuray.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
1 Intelligent Information System Lab., Department of Computer and Information Science, Korea University Semantic Social Network Analysis Kyunglag Kwon.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
Build Your Own Identity Hub Ted Lawless Code4Lib 2016 – March 8 th, 2016.
Cloud based linked data platform for Structural Engineering Experiment
Boosted Augmented Naive Bayes. Efficient discriminative learning of
A paper on Join Synopses for Approximate Query Answering
Analyzing and Securing Social Networks
Ontology-Based Information Integration Using INDUS System
Presentation transcript:

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Harris T. Lin and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University Machine Learning Relational, Distributed ?

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Resource Description Framework (RDF) Primer (Inception, hasActor, Ellen Page) (Inception, hasActor, Leonardo DiCaprio) (Titanic, hasActor, Leonardo DiCaprio) (Ellen Page, yearOfBirth, 1987) (Ellen Page, gender, F) (Leonardo DiCaprio, yearOfBirth, 1974) (Leonardo DiCaprio, gender, M) hasActor yearOfBirth gender Inception Leonardo DiCaprio Ellen Page Titanic hasActor 1987 F yearOfBirth gender 1974 M hasActor Movie Actor Gender xsd:integer hasActor yearOfBirth gender RDF Data (Graph representation) RDF Schema RDF Data (Triple representation) RDF triple = subject-predicate-object triple RDF graph = set of RDF triples Directed labeled graph whose nodes are URIs

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Introduction Motivating scenario: Facebook + New York Times –Facebook users share posts about news items published in New York Times –Goal: predict the interest of a user in joining a group Challenges for Machine Learning –Multiple interlinked data stores –Physically distributed data stores –Autonomously maintained data stores

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Introduction Linked Open Data cloud –300+ interlinked datasets –30+ trillion triples Multiple interlinked, physically distributed, autonomously maintained data stores Prohibits downloading all data together –Bandwidth limits –Access limits –Storage and Memory limits –Privacy and confidentiality constraints We need –Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Linked Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Summary of Contribution Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Contributions 1.Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data 2.Distributed learning framework for RDF stores that form a chain [Not covered in this talk] 3.Identify 3 special cases of RDF data fragmentation [Not covered in this talk] 4.Novel application of matrix reconstruction for approximating statistics, which dramatically reduce communication 5.Experimental results demonstrating feasibility

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Problem Formulation Last.fm dataset: Dataset (Conceptual) ((B 11, …, B 1K ), c 1 ) ((B 21, …, B 2K ), c 2 ) … ((B n1, …, B nK ), c n ) Y User 1 N User 2 N User 3

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Learning with Indirect Access to Data Single RDF data store –Lin et al. [10] Multiple Interlinked RDF data stores –This work Learner Classifier New instance Predicted class ((B 11, …, B 1K ), c 1 ) ((B 21, …, B 2K ), c 2 ) … ((B n1, …, B nK ), c n ) Statistics via SPARQL queries RDF data stores

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Learning Algorithms 1.Aggregation –Simple aggregation (max, min, avg, etc.) –Vector distance aggregation (Perlich and Provost [12]) 2.Generative Models –Naïve Bayes (with 4 different distributions) Bernoulli Multinomial Dirichlet Polya (Dirichlet-Multinomial) Key sufficient statistic: count for each value, for each instance (= histogram for each instance) How to obtain this efficiently?

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Obtaining Statistics for Learning UsersTrackArtistTag User Track Artist Tag User Tag Schema: Data Graph: Matrix Representation:

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Approximating Statistics User Track Artist Tag User Tag User Track Artist Tag User Tag User Track Artist Tag User Tag Column Projection: Row Projection:

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Approximating Statistics ? User Tag Could we approximate this matrix from the two projections? CT scans for the rescue! CT: Reconstruct 3D object from its projected slices We want: Reconstruct 2D matrix from its projections Source: imaging/imaging-using-x-rays/computed-tomography-ct/

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Approximating Statistics ? User Tag We adapted one of the simplest reconstruction method: Algebraic Reconstruction Technique Proposed scheme 1.Use SPARQL queries to accumulate and pass along column and row vectors, ultimately send back to the learner 2.Learner use a CT method to reconstruct matrix from projections 3.Use the approximated matrix to compute necessary statistics for learning Dramatically reduce communication! How accurate are the learned classifiers?

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Experimental Results Two subsets of Last.fm dataset 2 aggregation and 4 naïve Bayes models Compares against centralized counterpart –Uses exact matrix for learning Accuracy Results (10-fold cross validation) –ART approximation has different effects depend on models –NB(Pol) is competitive, even in the ART approximated case –NB(Mul) is competitive too, despite using less information than NB(Pol) NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence their results are identical (*) Sensitivity of ART on different models [Not covered in this talk]

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Communication Complexity Size of query results transferred v.s. Size of the dataset (# users) Size of projections are several orders of magnitude smaller

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Conclusion Challenges –Multiple interlinked, physically distributed, autonomously maintained RDF data stores –Learner may be prohibited to download all data due to limitations in bandwidth, access, storage and memory, privacy and confidentiality constraints We need –Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Contributions –Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data –Distributed learning framework for RDF stores that form a chain [Not covered in this talk] –Identify 3 special cases of RDF data fragmentation [Not covered in this talk] –Novel application of matrix reconstruction from Computerized Tomography for approximating statistics, which dramatically reduce communication –Experimental results demonstrating feasibility

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS Related Work and Future Work Related Work –Most existing work on learning from RDF data assume direct access –Lin et al. [10] learns relational Bayesian classifiers from a single remote RDF store via SPARQL queries –Extends the remote access framework [20] to multiple RDF stores Future Work –Consider more recent and complex CT methods –Explore other ways of taking projections –Consider more complex RDF data fragmentations –Consider richer classes of learning models