Presentation is loading. Please wait.

Presentation is loading. Please wait.

Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013.

Similar presentations


Presentation on theme: "Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013."— Presentation transcript:

1 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Harris T. Lin and Vasant Honavar Artificial Intelligence Research Laboratory Department of Computer Science Iowa State University htlin@iastate.edu Machine Learning Relational, Distributed ?

2 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Resource Description Framework (RDF) Primer (Inception, hasActor, Ellen Page) (Inception, hasActor, Leonardo DiCaprio) (Titanic, hasActor, Leonardo DiCaprio) (Ellen Page, yearOfBirth, 1987) (Ellen Page, gender, F) (Leonardo DiCaprio, yearOfBirth, 1974) (Leonardo DiCaprio, gender, M) hasActor yearOfBirth gender Inception Leonardo DiCaprio Ellen Page Titanic hasActor 1987 F yearOfBirth gender 1974 M hasActor Movie Actor Gender xsd:integer hasActor yearOfBirth gender RDF Data (Graph representation) RDF Schema RDF Data (Triple representation) RDF triple = subject-predicate-object triple RDF graph = set of RDF triples Directed labeled graph whose nodes are URIs

3 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction Motivating scenario: Facebook + New York Times –Facebook users share posts about news items published in New York Times –Goal: predict the interest of a user in joining a group Challenges for Machine Learning –Multiple interlinked data stores –Physically distributed data stores –Autonomously maintained data stores

4 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Introduction Linked Open Data cloud –300+ interlinked datasets –30+ trillion triples Multiple interlinked, physically distributed, autonomously maintained data stores Prohibits downloading all data together –Bandwidth limits –Access limits –Storage and Memory limits –Privacy and confidentiality constraints We need –Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Linked Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

5 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Summary of Contribution Learning Classifiers from Chains of Multiple Interlinked RDF Data Stores Contributions 1.Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data 2.Distributed learning framework for RDF stores that form a chain [Not covered in this talk] 3.Identify 3 special cases of RDF data fragmentation [Not covered in this talk] 4.Novel application of matrix reconstruction for approximating statistics, which dramatically reduce communication 5.Experimental results demonstrating feasibility

6 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Problem Formulation Last.fm dataset: Dataset (Conceptual) ((B 11, …, B 1K ), c 1 ) ((B 21, …, B 2K ), c 2 ) … ((B n1, …, B nK ), c n ) Y User 1 N User 2 N User 3

7 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Learning with Indirect Access to Data Single RDF data store –Lin et al. [10] Multiple Interlinked RDF data stores –This work Learner Classifier New instance Predicted class ((B 11, …, B 1K ), c 1 ) ((B 21, …, B 2K ), c 2 ) … ((B n1, …, B nK ), c n ) Statistics via SPARQL queries RDF data stores

8 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Learning Algorithms 1.Aggregation –Simple aggregation (max, min, avg, etc.) –Vector distance aggregation (Perlich and Provost [12]) 2.Generative Models –Naïve Bayes (with 4 different distributions) Bernoulli Multinomial Dirichlet Polya (Dirichlet-Multinomial) Key sufficient statistic: count for each value, for each instance (= histogram for each instance) How to obtain this efficiently?

9 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Obtaining Statistics for Learning UsersTrackArtistTag User Track Artist Tag User Tag Schema: Data Graph: Matrix Representation:

10 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Approximating Statistics User Track Artist Tag User Tag User Track Artist Tag User Tag User Track Artist Tag User Tag Column Projection: Row Projection:

11 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Approximating Statistics ? User Tag Could we approximate this matrix from the two projections? CT scans for the rescue! CT: Reconstruct 3D object from its projected slices We want: Reconstruct 2D matrix from its projections Source: http://health-fts.blogspot.com/2012/01/brain-ct-mri.htmlSource: https://www.medicalradiation.com/types-of-medical- imaging/imaging-using-x-rays/computed-tomography-ct/

12 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Approximating Statistics ? User Tag We adapted one of the simplest reconstruction method: Algebraic Reconstruction Technique Proposed scheme 1.Use SPARQL queries to accumulate and pass along column and row vectors, ultimately send back to the learner 2.Learner use a CT method to reconstruct matrix from projections 3.Use the approximated matrix to compute necessary statistics for learning Dramatically reduce communication! How accurate are the learned classifiers?

13 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Experimental Results Two subsets of Last.fm dataset 2 aggregation and 4 naïve Bayes models Compares against centralized counterpart –Uses exact matrix for learning Accuracy Results (10-fold cross validation) –ART approximation has different effects depend on models –NB(Pol) is competitive, even in the ART approximated case –NB(Mul) is competitive too, despite using less information than NB(Pol) NB (Bernoulli) and NB (Multinomial) only need projections for learning, hence their results are identical (*) Sensitivity of ART on different models [Not covered in this talk]

14 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Communication Complexity Size of query results transferred v.s. Size of the dataset (# users) Size of projections are several orders of magnitude smaller

15 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Conclusion Challenges –Multiple interlinked, physically distributed, autonomously maintained RDF data stores –Learner may be prohibited to download all data due to limitations in bandwidth, access, storage and memory, privacy and confidentiality constraints We need –Learning from multiple interlinked RDF stores that support only indirect access to data (e.g. SPARQL query interface) Contributions –Statistical query-based formulations of several representative algorithms for learning classifiers from RDF data –Distributed learning framework for RDF stores that form a chain [Not covered in this talk] –Identify 3 special cases of RDF data fragmentation [Not covered in this talk] –Novel application of matrix reconstruction from Computerized Tomography for approximating statistics, which dramatically reduce communication –Experimental results demonstrating feasibility

16 Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013. Research supported in part by NSF grant IIS 0711356. Related Work and Future Work Related Work –Most existing work on learning from RDF data assume direct access –Lin et al. [10] learns relational Bayesian classifiers from a single remote RDF store via SPARQL queries –Extends the remote access framework [20] to multiple RDF stores Future Work –Consider more recent and complex CT methods –Explore other ways of taking projections –Consider more complex RDF data fragmentations –Consider richer classes of learning models


Download ppt "Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin and Vasant Honavar. BigData2013."

Similar presentations


Ads by Google