Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
1 Relational Query Optimization Module 5, Lecture 2.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
DB performance tuning using indexes Section 8.5 and Chapters 20 (Raghu)
Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Slides adapted from A. Silberschatz et al. Database System Concepts, 5th Ed. SQL - part 2 - Database Management Systems I Alex Coman, Winter 2006.
Database Systems More SQL Database Design -- More SQL1.
Data Mining – Intro.
Data Mining Query Languages Kristen LeFevre April 19, 2004 With Thanks to Zheng Huang and Lei Chen.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Access Path Selection in a Relation Database Management System (summarized in section 2)
Midterm 1 Concepts Relational Algebra (DB4) SQL Querying and updating (DB5) Constraints and Triggers (DB11) Unified Modeling Language (DB9) Relational.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
ECON 1150 Matrix Operations Special Matrices
Issues with Data Mining
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Chapter 4: SQL Basic Structure Set Operations Aggregate Functions Null Values Nested Subqueries.
Relational DBs and SQL Designing Your Web Database (Ch. 8) → Creating and Working with a MySQL Database (Ch. 9, 10) 1.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Data Access Patterns Some of the problems with data access from OO programs: 1.Data source and OO program use different data modelling concepts 2.Decoupling.
“Here is my data. Where do I start?” Examples of Ad Hoc Databases Automatic Example Queries for Ad Hoc Databases Bill Howe 1, Garret Cole 2, Nodira Khoussainova.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.
EN : Adv. Storage and TP Systems Cost-Based Query Optimization.
Database Management 9. course. Execution of queries.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Optimization in XSLT and XQuery Michael Kay. 2 Challenges XSLT/XQuery are high-level declarative languages: performance depends on good optimization Performance.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
IS 230Lecture 6Slide 1 Lecture 7 Advanced SQL Introduction to Database Systems IS 230 This is the instructor’s notes and student has to read the textbook.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
SQL LANGUAGE and Relational Data Model TUTORIAL Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha.
Automated Reasoning Early AI explored how to automated several reasoning tasks – these were solved by what we might call weak problem solving methods as.
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
SQL Aggregation Oracle and ANSI Standard SQL Lecture 9.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Krogel, Rawles, Železný, Flach, Lavrač, Wrobel: Comparative Evaluation of Approaches to Propositionalization Comparative Evaluation of Approaches to.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Query Processing – Implementing Set Operations and Joins Chap. 19.
More Symbolic Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Chap. 10 Learning Sets of Rules 박성배 서울대학교 컴퓨터공학과.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
3 Copyright © 2006, Oracle. All rights reserved. Designing and Developing for Performance.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
More SQL: Complex Queries,
Microsoft Office Access 2010 Lab 3
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha
Hanan Ayad Supervisor Prof. Mohamed Kamel
Database Performance Tuning and Query Optimization
Prof: Dr. Shu-Ching Chen TA: Hsin-Yu Ha
Data Warehousing and Data Mining
More SQL: Complex Queries, Triggers, Views, and Schema Modification
SQL: Structured Query Language
Chapter 11 Database Performance Tuning and Query Optimization
A Framework for Testing Query Transformation Rules
Wellington Cabrera Advisor: Carlos Ordonez
Presentation transcript:

Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers

Overview Structural Generalized Linear Regression (SGLR) Overview Design Motivations Experiments Conclusions

SGLR Overview Adds statistical methods to ILP  SQL as the logical language  Generalized Linear Regression as statistical method Uses clustering to generate new relations Builds discriminative models  Targeted at large problems where generative models impossible Integrates feature generation and problem modeling

SGLR Loop

SGLR Method Clusters data and adds clusters as new relations Searches the space of SQL query refinements  Features are numerical SQL aggregates  Test feature with statistical measure (e.g. AIC, BIC)  Add only significantly predictive features  Examine each feature only once  Use current set of features to guide search

Overview Structural Generalized Linear Regression (SGLR) Overview Design Motivations Experiments Conclusions

SQL Motivation Most of the world’s data is in relational databases  Can exploit schema and meta-information SQL uses a fairly expressive language  Non-recursive first-order logic formulas Relational DBs have been studied and optimized for decades, so should be more scalable than other alternatives

Clustering Motivation Dimensionality reduction Clusters are added as relations (new first- class concepts)  Increases expressivity of the language describing patterns in the data  Can lead to a more rapid discovery of predictive features Done as a pre-processing step  cost(clustering) << cost(feature search)

Aggregation Motivation Summarizes the information in a table into scalar values usable by a statistical model  average, max, min, count, average, empty/exists (0/1) Exploits database work into making them efficient Provides a richer space of features to choose from

Dynamic Feature Generation Most features do not provide useful information In large domains, feature generation is expensive, and precomputing all possible features is far too time consuming Solution: Use a smarter search strategy and dynamically generate features. Let the features already selected influence which features are added Focuses only on the promising areas in the search space

Feature Streams Put features into different evaluation queues Choose next feature from the ‘best’ stream If feature in multiple streams, only evaluate once Stream design can use prior knowledge/bias

Refinement Graphs (in ILP) Start with most general rule, and ‘refines’ it to produce more specific clauses  Single variable substitution  Add predicate involving 1+ existing variables Uses top-down breadth-first search to find the most general rule that covers only positive examples Performs poorly in noisy domains

Refinement Graphs (in SGLR) Adds one relation to a query and expands it into all possible configurations of equality conditions of new attributes with a new or old attribute  Contains at least one equality condition between a new and old attribute  Any attribute can be set to a constant  High-level variable typing/classes are enforced Not all refinements are most general, but simplifies pruning of equivalent subspaces (accounts only for the type and number of relations joined in a query)

Example Refinement Graph Query(d) Cites(d,d1)Author_of(d, a)Word_count(d, w, int) Author_of(d, a=“Smith”) Cites(d,d1),Cites(d1,d2) Cites(d,d1), Author_of(d1, a) Cites(d,d1), Author_of(d1, a=“Domingos”) DB Tables

Overview Structural Generalized Linear Regression (SGLR) Overview Design Motivations Experiments Conclusions

Experiments Used CiteSeer data  Citation(doc1, doc2), Author(doc, person), PublishedIn(doc, venue), HasWord(doc,word)  60k Docs, 131k Authors, 173k Citations, 6.8M Words Two Tasks  Predict the publication venue  Predict existence of a citation

Experiments Cluster all many-to-many relations  K-means  Added 6 new relations Use logistic regression for prediction BFS of search space 5k+/5k- examples for venue prediction 2.5k+/2.5k- examples for citation prediction

Results Venue (87.2%)Citation (93.1%)

Dynamic Feature Generation Query expressions generated Breadth-First Baseline puts all queries into one queue Dynamic strategy enqueues queries into separate streams  Stream 1: exists and count over table  Stream 2: other aggregates (counts of unique elements in individual columns)  Chooses next feature from stream where (featuresAdded+1)/(featuresTried+1) is max  Stop when a stream is empty

Results Venue Citation Clusters No Clusters

Time Results Venue Citation Clusters No Clusters

Domain Independent Learning Most citation prediction features are research-area generic Can we train a model for one area and test on another?

Domain Independent Results Used KDD-Cup 2003 data (High Energy Physics papers in arXiv) Train OnTest OnAccuracy CiteSeerarXiv92.9% CiteSeer 92.6% arXiv 96.0%

Conclusions Cluster-based features add expressivity, and apply to any domain or SRL method Generating queries dynamically can reduce search time and increase accuracy

Questions?