Applying logic to practice in computer science

Applying logic to practice in computer science
Ron Fagin IBM Research—Almaden

First Case Study: Garlic (1996)

Garlic . . . What was the problem?
Mr. Database Theoretician, we’ve got a problem with Garlic, our multimedia database system! Laura Haas What was the problem? The answers to queries in DB/2 are sets The answers to queries in QBIC are sorted lists How do you combine the results? Garlic Example databases:

Musicbrainz has 12 million recordings in its DB
Example Searching a CD database for Artist = “Beatles” yields a set, via, say DB/2 Musicbrainz has 12 million recordings in its DB

Example AlbumColor = “Red” yields a sorted list, via, say QBIC .697
.683 .670 .659 .629 Redness

Example How do we make sense of What about And what about
(Artist = ‘Beatles’) ∧ (AlbumColor = ‘Red’) ? Here it is probably a list of albums by the Beatles, sorted by how red they are What about (Artist = ‘Beatles’) ∨ (AlbumColor = ‘Red’) ? And what about (Color = ‘Red’) ∧ (Shape = ‘Round’) ?

What Was My Solution? These weren’t just sorted lists: they were scored lists Can view sets as scored lists (scores 0 or 1) This reminded me of fuzzy logic In fuzzy logic, conjunction (∧) is min, and disjunction (∨) is max

I proved that you can’t do better than √n
Use fuzzy logic I like your solution. But we also need an efficient algorithm that can find the top k results while minimizing database accesses ⋮ I have an algorithm that finds the top k with only √n database accesses Laura Haas Ron Fagin Good, that beats linear! But we database people are spoiled, and are used to only log n accesses. Be smarter and get me a log n algorithm ⋮ I proved that you can’t do better than √n

Time for the Accesses Say n = 12,000,000 CDs
Assume 1000 accesses per second n accesses (naïve algorithm) would take 3 hours n accesses would take 3 seconds √

Generalizing the Algorithm
The algorithm works for arbitrary monotone scoring functions increasing the scores of arguments cannot decrease the overall score

Influence Algorithm implemented in Garlic
Influenced other IBM products, including Watson Bundled Search system InfoSphere Federation Server WebSphere Commerce Paper introducing my algorithm (now called “Fagin’s Algorithm”) has around 900 citations (Google Scholar) 11

The Threshold Algorithm
In 2001, we found the Threshold Algorithm Amnon Lotem Moni Naor Ron Fagin

The Problem There are m attributes
Each object in a database has a score xi for attribute i The objects are given in m sorted lists, one list per attribute Goal: Find the top k objects according to a monotone scoring function, while minimizing access to the lists Can think of the attributes as voters, and the objects as candidates, where each voter assigns a score to each candidate

Multimedia Example REDNESS 177: 0.993 139: 0.991 702: 0.982 . . .
235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 . . . 177: 0.406

Scoring Functions Let f be the scoring function Popular choices for f:
min (used in fuzzy logic) average Let x1,…, xm be the scores of object R under the m attributes Then f(x1,…, xm) is the overall score of object R Sometimes write f(R) to mean f(x1,…, xm) A scoring function f is monotone if whenever xi ≤ yi for every i, then f(x1,…, xm) ≤ f(y1,…, ym)

Modes of Access Sorted (or sequential) access Random access
Can obtain the next object and its score for attribute i Random access Can obtain the score of object R for attribute i Wish to minimize total number of accesses

Algorithms Want an algorithm for finding the top k objects
Naïve algorithm retrieves every score of every object Too expensive

Threshold Algorithm Do sorted access in parallel to each of the m scored lists. As each object R is seen under sorted access: Do random access to retrieve all of its scores x1,…, xm Compute its overall score f(x1,…, xm) If this is one of the top k answers so far, remember it For each list i, let ti be the score of the last object seen under sorted access Define the threshold value T to be f(t1,…, tm). When k objects have been seen whose overall score is at least T, stop Return the top k answers

Threshold Algorithm: Example (using min)
REDNESS 177: 0.993 ROUNDNESS 235: 0.999 Scoring function is min

REDNESS 177: 0.993 ROUNDNESS 235: 0.999 . . . 177: 0.406

REDNESS 177: 0.993 . . . 235: 0.325 ROUNDNESS 235: 0.999 . . . 177: 0.406

REDNESS 177: 0.993 . . . 235: 0.325 ROUNDNESS 235: 0.999 . . . 177: 0.406 Overall score for 177: min(0.993, 0.406) = .406 Overall score for 235: min(0.325, 0.999) = .325

REDNESS 177: 0.993 . . . 235: 0.325 ROUNDNESS 235: 0.999 . . . 177: 0.406 .993 Overall score for 177: min(0.993, 0.406) = .406 Overall score for 235: min(0.325, 0.999) = .325 Threshold value: min( , ) = .993

REDNESS 177: 0.993 139: 0.991 . . . 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 . . . 177: 0.406

REDNESS 177: 0.993 139: 0.991 . . . 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 . . . 177: 0.406 .991 Threshold value: min( , ) = .991

REDNESS 177: 0.993 139: 0.991 702: 0.982 . . . 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 . . . 177: 0.406

REDNESS 177: 0.993 139: 0.991 702: 0.982 . . . 235: 0.325 ROUNDNESS 235: 0.999 666: 0.996 820: 0.992 . . . 177: 0.406 .982 Threshold value: min( , ) = .982

Correctness of the Halting Rule
Suppose the current top k objects have scores at least T (the current threshold). Assume (by way of contradiction): R unseen; S in current top k ; f(R)>f(S) R has scores x1,…, xm ⇒ xi ≤ ti for every i (as R has not been seen) ⇒ f(R) = f(x1,…, xm) ≤ f(t1,…, tm) = T ≤ f(S) ⇒ contradiction!

cost(A,D)  c1 cost(A’,D)  c2.
Instance Optimality A = class of algorithms, D = class of legal inputs. For AA and DD have cost(A,D)  0. An algorithm AA is instance optimal over A and D if there are constants c1 and c2 s.t. for every A’A and D D cost(A,D)  c1 cost(A’,D)  c2. c1 is called the optimality ratio

Instance Optimality of TA
Intuition about why TA is instance optimal: Cannot stop any sooner, since the next object to be explored might have the threshold value. But, life is a bit more delicate...

Wild Guesses Wild guesses: random access for a field i of object R that has not been sequentially accessed before Neither FA nor TA use wild guesses Subsystem might not allow wild guesses

Instance Optimality of TA
Theorem: For each monotone f let A be the class of algorithms that correctly find top k answers, with scoring function f, for every database. Do not make wild guesses. D be the class of all databases. Then TA is instance optimal over A and D. Optimality ratio is m+m(m-1) ·cR/cS - best possible!

But Ron, you told me that your algorithm is optimal!?
Our “threshold algorithm” is an even better algorithm (optimal in a stronger sense) Amnon Lotem Moni Naor Ron Fagin But Ron, you told me that your algorithm is optimal!? Laura Haas Well, Laura, there is optimal, and then there is optimal

Influence We submitted the paper to PODS ’01
I was worried that the Threshold Algorithm was so simple that the paper would be rejected So I called it a “remarkably simple algorithm” The paper won the PODS Best Paper Award! The paper was very influential Over 1900 citations (Google Scholar) PODS Test of Time Award in 2011 IEEE Technical Achievement Award in 2011 Gödel Prize in 2014 Gems of PODS 2016

Applications of TA relational databases multimedia databases
music databases semistructured databases text databases uncertain databases probabilistic databases graph databases spatial databases spatio-temporal databases web-accessible databases XML data web text data semantic web high-dimensional datasets information retrieval fuzzy data sets data streams search auctions wireless sensor networks distributed sensor networks distributed networks social-tagging networks document tagging systems peer-to-peer systems recommender systems personal information management systems group recommendation systems document annotation

Morals How did theory help? Figure out the real problem
Resolving Laura Haas’s dilemma Knowledge of the literature (fuzzy logic) Abstraction (using scoring functions) Devising optimal algorithms and proving optimality Figure out the real problem For example, there are scores, not just sorted lists Don’t stop at original problem Example: doing a weighted version (with Ed Wimmers) Led to a successful and influential body of work

Measures of Success Making our products better Creating a new subfield
An ultimate measure of success for practitioners Creating a new subfield An ultimate measure of success for theoreticians A paper that arose by resolving a practical problem won the Gödel Prize!

Second Case Study: Clio (2003)

Let’s start from scratch and lay the foundations for data exchange!
Clio Clio deals with “data exchange,” where we convert data from one format to another When Laura Haas started Clio, I followed her I attended Clio meetings for a year Phokion Kolaitis Renee Miller Lucian Popa Ron Fagin Let’s start from scratch and lay the foundations for data exchange!

Data Exchange Translate data from source format to target format
Σ S T Source Schema Target Schema J I 40

Data Exchange Data exchange is an old, but recurrent, database problem
Phil Bernstein—2003 “Data exchange is the oldest database problem” EXPRESS: IBM San Jose Research Lab—1977 Transforms data between hierarchical databases Data exchange underlies: Data warehousing, ETL (Extract-Transform-Load), …

Example Source Target EMP MGR EMP DEPT DEPT MGR Relationship between source and target (the “schema mapping”) specified by tuple-generating dependencies (tgds) Originally used to help specify “normal forms” for relational databases EM(e,m)  d (ED(e,d) ∧ DM(d,m)) 42

Example Source Target EMP MGR Fagin Haas Clarkson Welser EMP DEPT DEPT
43

Example – 3 Possible Solutions
Source Target EMP MGR Fagin Haas Clarkson Welser EMP DEPT Fagin Haas Clarkson Welser DEPT MGR Haas Welser EMP DEPT Fagin d1 Clarkson Haas d2 DEPT MGR d1 Haas d2 Welser EMP DEPT Fagin d1 Clarkson d2 Haas d3 DEPT MGR d1 Haas d2 d3 Welser

Which Solution Should We Produce?
We define a “universal” solution to be one as general as possible Third solution is universal

Target Constraints Might have target constraints specified by equality-generating dependencies (egds), like DM(d,m) ∧ DM(d',m))  (d = d') If this egd is a target constraint, then second solution is universal

How Do We Obtain a Universal Solution?
There is a well-known mechanical procedure called the “chase”, originally used as a tool in database design We use the chase to generate the target from the source efficiently Example: EM(e,m)  d (ED(e,d) ∧ DM(d,m)) From EM(Fagin, Haas), create ED(Fagin, d) and DM(d, Haas), where d is a newly introduced “labeled null” The egds tell when to equate labeled nulls

Composing Schema Mappings
Schema S3 S23 Schema S1 Schema S2 S13 With Phokion Kolaitis, Lucian Popa, and Wang-Chiew Tan, we studied composition of schema mappings Composition can take us out of first-order logic! We found the right language for composition (“second–order tgds”) We gave an algorithm for composition

Second-order tgds An example of an SO tgd: f em (EM(e,m)  (ED(e, f (m)) ∧ DM(f (m),m)) An SO tgd is a formula of the form f ( x1 (1 ψ1) ∧ … ∧ xk (k  ψk) ) where (a) each i is a conjunction of atomic formulas not involving function symbols, and possibly equalities of terms, and (b) each ψi is a conjunction of atomic formulas, not including equality.

Second-order tgds (cont.)
Recall the existential second-order logic = NP Therefore, we might suspect that the following theorem holds: Theorem: There is a second-order tgd σ such that deciding if (I,J) ⊨ σ is an NP-complete problem Proof: Let σ be f (E(x,y)  D(f(x),f(y))). Let D = {(r,g), (r,b), (g,b), (g,r), (b,r), (b,g)}. Then σ says that E is 3-colorable.

Second-order tgds (cont.)
SO tgds are the right language for composing FO tgds, in the following sense: The composition of any number of FO tgds gives an SO tgd. Every SO tgd is the result of composing some finite number of FO tgds. In fact, in joint work with Marcelo Arenas and Alan Nash, we showed that surprisingly, every SO tgd is the result of composing just two FO tgds.

Measures of Success Used in DB2 Control Center, Rational Data Architect, and Content Manager Using universal solutions Using our algorithm to produce a universal solution, and our algorithm to compose schema mappings Our initial paper won the International Conference on Database Theory Test of Time Award in 2013. With over 1100 citations, our paper was the 2nd most highly cited paper of the decade in the journal TCS Our paper on composition won the PODS Test of Time Award in 2014, and our follow-up paper on composition won the ICDT 2010 Best Paper Award This work created a new subfield Special sessions on data exchange in major db conferences

Morals How did theory help? Theorists need a partner to keep us honest
Established principles rather than ad hoc approaches Yielded algorithms for converting data, and for composing schema mappings Theorists need a partner to keep us honest Never too late to lay the foundations for an area, even for existing systems Can cause essential changes and improvement Again, don’t stop at original problem

Conclusions

Conclusions (for System Builders)
Consult with theoreticians Explaining the problem is useful by itself Principled approaches can improve your product Better or new algorithms can differentiate your product Algorithm analysis can provide performance expectations and provide product guarantees Abstractions can expand the function of your product

Conclusions (for Theoreticians)
Involvement with system builders can help your theory! Novel questions will be asked New models and new, interesting areas of study will arise Implementation can reveal weaknesses in the theory Theory will be relevant Practical impact!

Applying logic to practice in computer science

Similar presentations

Presentation on theme: "Applying logic to practice in computer science"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applying logic to practice in computer science

Similar presentations

Presentation on theme: "Applying logic to practice in computer science"— Presentation transcript:

Similar presentations

About project

Feedback