1 Autocompletion for Mashups Ohad Greenshpan, Tova Milo, Neoklis Polyzotis Tel-Aviv University UCSC

2 Talk Roadmap Introduction on Mashups and Autocompletion Problem Definition The Algorithm Implementation & experiments Conclusions & Related Work

3 Introduction - What is a mashup ? Mashup is a technology for integration of data, services and applications being available on the web, into a single application.

4 Application Integration Data Logic GUI Logic GUI Logic GUI Logic GUI Logic GUI Data Logic GUI Data Logic GUI Data Logic GUI Data Logic GUI Mashup Platform

5 Mashup Development is difficult... Choose some relevant components Decide which should be connected and learn their spec Components Repository Glue Mashup Repositories 10 2

6 knowledge ?

7 Introduction - Mashup Autocompletion

8 The Mashup Model Data & logic Mashlets & Mashlet-APIs API Mashlet Data & logic Mashlets & Mashlet-APIs API Mashlet Data & logic Mashlets & Mashlet-APIs API Glue Pattern

9 Inheritance A B A B

10 Mashup Autocompletion – Problem Definition Given a database of mashlets and GPs and a set of mashlets selected by the user, identify and rank GPs that link a subset of the selected mashlets. Based on: Popularity & Relevance to user query What would be the “ideal” GP: The most popular one that connects only the user mashlets and nothing else Relaxations: - Less popular - Connects variants of the user mashlets - Connects a subset of the user mashlets - Connects additional mashlets

11 Inheritance

12 -Each glue pattern is represented as a point in a multidimensional space. -One dimension representing the GP popularity -The rest: All mashlets 1) User Mashlets 2) Other mashlets -The algorithm goal is to find the top-k GPs that link the given user mashlets (the ones close to the optimal GP). Problem Abstraction m1 m2 GP Popularity A simplified 3D illustration 0 0 0 0 0 0 0 0 0... g 0.4 0.3 0.2 0 1 0 0 1 0 1 0...

13 Data Structure & Basic Top-k Algorithm L1 >gp,score< >g7,0.1< >g4,0.2< >g6,0.2< >g1,0.3< >g5,0.4< >g2,0.5< >g3,0.7< L2 >gp,score< >g4,0.1< >g3,0.2< >g1,0.5< >g2,0.5< >g7,0.5< >g5,0.8< >g6,0.8< L0 >gp,score< >g1,0.1< >g2,0.2< >g3,0.4< >g4,0.4< >g5,0.4< >g6,0.4< >g7,0.4< L3 >gp,score < >g1,0.1< >g2,0.6< >g7,0.6< >g6,0.7< >g4,0.8< >g5,0.8< >g3,0.9< Glue Patterns Mashlets GP Popularity

14 Problems with the algorithm The number of lists the algorithm accesses is very large Most of the mashlet lists are unrelated to the user selection (query)

15 Data Structure Glue Patterns Mashlets GP Popularity User mashlets

16 Algorithm n n and p g’ [m]=0 for n < m ≤ |M all | n M

17 Correctness of AC* - Lemma Theorem 4.1: Algorithm AC* returns a correct solution Proof is based on a lemma showing that any candidate that has not been encountered by AC*, has a total score lower than the threshold. Optimality of AC* Competing Algorithms: C – class of deterministic algorithms that operate under the same access model as AC*. Algorithms receive as input the lists, the monotonic function, and k. Algorithms can use any order (i.e., not specifically round-robin) and any thresholding scheme, and can rely on accessed elements. Instance Optimality: AC* is instance optimal within class C if there are constants c and c0 such that for every input instance I, cost(AC*,I) ≤ c·cost(A,I)+c0 for any A C.

18 Calculating Popularity Glue Pattern and Mashlets Rank Page-rank style algorithm Takes into account popularity of mashlets and GPs, as well as relationship between them. MM GP M M

19 Websphere Application Server MatchUp Algorithm 4 Knowledge base 1 1 2 3 5 IBM Mashup Center Implementation

20 Experiments (synthetic dataset) Synthetic dataset for large-scale experiments - Generated a DB of 40k mashlets & GPs (ProgrammableWeb has 4k) - Based on ProgrammableWeb characteristics. Experiments for synthetic dataset - Varying # of total mashlets and GPs - Varying k - Varying # of user mashlets - Varying GP complexity

21 GP Complexity = 5, varying k Results (synthetic dataset)

22 GP Complexity = 10, varying k Results (synthetic dataset)

23 Varying # of user mashlets Results (synthetic dataset)

24 Real dataset - Used real-life mashlets from ProgrammableWeb and IBM Mashup Center - Scenario: development of a travel-related mashup Experiments for quality assesment - IBM Mashup Center as the mashup platform - Users placed mashlets - MatchUp offered top-10 GPs for their mashlets - Users searched for alternatives Results - User satisfaction was high - High correlation between suggestions and users’ lists - Browsing for additional results was in general unsuccessful - Gluing process was significantly expedited Experiments (real dataset)

25 Related Work Autocompletion in many other domains Phrase Prediction (Nandi & Jagadish, VLDB 2007) File locations (Myers, CHI 2000) Web service composition Model for WS composition (Berardi et al., VLDB 2005) Optimized and customized algorithm (Mcilraith and Son, KR 2002) Mashup assembly tools MashMaker (Ennals & Garofalakis, SIGMOD 2007) : data -> widgets MashupAdvisor (Elmeleegy et al., ICWS 2008): mashup -> output recomm. -> assembly to achieve this output

26 Future Work Infer semantic inheritance automatically Distributed environment Incorporating context and user preference Conclusions A novel Autocompletion mechanism for rapid development of mashups Using the collective wisdom of other users on the web A dedicated Threshold-based top-k algorithm which reduces the search space Pagerank-style calculation of mashlets and glue patterns popularity

