CS240B Midterm Spring 2013 Your Name: and your ID: Problem Max scoreScore Problem 140% Problem 232% Problem 228% Total 100%

Slides:



Advertisements
Similar presentations
Differential calculus
Advertisements

Solving Systems of Linear Equations Graphically and Numerically
Chapter 4 Loops Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved
Best fit line Graphs (scatter graphs) Looped Intro Presentation.
Chapter 13: Query Processing
How to Use the Earthquake Travel Time Graph (Page 11
1 Concurrency: Deadlock and Starvation Chapter 6.
Analysis of Computer Algorithms
A Multiperiod Production Problem
Dynamic Programming ACM Workshop 24 August Dynamic Programming Dynamic Programming is a programming technique that dramatically reduces the runtime.
Dynamic Programming Introduction Prof. Muhammad Saeed.
Tuesday, May 7 Integer Programming Formulations Handouts: Lecture Notes.
Writing Pseudocode And Making a Flow Chart A Number Guessing Game
solved problems on optimization
9.2 – Arithmetic Sequences and Series
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Data Link Layer Protocols Flow Control in Data Link Layer.
5th Grade Module 2 – Lesson 16
The Game of Algebra or The Other Side of Arithmetic The Game of Algebra or The Other Side of Arithmetic © 2007 Herbert I. Gross by Herbert I. Gross & Richard.
Digital Filter Banks The digital filter bank is set of bandpass filters with either a common input or a summed output An M-band analysis filter bank is.
COP4540 Database Management System Midterm Review
An Introduction to International Economics
Modern Programming Languages, 2nd ed.
Copyright © Cengage Learning. All rights reserved.
CHAPTER 16 Life Tables.
Digital Systems Introduction Binary Quantities and Variables
Lecture 7 Paradigm #5 Greedy Algorithms
40S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Probability Lesson: PR-L1 Intro To Probability Intro to Probability Learning Outcome B-4 PR-L1.
Chapter 5 Loops Liang, Introduction to Java Programming, Tenth Edition, (c) 2015 Pearson Education, Inc. All rights reserved.
Sorting It All Out Mathematical Topics
Slippery Slope
Mathematical Induction (cont.)
Finite-state Recognizers
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 6 The Relational Algebra.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
An Interactive Tutorial by S. Mahaffey (Osborne High School)
Global States.
SMA 6304/MIT2.853/MIT2.854 Manufacturing Systems Lecture 19-20: Single-part-type, multiple stage systems Lecturer: Stanley B. Gershwin
1 Functions and Applications
§ 11.2 Arithmetic Sequences. Blitzer, Intermediate Algebra, 5e – Slide #2 Section 11.2 Arithmetic Sequences Annual U.S. Senator Salaries from 2000 to.
9. Two Functions of Two Random Variables
Probabilistic Reasoning over Time
The Nature of Mathematical Reasoning
12-Apr-15 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
L6:CSC © Dr. Basheer M. Nasef Lecture #6 By Dr. Basheer M. Nasef.
Liang, Introduction to Java Programming, Eighth Edition, (c) 2011 Pearson Education, Inc. All rights reserved Chapter 3 Loops.
Clustering Categorical Data The Case of Quran Verses
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Avoiding Idle Waiting in the execution of Continuous Queries Carlo Zaniolo CSD CS240B Notes April 2008.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Extending the Definition of Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 -8.
SQL-5 (Group By.. Having). Group By  Need: To apply the aggregate functions to subgroups of tuples in a relation, where the subgroups are based on some.
Simulation Using computers to simulate real- world observations.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Key Stone Problem… Key Stone Problem… Set 17 Part 2 © 2007 Herbert I. Gross next.
1 Semantics and Evaluation Techniques for Window Aggregates in Data Streams Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, Peter Tucker This work.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
Mining Data Streams (Part 1)
The Stream Model Sliding Windows Counting 1’s
UCLA, Winter Sample from CS240B Past Midterms
Relational Algebra - Part 1
Data stream as an unbounded table
CS240B: Assignment1 Winter 2016.
UCLA, Fall CS240B Midterm Your Name: and your ID:
Idle Waiting for slides
CS240B Midterm: Winter 2017 Your Name: and your ID:
Presentation transcript:

CS240B Midterm Spring 2013 Your Name: and your ID: Problem Max scoreScore Problem 140% Problem 232% Problem 228% Total 100%

2 Source1 A U Source2 B Problem 1. 40% We have a steady-state situation. The number of tuples produced by A is the same as those in its input, but its output tuples are shorter and occupy half of the memory of its input. The same is true for B. Now A can process 1000 tuples per second, whereas B process 500 tuples per second. The union process 1000 tuples per second. Because of punctuation marks generated on demand there is no idle waiting. After processing other queries, the DSMS start processing this query graph where it finds N1 tuples in S1 and N2 tuples in S2. Q11: How long will the DMS take to finish processing these N1+N2 tuples (write an expression). N1( ) + N2( )= N1*4.5+ N2*5.5. With no change for A and B, let us assume now that C can process 400 tuples per second. Also say that the DSMS tries to minimize memory using a Chain-like optimization approach. Q12: Under which input situation (i.e. values of N2 and N2, or perhaps their ratio) will the the average delay of the tuples increase significantly? Observe that one never breaks immediately after the union which is flat and the two paths are identical. Thus both path sees a memory release of one unit ever y 1+2.5=3.5 ms. Q13. Still assuming the chain-based memory optimization and no idle waiting, estimated the average delay for 10,000 tuples for the following three cases: (i)Almost all tuples are in S1: Observe that processing tuples in A takes only 1ms. Much steeper path. The chain will break this path. The average delay for tuples in the buffer is: 10000* *3.5/2= 27500ms=27.5 sec. (ii)Almost all tuples are in S2: Here we have: 10000* *3.5/2= 27500ms=37.5 sec. (iii) half of the tuples are in S1 and the rest is in S2. A seemly correct answer is just take the average of the two 27.5/2+37.5/2 seconds. But the reality is that the chain is greedy and will process all the 5000 in S1 first and that takes 5000*1=5sec. Then the next steepest is B and that takes 5000*2=10sec. At this point, the tuples are processed by U+C. This takes 10000*3.5=3.5 sec. So far no output. Then, the output begin and ends 35 sec later. Thus /2= = 32.5sec. Thus the delay increases with the increases with the S2/S1 ratio. Compare these numbers with the optimum average delay is 50/2= 25 sec Q14. Illustrate the various cases with diagrams (it might be easier to start from here, before you answer the other questions). S1 A1 S2 B2 Sink C

Problem 2: 32% We have a stream of tuples describing employee histories, such as those above. The tuples arrive ordered by their timestamps. We want to project out Title and produce coalesced tuples in the output. (Q21) write an SQL-TS query to do that (Q22) write an SQL-MR query to do that. SQL-TS SELECT empno, first(B.start), max(B.end) FROM Emp AS PATTERN (B+) PARTITION BY empno ORDER BY start WHERE NOT EXISTS (SELECT empno AS cover FROM Emp WHERE cover.start = B.start) AND ( count(B.empno)=1 OR max(previous(B.end)) >= B.start ) So we begin with a period whose start is not covered by another Then we find a sequence of only one period and we continue while the next start is before the end of the still growing coalesced period.

Problem 2: 32% We have a stream of tuples describing employee histories, such as those above. The tuples arrive ordered by their timestamps. We want to project out Title and produce coalesced tuples in the output. (Q22) write an SQL-MR query to do that. SQL-MR SELECT empno, /*employee number*/ startTS, /*start of the next maximal period*/ endTS /*end of the current maximal period*/ FROM Emp MATCH_RECOGNIZE ( PARTITION BY empno ORDER BY start MEASURES B.empno AS Eno, FIRST(B.start) AS startTS, MAX(B.end) AS endTS ONE ROW PER MATCH AFTER MATCH SKIP PAST LAST ROW /* a hole between periods*/ MAXIMAL MATCH PATTERN (B+) DEFINE B AS (B.empno =1) OR MAX(PREV(B.end))>= B.start )

Problem 3: 28% 3A We have a time-stamped data stream that begins at time TAlpha, and we can call SQL 2003 aggregates with UNLIMITE PRECEDING. We want to compute the COUNT on a sliding window of 80 minutes without using windows since they require too much memory. Write an ESL expression to do that. Say that we have union-merged the live stream and the one coming from disk. Thus we might have something like: SMstream(timestamp, IP, source) where source tells us if it is a live event or one from disk. The timestamp of events from disks has been incremented by 80 minutes. Also source=+1 denotes the original stream, wile source -1 denotes that this comes from disk. /* This counts alls the s in a widnow of 80 minutes.*/ SELECT SUM (source) (UNLIMITED PRECEDING) FROM SMstream /*This counts alls the s in a widnow of 80 minutes for each IP*/ SELECT SUM (source) (UNLIMITED PRECEDING) FROM SMstream 3B Now the input stream is filtered by a sampler that retains 5% of the input tuples and discard the rest. Is there an easy modification for query 3A that produces an accurate estimate for the exact count? Show and explain the modification. One should multiply by 100/5=20 the result. 3CNow we want to compute count_distinct. and also perform an accurate estimation for it. Explain why approaches similar to those used in 3A or 3B will not work well. E.g. if B contains N distinct items, a W of size N could just contain the same items or totally new ones. In the second case N+N is correct. In the first case it is not. Let U be the stream so far. W is the window, and B is the stream before W. If cntd denotes count distinct, in general cntd(B)+ cntd(W) can be quite different from cntd(U). 3DAnother approach to estimate count_distinct, consists in (i) maintaining a fixed size sample on a window (e.g., using the BOZ technique discussed in class), (ii) compute the count_distinct value on this window, and (iii) scale-up the result according to the tuple count in the original window (this is known for physical windows and can be computed with technique 3A for logical ones). Would this approach produce reliable estimates? (Support your statements with formulas and examples.) Again there is no way to guess the correct value from that of the sample. For instance, we have 5% sampling and we see 10 distinct items. In the original window we might have 50 distinct items of just the 10 we observed, and the rest are repetitions.