Bruno Ribeiro CS69000-DM1 Topics in Data Mining. Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment.

Slides:



Advertisements
Similar presentations
New Directions in Traffic Measurement and Accounting Cristian Estan – UCSD George Varghese - UCSD Reviewed by Michela Becchi Discussion Leaders Andrew.
Advertisements

Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
Chapter 6 Sampling and Sampling Distributions
Experimental Design, Response Surface Analysis, and Optimization
Fundamental Performance Limits in Image Registration By Dirk Robinson and Peyman Milanfar IEEE Transactions on Image Processing Vol. 13, No. 9, 9/2004.
The Simple Linear Regression Model: Specification and Estimation
The Mean Square Error (MSE):. Now, Examples: 1) 2)
Today Today: Chapter 9 Assignment: Recommended Questions: 9.1, 9.8, 9.20, 9.23, 9.25.
Estimation from Samples Find a likely range of values for a population parameter (e.g. average, %) Find a likely range of values for a population parameter.
Part 2b Parameter Estimation CSE717, FALL 2008 CUBS, Univ at Buffalo.
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Chapter 7 Sampling and Sampling Distributions
Evaluating Hypotheses
SAMPLING DISTRIBUTIONS. SAMPLING VARIABILITY
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Visual Recognition Tutorial
Discussion of Profs. Robins’ and M  ller’s Papers S.A. Murphy ENAR 2003.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Inferences About Process Quality
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
Principles of the Global Positioning System Lecture 11 Prof. Thomas Herring Room A;
Simulation Output Analysis
Model Inference and Averaging
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Network Characterization via Random Walks B. Ribeiro, D. Towsley UMass-Amherst.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Modern Navigation Thomas Herring
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Challenges and Opportunities Posed by Power Laws in Network Analysis Bruno Ribeiro UMass Amherst MURI REVIEW MEETING Berkeley, 26 th Oct 2011.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Confidence Interval & Unbiased Estimator Review and Foreword.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Latent regression models. Where does the probability come from? Why isn’t the model deterministic. Each item tests something unique – We are interested.
Chapter 9: One- and Two-Sample Estimation Problems: 9.1 Introduction: · Suppose we have a population with some unknown parameter(s). Example: Normal( ,
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Yi Jiang MS Thesis 1 Yi Jiang Dept. Of Electrical and Computer Engineering University of Florida, Gainesville, FL 32611, USA Array Signal Processing in.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Chapter 6 Sampling and Sampling Distributions
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Stats 242.3(02) Statistical Theory and Methodology.
Statistical Estimation
Visual Recognition Tutorial
A Resource-minimalist Flow Size Histogram Estimator
Sample Size Determination
The Simple Linear Regression Model: Specification and Estimation
Model Inference and Averaging
Sampling Distributions
Streaming & sampling.
Counting How Many Elements Computing “Moments”
Discrete Event Simulation - 4
OVERVIEW OF LINEAR MODELS
Computing and Statistical Data Analysis / Stat 7
Principles of the Global Positioning System Lecture 11
Introduction to Stream Computing and Reservoir Sampling
Presentation transcript:

Bruno Ribeiro CS69000-DM1 Topics in Data Mining

Bruno Ribeiro  Reviews of next week’s papers due Friday 5pm (Sunday 11:59pm submission closes) ◦ Assignment on blackboard  Deadline to select projects ◦ Sept 29 2 Announcement Reminder

Bruno Ribeiro  Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling  Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 3 Today

Bruno Ribeiro  Why is your bus often full? 4 Waiting Time Paradox but

Bruno Ribeiro Set Size Estimation Problem 5 sample prob = p More likely to observe sets with large no. elements How much more likely to see green set than blue set? Observed sets

Bruno Ribeiro Set Size Distribution Estimation 6 random sampling estimation Set size distribution observed data original data

Bruno Ribeiro 7 Example Application Do we see c 0 ?

Bruno Ribeiro 8 Problem Formulation (corrected)

Bruno Ribeiro  If edges arrive independently at random…  Estimate original average degree ◦ Knowing the sampling probability p 9 Application 1: Estimate Latent Characteristics Observed during window [0, T ]Underlying “true network” “e.g. phone calls” p p p

Bruno Ribeiro Estimate the original flow size distribution from counts of no. sampled packet 10 Application 2: TCP flow size estimation TCP flow packets packet samplin g … no packet sampled (flow not sampled) 1 packet sampled all packets sampled random sampling estimation Set size distribution observed data original data

Bruno Ribeiro 11 Maximum Likelihood Estimation in practice… accuracy of proposed estimator sampling rate=1/100 without proto. info. with proto. info. n

Bruno Ribeiro Fisher information data processing inequality “debug” measurement methods 12 What I will show Lessons:  Feature engineering by trial & error is tricky and expensive  Analyze last step ◦ enough information to proceed to estimate? ◦ exists better summary function? ◦ where information lost?

Bruno Ribeiro Data processing inequality: “No processing can increase the amount of statistical information already contained in the data” 13 Estimating characteristics from sampling Nature raw samples sample summary characteristic summary sampling Estimator Data processing inequality

Bruno Ribeiro  Fisher information ◦ Amount of information observations carry about the unknown characteristic  Cramér-Rao inequality ◦ Connect the Fisher information with the minimum Mean Squared Error (MSE) achievable by any unbiased estimator 14 “Debugging” the sampling design Nature raw samples sample summary characteristic summary sampling Best Estimator Data processing inequality poor good summary best estimator quality of estimates? done back to the drawing board assumption: θ

Bruno Ribeiro [The finding] that the amount of information extracted in the process of estimation could never exceed the quantity supplied by the data Combined with the practical fact that directly available processes of computation would extract almost always a very large fraction of the total available [information], shifted the moral balance. The weight of [the statistician’s] responsibility was thrown back on to the process by which the data had come into existence. […] what types of observational programs would yield the most information for a given expenditure in time, money and labor. R. A. Fisher

Bruno Ribeiro 16 Problem Formulation

Bruno Ribeiro where or in matrix form 17 Fisher Information

Bruno Ribeiro Suppose we obtain unbiased estimates from observations Mean squared error (covariance matrix) Cramér-Rao Bound for 18 Cramér-Rao Lower Bound Inverse Fisher information

Bruno Ribeiro But Must consider parameter constraint 19 Cramér-Rao Lower Bound CRLB without constraint CRLB with constraint

Bruno Ribeiro Fisher Information with Priors  Fisher information with priors total FI FI of prior FI original

Bruno Ribeiro Different Sampling Designs  FS = Flow sampling: Sample sets with probability q  SH = Randomly sample first element with probability q’ but collect all future elements of same set  DS = Dual Sampling: Sample first element with high probability. Sample following elements with low probability and use “sequence numbers” to obtain elements lost “in the middle”  PS = Packet Sampling: Sample elements with probability p moca seeing as a stream of elements jg

Bruno Ribeiro Results: Different Sampling Designs (Veitch & Tune’14)  FS = Flow sampling  SH = Sample and hold DS = Dual sampling PS = Packet sampling

Bruno Ribeiro  Murai, F., Ribeiro, B., Towsley, D., & Wang, P. (2013). On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling. JSAC 2013.On Set Size Distribution Estimation and the Characterization of Large Networks via Sampling  Veitch, D., & Tune, P. (2015). Optimal Skampling for the Flow Size Distribution. IEEE Transactions on Information Theory 2015.Optimal Skampling for the Flow Size Distribution 23 Today

Bruno Ribeiro  Part 1: Random Sampling v.s. Data Streaming 24

Bruno Ribeiro  Fisher information to of sample summary? 25 What if we decided to bypass sampling?

Bruno Ribeiro 26 0 Sketching router Estimation phase powerful back end server powerful back end server 0 0 universal hash function Sketch phase 1 2 collision!! counters summary flow size distribution estimate Prevent collisions keep unique packet ID (flow sampling) Disambiguate

Bruno Ribeiro  Why? ◦ Fisher information analysis shows collided counter ≃ 0 information 27 Eviction Sketch

Bruno Ribeiro 28

Bruno Ribeiro Set Size Estimation Errors in Practice  p = 0.25  (a) N=10,000 and (b) N=50,000 sampled sets  (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets

Bruno Ribeiro Set Size Estimation Errors in Practice II  p = 0.90  (a) N=10,000 and (b) N=50,000 sampled sets  (c) N ∊ {5,10,20,50,100} x 10 3 sampled sets

Bruno Ribeiro Scaling on max set size: Phase transition of estimation errors - observable set sizes W – size of largest set T i ( S ) – estimate of θ i

Bruno Ribeiro Infinite support  & power laws  If  is power law with infinite support (W  ∞) ◦ if p < ½ any unbiased estimator is inaccurate  might as well output random estimates ◦ if p > ½ estimates can be accurate if enough samples are collected 32

Bruno Ribeiro  How to collect data!! 33 Next Class