Should you trust your experimental results? Amer Diwan, Google Stephen M. Blackburn, ANU Matthias Hauswirth, U. Lugano Peter F. Sweeney, IBM Research Attendees.

Slides:



Advertisements
Similar presentations
Conducting Research Investigating Your Topic Copyright 2012, Lisa McNeilley.
Advertisements

To achieve a Level 7 you need to…. To achieve a Level 6 you need to…
Critical Reading Strategies: Overview of Research Process
Animal, Plant & Soil Science
WHAT IS THE NATURE OF SCIENCE?
Benchmarking and Performance Evaluations Todd Mytkowicz Microsoft Research.
The Scientific Method.
Testing Theories: Three Reasons Why Data Might not Match the Theory.
Sampling and Probability Chapter 5. Samples and Their Populations >Decision making The risks and rewards of sampling.
Welcome to PLDI 2009 General chair: Michael Hind Program chair: Amer Diwan Tutorial chair: Kim Hazelwood Workshops chair: Ranjit Jhala FIT chair: Rodric.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
PSAE Practice Session Science Mr. Johns Room 2012.
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
ITEC 451 Network Design and Analysis. 2 You will Learn: (1) Specifying performance requirements Evaluating design alternatives Comparing two or more systems.
Checking and Inferring Local Non-Aliasing Alex AikenJeffrey S. Foster UC BerkeleyUMD College Park John KodumalTachio Terauchi UC Berkeley.
DECO3008 Design Computing Preparatory Honours Research KCDCC Mike Rosenman Rm 279
I won’t cite a paper as a reference unless I’ve read it first. This seems like an obvious rule. Am I ever tempted not to follow it? o I read a paper by.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
1 The Expected Performance Curve Samy Bengio, Johnny Mariéthoz, Mikaela Keller MI – 25. oktober 2007 Kresten Toftgaard Andersen.
Basic Scientific Writing in English Lecture 3 Professor Ralph Kirby Faculty of Life Sciences Extension 7323 Room B322.
Argumentation - 1 We often encounter situations in which someone is trying to persuade us of a point of view by presenting reasons for it. We often encounter.
Mgt2700: Theory continued Science, Scientific Method, and Truth and Truth.
Science Inquiry Minds-on Hands-on.
Unit 1 Lesson 1 What Is Science?
Bell Work How would you separate “good” science from “bad” science? What’s the difference between the two?
Chapter 1 Psychology as a Science
1 Technical writing part one Richard Golding IBM Almaden Research Center
Social Networking Techniques for Ranking Scientific Publications (i.e. Conferences & journals) and Research Scholars.
ASPLOS 2015 Debate Onur Mutlu
JDS Special program: Pre-training1 Carrying out an Empirical Project Empirical Analysis & Style Hint.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.
11 Writing a Conference Research Paper Miguel A. Labrador Department of Computer Science & Engineering
TEA Science Workshop #3 October 1, 2012 Kim Lott Utah State University.
 Internal Validity  Construct Validity  External Validity * In the context of a research study, i.e., not measurement validity.
Exam Taking Kinds of Tests and Test Taking Strategies.
Software Engineering Experimentation Rules for Reviewing Papers Jeff Offutt See my editorials 17(3) and 17(4) in STVR
Unit 1 Section 2 Scientific MEthods.
LEVEL 3 I can identify differences and similarities or changes in different scientific ideas. I can suggest solutions to problems and build models to.
Science Fair How To Get Started… (
CS380 C lecture 20 Last time –Linear scan register allocation –Classic compilation techniques –On to a modern context Today –Jenn Sartor –Experimental.
ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.
CS529 Multimedia Networking Experiments in Computer Science.
WHAT IS THE NATURE OF SCIENCE?. SCIENTIFIC WORLD VIEW 1.The Universe Is Understandable. 2.The Universe Is a Vast Single System In Which the Basic Rules.
Sampling and Probability Chapter 5. Sampling & Elections >Problems with predicting elections: Sample sizes are too small Samples are biased (also tied.
Handbook for Health Care Research, Second Edition Chapter 4 © 2010 Jones and Bartlett Publishers, LLC CHAPTER 4 Scientific Method.
How do scientists employ imagination and creativity in developing hypothesis through observations and drawing conclusions? The Mystery Tube.
1 Common Mistakes in Performance Evaluation (1) 1.No Goals  Goals  Techniques, Metrics, Workload 2.Biased Goals  (Ex) To show that OUR system is better.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
Systematic Review: Interpreting Results and Identifying Gaps October 17, 2012.
Approaches to Area Studies 1: A preliminary Step for a Systematic Research Presented by Alina Kim.
1 Technical Communication A Reader-Centred Approach First Canadian Edition Paul V. Anderson Kerry Surman
Yr 7.  Pupils use mathematics as an integral part of classroom activities. They represent their work with objects or pictures and discuss it. They recognise.
Single-Subject and Correlational Research Bring Schraw et al.
Chapter 1 Section 2 Review
What Is Science?. 1. Science is limited to studying only the natural world. 2. The natural world are those phenomena that can be investigated, discovered,
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Evaluation – Metrics, Simulation, and Workloads Copyright 2004 Daniel.
FYP 446 /4 Final Year Project 2 Dr. Khairul Farihan Kasim FYP Coordinator Bioprocess Engineering Program Universiti Malaysia Perls.
A research and policy informed discussion of cross-curricular approaches to the teaching of mathematics and science with a focus on how scientific enquiry.
CHAPTER 1 THE FIELD OF SOCIAL PSYCHOLOGY. CHAPTER OBJECTIVES After reading this chapter, you should be able to: Offer a definition of social psychology.
ACT Science ACT Test Prep Goals – 1. Become familiar with many of the concepts that are tested on the official test 2. Be able to target the item-types.
Unit 1 Lesson 1 What Is Science?
Dr.V.Jaiganesh Professor
Common Mistakes in Performance Evaluation The Art of Computer Systems Performance Analysis By Raj Jain Adel Nadjaran Toosi.
Review of Testing a Claim
THE FIELD OF SOCIAL PSYCHOLOGY
ITEC 451 Network Design and Analysis
THE FIELD OF SOCIAL PSYCHOLOGY
Software Engineering Experimentation
SCIENTIFIC METHOD By Diana Bivens.
Presentation transcript:

Should you trust your experimental results? Amer Diwan, Google Stephen M. Blackburn, ANU Matthias Hauswirth, U. Lugano Peter F. Sweeney, IBM Research Attendees of Evaluate '11 workshop

Why worry? ExperimentInnovate For scientific progress we need sound experiments

Unsound experiments Make a bad idea look great! Unsound Experiment Bad Idea

Unsound experiments Unsound Experiment Great Idea Make a great idea look bad!

Thesis Sound experimentation is critical but requires Creativity Diligence As a community, we must Learn how to design and conduct sound experiments Reward sound experimentation

A simple experiment Goal: To characterise the speedup of optimization O Experiment: Measure program P on unloaded machine M with/without O Claim: O speeds up programs by 10% M PP/O T1T2

<< Scope of claim Why is this unsound? Scope of experiment The relationship of the two scopes determines if an experiment is sound

Sound experiments Sufficient for sound experiment: Scope of claim <= Scope of experiment Option 1: Reduce claim Option 2: Extend experiment What are the common causes of unsound experiments?

The four fatal sins It is our pleasure to inform you that your paper titled "Envy of PLDI authors" was accepted to PLDI... The deadly sins do not stand in the way of a PLDI acceptance: But the four fatal sins might!

Sin 1: Ignorance Defn: Ignoring components necessary for Claim Experiment: a particular computer Claim: all computers

Sin 1: Ignorance Defn: Ignoring components necessary for Claim Experiment: one benchmark avora Ignorance systematically biases results Claim: full suite

Ignorance is not obvious! A is better than B I found just the opposite Have you had this conversation with a collaborator?

Ignoring Linux environment variables Changing the environment can change the outcome of your experiment! [Mytkowicz et al., ASPLOS 2009] Todd's results My results

Ignoring heap size Changing heap size can change the outcome of your experiment! Graph from [Blackburn et al., OOPSLA 2006] SS is worst! SS is best!

Ignoring profiler bias Different profilers can yield contradictory conclusions! [Mytkowicz et al., PLDI 2010]

Sin 2: Inappropriateness Defn: Using components irrelevant for Claim Experiment: Server applications Claim: Mobile performance

Sin of inappropriateness Defn: Using components irrelevant for Claim Experiment: Compute benchmarks Claim: GC performance Inappropriateness produces unsupported claims

Inappropriateness is not obvious! Has your optimization ever delivered a 10% improvement...which never materialized in the "wild"?

Inappropriate statistics Have you ever been fooled by a lucky outlier? [Georges and Eeckhout, 2007]: (SemiSpace is best by far)(SemiSpace is one of the best)

Inappropriate data analysis A single Google search = 100s of RPCs 99th percentile affects a majority of the requests! A mean is inappropriate if long-tail latency matters! A Mean: 45.0 B 99pc: 45099pc: 50

Inappropriate data analysis Mean Do you check the shape of your data before summarizing it? Cache HitCache Miss Layered systems often use caches at each level:

Inappropriate metric Have you ever picked a metric that was not ends-based? With extra nops

Inappropriate metric Pointer analysis APointer analysis B Program Mean points-to-set = 2 Claim: B is simpler yet just as precise as A Have you ever used a metric that was inconsistent with "better"? versus PQ P R QR

Sin 3: Inconsistency Defn: Experiment compares A to B in different contexts

Sin 3: Inconsistency Claim: B > A Defn: Experiment compares A to B in different contexts Experiment: They used P; We used Q System A System B Suite P Suite Q dD Inconsistency misleads!

Inconsistency is not obvious Workload, context, and metrics must be the same Measurement Context System A Workload Metrics Measurement Context System B Workload Metrics

Inconsistent workload I want to evaluate a new optimization for Gmail Has the workload ever changed from under you? Optimization enabled

Inconsistent metric Issued instructions Retired instructions Do you (or even vendors) know what each hardware metric means?

Sin 4: Irreproducibility Irreproducibility makes it harder to identify unsound experiments Defn: Others cannot reproduce your experiment Experiment:Report: Measurement Context System Workload Metrics

Irreproducibility is not obvious Omitting any biases can make results irreproducible Measurement Context System Workload Metrics

Revisiting the thesis The four fatal sins affect all aspects of experiments cannot be eliminated with a silver bullet o (even with a much longer history, other sciences have them too) It will take creativity and diligence to overcome these sins!

But I can give you one tip Look your gift horse in the mouth!

Back of the envelope Your optimization eliminates memory loads o Can the count of eliminated loads explain speedup? You blame "cache effects" for results you cannot explain... o Does the variation in cache misses explain results?

Rewarding good experimentation Novelty of algorithm Quality of experiments Reject Loch Ness Monster Often rejected Safe Bet Is this where we want to be? No evidence that the idea works... Scope of a paper: Evaluates existing ideas; no new algorithms...

Novel ideas can stand on their own Novel (and carefully reasoned) ideas expose New paths for exploration New ways of thinking A groundbreaking idea and no evaluation >> A groundbreaking idea and misleading evaluation

Insightful experiments can stand on their own! An insightful experiment may o Give insight into leading alternatives o Opens up new investigations o Increase confidence in prior results or approaches An insightful evaluation and no algorithm >> An insightful evaluation and a lame algorithm

But sound experiments take time! But not as much as chasing a false lead for years... How would you feel if you built a product...based on incorrect data? Do you prefer to build upon:

Why you should care (revisited) Has your optimization ever yielded an improvement o...even when you had not enabled it? Have you ever obtained fantastic results o...which even your collaborators could not reproduce? Have you ever wasted time chasing a lead o...only to realize your experiment was flawed? Have you ever read a paper o...and immediately decided to ignore the results?

The end Experiments are difficult and not just for us o Jonah Lehrer's "The truth wears off" Other sciences have established methods o It is our turn to learn from them and establish ours! Want to learn more? o The Evaluate collaboratory (

Acknowledgements Todd Mytkowicz Evaluate 2011 attendees: José Nelson Amaral, Vlastimil Babka, Walter Binder, Tim Brecht, Lubomír Bulej, Lieven Eeckhout, Sebastian Fischmeister, Daniel Frampton, Robin Garner, Andy Georges, Laurie J. Hendren, Michael Hind, Antony L. Hosking, Richard E. Jones, Tomas Kalibera, Philippe Moret, Nathaniel Nystrom, Victor Pankratius, Petr Tuma My mentors: Mike Hind, Kathryn McKinley, Eliot Moss