Presentation is loading. Please wait.

Presentation is loading. Please wait.

Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and.

Similar presentations


Presentation on theme: "Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and."— Presentation transcript:

1 Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and Data Sciences Group Computer Science and Mathematics Division Oak Ridge National Laboratory

2 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Filling a Gap in Statistics to Address Office of Science Needs ASCR Strategic Plan [AMR] weaknesses include an underinvestment or lack of investment in several critical areas:... Underinvestment in statistics The following gaps in the [AMR] program have been identified: Multiscale mathematics Ultrascale algorithms Discrete mathematics Statistics – investments in this area are required to deal with extracting knowledge from the oceans of data that large- scale simulations will produce. Multiphysics Through Applied Statistics, ASCR has the opportunity to engage the dominant segment of Applied Mathematics for its goals. Office of Science Response to the Data Challenge: The Office of Science will initiate a long-term research program to address the Curse of Dimensionality. Raymond L. Orbach, AAAS, Feb. 19, 2006 U.S. Department of Energy Office of Science ORNL Applied Statistics program can address the curse of dimensionality and other Office of Science goals.

3 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Statistics Brings Rigor and Efficiency to Scientific Investigation Statistics Brings Rigor and Efficiency to Scientific Investigation and Technology Conrad Habicht, Maurice Solovine, and Albert Einstein, the self-styled Olympia Academy, in about 1903. At Einsteins suggestion, the first book read was Pearsons The Grammar of Science. CREDIT: IMAGE ARCHIVE ETH-BIBLIOTHEK, ZÜRICH Karl Pearson (1857-1936) The Grammar of Science (1892) – Relativity First Department of Statistics (1911) UCL Founding editor of Biometrika E XPERIMENTAL

4 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Common Evolutionary Steps: Experimental Science and Computational Science Early computational science relies largely on intuitive design and visual validation Computational experiments are expensive Petascale data sets are nearly as opaque as real systems – statistical analysis must select what to visualize Uncertainty analysis is in its infancy Statistics is a major partner in bringing computational science to the rigor and efficiency standards of experimental science Methods to see through, examine, and classify variability Uncertainty quantification Statistical design of experiments Fusion of data and computational experiment

5 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Statistics: the Study of Variability The discipline concerned with the study of variability, with the study of uncertainty, and with the study of decision-making in the face of uncertainty. Large scale user of mathematical and computational tools with a focused scientific agenda Inherently interdisciplinary Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century, Cuts through the fog of variability and brings efficiency to science.

6 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Mathematics is Biologys Next Microscope, Only Better Here are five mathematical challenges that would contribute to the progress of biology. (1) Understand computation. Find more effective ways to gain insight and prove theorems from numerical or symbolic computations and agent-based models. We recall Hamming: The purpose of computing is insight, not numbers (Hamming 1971, p. 31). (2) Find better ways to model multi-level systems, for example, cells within organs within people in human communities in physical, chemical, and biotic ecologies. (3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we are still at the very beginning of a true understanding. Can we understand uncertainty and risk better by integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is an entirely new approach required? (4) Understand data mining, simultaneous inference, and statistical de-identification (Miller 1981). Are practical users of simultaneous statistical inference doomed to numerical simulations in each case, or can general theory be improved? What are the complementary limits of data mining and statistical de-identification in large linked databases with personal information? (5) Set standards for clarity, performance, publication and permanence of software and computational results. Mathematics, Computer Science, and Statistics are Biologys Next Microscope, Only Better Here are five mathematical challenges that would contribute to the progress of biology. (1) Understand computation. Find more effective ways to gain insight and prove theorems from numerical or symbolic computations and agent-based models. We recall Hamming: The purpose of computing is insight, not numbers (Hamming 1971, p. 31). (2) Find better ways to model multi-level systems, for example, cells within organs within people in human communities in physical, chemical, and biotic ecologies. (3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we are still at the very beginning of a true understanding. Can we understand uncertainty and risk better by integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is an entirely new approach required? (4) Understand data mining, simultaneous inference, and statistical de-identification (Miller 1981). Are practical users of simultaneous statistical inference doomed to numerical simulations in each case, or can general theory be improved? What are the complementary limits of data mining and statistical de-identification in large linked databases with personal information? (5) Set standards for clarity, performance, publication and permanence of software and computational results. Statistics Multiscale Math Statistics Computer Science Computer Science and Mathematics Cohen JE (2004). PLoS Biol 2(12): e439 Chemistrys MaterialsAstrophysicsTelescopeParticle PhysicsDevice, Fellow AAAS, Fellow AmPhilSoc, Member NAS

7 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Particle Physics Embraces Statistics … since 1900 … statistics … takes over field after field … [as] … the methodology of choice … … people in astronomy and physics … are starting to use statistics a lot more for the simple reason that they have to be efficient now. … I don't see any area where it's being resisted much. Bradley Efron Chair, Department of Statistics, Stanford University and Max H. Stein Professor of Humanities and Sciences 2005 National Medal of Science Recipient

8 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Citations to Statistics Comprise the Dominant Group within Mathematics Highly Cited Journals in Mathematics Rank Journal 1991-2001Citations 1. J. American Statistical Assn. 16,457 2. Biometrics 10,854 3. J. Math. Analysis 9,845 4. Annals of Statistics 9,702 5. Proc. Amer. Math Soc. 9,237 6. C.R. Acad. Sci. Ser. I Math. 9,153 7. Trans. Amer. Math. Soc. 8,586 8. Journal of Algebra 8,531 9. J. Functional Analysis 7,999 10. Biometrika 7,911 11. SIAM J. Numer. Anal. 7,383 12. Inventiones Mathmaticae 7,382 13. J. Royal Stat. Soc. B 6,575 14. Mathemat. Programming 6,444 15. Linear Algebra Appl. 6,112 SOURCE: ISI Essential Science Indicators, Sci. Citation Index (300 Journals in pure mathematics, applied mathematics, statistics and probability)ISI Essential Science Indicators, Sci. Citation Index Highly Cited Authors in Mathematics for period 1991-2001 Rank NameAffiliation Department / Field Papers Citations 1.Pierre-Louis Lions University of Paris 9 Mathematics 75 1207 2.David L. Donoho Stanford University Statistics 27 1182 3.Adrian F.M. Smith Univ. London Statistics 40 1026 4.Elizabeth A. Thompson U. Washington Biostatistics 11 973 5.Iain M Johnstone Stanford University Statistics 17 968 6.Jianqing Fan Chinese U. Hong Kong Statistics 53 901 7.Donald B. Rubin Harvard University Statistics 38 854 8.Ingrid Daubechies Princeton University Mathematics 20 807 9.Adrian E. Raftery U. Washington Statistics/Sociol. 31 804 10.Alan E. Gelfand U. Connecticut Statistics 35 747 11.Sun-Wei Guo Med. Coll. Wisconsin Biostatistics 6 737 12.Scott L. Zeger Johns Hopkins Univ. Biostatistics 23 723 13.Peter J. Green University of Bristol Statistics 14 667 14.Bradley P. Carlin University of Minnesota Biostatistics 28 663 15.J. Stephen Marron U. North Carolina Statistics 43 618 16.David G. Clayton MRC, Cambridge Biostatistics 4 598 17.Gareth O. Roberts Lancaster Univ. Statistics 41 598 18.Albert Cohen University of Paris Mathematics 61 572 19.Michael Rockner Univ. Bielefeld, Germany Mathematics 69 572 20.Yangbo Ye University of Iowa Mathematics 42 567 21.Jinchao Xu Pennsylvania St. U. Mathematics 22 566 22.Xiao-Li Meng University of Chicago Statistics 27 561 23.Matthew P. Wand Harvard University Biostatistics 31 558 24.Wally R. Gilks MRC Biostatistics 16 551 25.M. Chris Jones Open University Statistics 52 542 19 of Top 25 most cited mathematics authors are from Statistics or Biostatistics ! Statistics is Highly Interdisciplinary ! Citations per paper: Statistics and Biostatistics – 27 Rest of Mathematics - 15

9 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Statistics Disseminates Data Analysis Ideas Accross Science Domains Of 500 recent citations of Efrons Bootstrap paper, 348 were outside statistics. [NSF2004] Mitchells Detmax Algorithm paper 200+ citations (funded by AMR at ORNL) - red are outside statistics.

10 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Statistics Core Research Disseminates and Unifies Data Analysis Ideas Tames the explosion of data analytic methods by Providing portability between science domains Deriving properties of new data analytic methods Building bridges between data analytic methods Examples: Latent Semantic Indexing (Dumais+ 1991) and Correspondence Analysis (Benzecri 1969, 1980,1992, Greenacre 1984) Empirical Orthogonal Functions (Lorenz 1956) and a climate time series application of Principal Components Analysis (Pearson 1902, Hotelling 1935) Support Vector Machines (Vapnik 1995) and Logistic Regression (Cox 1970) via hinge loss function (Hastie+ 2001) FastMap approximation to Principal Components (Faloutsos+ 1995): Bridge to Convex Hull and new methods, RobustMap (Ostrouchov+ 2005) and to right Householder transformations (Ostrouchov+ 2006) Addressing the Curse of Dimensionality

11 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Statistics Core Science Applications I … emphasize the symbiotic relationship … between the Statisticians and Astrophysicists …. It is now … clear that there are core common problems … Bob Nichol (CMU Physics) Miller, CJ; Genovese, C; Nichol, RC; et al. Controlling the false-discovery rate in astrophysical data analysis ASTRONOMICAL JOURNAL, 122 (6): 3492-3505 DEC 2001 Miller, CJ; Nichol, RC; Batuski, DJ Acoustic oscillations in the early universe and today SCIENCE, 292 (5525): 2302-2303 JUN 22 2001 Science publication on Big Bang while others still plow through plethora of data Quantitative Rigor for Science: Transfer From Medicine via Core Statistics to Big Bang False Discovery Rate: Interdisciplinary Decision-making in the face of uncertainty Family-wise error rate of statistical tests: One test: 0.05 probability of a false positive Fifty tests: 0.93 probability of a false positive need simultaneous inference (SI) Thousand tests: SI too conservative, need FDR Statistics core is the hub that disseminates and unifies data analysis ideas. Critical mass engagement is needed to reap short term and long term returns. Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,

12 Statistics and Data Sciences O AK R IDGE N ATIONAL L ABORATORY U. S. D EPARTMENT OF E NERGY George Ostrouchov Engage Core Statistics for OASCR Goals A gap exists between statistics research and simulation science Engage statistics with leadership computing Engage statistics with simulation science data Engage statistics with Office of Science experimental data (neutron science) Statistics Core Science Applications Computational ChemistryClimate SimulationFusion SimulationCombustion SimulationSuperscalable AlgorithmsNeutron ScienceAstrophysics SimulationGenome ScienceTuning Leadership FacilitiesOntologies for Energy


Download ppt "Applied Statistics for the Office of Science Understanding Variability and Bringing Rigor to Scientific Investigation George Ostrouchov Statistics and."

Similar presentations


Ads by Google