We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published bySamuel Kent
Modified over 3 years ago
1 Copyright © 2009 M. E. Kabay. All rights reserved. Making Sense of Statistics in Information Security ISSA-Hartford Meeting Tuesday 16 June 2009 M. E. Kabay, PhD, CISSP-ISSMP CTO, Adaptive Cyber Security Instruments, Inc. Assoc Prof Information Assurance School of Business & Management Norwich University http://www.mekabay.com
2 Copyright © 2009 M. E. Kabay. All rights reserved. Topics Introduction Fundamentals of Statistical Design and Analysis Resources for Further Study
3 Copyright © 2009 M. E. Kabay. All rights reserved. Introduction Professional Background in Applied Statistics Value of Statistical Knowledge Base Limitations on Our Knowledge of Computer Crime Limitations on Applicability of Computer- Crime Statistics
4 Copyright © 2009 M. E. Kabay. All rights reserved. Professional Background in Applied Statistics Studied biology, genetics at McGill 1966-1970 Fascinated by biometrics (statistics applied to biological research) taught by Prof Hugh Tyson 1969 using Sokal & Rohlfs Biometry text Continued study independently during MSc at McGill in teratology 1970-1972 Took PhD Dartmouth in invertebrate zoology & applied statistics 1972-1976; One of PhD examiners was Dr Thomas E. Kurtz, co-inventor of BASIC (and a statistician) Have taught applied statistics at universities since 1975 & served as statistical consultant to scientists and industry
5 Copyright © 2009 M. E. Kabay. All rights reserved. Value of Statistical Knowledge Base Security professionals often asked about Frequency and security breaches Severity of damage Bear upon risk management Quantitative Qualitative Competitive analysis Litigation Standards of due care and diligence Commonly-accepted or best practices
6 Copyright © 2009 M. E. Kabay. All rights reserved. Limitations on Knowledge of Computer Crime: Detection AKA problem of ascertainment Not always possible to detect breach of security E.g., data leakage using covert channel has no record and no evidence (until competitor steals the market) But DoD DISA research 1995-1996 showed experimental evidence of non-detection 68,000 non-classified DoD systems Penetration tests broke into 2/3 of them Only 4% of sysadmins noticed penetrations
7 Copyright © 2009 M. E. Kabay. All rights reserved. Limitations on Knowledge of Computer Crime: Reporting Few reported in systematic way Unquantified, anecdotal reports of information assurance specialists Only ~10% of all breaches known publicly DoD DISA studies support this view Only ~½% of all detected breaches were properly reported as required by procedures … COMPUTER CRIME STATISTICS SHOULD GENERALLY BE TREATED WITH SKEPTICISM.
8 Copyright © 2009 M. E. Kabay. All rights reserved. Limitations on Applicability of Computer-Crime Statistics Enormous variability in computer systems and networks Processors Operating systems Topologies Firewalls Encryption Applications … How do we generalize from specific cases? How do we build database of usable statistics?
9 Copyright © 2009 M. E. Kabay. All rights reserved. Fundamentals of Statistical Design and Analysis Descriptive Statistics Inference Hypothesis Testing Random Sampling Confidence Limits Contingency Tables Association vs Causality Control Groups Confounded Variables
10 Copyright © 2009 M. E. Kabay. All rights reserved. Descriptive Statistics (1) Presentation of data can greatly influence perception of reality Amateurs (e.g., some reporters and PR personnel) can inadvertently or deliberately distort information through elementary mistakes E.g., consider 3 companies who report following losses from security breaches: $1M $2M $6M Next page shows different ways of representing these data
11 Copyright © 2009 M. E. Kabay. All rights reserved. Descriptive Statistics (2) ClassFrequency $2M2 > $2M1 ClassFrequency < $1M0 $1M & < $2M1 $2M & < $3M1 $3M & < $4M0 $4M & < $5M0 $5M & < $6M0 $6M & < $7M1 $7M0 Left-hand table: Wrong impression of where the data lie No sense of lower or upper bounds No idea of gap between 1, 2 & 6 Cannot compute mean, median at all Right-hand table: Still wrong mean
12 Copyright © 2009 M. E. Kabay. All rights reserved. Descriptive Statistics (3) Measures of central tendency Mean (computed) – sum / total number Median (counted) – value of middle of sorted list May differ if distribution is skewed (asymmetric)
13 Copyright © 2009 M. E. Kabay. All rights reserved. Descriptive Statistics (4) Measures of dispersion (variability) Range – largest value – smallest value Variance – average of squared deviations from mean (σ 2 ) Standard deviation – square root of variance (σ) In a Gaussian (Normal) frequency distribution, standard deviation is distance between mean & inflection point of curve (where slope stops increasing)
14 Copyright © 2009 M. E. Kabay. All rights reserved. Inference (1) Population is entire set of all possible members E.g., population of residents of USA is all people residing in USA at a specific time Sample statistic is known as parametric value Sample is enumerated or measured set of observations E.g., 100,000 people selected from US population is a sample Statistic computed on sample is sample statistic or estimator of parametric value
15 Copyright © 2009 M. E. Kabay. All rights reserved. Inference (2) Statisticians try to infer population statistics from sample statistics Called statistical inference E.g., population mean is µ and sample mean is ; parametric variance is 2 and sample is s 2 Sample statistics sometimes have different formula from parametric statistic E.g., estimates µ But estimator s 2 of 2 is sum of squared deviations from mean divided by (n-1) instead of by n [where n is sample size]
16 Copyright © 2009 M. E. Kabay. All rights reserved. Hypothesis Testing (1) Often need to test an idea (hypothesis) about populations based on sample statistics; e.g., Testing idea that µ lies between 1.3 & 4.3 based on a sample mean = 2.8 Testing idea that σ 35.6 based on s = 52.8 Can also test hypotheses about relationships E.g., given observed data in table, test idea that firewalls and penetration
17 Copyright © 2009 M. E. Kabay. All rights reserved. Hypothesis Testing (2) Null hypothesis (H 0 ) is that there is no relationship Testing for relation between two independent variables Presence of firewall Detection of penetration Various calculations available to test for independence; e.g., Chi-square 2 Log-likelihood ratio G Both are 0 in a population where there is no relationship between variables Compute probability that sample statistic would occur by chance alone if really 0 in population
18 Copyright © 2009 M. E. Kabay. All rights reserved. Hypothesis Testing (3) Probability that the null hypothesis is true p(H 0 ) > 0.05: not statistically significant (symbols ns) 0.05 p(H 0 ) > 0.01: statistically significant (*) 0.01 p(H 0 ) > 0.001: highly statistically significant (**) p(H 0 ) 0.001: extremely statistically significant (***).
19 Copyright © 2009 M. E. Kabay. All rights reserved. Random Sampling (1) Randomization essential to all of statistical inference Sample is random when every member of population has equal likelihood of being selected for sample Non-random sample is biased E.g., population is all members of multinational company BUT most employees picked are disproportionately from US subsidiaries – biased toward US sub-group E.g., population is all adult US residents but 2x as many men are selected as women – gender bias
20 Copyright © 2009 M. E. Kabay. All rights reserved. Random Sampling (2) Surveys can suffer from response bias What if survey is known only to a subset of desired population? What if results report only those who respond? What if those who respond are different from those who do not respond? The response bias can confound variables: Subjects of the questions are confounded with Awareness of the survey Tendency to respond
21 Copyright © 2009 M. E. Kabay. All rights reserved. Confidence Limits (1) Point estimates not generally useful The average salary was $38,232 The cost of gasoline rose $0.12 per week last quarter Generally prefer to have a sense of reliability Often report mean ± standard deviation The average salary was $38,232 ± $1955 The cost of gasoline rose $0.12 ± $0.035 per week last quarter Should specify sample size to give intuitive sense of reliability The average salary was $38,232 ± $1955 (n = 12) The average salary was $38,232 ± $1955 (n = 12,000)
22 Copyright © 2009 M. E. Kabay. All rights reserved. Confidence Limits (2) Can compute ranges that have a known probability of including the parametric value being estimated: The probability that the average salary was between $36,277 & $40,187 based on the sample statistics is 95%. The 95% confidence limits of the average salary were $36,277 & $40,187 Confidence limit computations depend on Random sampling Known error distribution (e.g., Normal/Gaussian) Equal variances at all values Larger values no more variable than smaller values SAME
23 Copyright © 2009 M. E. Kabay. All rights reserved. Contingency Tables Contingency tables present counted (enumerated) data for two or more variables Common error: Presenting only part of contingency table Over 70% of systems without firewalls were penetrated last year Yes, but what % of systems with firewalls were penetrated?
24 Copyright © 2009 M. E. Kabay. All rights reserved. Association vs Causality Dont mistake association for causality Error of logic known as post hoc, ergo propter hoc – after the fact, thus because of the fact E.g., suppose study shows that organizations with lots of fire extinguishers have lower rate of computer network penetration than those with few fire extinguishers Do we conclude that presence of fire extinguishers causes better resistance to penetration? Many possible explanations for association other than causality
25 Copyright © 2009 M. E. Kabay. All rights reserved. Control Groups When associated variables may be confounded, one can control for the variables E.g., in fire-extinguisher case Measure state of security awareness Compare groups with similar level of awareness Statistical techniques exist to control for independent variables and their interactions Analysis of variance with regression Multivariate analysis of contingency tables
26 Copyright © 2009 M. E. Kabay. All rights reserved. More about Confounded Variables One in 10 employees admitted stealing data or corporate devices, selling them for a profit, or knowing fellow employees who did. Confounds Theft of data Theft of devices Selling things for profit Knowing of others who did such criminal acts Cannot tease out the individual contributions Knowing particularly bad: confounds occurrence with social networking If everyone knows everyones business, could have 100% +ve response even if only 1% were criminals
27 Copyright © 2009 M. E. Kabay. All rights reserved. For Further Reading Kabay, M. E. (2009). Understanding Studies and Surveys of Computer Crime: http://www.mekabay.com/methodology/crime_stats_methods.pdf (the apparent blanks are the underscore character, _ ) http://www.mekabay.com/methodology/crime_stats_methods.htm Any introductory text for applied statistics in the social sciences Any introductory text on survey design and analysis
28 Copyright © 2009 M. E. Kabay. All rights reserved. Sample Textbooks Babbie, E. R., F. S. Halley & J. Zaino (2003). Adventures in Social Research : Data Analysis Using SPSS 11.0/11.5 for Windows, 5 th Ed. Pine Science Press (ISBN 0-761-98758- 4). Sirkin, R. M. (2005). Statistics for the Social Sciences, 3 rd Ed. Sage Publications (ISBN 1- 412-90546-X). Schutt, R. K. (2003). Investigating the Social World: The Process and Practice of Research, Fourth Edition. Pine Science Press (0-761- 92928-2).
29 Copyright © 2009 M. E. Kabay. All rights reserved. Sample Web Sites Creative Research Systems Survey Design http://www.surveysystem.com/sdesign.htm http://www.surveysystem.com/sdesign.htm New York University Statistics & Social Science http://www.nyu.edu/its/socsci/statistics.html http://www.nyu.edu/its/socsci/statistics.html StatPac Survey & Questionnaire Design http://www.statpac.com/surveys/ http://www.statpac.com/surveys/ University of Miami Libraries Research Methods in the Social Sciences: An Internet Resource List http://www.library.miami.edu/netguides/psymeth.html http://www.library.miami.edu/netguides/psymeth.html
30 Copyright © 2009 M. E. Kabay. All rights reserved. Discussion
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Experimental Design and Analysis of Variance Chapter 11.
Chap 7-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 7 Sampling and Sampling Distributions Statistics for Business.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Statistical Inferences Based on Two Samples Chapter 10.
Chapter Thirteen The One-Way Analysis of Variance.
© The McGraw-Hill Companies, Inc., Chapter 12 Chi-Square and Analysis of Variance (ANOVA)
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview of Lecture Parametric vs Non-Parametric Statistical Tests. Single Sample.
1 Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 22 Comparing Two Proportions.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Hypothesis Tests: Two Independent Samples Cal State Northridge 320 Andrew Ainsworth PhD.
January Structure of the book Section 1 (Ch 1 – 10) Basic concepts and techniques Section 2 (Ch 11 – 15): Inference for quantitative outcomes Section.
Chapter 4 Inference About Process Quality Motivation Estimation – point estimation – interval estimation Hypothesis Testing – Definition – Testing on means.
Chap 16-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 16 Goodness-of-Fit Tests and Contingency Tables Statistics for.
1 Statistical Analysis SC504/HS927 Spring Term 2008 Introduction to Logistic Regression Dr. Daniel Nehring.
Chapter 6 Sampling: Theory and Methods Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Working with MS-ACCESS IS 240 – Database Management Lecture #2 – Assoc. Prof. M. E. Kabay, PhD, CISSP Norwich University
1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.
1 Your lecturer and course convener Colin Gray Room S16 William Guild Building address: Telephone: (27) 2233.
Quantitative Analysis (Statistics Week 8) FDA B&M Level C Peter Matthews
Evaluation of precision and accuracy of a measurement 1.
21-1 Copyright 2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 21 Hypothesis.
Chapter 15 ANOVA. Comparing Means for Several Populations When we wish to test for differences in means for only 1 or 2 populations, we use one- or two-sample.
Copyright © 2014 Pearson Education, Inc. All rights reserved Chapter 10 Associations Between Categorical Variables.
© The McGraw-Hill Companies, Inc., Chapter 12 Chi-Square.
1 Chapter 11: The t Test for Two Related Samples.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Copyright © 2010 Pearson Education, Inc. Slide
STATISTICAL ANALYSIS. Your introduction to statistics should not be like drinking water from a fire hose!!
Measures of Location and Dispersion Central Tendency.
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
You will need Your text Your calculator And the handout Steps In Hypothesis Testing Bluman, Chapter 81.
© The McGraw-Hill Companies, Inc., Chapter 10 Testing the Difference between Means and Variances.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 20 Testing Hypotheses About Proportions.
Copyright © 2011 Pearson Education, Inc. Putting Statistics to Work.
Chapter 10 Estimating Means and Proportions Stat-Slide-Show, Copyright by Quant Systems Inc.
Copyright © 2008 by the McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Managerial Economics, 9e Managerial Economics Thomas Maurice.
Chapter 11 Elementary Statistics Larson Farber Nonparametric Tests.
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
C Del Siegle By Del Siegle, PhD (This presentation may be used for instructional purposes) Press the space.
Module 17: Two-Sample t-tests, with equal variances for the two populations This module describes one of the most utilized statistical tests, the.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
1 Quantitative Methods Topic 5 Probability Distributions.
Copyright ©2005 Brooks/Cole, a division of Thomson Learning, Inc. Statistical Significance for 2 x 2 Tables Chapter 13.
1 Chapter 4: Commonly Used Distributions. 2 Section 4.1: The Bernoulli Distribution We use the Bernoulli distribution when we have an experiment which.
Non- Parametric Statistics A Presentation by Rob McMullen for AP Statistics.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13 1.
1/2/2014 (c) 2001, Ron S. Kenett, Ph.D.1 Parametric Statistical Inference Instructor: Ron S. Kenett Course Website:
STATISTICS Sampling and Sampling Distributions Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
© 2017 SlidePlayer.com Inc. All rights reserved.