Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics And Application

Similar presentations


Presentation on theme: "Statistics And Application"— Presentation transcript:

1 Statistics And Application
Revealing Facts From Data

2 What Is Statistics Statistics is a mathematical science pertaining to collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines from the physical and social sciences to the humanities, as well as to business, government, medicine and industry.

3 Statistics Is … Almost every professionals need a statistical tool.
Statistical skills enable you to intelligently collect, analyze and interpret data relevant to their decision-making. Statistical concepts enable us to solve problems in a diversity of contexts. Statistical thinking enables you to add substance to your decisions

4 Statistics is a science
To assist you making decisions under uncertainties. Decision making process must be based on data neither on personal opinion nor on belief. It is already an accepted fact that "Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write." So, let us be ahead of our time. In US, students learn statistics from middle school

5 Type Of Statistics Descriptive statistics deals with the description problem: Can the data be summarized in a useful way, either numerically or graphically, to yield insight about the population in question? Basic examples of numerical descriptors include the mean and standard deviation. Graphical summarizations include various kinds of charts and graphs. Inferential statistics is used to model patterns in the data, accounting for randomness and drawing inferences about the larger population. These inferences may take the form of answers to yes/no questions (hypothesis testing), estimates of numerical characteristics (estimation), prediction of future observations, descriptions of association (correlation), or modeling of relationships (regression). Other modeling techniques include ANOVA, time series, and data mining.

6 Type of Studies There are two major types of causal statistical studies, experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types is in how the study is actually conducted. Each can be very effective. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation may have modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead data are gathered and correlations between predictors and the response are investigated.

7 Type of Statistical Courses
Two types: Greater statistics is everything related to learning from data, from the first planning or collection, to the last presentation or report, which is deep respect for data and truth. Lesser statistics is the body of statistical methodology, which has no interest in data or truth, and are generally arithmetic exercises. If a certain assumption is needed to justify a procedure, they will simply to "assume the ... are normally distributed" -- no matter how unlikely that might be.

8 Statistical Models Statistical models are currently used in various fields of business and science. The terminology differs from field to field. For example, the fitting of models to data, called calibration, history matching, and data assimilation, are all synonymous with parameter estimation.

9 Data Analysis Developments in statistical data analysis often parallel or follow advancements in other fields to which statistical methods are fruitfully applied. Decision making process under uncertainty is largely based on application of statistical data analysis for probabilistic risk assessment of your decision.

10 (cont.) Decision makers need to lead others to apply statistical thinking in day to day activities and secondly, Decision makers need to apply the concept for the purpose of continuous improvement.

11 Is Data Information? Database in your office contains a wealth of information. The decision technology group members tap a fraction of it Employees waste time scouring multiple sources for a database. The decision-makers are frustrated because they cannot get business-critical data exactly when they need it. Therefore, too many decisions are based on guesswork, not facts. Many opportunities are also missed, if they are even noticed at all. Data itself is not information, but might generate information.

12 Knowledge Knowledge is what we know well. Information is the communication of knowledge. In every knowledge exchange, the sender make common what is private, does the informing, the communicating. Information can be classified as explicit and tacit forms. The explicit information can be explained in structured form, while tacit information is inconsistent and fuzzy to explain. Know that data are only crude information and not knowledge by themselves.

13 Data → Knowledge (?) Data is known to be crude information and not knowledge by itself. The sequence from data to knowledge is: from Data to Information, from Information to Facts, and finally, from Facts to Knowledge. Data becomes information, when it becomes relevant to your decision problem. Information becomes fact, when the data can support it. Facts are what the data reveals. However the decisive instrumental (i.e., applied) knowledge is expressed together with some statistical degree of confidence.

14 Fact → knowledge Fact becomes knowledge, when it is used in the successful completion of a statistical process.

15 Statistical Analysis The exactness of a statistical model increases, the level of improvements in decision-making increases: the reason of using statistical data analysis. Statistical data analysis arose from the need to place knowledge on a systematic evidence base. Statistics is a study of the laws of probability, the development of measures of data properties and relationships, and so on.

16 Statistical Inference
Verify the statistical hypothesis: Determining whether any statistical significance can be attached that results after due allowance is made for any random variation as a source of error. Intelligent and critical inferences cannot be made by those who do not understand the purpose, the conditions, and applicability of the various techniques for judging significance. Considering the uncertain environment, the chance that "good decisions" are made increases with the availability of "good information." The chance that "good information" is available increases with the level of structuring the process of Knowledge Management.

17 Knowledge Needs Wisdom
Wisdom is the power to put our time and our knowledge to the proper use. Wisdom is the accurate application of accurate knowledge. Wisdom is about knowing how technical staff can be best used to meet the needs of the decision-maker.

18 History Of Statistics The word statistics ultimately derives from the modern Latin term statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician"). The birth of statistics occurred in mid-17th century. A commoner, named John Graunt, who was a native of London, begin reviewing a weekly church publication issued by the local parish clerk that listed the number of births, christenings, and deaths in each parish. These so called Bills of Mortality also listed the causes of death. Graunt who was a shopkeeper organized this data in the forms we call descriptive statistics, which was published as Natural and Political Observation Made upon the Bills of Mortality. Shortly thereafter, he was elected as a member of Royal Society. Thus, statistics has to borrow some concepts from sociology, such as the concept of "Population". It has been argued that since statistics usually involves the study of human behavior, it cannot claim the precision of the physical sciences.

19 Statistics is for Government
The original principal purpose of Statistik was data to be used by governmental and (often centralized) administrative bodies. The collection of data about states and localities continues, largely through national and international statistical services. Censuses provide regular information about the population. During the 20th century, the creation of precise instruments for public health concerns (epidemiology, biostatistics, etc.) and economic and social purposes (unemployment rate, econometry, etc.) necessitated substantial advances in statistical practices.

20 History of Probability
Probability has much longer history. Probability is derived from the verb to probe meaning to "find out" what is not too easily accessible or understandable. The word "proof" has the same origin that provides necessary details to understand what is claimed to be true. Probability originated from the study of games of chance and gambling during the sixteenth century. Probability theory was a branch of mathematics studied by Blaise Pascal and Pierre de Fermat in the seventeenth century. Currently; in 21st century, probabilistic modeling are used to control the flow of traffic through a highway system, a telephone interchange, or a computer processor; find the genetic makeup of individuals or populations; quality control; insurance; investment; and other sectors of business and industry.

21 Stat Merge With Prob Statistics eventually merged with the field of inverse probability, referring to the estimation of a parameter from experimental data in the experimental sciences (most notably astronomy). Today the use of statistics has broadened far beyond the service of a state or government, to include such areas as business, natural and social sciences, and medicine, among others. Statistics emerged in part from probability theory, which can be dated to the correspondence of Pierre de Fermat and Blaise Pascal (1654). Christiaan Huygens (1657) gave the earliest known scientific treatment of the subject. Jakob Bernoulli's Ars Conjectandi (posthumous, 1713) and Abraham de Moivre's Doctrine of Chances (1718) treated the subject as a branch of mathematics.

22 Development in 18-19 centery
The theory of errors may be traced back to Roger Cotes's Opera Miscellanea (posthumous, 1722), but a memoir prepared by Thomas Simpson in 1755 (printed 1756) first applied the theory to the discussion of errors of observation. Daniel Bernoulli (1778) introduced the principle of the maximum product of the probabilities of a system of concurrent errors. The method of least squares, which was used to minimize errors in data measurement, is due to Robert Adrain (1808), Carl Gauss (1809), and Adrien-Marie Legendre (1805) by the problems of survey measurements, reconciling disparate physical measurements. General theory in statistics: by Laplace (1810, 1812), Gauss (1823), James Ivory (1825, 1826), Hagen (1837), Friedrich Bessel (1838), W. F. Donkin (1844, 1856), and Morgan Crofton (1870). Other contributors were Ellis (1844), De Morgan (1864), Glaisher (1872), and Giovanni Schiaparelli (1875).

23 Statistics in 20 Century Karl Pearson (March 27, 1857 – April 27, 1936) was a major contributor to the early development of statistics. Pearson's work was all-embracing in the wide application and development of mathematical statistics, and encompassed the fields of biology, epidemiology, anthropometry, medicine and social history, his main contributions are: Linear regression and correlation. The Pearson product-moment correlation coefficient was the first important effect size to be introduced into statistics; Classification of distributions forms the basis for a lot of modern statistical theory; in particular, the exponential family of distributions underlies the theory of generalized linear models; Pearson's chi-square test. Sir Ronald Aylmer Fisher, FRS (17 February 1890 – 29 July 1962) Fisher invented the techniques of maximum likelihood and analysis of variance, and originated the concepts of sufficiency, ancillarity, Fisher's linear discriminator and Fisher information. His 1924 article "On a distribution yielding the error functions of several well known statistics" presented Karl Pearson's chi-squared and Student's t in the same framework as the normal distribution and his own analysis of variance distribution z (more commonly used today in the form of the F distribution). These contributions easily made him a major figure in 20th century statistics. He began the field of non-parametric statistics, entropy as well as Fish information were essential for developing Bayesian analysis.

24 Statistics in 20 Century Gertrude Mary Cox (January 13, 1900 – 1978) Experimental Design Charles Edward Spearman (September 10, September 7, 1945) non-parametric analysis, rank correlation coefficient Chebyshev's inequality Lyapunov's central limit theorem John Wilder Tukey (June 16, July 26, 2000): jackknife estimation, exploratory data analysis and confirmatory data analysis. George Bernard Dantzig (8 November 1914 – 13 May 2005):developing the simplex method and furthering linear programming, advanced the fields of decomposition theory, sensitivity analysis, complementary pivot methods, large-scale optimization, nonlinear programming, and programming under uncertainty. Bayes' theorem Sir David Roxbee Cox (born Birmingham, England, 1924) has made pioneering and important contributions to numerous areas of statistics and applied probability, of which the best known is perhaps the proportional hazards model, which is widely used in the analysis of survival data.

25 School Thought of Statistics
The Classical, attributed to Laplace: Relative Frequency, attributed to Fisher Bayesian, attributed to Savage What Type of Statistician Are You?

26

27 Classic Statistics The problem with the Classical Approach is that what constitutes an outcome is not objectively determined. One person's simple event is another person's compound event. One researcher may ask, of a newly discovered planet, "what is the probability that life exists on the new planet?" while another may ask "what is the probability that carbon-based life exists on it?" Bruno de Finetti, in the introduction to his two-volume treatise on Bayesian ideas, clearly states that "Probabilities Do not Exist". By this he means that probabilities are not located in coins or dice; they are not characteristics of things like mass, density, etc

28 Relative Frequency Statistics
Consider probabilities as "objective" attributes of things (or situations) which are really out there (availability of data). Use the data we have only to make interpretation. Even substantial prior information is available, Frequentists do not use it, while Bayesians are willing to assign probability distribution function(s) to the population's parameter(s).

29 Bayesian approaches Consider probability theory as an extension of deductive logic (including dialogue logic, interrogative logic, informal logic, and artificial intelligence) to handle uncertainty. First principle that the uniquely correct way is your belief about the state of things (Prior), and updating them in the light of the evidence. The laws of probability have the same status as the laws of logic. Bayesian approaches are explicitly "subjective" in the sense that they deal with the plausibility which a rational agent ought to attach to the propositions he/she considers, "given his/her current state of knowledge and experience."

30 Discussion From a scientist's perspective, there are good grounds to reject Bayesian reasoning. Bayesian deals not with objective, but subjective probabilities. The result is that any reasoning using a Bayesian approach cannot be checked -- something that makes it worthless to science, like non replicate experiments. Bayesian perspectives often shed a helpful light on classical procedures. It is necessary to go into a Bayesian framework to give confidence intervals. This insight is helpful in drawing attention to the point that another prior distribution would lead to a different interval. A Bayesian may cheat by basing the prior distribution on the data, because priors must be personal for coherence to hold before the study, which is more complex. Objective Bayesian: There is a clear connection between probability and logic: both appear to tell us how we should reason. But how, exactly, are the two concepts related? Objective Bayesians offers one answer to this question.

31 Steps Of The Analysis Defining the problem: An exact definition of the problem is imperative in order to obtain accurate data about it. Collecting the data: Designing ways to collect data is an important job in statistical data analysis. Population and Sample are VIP aspects. Analyzing the data: Exploratory methods are used to discover what the data seems to be saying by using simple arithmetic and easy-to-draw pictures to summarize data. Confirmatory methods use ideas from probability theory in the attempt to answer specific questions. Reporting the results

32 Type of Data, Levels of Measurement & Errors
Qualitative and Quantitative Discrete and Continuous Nominal, Ordinal, Interval and Ratio Types of error: Recording error, typing error, transcription error (incorrect copying), Inversion (e.g., is typed as ), Repetition (when a number is repeated), Deliberate error, Type Error, etc.

33 Data Collection: Experiments
Experiment is a set of actions and observations, performed for solving a given problem, to test a hypothesis or research concerning phenomena. Itis an empirical approach acquiring deeper knowledge about the physical world. Design of experiments In the "hard" sciences tends to focus on the elimination of extraneous effects, in the "soft" sciences it focuses more on the problems of external validity, by using statistical methods. Events occur naturally from which scientific evidence can be drawn, which is the basis for natural experiments. Controlled experiments To demonstrate a cause and effect hypothesis, an experiment must often show that, for example, a phenomenon occurs after a certain treatment is given to a subject, and that the phenomenon does not occur in the absence of the treatment. A controlled experiment generally compares the results obtained from an experimental sample against a control sample, which is practically identical to the experimental sample except for the one aspect whose effect is being tested.

34 Data Collection: Experiments
Natural experiments or quasi-experiments Natural experiments rely solely on observations of the variables of the system under study, rather than manipulation of just one or a few variables as occurs in controlled experiments. Much research in several important science disciplines, including geology, paleontology, ecology, meteorology, and astronomy, relies on quasi-experiments. Observational studies Observational studies are very much like controlled experiments except that they lack probabilistic equivalency between groups. These types of experiments often arise in the area of medicine where, for ethical reasons, it is not possible to create a truly controlled group. ] Field Experiments Named in order to draw a contrast with laboratory experiments. Often used in the social sciences, economics etc. Field experiments suffer from the possibility of contamination: experimental conditions can be controlled with more precision and certainty in the lab.

35 Data Analysis It will follow different approaches!

36 Applied Statistics

37 Actuarial science Applies mathematical and statistical methods to finance and insurance, particularly to the assessment of risk. Actuaries are professionals who are qualified in this field.

38 Actuarial science Actuarial science is the discipline that applies mathematical and statistical methods to assess risk in the insurance and finance industries. Actuaries are professionals who are qualified in this field through examinations and experience. Actuarial science includes a number of interrelating subjects, including probability and statistics, finance, and economics. Historically, actuarial science used deterministic models in the construction of tables and premiums. The science has gone through revolutionary changes during the last 30 years due to the proliferation of high speed computers and the synergy of stochastic actuarial models with modern financial theory (Frees 1990). Many universities have undergraduate and graduate degree programs in actuarial science. In 2002, a Wall Street Journal survey on the best jobs in the United States listed “actuary” as the second best job (Lee 2002).

39 Where Do Actuaries Work and What Do They Do?
The insurance industry can't function without actuaries, and that's where most of them work. They calculate the costs to assume risk—how much to charge policyholders for life or health insurance premiums or how much an insurance company can expect to pay in claims when the next hurricane hits Florida. Actuaries provide a financial evaluation of risk for their companies to be used for strategic management decisions. Because their judgement is heavily relied upon, actuaries' career paths often lead to upper management and executive positions. When other businesses that do not have actuaries on staff need certain financial advice, they hire actuarial consultants. A consultant can be self-employed in a one-person practice or work for a nationwide consulting firm. Consultants help companies design pension and benefit plans and evaluate assets and liabilities. By delving into the financial complexities of corporations, they help companies calculate the cost of a variety of business risks. Consulting actuaries rub elbows with chief financial officers, operating and human resource executives, and often chief executive officers. Actuaries work for the government too, helping manage such programs as the Social Security system and Medicare. Since the government regulates the insurance industry and administers laws on pensions and financial liabilities, it also needs actuaries to determine whether companies are complying with the law. Who else asks an actuary to assess risks and solve thorny statistical and financial problem? You name it: Banks and Investment firms, large corporations, public accounting firms, insurance rating bureaus, labor unions, and fraternal organizations..,

40 Typical actuarial projects:
Analyzing insurance rates, such as for cars, homes or life insurance. Estimating the money to be set-aside for claims that have not yet been paid. Participating in corporate planning, such as mergers and acquisitions. Calculating a fair price for a new insurance product. Forecasting the potential impact of catastrophes. Analyzing investment programs.

41 VEE–Applied Statistical Methods
Courses that meet this requirement may be taught in the mathematics, statistics, or economics department, or in the business school. In economics departments, this course may be called Econometrics. The material could be covered in one course or two. The mathematical sophistication of these courses will vary widely and all levels are intended to be acceptable. Some analysis of real data should be included. Most of the topics listed below should be covered:  Probability.  3 pts.  Statistical Inference.  3 pts.  Linear Regression Models.  3 pts. Time Series Analysis.  3 pts. Survival Analysis.  3 pts. Elementary Stochastic Processes.  3 pts. Simulation.  3 pts. Introduction to the Mathematics of Finance.  3 pts. Statistical Inference and Time-Series Modelling.  3 pts. Stochastic Methods in Finance.  3 pts. Stochastic Differential Equations and Applications.  3 pts. Advanced Data Analysis.  3 pts. Data Mining.  3 pts. Statistical Methods in Finance.  3 pts. Nonparametric Statistics.  3 pts. Stochastic Processes and Applications,  3 pts.

42 Some Books Generalized Linear Models for Insurance Data, by Piet de Jong and Gillian Z. Heller Stochastic Claims Reserving Methods in Insurance (The Wiley Finance Series) by Mario V. Wüthrich and Michael Merz Actuarial Modelling of Claim Counts: Risk Classification, Credibility and Bonus-Malus Systems, by Michel Denuit, Xavier Marechal, Sandra Pitrebois and Jean-Francois Walhin Loss Models: From Data to Decisions (Wiley Series in Probability and Statistics) (Hardcover) by Stuart A. Klugman, Harry H. Panjer and Gordon E. Willmot

43 Biostatistics or Biometry
Biostatistics or biometry is the application of statistics to a wide range of topics in biology. Public health, including epidemiology, nutrition and environmental health, Design and analysis of clinical trials in medicine Genomics, population genetics, and statistical genetics in populations in order to link variation in genotype with a variation in phenotype. Ecology Biological sequence analysis .

44 Data Mining Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. The nontrivial extraction of implicit, previously unknown, and potentially useful information from data Data mining involves the process of analyzing data Data Mining is a fairly recent and contemporary topic in computing. Data Mining applies many older computational techniques from statistics, machine learning and pattern recognition.

45 Data Mining and Business Intelligence
Increasing potential to support business decisions End User Decision Making Data Presentation Business Analyst Visualization Techniques Data Mining Data Analyst Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses DBA Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems

46 Data Mining: Confluence of Multiple Disciplines
Database Technology Statistics Machine Learning Pattern Recognition Algorithm Other Disciplines Visualization

47 Data Mining: On What Kinds of Data?
Database-oriented data sets and applications Relational database, data warehouse, transactional database Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web

48 Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I)
Classification #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) #4. Naive Bayes Hand, D.J., Yu, K., Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, Statistical Learning #5. SVM: Vapnik, V. N The Nature of Statistical Learning Theory. Springer-Verlag. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y Mining frequent patterns without candidate generation. In SIGMOD '00.

49 The 18 Identified Candidates (II)
Link Mining #9. PageRank: Brin, S. and Page, L The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. #10. HITS: Kleinberg, J. M Authoritative sources in a hyperlinked environment. SODA, 1998. Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997),

50 The 18 Identified Candidates (III)
Sequential Patterns #14. GSP: Srikant, R. and Agrawal, R Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. Rough Sets #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 Graph Mining #18. gSpan: Yan, X. and Han, J gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02.

51 Major Issues in Data Mining
Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Domain-specific data mining & invisible data mining Protection of data security, integrity, and privacy

52 Challenge Problems in Data Mining
Developing a Unifying Theory of Data Mining Scaling Up for High Dimensional Data and High Speed Data Streams Mining Sequence Data and Time Series Data Mining Complex Knowledge from Complex Data Data Mining in a Network Setting Distributed Data Mining and Mining Multi-agent Data Data Mining for Biological and Environmental Problems Data-Mining-Process Related Problems Security, Privacy and Data Integrity Dealing with Non-static, Unbalanced and Cost-sensitive Data

53 Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 B. Liu, Web Data Mining, Springer 2006. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

54 Economic statistics Economic statistics is a branch of applied statistics focusing on the collection, processing, compilation and dissemination of statistics concerning the economy of a region, a country or a group of countries. Economic statistics is also referred as a subtopic of official statistics, since most of the economic statistics are produced by official organizations (e.g. statistical institutes, supranational organizations, central banks, ministries, etc.). Economic statistics provide the empirical data needed in economic research (econometrics) and they are the basis for decision and economic policy making.

55 Econometrics Econometrics is concerned with the tasks of developing and applying quantitative or statistical methods to the study and elucidation of economic principles. Econometrics combines economic theory with statistics to analyze and test economic relationships. Theoretical econometrics considers questions about the statistical properties of estimators and tests, while applied econometrics is concerned with the application of econometric methods to assess economic theories. Although the first known use of the term "econometrics" was by Pawel Ciompa in 1910, Ragnar Frisch is given credit for coining the term in the sense that it is used today.

56 Method in Econometrics
Although many econometric methods represent applications of standard statistical models, there are some special features of economic data that distinguish econometrics from other branches of statistics. Economic data are generally observational, rather than being derived from controlled experiments. Because the individual units in an economy interact with each other, the observed data tend to reflect complex economic equilibrium conditions rather than simple behavioral relationships based on preferences or technology. Consequently, the field of econometrics has developed methods for identification and estimation of simultaneous equation models. These methods allow researchers to make causal inferences in the absence of controlled experiments. Early work in econometrics focused on time-series data, but now econometrics also fully covers cross-sectional and panel data.

57 Data in Econometrics Data is broadly classified according to the number of dimensions. A data set containing observations on a single phenomenon observed over multiple time periods is called time series. In time series data, both the values and the ordering of the data points have meaning. A data set containing observations on multiple phenomena observed at a single point in time is called cross-sectional. In cross-sectional data sets, the values of the data points have meaning, but the ordering of the data points does not. A data set containing observations on multiple phenomena observed over multiple time periods is called panel data. Alternatively, the second dimension of data may be some entity other than time. For example, when there is a sample of groups, such as siblings or families, and several observations from every group, the data is panel data. Whereas time series and cross-sectional data are both one-dimensional, panel data sets are two-dimensional. Data sets with more than two dimensions are typically called multi-dimensional panel data.

58 Program   Research Area: Theoretical econometrics, including time series analysis, nonparametric and semi-parametric estimation, panel data analysis, and financial econometrics; applied econometrics, including applied labor economics and empirical finance. Courses       Probability and Statistics       Advanced Econometrics    Time Series Models       Micro Econometrics       Panel Data Econometrics        Financial Econometric       Nonparametric and semi-parametric econometrics       Lecture on Advanced Econometrics       Data Analysis in Academic Research (using SAS) Statistics and Data Analysis for Economics Nonlinear Models

59 Some researches in U. of Chicago
Proposal: "Selective Publicity and Stock Prices" By: David Solomon Proposal: "Activating Self-Control: Isolated vs. Interrelated Temptations" By: Kristian Myrseth Proposal: "Buyer's Remorse: When Evaluation is Based on Simulation Before You Chose but Deliberation After" By: Yan Zhang Proposal: "Brokerage, Second-Hand Brokerage and Difficult Working Relationships: The Role of the Informal Organization on Speaking Up about Difficult Relationships and Being Deemed Uncooperative by Co-Workers" By: Jennifer Hitler Proposal: "Resource Space Dynamics in the Evolution of Industries: Formation, Expansion and Contraction of the Resource Space and its Effects on the Survival of Organizations: By: Aleksios Gotsopoulos Defense: "An Examination of Status Dynamics in the U.S. Venture Capital Industry" By: Young-Kyu Kim Defense: "Group Dynamics and Contact: A Natural Experiment" By: Arjun Chakravarti Defense: "Essays in Corporate Governance" By: Ashwini Agrawal

60 Some Researches in U. of Chicago
Defense: "Essays on Consumer Finance" By: Brian Melzer Defense: "Male Incarceration and Teen Fertility" By: Amee Kamdar Defense: "Essays on Economic Fundamentals in Asset Pricing" By: Jie (Jennie) Bai Defense: "Asset-Intensity and the Cross-Section of Stock Returns" By: Raife Giovinazzo Defense: "Essays on Household Behavior" By: Marlena Lee Proposal: "How (Un)Accomplished Goal Actions Affect Goal Striving and Goal Setting" By: Minjung Koo Defense: "Empirical Entry Games with Complementarities: An Application to the Shopping Center Industry" By: Maria Ana Vitorino Defense: "Betas, Characterisitcs, and the Cross-Section of Hedge Fund Returns" By: Mark Klebanov Defense: "Expropriatin Risk and Technology" By: Marcus Opp Defense: "Essays in Corporate Finance and Real Estate" By: Itzhak Ben-David Proposal: "Group Dynamics and Interpersonal Contact: A Natural Experiment" By: Arjun Chakravarti Proposal: "Structural Estimation of a Moral Hazard Model: An Application to Industrial Selling" By: Renna Jiang Proposal: "Status, Quality, and Earnings Announcements: An Analysis of the Effect of News of which Confirms or Contradicts the Status-Quality Correlation on the Stock of a Company“ By: Daniela Lup Defense: "Diversification and its Discontents: Idiosyncratic and Entrepreneurial Risk in the Quest for Social Status" By: Nick Roussanov

61 Summary of Econometrics
It is a combination of mathematical economics, statistics, economic statistics and economic theory. Regression analysis is popular Time-series analysis and cross-sectional analysis are useful. Panel analyses, which related to multi-dimension regression Fixed effect models: There are unique attributes of individuals that are not the results of random variation and that do not vary across time. Adequate, if we want to draw inferences only about the examined individuals. Random effect models:There are unique, time constant attributes of individuals that are the results of random variation and do not correlate with the individual regressors. This model is adequate, if we want to draw inferences about the whole population, not only the examined sample.

62 References Arellano, Manuel. Panel Data Econometrics, Oxford University Press 2003. Hsiao, Cheng, Analysis of Panel Data, Cambridge University Press. Davies, A. and Lahiri, K., "Re-examining the Rational Expectations Hypothesis Using Panel Data on Multi-Period Forecasts," Analysis of Panels and Limited Dependent Variable Models, Cambridge University Press. Davies, A. and Lahiri, K., "A New Framework for Testing Rationality and Measuring Aggregate Shocks Using Panel Data," Journal of Econometrics 68: Frees, E., Longitudinal and Panel Data, Cambridge University Press.

63 Engineering Statistics
(DOE) or design of experiments uses statistical techniques to test and construct models of engineering components and systems. Quality control and process control use statistics as a tool to manage conformance to specifications of manufacturing processes and their products. Time and methods engineering use statistics to study repetitive operations in manufacturing in order to set standards and find optimum (in some sense) manufacturing procedures

64 Statistical Physics Using methods of statistics in solving physical problems with stochastic nature. The term statistical physics encompasses probabilistic and statistical approaches to classical mechanics and quantum mechanics. Hence might be called as Statistical mechanics It works well in classical systems when the number of degrees of freedom is so large that exact solution is not possible, or not really useful. Statistical mechanics can also describe work in non-linear dynamics, chaos theory, thermal physics, fluid dynamics (particularly at low Knudsen numbers), or plasma physics.

65 Demography The study of human population dynamics. It encompasses the study of the size, structure and distribution of populations, and how populations change over time due to births, deaths, migration and ageing. Methods are including census returns and vital statistics registers, or incorporate survey data using indirect estimation techniques.

66 Psychological Statistics
The application of statistics to psychology. Some of the more commonly used statistical tests in psychology are: Student's t-test , Chi-square, ANOVA, ANCOVA, MANOVA, Regression analysis , Correlation, Survival analysis, Cliniqual trial , etc.

67 Social Statistics Using statistical measurement systems to study human behavior in a social environment Advanced statistical analyses have become popular among social science. A new branch: quantitative social science in Harvard Structural Equation Modeling and factor analysis Multilevel models Cluster analysis Latent class models Item response theory Survey methodology and survey sampling

68 Chemometrics Apply mathematical or statistical methods to chemical data. Chemometrics is the science of relating measurements made on a chemical system or process to the state of the system via application of mathematical or statistical methods. Chemometric research spans a wide area of different methods which can be applied in chemistry. There are techniques for collecting good data (optimization of experimental parameters, design of experiments, calibration, signal processing) and for getting information from these data (statistics, pattern recognition, modeling, structure-property-relationship estimations). Chemometrics tries to build a bridge between the methods and their application in chemistry.

69 Reliability Engineering
Reliability engineers perform a wide variety of special management and engineering tasks to ensure that sufficient attention is given to details that will affect the reliability of a given system. Reliability engineers rely heavily on statistics, probability theory, and reliability theory. Many engineering techniques are used in reliability engineering, such as reliability prediction, Weibull analysis, thermal management, reliability testing and accelerated life testing.

70 Statistical Methods A common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on a response or dependent variable. Two major types of studies: Experimental and observational studies

71 Well Known Techniques Student's t-test: test means of two normally distributed populations are equal chi-square: test two distributions are the same analysis of variance (ANOVA): test the difference of mean or effects. Mann-Whitney U: test difference in medians between two observed distributions regression analysis: model relationships between random variables, determine the magnitude of the relationships between variables, and can be used to make predictions based on the models

72 Correlation: indicates the strength and direction of a linear relationship between two random variables Fisher's Least Significant Difference test : test difference of means in multiple comparison. Pearson product-moment correlation coefficient: a measure of how well a linear equation describes the relation between two variables X and Y measured on the same object or organism. Spearman's rank correlation coefficient: a non-parametric measure of correlation between two variables

73 Simple Statistic Application
Compare two means Compare two proportions Compare two populations Estimate mean or proportion Find empirical distribution

74 Statistical Topics

75 Sampling Distribution
Sampling distribution is used to describe the distribution of outcomes that one would observe from replication of a particular sampling plan. Know that to estimate means to esteem (to give value to). Know that estimates computed from one sample will be different from estimates that would be computed from another sample. Understand that estimates are expected to differ from the population characteristics (parameters) that we are trying to estimate, but that the properties of sampling distributions allow us to quantify, probabilistically, how they will differ. Understand that different statistics have different sampling distributions with distribution shape depending on (a) the specific statistic, (b) the sample size, and (c) the parent distribution. Understand the relationship between sample size and the distribution of sample estimates. Understand that the variability in a sampling distribution can be reduced by increasing the

76 Research Sequential sampling technique Low response rate
Biased response

77 Outlier Removal Outliers are a few observations that are not well fitted by the "best" available model. When occurring, one must first investigate the source of data, if there is no doubt about the accuracy or veracity of the observation, then it should be removed and the model should be refitted. Robust statistical techniques are needed to cope with any undetected outliers; otherwise the result will be misleading. Because of the potentially large variance, outliers could be the outcome of sampling. It's perfectly correct to have such an observation that legitimately belongs to the study group by definition. Say, Lognormally distributed data. To be very careful and cautious: before declaring an observation "an outlier," find out why and how such observation occurred. It could even be an error at the data entering stage. First, construct the BoxPlot of your data. Form the Q1, Q2, and Q3 points which divide the samples into four equally sized groups. (Q2 = median) Let IQR = Q3 - Q1. Outliers are defined as those points outside the values Q3+k*IQR and Q1-k*IQR. For most case one sets k=1.5 or 3. Another alternative outlier definition is out off: mean + ks, mean - ks sigma (k is 2, 2.5, or 3)

78 Central Limit Theorem The average of a sample of observations drawn from some population with any shape-distribution is approximately distributed as a normal distribution if certain conditions are met. It is well known that whatever the parent population is, the standardized variable will have a distribution with a mean 0 and standard deviation 1 under random sampling with a large sample size. The sample size needed for the approximation to be adequate depends strongly on the shape of the parent distribution. Symmetry is particularly important. For a symmetric and short tail parent distribution, even if very different from the shape of a normal distribution, an adequate approximation can be obtained with small samples (e.g., 10 or 12 for the uniform distribution). In some extreme cases (e.g. binomial with ) samples sizes far exceeding the typical guidelines (say, 30) are needed for an adequate approximation

79 P-values The P-value, which directly depends on a given sample, attempts to provide a measure of the strength of the results of a test, in contrast to a simple reject or do not reject. If the null hypothesis is true and the chance of random variation is the only reason for sample differences, then the P-value is a quantitative measure to feed into the decision making process as evidence. The following table provides a reasonable interpretation of P-values: P< 0.01 very strong evidence against H0; 0.01≤ P < 0.05 moderate evidence against H0; 0.05 ≤ P < 0.10 suggestive evidence against H0; 0.10 ≤ P little or no real evidence against H0 This interpretation is widely accepted, and many scientific journals routinely publish papers using this interpretation for the result of test of hypothesis. For the fixed-sample size, when the number of realizations is decided in advance, the distribution of p is uniform (assuming the null hypothesis). We would express this as P(p ≤ x) = x. That means the criterion of p <0.05 achieves a of 0.05. When a p-value is associated with a set of data, it is a measure of the probability that the data could have arisen as a random sample from some population described by the statistical (testing) model. A p-value is a measure of how much evidence you have against the null hypothesis. The smaller the p-value, the more evidence you have. One may combine the p-value with the significance level to make decision on a given test of hypothesis. In such a case, if the p-value is less than some threshold (usually .05, sometimes a bit larger like 0.1 or a bit smaller like .01) then you reject the null hypothesis.

80 Accuracy, Precision, Robustness, and Data Quality
Accuracy is the degree of conformity of a measured/calculated quantity to its actual (true) value. Precision is the degree to which further measurements or calculations will show the same or similar results. Robustness is the resilience of the system, especially when under stress or when confronted with invalid input. Data are of high quality "if they are fit for their intended uses in operations, decision making and planning . An "accurate" estimate has small bias. A "precise" estimate has both small bias and variance. The robustness of a procedure is the extent to which its properties do not depend on those assumptions which you do not wish to make. Distinguish between bias robustness and efficiency robustness. Example: Sample mean is seen as a robust estimator, it is because the CLT guarantees a 0 bias for large samples regardless of the underlying distribution. This estimator is bias robust, but it is clearly not efficiency robust as its variance can increase endlessly. That variance can even be infinite if the underlying distribution is Cauchy or Pareto with a large scale parameter.

81 Bias Reduction Techniques
The most effective tools for bias reduction is non-biased estimators are the Bootstrap and the Jackknifing. The bootstrap uses resampling from a given set of data to mimic the variability that produced the data in the first place, has a rather more dependable theoretical basis and can be a highly effective procedure for estimation of error quantities in statistical problems. Bootstrap is to create a virtual population by duplicating the same sample over and over, and then re-samples from the virtual population to form a reference set. Then you compare your original sample with the reference set to get the exact p-value. Very often, a certain structure is "assumed" so that a residual is computed for each case. What is then re-sampled is from the set of residuals, which are then added to those assumed structures, before some statistic is evaluated. The purpose is often to estimate a P-level. Jackknife is to re-compute the data by leaving on observation out each time. Jackknifing does a bit of logical folding to provide estimators of coefficients and error that will have reduced bias. Bias reduction techniques have wide applications in anthropology, chemistry, climatology, clinical trials, cybernetics, and ecology, etc.

82 Effect Size Effect size (ES) permits the comparative effect of different treatments to be compared, even when based on different samples and different measuring instruments. The ES is the mean difference between the control group and the treatment group. Glass's method: Suppose an experimental treatment group has a mean score of Xe and a control group has a mean score of Xc and a standard deviation of Sc, then the effect size is equal to (Xe - Xc)/Sc. Hunter and Schmidt (1990) suggested using a pooled within-group standard deviation because it has less sampling error than the control group standard deviation under the condition of equal sample size. In addition, Hunter and Schmidt corrected the effect size for measurement error by dividing the effect size by the square root of the reliability coefficient of the dependent variable: Cohen's ES: (mean1 - mean2)/pooled SD

83 Nonparametric Technique
Parametric techniques are more useful the more one knows about your subject matter, since knowledge about the data can be built into parametric models. Nonparametric methods, including both senses of the term, distribution free tests and flexible functional forms, are more useful when knowing less about the subject matter. One must use statistical technique called nonparametric if it satisfies at least on of the following five types of criteria: 1. The data entering the analysis are enumerative - that is, count data representing the number of observations in each category or cross-category. 2. The data are measured and /or analyzed using a nominal or ordinal scale of measurement. 3. The inference does not concern a parameter in the population distribution. 4. The probability distribution of the statistic upon which the analysis is based is very general, such as continuous, discrete, or symmetric etc. The Statistics are: Mann-Whitney Rank Test as a nonparametric alternative to Students T-test when one does not have normally distributed data. Mann-Whitney: To be used with two independent groups (analogous to the independent groups t-test) Wilcoxon: To be used with two related (i.e., matched or repeated) groups (analogous to the related samples t-test) Kruskall-Wallis: To be used with two or more independent groups (analogous to the single-factor between-subjects ANOVA) Friedman: To be used with two or more related groups (analogous to the single-factor within-subjects ANOVA)

84 Least Squares Models Many problems in analyzing data involve describing how variables are related. The simplest of all models describing the relationship between two variables is a linear, or straight-line, model. The conventional method is that of least squares, which finds the line minimizing the sum of distances between observed points and the fitted line. There is a simple connection between the numerical coefficients in the regression equation and the slope and intercept of regression line. The summary statistic like a correlation coefficient or does not tell the whole story. A scatter plot is an essential complement to examining the relationship between the two variables. Model checking is an essential part of the process of statistical modeling. After all, conclusions based on models that do not properly describe an observed set of data will be invalid. The impact of violation of regression model assumptions (i.e., conditions) and possible solutions by analyzing the residuals.

85 Least Median of Squares Models least absolute deviation (LAD)
The standard least squares techniques for estimation in linear models are not robust in the sense that outliers or contaminated data can strongly influence estimates. A robust technique, which protects against contamination is least median of squares (LMS) or least absolute deviation (LAD) . An extension of LMS estimation to generalized linear models, giving rise to the least median of deviance (LMD) estimator.

86 Multivariate Data Analysis
Multivariate analysis is a branch of statistics involving the consideration of objects on each of which are observed the values of a number of variables. Multivariate techniques are used across the whole range of fields of statistical application. The techniques are: Principal components analysis Factor analysis Cluster analysis Discriminant analysis Principal component analysis used for exploring data to reduce the dimension. Generally, PCA seeks to represent n correlated random variables by a reduced set of uncorrelated variables, which are obtained by transformation of the original set onto an appropriate subspace. Two closely related techniques, principal component analysis and factor analysis, are used to reduce the dimensionality of multivariate data. In these techniques correlations and interactions among the variables are summarized in terms of a small number of underlying factors. The methods rapidly identify key variables or groups of variables that control the system under study. Cluster analysis is an exploratory data analysis tool which aims at sorting different objects into groups in a way that the degree of association between two objects is maximal if they belong to the same group and minimal otherwise. Discriminant function analysis used to classify cases into the values of a categorical dependent, usually a dichotomy.

87 Regression Analysis Models the relationship between one or more response variables (Y), and the predictors (X1,...,Xp). If there is more than one response variable, we speak of multivariate regression. Types of regression Simple and multiple linear regression Simple linear regression and multiple linear regression are related statistical methods for modeling the relationship between two or more random variables using a linear equation. Linear regression assumes the best estimate of the response is a linear function of some parameters (though not necessarily linear on the predictors). Nonlinear regression models If the relationship between the variables being analyzed is not linear in parameters, a number of nonlinear regression techniques may be used to obtain a more accurate regression. Other models Although these three types are the most common, there also exist Poisson regression, supervised learning, and unit-weighted regression. Linear models Predictor variables may be defined quantitatively or qualitatively(or categorical). Categorical predictors are sometimes called factors. Although the method of estimating the model is the same for each case, different situations are sometimes known by different names for historical reasons: If the predictors are all quantitative, we speak of multiple regression. If the predictors are all qualitative, one performs analysis of variance. If some predictors are quantitative and some qualitative, one performs an analysis of covariance.

88 General Linear Regression
The general linear model (GLM) is a statistical linear model. It may be written as where Y is a matrix with series of multivariate measurements, X is a matrix that might be a design matrix, B is a matrix containing parameters that are usually to be estimated and U is a matrix containing residuals (i.e., errors or noise). The residual is usually assumed to follow a multivariate normal distribution or other distribution, such as a distribution in exponential family. The general linear model incorporates a number of different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. If there is only one column in Y (i.e., one dependent variable) then the model can also be referred to as the multiple regression model (multiple linear regression). For example, if the response variable can take only binary values (for example, a Boolean or Yes/No variable), logistic regression is preferred. The outcome of this type of regression is a function which describes how the probability of a given event (e.g. probability of getting "yes") varies with the predictors Hypothesis tests with the general linear model can be made in two ways: multivariate and mass-univariate.

89 Semiparametric and Non-parametric modeling
The Generalized Linear Model (GLM) Y= G(X1*b Xp*bp) + e where G is called the link function. All these models lead to the problem of estimating a multivariate regression. Parametric regression estimation has the disadvantage, that by the parametric "form" certain properties of the resulting estimate are already implied. Nonparametric techniques allow diagnostics of the data without this restriction, and the model structure is not specified a priori. However, this requires large sample sizes and causes problems in graphical visualization. Semiparametric methods are a compromise between both: they support a nonparametric modeling of certain features and profit from the simplicity of parametric methods. Example: Cox Proportional Hazard Model.

90 Survival analysis It deals with “death” in biological organisms and failure in mechanical systems. Death or failure is called an "event" in the survival analysis literature, and so models of death or failure are generically termed time-to-event models. Survival data arise in a literal form from trials concerning life-threatening conditions, but the methodology can also be applied to other waiting times such as the duration of pain relief. Censoring: Nearly every sample contains some cases that do not experience an event. If the dependent variable is the time of the event, what do you do with these "censored" cases? Survival analysis attempts to answer questions such as: what is the fraction of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the odds of survival? Time-dependent covariate: Many explanatory variables (like income or blood pressure)change in value over time. How do you put such variables in a regression analysis? Survival Analysis is a group of statistical methods for analysis and interpretation of survival data. Survival and hazard functions, the methods of estimating parameters and testing hypotheses that are the main part of analyses of survival data. Main topics relevant to survival data analysis are: Survival and hazard functions, Types of censoring, Estimation of survival and hazard functions: the Kaplan-Meier and life table estimators, Simple life tables, Comparison of survival functions: The logrank and Mantel-Haenszel tests, Wilcoxon test;The proportional hazards model: time independent and time dependent covariates, Recurrent model, and Methods for determining sample sizes.

91 Repeated Measures and Longitudinal Data
Repeated measures and longitudinal data require special attention because they involve correlated data that commonly arise when the primary sampling units are measured repeatedly over time or under different conditions. The experimental units are often subjects. It is usually interested in between-subject and within-subject effects. Between-subject effects are those whose values change only from subject to subject and remain the same for all observations on a single subject, for example, treatment and gender. Within-subject effects are those whose values may differ from measurement to measurement. Since measurements on the same experimental unit are likely to be correlated, repeated measurements analysis must account for that correlation. Normal theory models for split-plot experiments and repeated measures ANOVA can be used to introduce the concept of correlated data. PROC GLM, PROC GENMOD and PROC MIXED in the SAS system may be used. Mixed linear models provide a general framework for modeling covariance structures, a critical first step that influences parameter estimation and tests of hypotheses. The primary objectives are to investigate trends over time and how they relate to treatment groups or other covariates. Techniques applicable to non-normal data, such as McNemar's test for binary data, weighted least squares for categorical data, and generalized estimating equations (GEE) are the main topics. The GEE method can be used to accommodate correlation when the means at each time point are modeled using a generalized linear model. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika 1986;73:13–22

92 Information Theory Information theory is a branch probability and mathematical statistics that deal with communication systems, data transmission, cryptography, signal to noise ratios, data compression, etc. Claude Shannon is the father of information theory. His theory considered the transmission of information as a statistical phenomenon and gave communications engineers a way to determine the capacity of a communication channel about the common currency of bits Shannon defined a measure of entropy as: H = - ∑ p i log pi, that, when applied to an information source, could determine the capacity of the channel required to transmit the source as encoded binary digits. The entropy is a measure of the amount of uncertainty one has about which message will be chosen. It is defined as the average self-information of a message i from that message space. Entropy as defined by Shannon is closely related to entropy as defined by physicists in statistical thermodynamics. This work was the inspiration for adopting the term entropy in information theory. Other useful measures of information include mutual information which is a measure of the correlation between two event sets. Mutual information is defined for two events X and Y as: M (X, Y) = H(X, Y) - H(X) - H(Y) where H(X, Y) is the join entropy defined as: H(X, Y) = - ∑ p (xi, yi) log p (xi, yi), Mutual information is closely related to the log-likelihood ratio test for multinomial distribution, and to Pearson's Chi-square test. The field of Information Science has since expanded to cover the full range of techniques and abstract descriptions for the storage, retrieval and transmittal of information. Applications: Coding theory, making and breaking cryptographic systems, intelligent work, Bayesian analysis, gabling, investing, etc.

93 Incomplete Data Methods dealing with analysis of data with missing values can be classified into: - Analysis of complete cases, including weighting adjustments, - Imputation methods, and extensions to multiple imputation, and - Methods that analyze the incomplete data directly without requiring a rectangular data set, such as maximum likelihood and Bayesian methods. Multiple imputation (MI) is a general paradigm for the analysis of incomplete data. Each missing datum is replaced by m> 1 simulated values, producing m simulated versions of the complete data. Each version is analyzed by standard complete-data methods, and the results are combined using simple rules to produce inferential statements that incorporate missing data uncertainty. The focus is on the practice of MI for real statistical problems in modern computing environments.

94 Interactions ANOVA programs generally produce all possible interactions, while regression programs generally do not produce any interactions. So it's up to the user to construct interaction terms to multiply together. If the standard error term is high, it might be Multicolinearity. But it is not the only factor that can cause large SE's for estimators of "slope" coefficients any regression models. SE's are inversely proportional to the range of variability in the predictor variable. To increase the precision of estimators, we should increase the range of the input. Another cause of large SE's is a small number of "event" observations or a small number of "non-event" observations There is also another cause of high standard errors; it's called serial correlation, when using time-series. When X and W are category systems. The interaction describes a two-way analysis of variance (ANOV) model; when X and W are (quasi-)continuous variables, this equation describes a multiple linear regression (MLR) model. In ANOVA contexts, the existence of an interaction can be described as a difference between differences. In MLR contexts, an interaction implies a change in the slope (of the regression of Y on X) from one value of W to another value of W.

95 Sufficient Statistic A sufficient estimator based on a statistic contains all the information which is present in the raw data. For example, the sum of your data is sufficient to estimate the mean of the population. You do not have to know the data set itself. This saves a lot ... Simply, send out the total, and the sample size. A sufficient statistic t for a parameter q is a function of the sample data x1,...,xn, which contains all information in the sample about the parameter q . More formally, sufficiency is defined in terms of the likelihood function for q . For a sufficient statistic t, the Likelihood L(x1,...,xn| q ) can be written as g (t | q )*k(x1,...,xn). Since the second term does not depend on q , t is said to be a sufficient statistic for q . To illustrate, let the observations be independent Bernoulli trials with the same probability of success. Suppose that there are n trials, and that person A observes which observations are successes, and person B only finds out the number of successes. If seeing these successes at random points without replication, B and A will see the same ting.

96 Tests Significance tests are based on assumptions: The data have to be random, out of a well defined basic population and one has to assume that some variables follow a certain distribution. Power of a test is the probability of correctly rejecting a false null hypothesis. It is one minus the probability of making a Type II error. The Type I error: fail to reject the false hypothesis. Decrease the probability of making a Type I error will increase the probability of making a Type II error. Power and the True Difference between Population Means: The distance between the two population means will affect the power of our test. Power as a Function of Sample Size and Variance: Sample size has an indirect effect on power because it affects the measure of variance we used in the test. When n is large we will have a lower standard error than when n is small. Pilot Studies: When the needed estimates for sample size calculation is not available from existing database, a pilot study is needed for adequate estimation with a given precision.

97 ANOVA: Analysis of Variance
Test the difference between 2 or more means. ANOVA does this by examining the ratio of variability between two conditions and variability within each condition. Say we give a drug that we believe will improve memory to a group of people and give a placebo to another group of people. We might measure memory performance by the number of words recalled from a list we ask everyone to memorize. An ANOVA test would compare the variability that we observe between the two conditions to the variability observed within each condition. Recall that we measure variability as the sum of the difference of each score from the mean. When the variability that we predict (between the two groups) is much greater than the variability we don't predict (within each group) then we will conclude that our treatments produce different results.

98 Data Mining and Knowledge Discovery
It uses sophisticated statistical analysis and modeling techniques to uncover patterns and relationships hidden in organizational databases. Aim at tools and techniques to process structured information from databases to data warehouses to data mining, and to knowledge discovery. Data warehouse applications have become business-critical. It can compress even more value out of these huge repositories of information. The continuing rapid growth of on-line data and the widespread use of databases necessitate the development of techniques for extracting useful knowledge and for facilitating database access. The challenge of extracting knowledge from data is of common interest to several fields, including statistics, databases, pattern recognition, machine learning, data visualization, optimization, and high-performance computing. The data mining process involves identifying an appropriate data set to "mine" or sift through to discover data content relationships. Data mining tools include techniques like case-based reasoning, cluster analysis, data visualization, fuzzy query and analysis, and neural networks. Data mining sometimes resembles the traditional scientific method of identifying a hypothesis and then testing it using an appropriate data set. It is reminiscent of what happens when data has been collected and no significant results were found and hence an ad hoc, exploratory analysis is conducted to find a significant relationship.

99 Data mining is the process of extracting knowledge from data
Data mining is the process of extracting knowledge from data. For clever marketers, that knowledge can be worth as much as the stuff real miners dig from the ground. Data mining as an analytic process designed to explore large amounts of data in search for consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The process thus consists of three basic stages: exploration, model building or pattern definition, and validation/verification. What distinguishes data mining from conventional statistical data analysis is that data mining is usually done for the purpose of "secondary analysis" aimed at finding unsuspected relationships unrelated to the purposes for which the data were originally collected. Data warehousing as a process of organizing the storage of large, multivariate data sets in a way that facilitates the retrieval of information for analytic purposes. Data mining is now a rather vague term, but the element that is common to most definitions is "predictive modeling with large data sets as used by big companies". Therefore, data mining is the extraction of hidden predictive information from large databases. It is a powerful new technology with great potential, for example, to help marketing managers "preemptively define the information market of tomorrow." Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools. Data mining answers business questions that traditionally were too time-consuming to resolve. Data mining tools scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations. Data mining techniques can be implemented rapidly on existing software and hardware platforms across the large companies to enhance the value of existing resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client-server or parallel processing computers, data mining tools can analyze massive databases while a customer or analyst takes a coffee break, then deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"

100 Bayes and Empirical Bayes Methods
Bayes and empirical Bayes (EB) methods structure combining information from similar components of information and produce efficient inferences for both individual components and shared model characteristics. Many complex applied investigations are ideal settings for this type of synthesis. For example, county-specific disease incidence rates can be unstable due to small populations or low rates. 'Borrowing information' from adjacent counties by partial pooling produces better estimates for each county, and Bayes/empirical Bayes methods structure the approach. Importantly, recent advances in computing and the consequent ability to evaluate complex models, have increase the popularity and applicability of Bayesian methods. Bayes and EB methods can be implemented using modern Markov chain Monte Carlo(MCMC) computational methods. Properly structured Bayes and EB procedures typically have good frequentist and Bayesian performance, both in theory and in practice. This in turn motivates their use in advanced high-dimensional model settings (e.g., longitudinal data or spatio-temporal mapping models), where a Bayesian model implemented via MCMC often provides the only feasible approach that incorporates all relevant model features.

101 Meta-Analysis combines the results of several studies that address a set of related research hypotheses. The first meta-analysis was performed by Karl Pearson in 1904 Results from different studies investigating different independent variables are measured on different scales, the dependent variable in a meta-analysis is some standardized measure, such as effect size, odds ratio, correlation coefficient, p-values, etc. Results from studies are combined using different approaches. One approach frequently used in meta-analysis in health care research is termed 'inverse variance method'. Variations in sampling schemes can introduce heterogeneity to the result, which is the presence of more than one intercept in the solution. In such a case, "random effects model," should be adopted whenever statistical measures of heterogeneity. Modern meta-analysis does more than just combine the effect sizes of a set of studies. It can test if the studies' outcomes show more variation than the variation that is expected because of sampling different research participants. If that is the case, study characteristics such as measurement instrument used, population sampled, or aspects of the studies' design are coded. These characteristics are then used as predictor variables to analyze the excess variation in the effect sizes.

102 Spatial Data Analysis Data that is geographically or spatially referenced is encountered in a very wide variety of practical contexts. In the same way that data collected at different points in time may require specialized analytical techniques, there are a range of statistical methods devoted to the modeling and analysis of data collected at different points in space. Increased public sector and commercial recording and use of data which is geographically referenced, recent advances in computer hardware and software capable of manipulating and displaying spatial relationships in the form of digital maps, and an awareness of the potential importance of spatial relationships in many areas of research, have all combined to produced an increased interest in spatial analysis. Spatial Data Analysis is concerned with the study of such techniques---the kind of problems they are designed to address, their theoretical justification, when and how to use them in practice. Many natural phenomena involve a random distribution of points in space. Biologists who observe the locations of cells of a certain type in an organ, astronomers who plot the positions of the stars, botanists who record the positions of plants of a certain species and geologists detecting the distribution of a rare mineral in rock are all observing spatial point patterns in two or three dimensions. Such phenomena can be modeled by spatial point processes.

103 The spatial linear model is fundamental to a number of techniques used in image processing, for example, for locating gold/ore deposits, or creating maps. There are many unresolved problems in this area such as the behavior of maximum likelihood estimators and predictors, and diagnostic tools. There are strong connections between kriging predictors for the spatial linear model and spline methods of interpolation and smoothing. The two-dimensional version of splines/kriging can be used to construct deformations of the plane, which are of key importance in shape analysis. For analysis of spatially auto-correlated data in of logistic regression for example, one may use of the Moran Coefficient which is available is some statistical packages such as Spacestat. This statistic tends to be between -1 and +1, though are not restricted to this range. Values near +1 indicate similar values tend to cluster; values near -1 indicate dissimilar values tend to cluster; values near -1/(n-1) indicate values tend to be randomly scattered.

104 Learning and decision Used in data mining and machine learning, uses a decision tree as a predictive model which maps observations about an item to conclusions about the item's target value. More descriptive names for such tree models are classification trees or regression trees. In these tree structures, leaves represent classifications and branches represent conjunctions of features that lead to those classifications.

105 General Questions/Problems
Learning Phase: What or whom to learn from? What to learn? How to represent and store the learned knowledge? What learning method to use? Application Phase How to detect applicable knowledge? How to apply knowledge? How to detect and deal with misleading knowledge? How to combine knowledge from different sources? Which concepts of similarity are helpful?

106 Learning Learning is essential for unknown environments,
i.e., when designer lacks omniscience Learning is useful as a system construction method, i.e., expose the agent to reality rather than trying to write it down Learning modifies the agent's decision mechanisms to improve performance

107 Learning element Design of a learning element is affected by
Which components of the performance element are to be learned What feedback is available to learn these components What representation is used for the components Type of feedback: Supervised learning: correct answers for each example Unsupervised learning: correct answers not given Semi-supervised learning: occasional rewards

108 Learning type Unsupervised learning  learn from own experiences
Teacher(s) provide knowledge Teacher provides evaluation of own experiences Teacher can be observed

109 What learning method to use
Learning by heart (of prototypical cases) Decision-tree based learning (ID3, C4.5) Reinforcement learning Evolutionary methods, like Classifier systems Neural Networks ...

110 Inductive learning Simplest form: learn a function from examples
f is the target function An example is a pair (x, f(x)) Problem: find a hypothesis h such that h ≈ f given a training set of examples (This is a highly simplified model of real learning: Ignores prior knowledge Assumes examples are given)

111 Inductive learning method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

112 Inductive learning method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

113 Inductive learning method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

114 Inductive learning method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting:

115 Inductive learning method
Construct/adjust h to agree with f on training set (h is consistent if it agrees with f on all examples) E.g., curve fitting: Ockham’s razor: prefer the simplest hypothesis consistent with data

116 Example: Learning decision trees
Problem: decide whether to wait for a table at a restaurant, based on the following attributes: Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

117 Decision trees One possible representation for hypotheses
E.g., here is the “true” tree for deciding whether to wait:

118 Hypothesis spaces How many distinct decision trees with n Boolean attributes? = number of Boolean functions = number of distinct truth tables with 2n rows = 22n-2 E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,614 trees

119 Decision tree learning
Aim: find a small tree consistent with the training examples Idea: (recursively) choose "most significant" attribute as root of (sub)tree

120 Choosing an attribute Idea: a good attribute splits the examples into subsets that are (ideally) "all positive" or "all negative" Patrons? is a better choice

121 Using information theory
To implement Choose-Attribute in the DTL algorithm Information Content (Entropy): I(P(v1), … , P(vn)) = Σi=1 -P(vi) log2 P(vi) For a training set containing p positive examples and n negative examples:

122 Performance measurement
How do we know that h ≈ f ? Use theorems of computational/statistical learning theory Try h on a new test set of examples (use same distribution over example space as training set) Learning curve = % correct on test set as a function of training set size

123 Summary Learning needed for unknown environments, lazy designers
Learning agent = performance element + learning element For supervised learning, the aim is to find a simple hypothesis approximately consistent with training examples Decision tree learning using information gain Learning performance = prediction accuracy measured on test set

124 Books might be worthy to read
Applied Survival Analysis, Author:Stanley Lemeshow, Sunny Kim, David W. Hosmer, David W. Hosmer Jr.Contribution by:Sunny KimFormat:Paperback: Solution Manual: 232 pages.Publisher:John Wiley & Sons Inc (04/01/2002)ISBN: ISBN13: Statistical Models and Methods for Lifetime Data (Wiley Series in Probability and Statistics) (Hardcover), by Jerald F. Lawless Applied Multivariate Statistical Analysis, Author:Leopold Simar, Wolfgang Hardle, Leopold SimarFormat:Paperback: 456 pages.Publisher:Springer Verlag (10/04/2007)ISBN: ISBN13: Introduction to Statistical Decision Theory, Author:Howard Raiffa, John Pratt, Robert Schlaifer, John W. PrattFormat:Paperback: 875 pages.Publisher:Mit Pr (03/31/2008)ISBN: XISBN13: Categorical Data Analysis (Wiley Series in Probability and Statistics) (Hardcover),by Alan Agresti

125 Pierre de Fermat Pierre de Fermat (August 17, 1601 – January 12, 1665) was a French lawyer at the Parlement of Toulouse, southwestern France, and a mathematician who is given credit for his contribution towards the development of modern calculus. With his insightful theorems Fermat created the modern theory of numbers. The depth of his work can be gauged by the fact that many of his results were not proved for over a century after his death, and one of them, the Last Theorem, took more than three centuries to prove. By the time he was 30, Pierre was a civil servant whose job was to act as a link between petitioners from Toulouse to the King of France and an enforcer of royal decrees from the King to the local people. Evidence suggests he was considerate and merciful in his duties. [citation needed] Since he was also required to act as an appeal judge in important local cases, he did everything he could to be impartial. To avoid socializing with those who might one day appear before him in court, he became involved in mathematics and spent as much free time as he could in its study. He was so skilled in the subject that he could be called a professional amateur. He was mostly isolated from other mathematicians, though he wrote regularly to two English mathematicians, Digby and Wallis. He also corresponded with French mathematician, Father Mersenne who was trying to increase discussion and the exchange of ideas among French mathematicians. One was Blaise Pascal who, with Fermat, established a new branch of math - probability theory. Besides probability theory, Fermat also helped lay the foundations for calculus, an area of math that calculates the rate of change of one quantity in relation to another, for example velocity and acceleration. In particular, he is the precursor of differential calculus with his method of finding the greatest and the smallest ordinates of curved lines, analogous to that of the then unknown differential calculus. Fermat himself was secretive and, since he rarely wrote complete proofs or explanations of how he got his answers, was mischievously frustrating for others to understand. He loved to announce in letters that he had just solved a problem in math but then refused to disclose its solution, leaving it for others to figure out. Fermat's passion in math was in yet another branch - number theory, the relationship among numbers. While he was studying an ancient number puzzle book, he came up with a puzzle of his own that has been called Fermat's Enigma. Mathematicians worked for over three centuries to find its answer, but no one succeeded until Andrew Wiles, an English mathematician, created a proof and published it 330 years after Fermat's death in 1665. Although he carefully studied and drew inspiration from Diophantus, Fermat inaugurated a different tradition. Diophantus was content to find a single solution to his equations, even if it was a fraction. Fermat was only interested in integer solutions to his diophantine equations and he looked for all solutions of the equation. He also proved that certain equations had no solution, an activity which baffled his contemporaries. He studied Pell's equation and Fermat, perfect, and amicable numbers. It was while researching perfect numbers that he created Fermat's theorem. He created the principle of infinite descent and Fermat's factorization method. He created the two-square theorem, and the polygonal number theorem, which states that each number is a sum of 3 triangular numbers, 4 square numbers, 5 pentagonal numbers, ... He was the first to evaluate the integral of general power functions. Using an ingenious trick, he was able to reduce this evaluation to summing geometric series. The formula that resulted was a key hint to Newton and Leibniz when they independently developed the fundamental theorems of calculus. Although Fermat claimed to be able to prove all his arithmetical results, few of his proofs (if he had them) have survived. And considering that some of the results are so difficult (especially considering the mathematical tools at his disposal) many, including Gauss, believe that Fermat was unable to do so. Together with René Descartes, Fermat was one of the two leading mathematicians of the first half of the 17th century. Independently of Descartes, he discovered the fundamental principle of analytic geometry. Through his correspondence with Blaise Pascal, he was a co-founder of the theory of probability.

126 Blaise Pascal (June 19, 1623 – August 19, 1662)
Blaise Pascal (June 19, 1623 – August 19, 1662) was a French mathematician, physicist, and religious philosopher. Pascal was a child prodigy, who was educated by his father. Pascal's earliest work was in the natural and applied sciences, where he made important contributions to the construction of mechanical calculators and the study of fluids, and clarified the concepts of pressure and vacuum by expanding the work of Evangelista Torricelli. Pascal also wrote powerfully in defense of the scientific method. He was a mathematician of the first order. Pascal helped create two major new areas of research. He wrote a significant treatise on the subject of projective geometry at the age of sixteen and corresponded with Pierre de Fermat from 1654 on probability theory, strongly influencing the development of modern economics and social science. Following a mystical experience in late 1654, he left mathematics and physics and devoted himself to reflection and writing about philosophy and theology. His two most famous works date from this period: the Lettres provinciales and the Pensées. However, he had suffered from ill-health throughout his life and his new interests were ended by his early death two months after his 39th birthday.

127 Jakob Bernoulli Jakob Bernoulli (Basel, Switzerland, December 27, August 16, 1705), also known as Jacob, Jacques or James Bernoulli was a Swiss mathematician and scientist and the older brother of Johann Bernoulli. While travelling in England in 1676, Jakob Bernoulli met Robert Boyle and Robert Hooke. This contact inspired Jakob to devote his life to science and mathematics. He was appointed Lecturer at the University of Basel in 1682 and, in 1687, was promoted to Professor of Mathematics. He became familiar with calculus through a correspondence with Gottfried Leibniz, then collaborated with brother Johann on various applications, notably publishing papers on transcendental curves (1696) and isoperimetry (1700, 1701). His masterwork was Ars Conjectandi (the Art of Conjecture), a groundbreaking work on probability theory. It was published eight years after his death in 1713 by his nephew Nicholas. The terms Bernoulli trial and Bernoulli Numbers result from this work. Bernoulli crater, on the Moon, is also named after him jointly with his brother Johann.

128 Abraham de Moivre Abraham de Moivre (May 26, 1667 in Vitry-le-François, Champagne, France – November 27, 1754 in London, England) was a French mathematician famous for de Moivre's formula, which links complex numbers and trigonometry, and for his work on the normal distribution and probability theory. He was elected a Fellow of the Royal Society in 1697, and was a friend of Isaac Newton, Edmund Halley, and James Stirling. The social status of his family is unclear, but de Moivre's father, a surgeon, was able to send him to the Protestant academy at Sedan ( ). de Moivre studied logic at Saumur ( ), attended the Collège de Harcourt in Paris (1684), and studied privately with Jacques Ozanam ( ). It does not appear that De Moivre received a college degree. de Moivre was a Calvinist. He left France after the revocation of the Edict of Nantes (1685) and spent the remainder of his life in England. Throughout his life he remained poor. It is reported that he was a regular customer of Slaughter's Coffee House, St. Martin's Lane at Cranbourn Street, where he earned a little money from playing chess. He died in London and was buried at St Martin-in-the-Fields, although his body was later moved. De Moivre wrote a book on probability theory, entitled The Doctrine of Chances.


Download ppt "Statistics And Application"

Similar presentations


Ads by Google