Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.

Slides:



Advertisements
Similar presentations
Probability: The Study of Randomness
Advertisements

CS433: Modeling and Simulation
Intro to Probability STA 220 – Lecture #5. Randomness and Probability We call a phenomenon if individual outcomes are uncertain but there is nonetheless.
Week 21 Basic Set Theory A set is a collection of elements. Use capital letters, A, B, C to denotes sets and small letters a 1, a 2, … to denote the elements.
Chapter 4 Probability and Probability Distributions
Probability Chapter 3. Methods of Counting  The type of counting important for probability theory involves choosing the number of ways we can arrange.
SI485i : NLP Day 2 Probability Review. Introduction to Probability Experiment (trial) Repeatable procedure with well-defined possible outcomes Outcome.
PROBABILITY INTRODUCTION The theory of probability consist of Statistical approach Classical approach Statistical approach It is also known as repeated.
Learning Objectives for Section 8.1 Probability
Section 1 Sample Spaces, Events, and Probability
1 Chapter 6: Probability— The Study of Randomness 6.1The Idea of Probability 6.2Probability Models 6.3General Probability Rules.
Probability Dr. Deshi Ye Outline  Introduction  Sample space and events  Probability  Elementary Theorem.
Lecture 0: Introduction and Measure Theory CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier.
2-1 Sample Spaces and Events Conducting an experiment, in day-to-day repetitions of the measurement the results can differ slightly because of small.
Pattern Classification, Chapter 1 1 Basic Probability.
1 Basic Probability Statistics 515 Lecture Importance of Probability Modeling randomness and measuring uncertainty Describing the distributions.
© Buddy Freeman, 2015Probability. Segment 2 Outline  Basic Probability  Probability Distributions.
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
Chapter 1: Random Events and Probability
Chapter 1 Probability and Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Welcome to Probability and the Theory of Statistics This class uses nearly every type of mathematics that you have studied so far as well as some possibly.
Nor Fashihah Mohd Noor Institut Matematik Kejuruteraan Universiti Malaysia Perlis ІМ ќ INSTITUT MATEMATIK K E J U R U T E R A A N U N I M A P.
L Berkley Davis Copyright 2009 MER301: Engineering Reliability1 LECTURE 1: Basic Probability Theory.
Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
Basic Concepts of Discrete Probability (Theory of Sets: Continuation) 1.
Copyright © Cengage Learning. All rights reserved. 4 Probability.
Chapter 3 – Set Theory  .
Week 15 - Wednesday.  What did we talk about last time?  Review first third of course.
Engineering Probability and Statistics Dr. Leonore Findsen Department of Statistics.
BA 201 Lecture 6 Basic Probability Concepts. Topics Basic Probability Concepts Approaches to probability Sample spaces Events and special events Using.
The Practice of Statistics
Lecture 2: Combinatorial Modeling CS 7040 Trustworthy System Design, Implementation, and Analysis Spring 2015, Dr. Rozier Adapted from slides by WHS at.
LECTURE 15 THURSDAY, 15 OCTOBER STA 291 Fall
K. Shum Lecture 14 Continuous sample space, Special case of the law of large numbers, and Probability density function.
LECTURE 14 TUESDAY, 13 OCTOBER STA 291 Fall
Lesson 6 – 2b Probability Models Part II. Knowledge Objectives Explain what is meant by random phenomenon. Explain what it means to say that the idea.
Computing Fundamentals 2 Lecture 6 Probability Lecturer: Patrick Browne
Week 11 What is Probability? Quantification of uncertainty. Mathematical model for things that occur randomly. Random – not haphazard, don’t know what.
EQT 272 PROBABILITY AND STATISTICS
Today Today: Course Outline, Start Chapter 1 Assignment 1: –Read Chapter 1 by next Tuesday.
1 CHAPTERS 14 AND 15 (Intro Stats – 3 edition) PROBABILITY, PROBABILITY RULES, AND CONDITIONAL PROBABILITY.
CS433 Modeling and Simulation Lecture 03 – Part 01 Probability Review 1 Dr. Anis Koubâa Al-Imam Mohammad Ibn Saud University
Lecture V Probability theory. Lecture questions Classical definition of probability Frequency probability Discrete variable and probability distribution.
Copyright © Cengage Learning. All rights reserved.
From Randomness to Probability Chapter 14. Dealing with Random Phenomena A random phenomenon is a situation in which we know what outcomes could happen,
확률및공학통계 (Probability and Engineering Statistics) 이시웅.
Discrete Structures By: Tony Thi By: Tony Thi Aaron Morales Aaron Morales CS 490 CS 490.
Inference: Probabilities and Distributions Feb , 2012.
Discrete Random Variables. Introduction In previous lectures we established a foundation of the probability theory; we applied the probability theory.
Lecture 6 Dustin Lueker.  Standardized measure of variation ◦ Idea  A standard deviation of 10 may indicate great variability or small variability,
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
5-Minute Check on Section 6-2a Click the mouse button or press the Space Bar to display the answers. 1.If you have a choice from 6 shirts, 5 pants, 10.
Probability theory is the branch of mathematics concerned with analysis of random phenomena. (Encyclopedia Britannica) An experiment: is any action, process.
UNIT 3. OUTLINE: Sets and Subsets Set Operations the Laws of Set Theory Counting and Venn Diagrams. A First Word on Probability. The Axioms of Probability.
Basic probability Sep. 16, Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies.
Notions & Notations (2) - 1ICOM 4075 (Spring 2010) UPRM Department of Electrical and Computer Engineering University of Puerto Rico at Mayagüez Spring.
Basic Probability. Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies of probability)
Chapter 8 Probability Section 1 Sample Spaces, Events, and Probability.
Week 10 - Monday.  What did we talk about last time?  Combinations  Binomial theorem.
Basic ideas 1.2 Sample space Event. Definition1.1 The of a random experiment The set of all possible outcomes of a random experiment is called the of.
Primbs, MS&E345 1 Measure Theory in a Lecture. Primbs, MS&E345 2 Perspective  -Algebras Measurable Functions Measure and Integration Radon-Nikodym Theorem.
The Language of Sets If S is a set, then
Chapter 4 Probability Concepts
What is Probability? Quantification of uncertainty.
Sets and Probabilistic Models
Sets and Probabilistic Models
Experiments, Outcomes, Events and Random Variables: A Revisit
Sets and Probabilistic Models
Sets and Probabilistic Models
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Presentation transcript:

Lecture 1: Data Science & Data Engineering CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier

Homework 1 Data Structures and Basic Programming Due: September 1 st at the beginning of class This assignment must be completed individually. It is worth 25% of your homework grade for the semester, and should help you judge if you are prepared for the course.

Homework 2 Presentations on Biomedical Data Science Due: September 10 th during class You will be divided into five groups, one at NG Xetron, four at UC. Each group will read an assigned article and prepare a 10 minute presentation on the topic for the rest of the class.

The Big News about Sanders

Data Science and Engineering

From the Information Age to the Data Age

What is Data Science?

What is Data Engineering?

Drew Conway’s Venn Diagram of Data Science

The Foundations of Data Science Statistics Computer Science Domain Expertise

Doing Data Science

Back to Bernie and Clinton…

Problems with Anecdotal Data Small number of observations Selection bias Confirmation bias Inaccuracy

Some Basic Definitions Population – the set of objects or units to be measured.

Some Basic Definitions Population – the set of objects or units to be measured. Observations – extracted or measured characteristics about the objects.

Some Basic Definitions Population – the set of objects or units to be measured. Sample – the subset of objects examined in order to draw conclusions and make inferences about the population.

Example Let’s say we want to infer information about the quality of students admitted to UC. Define the population, a single observation, and a sample.

Example Let’s say we want to infer information about the quality of students admitted to UC. How might we introduce biases into the data? What might the consequences be?

Estimating the generated by employees Bearcats Health Insurance Inc has hired you to help them understand their traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling: – Select 10% of their employees at random, and sample all they have ever sent.

Estimating the generated by employees Bearcats Health Insurance Inc has hired you to help them understand their traffic. They have 5,000 employees, and it is infeasible to capture all mailing records. They have asked you to evaluate a possible method for sampling: – Select 10% of all sent during the day at random.

But this is the age of BIG DATA! Why not just sample every message?

Measurements Measurements have inherent assumptions Measurements are often stated very informally – Formalize our measures!

Measurements Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington

The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the real number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?

The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the natural number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?

Solving the Problem of Measures What does it mean for some body (or subset) to be measurable? If a set E is measurable, how does one define its measure? What properties or axioms does measure (or the concept of measurability) obey?

Measure Theory Before we can measure anything we need something to measure! Let’s define a measurable space – A measurable space is a collection of events B, and the set of all outcomes, Ω, also called the sample space.

Events and Sample Spaces Each event, F, is a set containing zero or more outcomes. – Each outcome can be viewed as a realization of an event. The real world can be viewed as a player in a game that makes some move: – All events in F that contain the selected outcome are said to “have occurred”.

Events and Sample Space Take a deck of 52 cards + 2 jokers Draw a single card from the deck. Sample space: 54 element set, each card is a possible outcome. An event is any subset of the sample space, including a singleton set, or the empty set.

Events and Sample Space Potential events: – “Red and black at the same time without being a joker” – (0 elements) – “The 5 of hearts” – (1 element) – “A king” – (4 elements) – “A face card” – (12 elements) – “A card” – (54 elements)

Forming an Algebra on B and Ω In order to define measures on B, we need to make sure it has certain properties, those of a σ-algebra. A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets). “Vanilla” algebras are closed only under finite set operations.

Countable Sets Countable sets are those with the same cardinality of natural numbers. Quick refresher: Prove the cardinality of integers and natural numbers are the same.

σ-algebra If we have a σ-algebra on our sample space Ω, then:

Measures A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number. Formally:

Example Measure Let’s define a measure of “Volume”. The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties: – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Example Measure Does the ordinary concept of volume satisfy these two properties? – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of

Two Special Kinds of Measures Signed measure – can be negative Probability measure – defined over a probability space with a probability measure. – A probability measure, P, has the normal properties of a measure, but it is also normalized such that:

Sets of Measure Zero A set of measure zero is some set For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about the collection of sets B.

Borel Sets A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra. – Almost any set you can describe on the real line is a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc. – The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line.

Borel Sets For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X. Important! Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space.

Borel Sets Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets. Allows us to define equivalence of measures.

Borel Sets Let’s say we have two measures: To show they are equivalent we just need to show that: – They are equivalent on all intervals By definition they are then equivalent for all Borel sets, and hence over the measurable space. Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal.

Measure Theory and Data Science Data Science is about working with, and deriving observations or features from data. Features are effectively measures of some sort, but often not for the underlying space of interest. Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured.

Example Bearcats Elementary School had 300 students in their 5 th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning. The new class of 1 st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading. What can we infer? How does measure theory relate?

Measure Theory: Further Reading M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004 S. I. Resnick, “A probability path”, Birkhauser, A. Gut, “Probability: A Graduate Course”, Springer, R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990.

The Data Science Pipeline Metric identification Data collection Data exploration and summary statistics Feature generation Feature importance testing Modeling Validation

Automating the Data Pipeline Drake – Like make for data.

For next time Homework 1 Due this Tuesday!!!