Download presentation
Presentation is loading. Please wait.
Published byEverett Bradford Modified over 9 years ago
1
Lecture 2: Measures and Data Collection/Cleaning CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier
2
Homework 2 Presentations on Biomedical Data Science Due: Next Week on Tuesday? Reorganizing Groups.
3
Measurements Measurements have inherent assumptions Measurements are often stated very informally – Formalize our measures!
4
Measurements Measure theory is a bit like grammar, many people communicate clearly without worrying about all the details, but the details do exist and for good reasons. - Maya Gupta, University of Washington
5
The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the real number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?
6
The Problem of Measures Physical intuition of the measure of length, given a body E, the measure of this body, m(E) might be the sum of it’s components, or points. Let’s take two bodies on the natural number line – Body A is the line A = [0, 1] – Body B is the line B = [0, 2] Which is “longer”?
7
Solving the Problem of Measures What does it mean for some body (or subset) to be measurable? If a set E is measurable, how does one define its measure? What properties or axioms does measure (or the concept of measurability) obey?
8
Measure Theory Before we can measure anything we need something to measure! Let’s define a measurable space – A measurable space is a collection of events B, and the set of all outcomes, Ω, also called the sample space.
9
Events and Sample Spaces Each event, F, is a set containing zero or more outcomes. – Each outcome can be viewed as a realization of an event. The real world can be viewed as a player in a game that makes some move: – All events in F that contain the selected outcome are said to “have occurred”.
10
Events and Sample Space Take a deck of 52 cards + 2 jokers Draw a single card from the deck. Sample space: 54 element set, each card is a possible outcome. An event is any subset of the sample space, including a singleton set, or the empty set.
11
Events and Sample Space Potential events: – “Red and black at the same time without being a joker” – (0 elements) – “The 5 of hearts” – (1 element) – “A king” – (4 elements) – “A face card” – (12 elements) – “A card” – (54 elements)
12
Forming an Algebra on B and Ω In order to define measures on B, we need to make sure it has certain properties, those of a σ-algebra. A σ-algebra is a special kind of collection of subsets that is closed under countable-fold set operations (complement, union of countably many sets, and intersection of countably many sets). “Vanilla” algebras are closed only under finite set operations.
13
Countable Sets Countable sets are those with the same cardinality of natural numbers. Quick refresher: Prove the cardinality of integers and natural numbers are the same.
14
σ-algebra If we have a σ-algebra on our sample space Ω, then:
15
Measures A measure µ takes a set A from a measureable collection of sets B and returns the measure of A, which is some positive real number. Formally:
16
Example Measure Let’s define a measure of “Volume”. The triple combines a measureable space and a measure, the triple is called a measure space. This space is defined by two properties: – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of
17
Example Measure Does the ordinary concept of volume satisfy these two properties? – Nonnegativity: – Countable additivity: are disjoint sets for i = 1, 2, …, then the measure of the union of is equal to the sum of the measures of
18
Two Special Kinds of Measures Signed measure – can be negative Probability measure – defined over a probability space with a probability measure. – A probability measure, P, has the normal properties of a measure, but it is also normalized such that:
19
Sets of Measure Zero A set of measure zero is some set For a probability measure, any set of measure zero can never occur as it has probability of zero. – It can thus be ignored when stating things about the collection of sets B.
20
Borel Sets A common σ-algebra is the Borel σ-algebra. A Borel set is an element of a Borel σ-algebra. – Almost any set you can describe on the real line is a Borel set, for example, the unit line segment [0,1]. Irrational numbers, etc. – The Borel σ-algebra on the real line is a collection of sets that is the smallest σ-algebra that includes the open subsets of the real line.
21
Borel Sets For some space X, the collection of all Borel sets on X forms a σ-algebra known as the Borel algebra (or Borel σ-algebra) on X. Important! Why? Any measure defined on the open set of a space, or closed sets of a space, must also be defined on all Borel sets of that space.
22
Borel Sets Borel sets are powerful because if you know what a probability measure does on every interval, then you know what it does on all the Borel sets. Allows us to define equivalence of measures.
23
Borel Sets Let’s say we have two measures: To show they are equivalent we just need to show that: – They are equivalent on all intervals By definition they are then equivalent for all Borel sets, and hence over the measurable space. Example: Given probability distributions A, and B, with equivalent cumulative distribution functions, then the probability distributions must also be equal.
24
Measure Theory and Data Science Data Science is about working with, and deriving observations or features from data. Features are effectively measures of some sort, but often not for the underlying space of interest. Important to realize the limitations of measurable spaces for metrics of interest, and what can and cannot be measured.
25
Example Bearcats Elementary School had 300 students in their 5 th grade class. 77% of them graduated to middle school. 12% failed their mathematics Standards Of Learning, 11% failed their reading Standards of Learning. The new class of 1 st graders had interventions in mathematics and grammar, their graduation rates improved to 88%, with 7% failing mathematics, and 5% failing reading. What can we infer? How does measure theory relate?
26
Measure Theory: Further Reading M. Capinski and E. Kopp, “Measure, Integral, and Probability”, Springer Undergraduate Mathematics Series, 2004 S. I. Resnick, “A probability path”, Birkhauser, 1999. A. Gut, “Probability: A Graduate Course”, Springer, 2005. R. M. Gray, “Entropy and Information Theory”, Springer Verlag (available free online), 1990.
27
The Data Science Pipeline Metric identification Data collection Data exploration and summary statistics Feature generation Feature importance testing Modeling Validation
28
Automating the Data Pipeline Drake – Like make for data.
29
Getting your environments set for Data Science Over the next few weeks we will be introducing the projects and getting started with data science projects. Need to get the right tools installed!
30
Anaconda https://store.continuum.io/cshop/anaconda/ https://store.continuum.io/cshop/anaconda/ Grab the free distribution – Helps you maintain the appropriate python distributions.
31
iPython/Jupyter Interactive Python with documentation features Installs easily with Anaconda – http://jupyter.readthedo cs.org/en/latest/install.h tml
32
Markdown Markdown Syntax – http://daringfireball.net/projects/markdown/synt ax http://daringfireball.net/projects/markdown/synt ax Markdown Basics – http://daringfireball.net/projects/markdown/basi cs http://daringfireball.net/projects/markdown/basi cs
33
Compute Lab Compute Server Minerva – Each group will get an account on Minerva with space and compute power for their project – Cloud-based Ubuntu server, similar to AWS, but private and secure.
34
For next time No homework this week, work on HWK 3 presentations Work with Jupyter examples on Minerva once accounts are set up. Learn Markdown Basics No class Thursday
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.