Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia

Similar presentations

Presentation on theme: "1 Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia"— Presentation transcript:

1 1 Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia

2 Outline Increasing interest in data Course: From Data to Knowledge Summary 2

3 “The data deluge” “Data, data everywhere” Economist Special Issue Feb 27-Mar. 5, 2010 Walmart databases alone are estimated at more than 2.5 petabytes (a petabyte is 1 million gigabytes): 2010 numbers From businesses to governments, data collection and analysis is rapidly becoming the next big thing. 2012: review/big-datas-impact-in-the- world.html?pagewanted=all review/big-datas-impact-in-the- world.html?pagewanted=all 3

4 “The data deluge” “A new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data.” Hal Varian, Google’s chief economist notes that “Data are widely available; what is scarce is the ability to extract wisdom from them.” 4

5 Business intelligence Nestle sells > 100,000 products in 200 countries using 550,000 suppliers Problem: not using its huge buying power effectively Used SAP software and analyzed its data Just one ingredient – vanilla – its American operation reduced the number of specifications and used fewer suppliers, saving $30M per year Annual savings from such operational improvements: $1 billion 5 Economist special issue

6 Medical use Dr. Carolyn McGregor from University of Ontario Goal: spot fatal infections in premature babies Monitors subtle changes in 7 streams of real-time data, such as heart rate, blood pressure, etc. ECG alone takes 1000 readings/second Infections are detected before obvious symptoms emerge Naked eye cannot see it, but the computer can! Who programs these? Stats experts. Another term: Evidence Based Medicine 6 Economist special issue

7 Government usage An add-on to a 1986 law required firms to disclose the harmful chemicals they release. When the public started tracking these numbers, by 2000, American businesses had reduced their emissions of the chemicals covered under the law by 40% 7 Economist special issue

8 Best-sellers “Super-crunchers: Why Thinking-by- Numbers Is the New Way to Be Smart” by Ian Ayres “Money Ball: The Art of Winning an Unfair Game” by Michael Lewis “The Long Tail” by Chris Anderson Malcolm Gladwell books - Outliers Microtrends – Mark Penn (elections) Freakonomics – S. Dubner and S. Levitt 8

9 Moneyball example 2002 season: Richest team, NY Yankees, had a payroll of $126 million, while the Oakland A’s had a payroll of less than a third of that, about $40 million, and yet they had reached the playoffs three years in a row, and took the Yankees close to elimination. How did they do it? Billy Beane, general manager of Oakland A’s –Respected statistics –Hired Paul DePodesta, Harvard MBA, who applied Bill James’ formulas and selected players based on their statistics. –Runs created = (Hits + Walks)  Total Bases/(At Bats + Walks) –Jeremy Brown – only player in the history of the SEC with 300 hits and 200 walks, but he was overweight –Scouts vs. statisticians! The tendency of everyone to generalize wildly from his own experience. Most people think their own experience is typical! 9

10 Malcolm's Gladwell's "Outliers” hockey players story Why Canadian hockey players born early in the year have a big advantage; cutoff date was Jan. 1 ESPN conducted a little study: All the 2008 season NHL players who were born from 1980 to 1990. [Later disputed for 2011 players] Sure enough: Many more were born early in the year than late. 10 Jan.51 Feb.46 Mar.61 Apr.49 May46 June49 Jul.36 Aug.41 Sep.36 Oct.34 Nov.33 Dec.30

11 Examples from “The Long Tail” Rhapsody, an online music store, which in Dec. 2005 had 1.5M tracks, reported that the number of downloads/month for even the 100,000th track was in the 1000s, when a Walmart store, the largest brick-and-mortar music retailer, stocks only 55,000 tracks. Rhapsody reports that 40% of its total sales came from the Long Tail products, i.e., those not available in retail stores. Anderson gives several such examples, calling these businesses Long-Tail aggregators –Google as the long-tail aggregator of advertising –eBay of goods –Amazon of books –Apple of music –Netflix of movies 11

12 Experts vs. intuition Ian Ayres’ book –“The future belongs to people like Wolfers who are comfortable with both intuition and numbers” –Wolfers analyzed 44,000 college basketball games (> 16 years) Also see Jason Lehrer’s “How we Decide” – another bestseller 12 Ian Ayres’ book, page 220

13 What Wolfers did Plot density function of number of games that beat the Las Vegas spread –Perfect normal bell curve! Just look at games with point spreads less than or equal to 12 –Perfect normal bell curve Look at games with point spread > 12 –47% chance that the favored team beat the spread (53% failed to cover the spread) –more than 20% of games fell in this category of games with >12 spreads –Is it point shaving? Look at the score five minutes before the end of the game – right on track to beat the spread 50% of the time! –Indeed a stronger case for point shaving 13 Ian Ayres’ book, page 216

14 2SD Rule: To understand variability There is a 95% chance that a normally distributed variable will fall within two standard deviations (plus or minus) of its mean Statistical significance – simple intuitive concept – there is less than 5% chance that a random variable will be more than two standard deviations away from the mean. Stanford Law school students knew that professors were required to give a 3.2 mean. They wanted to know if the professor was a “spreader” or a “clumper”! 14 Ian Ayres’ book, page 221

15 Technology trends enabling all this data analysis Cloud computing –Amazon, Google, Yahoo, Microsoft Open source software –R programming language NY Times article, Jan. 7, 2009 –Hadoop allows ordinary PCs to analyze huge quantities of data that previously required supercomputers 15 Economist special issue

16 Technology or techniques? Moore’s Law –Processing power doubles every two years –Supercrunching does need CPUs, but computing power has been available More important: Kryder’s Law –Storage capacity of hard drives has been doubling every two years –Chief technology office (Mark Kryder) for hard drive manufacturer, Seagate 16 Ian Ayres’ book, page 151

17 Three techniques Regressions –error term ~ N(0,  2 ) Randomization –Run experiments by treating different samples in different ways Neural networks –Functional form is not assumed to be linear or anything specific 17 Ian Ayres’ book

18 Course material From Data to Knowledge Focus on data sets Less on details of statistical techniques Learn R programming through class- provided R programs and assignments K/index.htm 18

19 Summary Importance of data analysis –in every walk of life! How to extract the “story” hidden in the data set? 19

Download ppt "1 Introduction Malathi Veeraraghavan Professor Charles L. Brown Dept. of Electrical and Computer Engineering University of Virginia"

Similar presentations

Ads by Google