Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:

Similar presentations


Presentation on theme: "Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:"— Presentation transcript:

1 Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net Materials: http://www.joehummel.net/downloads.html Email: joe@joehummel.net

2  A little history…  Why Hadoop?  How it works  Demos  Summary Hadoop on Azure 2

3  Map-Reduce is from functional programming Hadoop on Azure 3 // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count // function returns 1 if i is prime, 0 if not: let isPrime(i) =... // sums 2 numbers: let sum(x, y) = return x + y // count the number of primes in 1..N: let countPrimes(N) = let L = [ 1.. N ] // [ 1, 2, 3, 4, 5, 6,... ] let T = map isPrime L // [ 0, 1, 1, 0, 1, 0,... ] let count = reduce sum T // 42 return count

4  Created by to drive internet search ◦ BIG data ― scalable to TBs and beyond ◦ Parallelism: to get the performance ◦ Data partitioning: to drive the parallelism ◦ Fault tolerance: at this scale, machines are going to crash, a lot… 4 BIG Data BIG Data page hits

5  Search engines: Google, Yahoo, Bing  Facebook  Twitter  Financials  Health industry  Insurance  Credit card companies  Just about any company collecting user data… Hadoop on Azure 5

6  Freely-available framework for big data ◦ http://hadoop.apache.org/ http://hadoop.apache.org/  Based on concept of Map-Reduce: 6 BIG data BIG data Map...... Reduce R R map function reduce intermediate results......

7 Hadoop on Azure 7 Mapper Reducer

8 8 Map Sort Reduce Merge [,, … ] [, … ] R R Data Map Sort Map Sort [,,, … ] [,, … ]

9  Netflix data-mining… Hadoop on Azure 9 Netflix Movie Reviews (.txt) Netflix Movie Reviews (.txt) Netflix Data Mining App Average rating… movieid,userid,rating,date 1,2390087,3,2005-09-06 217,5567801,5,2006-01-03 42,1121098,3,2006-03-25 1,8972234,5,2003-12-02.

10 10 Map Sort Reduce Merge [,,,, … ] [,,, … ] R R Data Map Sort Map Sort [,,,,,, … ]

11  To compute average rating for every movie: Hadoop on Azure 11 // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]); }; var reduce = function (key, values, context) { var sum = 0; var count = 0; while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count); }; // Javascript version: var map = function (key, value, context) { var values = value.split(","); // field 0 contains movieid, field 2 the rating: context.write(values[0], values[2]); }; var reduce = function (key, values, context) { var sum = 0; var count = 0; while (values.hasNext()) { count++; sum += parseInt(values.next()); } context.write(key, sum/count); };

12 Hadoop on Azure 12  Upload data to HDFS ◦ Hadoop file system  Write map / reduce functions ◦ default is to use Java ◦ most languages supported: C, C++, C#, JavaScript, Python, …  Compile and upload code ◦ For Java, you upload.jar file ◦ For others,.exe or script  Submit MapReduce job  Wait for job to complete

13 Hadoop on Azure 13  Queries against big datasets  Embarrassingly-parallel problems ◦ Solution must fit into map-reduce framework  Non-real-time demands  Hadoop is not for: ◦ Small datasets (< 1GB?) ◦ Sub-second / real-time needs (though clearly Google makes it work)

14  We’ll be working with Chicago crime data… ◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2 ◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html 14 1 GB 5M rows 1 GB 5M rows

15  Compute top-10 crimes… 15 0486 366903 0820 308074. 0890 166916 0486 366903 0820 308074. 0890 166916 IUCR Count IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois- Uniform-Crime-R/c7ck-438e IUCR = Illinois Uniform Crime Codes https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois- Uniform-Crime-R/c7ck-438e

16  Hadoop on Azure…  Supports traditional Hadoop usage ◦ Upload data ◦ Write MapReduce program ◦ Submit job  Additional features: ◦ Allows access to persistent data from Azure Storage Vault ◦ Provides interactive JavaScript console ◦ Built-in higher-level query languages (PIG, HIVE) 16 Hadoop on Azure

17 17 // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; // Javascript version: var map = function (key, value, context) { var values = value.split(","); context.write(values[4], 1); }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); }; 0486 366903 0820 308074. 0486 366903 0820 308074.

18 Hadoop on Azure 18 // interactive PIG with explicit Map-Reduce functions: pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // interactive PIG with explicit Map-Reduce functions: pig.from("asv://datafiles/CC-from-2001.txt"). mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long"). orderBy("Count DESC"). take(10). to("output-from-2001") // visualize the results: file = fs.read("output-from2001/part-r-00000") data = parse(file.data, "IUCR, Count:long") graph.bar(data) // visualize the results: file = fs.read("output-from2001/part-r-00000") data = parse(file.data, "IUCR, Count:long") graph.bar(data)

19  Microsoft is offering free access to Hadoop ◦ Request invitation @ http://www.hadooponazure.com/http://www.hadooponazure.com/  Hadoop connector for Excel ◦ Process data using Hadoop, analyze/visualize using Excel Hadoop on Azure 19

20 20 Hadoop on Azure

21 21  Hadoop is all about big data processing ◦ Scalable, parallel, fault-tolerant  Easy to understand programming model ◦ Map-Reduce ◦ But then solution must fit into this framework…  Rich ecosystem developing around Hadoop ◦ Technologies: PIG, HIVE, HBase, … ◦ Companies: Cloudera, Hortonworks, MapR, …

22  Presenter: Joe Hummel ◦ Email: joe@joehummel.net ◦ Materials: http://www.joehummel.net/downloads.html  For more info: ◦ http://www.hadooponazure.com/ http://www.hadooponazure.com/ ◦ http://msdn.microsoft.com/en-us/magazine/jj190805.aspx http://msdn.microsoft.com/en-us/magazine/jj190805.aspx ◦ Overview, including how to access via.NET API:  http://www.simple-talk.com/cloud/data-science/analyze- big-data-with-apache-hadoop-on-windows-azure- preview-service-update-3/ http://www.simple-talk.com/cloud/data-science/analyze- big-data-with-apache-hadoop-on-windows-azure- preview-service-update-3/ 22 Hadoop on Azure


Download ppt "Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:"

Similar presentations


Ads by Google