Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Data and Hadoop On Windows.Net SIG Cleveland Image credit: on.

Similar presentations

Presentation on theme: "Big Data and Hadoop On Windows.Net SIG Cleveland Image credit: on."— Presentation transcript:

1 Big Data and Hadoop On Windows.Net SIG Cleveland Image credit: on

2 About Me  Serkan Ayvaz, Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ.  LinkedIn:  

3 Agenda  Introduction to Big Data  Hadoop Framework  Hadoop On Windows  Ecosystem  Conclusions

4 What is Big Data?(“Hype?”)  Big data is a collection of data sets so large and complex that it becomes difficult to process using on- hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.- Wikipedia

5 What is new?  Enterprise data grows rapidly  Emerging Market for Vendors  New Data Sources  Competitive industries - need for more Insights  Asking different questions  Generating models instead transforming data into models

6 What is the problem?  Size of Data; Rapid growth, TBs to PBs are norm for many organizations  As of 2012, size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data.  Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc  Structured  Unstructured  Semi-structured  Rate of Data Growth  As of 2012, every day 2.5 quintillion (2.5×10 18 ) bytes of data were created -Wikipedia  Particularly large datasets; meteorology, genomics, complex physics simulations, and biological and environmental research, Internet search, finance and business informatics

7 Critique  Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review.  Consumer privacy concerns by increasing storage and integration of personal information

8 Things to consider  Return of Investment may differ  Asking wrong questions, won’t get right answers  Experts to fit in the organization  Requires leadership decision  Might be fine with traditional systems(for now)

9 What is Hadoop?  Scalability  Scales Horizontally, Vertical scaling has limits  Scales seamlesly  Moves processing to the data, opposed to traditional methods  Network bandwidth is limited resource  Processes data sequentially in chunks, avoid random access  Seeks are expensive, disk throughput is reasonable  Fault tolerance  Data Replication  Economical  Commodity-Servers(“not Low-end”) vs Specialized Servers  Ecosystem  Integration with other tools  Open Source  Innovative, Extensible Hadoop Core HDFS Storage MapReduce Processing

10 What can I do with Hadoop?  Distributed Programming(MapReduce)  Storage, Archive Legacy data  Transform Data  Analysis, Ad Hoc Reporting  Look for Patterns  Monitoring/ Processing logs  Abnormality detection  Machine Learning and advanced algorithms  Many more

11 HDFS Blocks Large enough to minimize the cost of seeks-64 MB default Unit of abstraction makes storage management simpler than file Fits well with replication strategy and availability NameNode Maintains the filesystem tree and metadata for all the files and directories Stores the namespace image and edit log Datanode Store and retrieve blocks Report the blocks back to NameNode periodically

12 HDFS  Designed for and Shines with large files  Fault tolerance - Data Replication within and across Racs  Hadoop breaks data into smaller blocks  Data locality  Most efficient with write-once, read-many-times pattern  Low-latency data access  optimized for high throughput data, may be at the expense of latency.  Consider Hbase for low latency  Lots of small files  namenode holds filesystem metadata in memory  the limit to the number of files in a filesystem  Multiple writers, arbitrary file modifications  Files in HDFS may be written to by a single writer. GoodNot so good

13 Data Flow Source:Hadoop:The Definitive Guide Read Write

14 MapReduce Programming  Splits input files into blocks  Operates on key-value pairs  Mappers filter & transform input data  Reducers aggregate mappers output  Handles processing efficiently in parallel  Move code to data – data locality  Same code run on all machines  Can be difficult to implement some algorithms  Can be implemented in almost any language  Streaming MapReduce for python, ruby, perl, php etc  pig latin as data flow language  hive for sql users

15 MapReduce  Programmers write two functions: map (k, v) → * reduce (k’, v’) → * All values with the same key are reduced together  For efficiency, programmers typically also write: partition (k’, number of partitions) → partition for k’ Often a simple hash of the key, e.g., hash(k’) mod n Divides up key space for parallel reduce operations combine (k’, v’) → * Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic  The framework takes care of rest of the execution

16 Simple example - Word Count // Map Reduce function in JavaScript // var map = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (var i = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } }; var reduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(; } context.write(key, sum); };

17 r2r2 combine yx41z9xz61yz7 9 partition map k1k1 v1v1 yx 41 zz36xz61yz7 9 Shuffle and Sort: aggregate values by keys reduce x46y17z198 r1r1 r3r3 s3s3 z199 Output Input Divide and Conquer k3k3 v3v3 k2k2 v2v2 k4k4 v4v4 k5k5 v5v5 x10 y8 z19

18 How MapReduce Works? Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator values): int sum = 0; for each v in values: sum += v; Emit(term, value); Source:Hadoop:The Definitive Guide

19 How is it different from other Systems?  Parallel - Message Passing Interfaces(MPI)  Compute-intensive jobs,  Issue larger data volumes  Network bandwidth is the bottleneck and compute nodes become idle.  Hard to implement  Challenge of Coordinating the processes in a large- scale distributed computation  Handling partial failure  Managing check pointing and recovery

20 Comparing MapReduce to RDBMs Traditional RDBMsMapReduce Data sizeGigabytesPetabytes AccessInteractive and batchBatch UpdatesRead and write many times Write once, read many times StructureStatic schemaDynamic schema IntegrityHighLow ScalingNonlinearLinear

21 MapReduce  MapReduce complementary to RDBMs, not competing  MapReduce good fit for analyzing the whole dataset in batch  An RDBMS is good for point queries or updates  indexed to deliver low-latency retrieval  relatively small amount of data.  MapReduce suits applications where the data is written once and read many times,  An RDBMS is good for datasets that are continually updated.

22 Hadoop on Windows Overview Apache Hadoop Core Common framework Open Source Community Shared by all Distribution Hortonworks Data platform Windows Platform 100% Open Source Contributions to Community HDInsight HDInsight Server HDInsight on CloudFamiliar Tools &Functionality

23 Hadoop on Windows  Standard Hadoop Modules  HDFS  MapReduce  Pig  Hive  Monitoring Pages  Easy installation and Configuration  Integration with Microsoft system  Active Directory  System Center  etc

24 Why Hadoop on Windows important?  Windows Server Large Market share  Large Developer and User community  Existing Enterprise tools  Familiarity  Simplicity of Use and Management  Deployment options on both Windows Server and Windows Azure.

25 HADOOP [Server and Cloud] HADOOP [Server and Cloud] HDFS DATA RDBMS [unstructured, semi-structured, structured] Java Streaming HiveQL PigLatin Other langs..NET NOSQL SQL External Data Web Mobile Devices Social Media Legacy Data HDFS User -Self Service Tools: Data Viewers, BI, Visualization

26 Run Jobs  Submit a JAR file(Java MapReduce)  HiveQL  PigLatin .NET wrapper through Streaming .Net MapReduce  LINQ to Hive  JavaScript Console  Excel Hive Add-In

27 .Net MapReduce Example NuGet Packages Reference “Microsoft.Hadoop.MapReduce.DLL” > MRRunner -dll MyDll -class MyClass -- extraArg1 extraArg2 Create a class the implements “HadoopJob Create a class called “FirstMapper” that implements “MapperBase” install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient Run DLL using MRRunner Utility; Run Invoke Exe using MRRunner Utility; var hadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob (arguments);

28 .Net MapReduce Example public class FirstJob : HadoopJob { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfiguration config = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { int inputValue = int.Parse(inputLine); // Perform the work. double sqrt = Math.Sqrt((double)inputValue); // Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); }

29 Hadoop Ecosystem  Hadoop  Common, MapReduce, HDFS  HBase  Column oriented distributed database  Hive  Distributed data warehouse-SQL like query platform  Pig  Data transformation language  Sqoop  Tool for bulk Import/export between HDFS, HBase, Hive and relational databases  Mahout  Data Mining Algorithms  ZooKeeper  Distributed Coordination service  Oozie  Job Running and scheduling workflow service

30 What’s HBase?  Column Oriented Distiributed DB  Inspired by Google BigTable  Uses HDFS  Interactive Processing  Can use either without MapRed  PUT, GET, SCAN Commands

31 What’s Hive?  Translate HiveQL,similar to SQL, to MapReduce  A Distributed Data warehouse  HDFS table file format  Integrate with BI products on tabular data, Hive ODBC, JDBC drivers

32 Hive o HiveQL – Familiar, high level language o Batch jobs – Ad Hoc Queries o Self service BI tools via ODBC, JDBC o Schema but not strict as traditional RDBMs o Supports UDFs o Easy access to Hadoop data No Updates or deletes, Insert only Limited Indexes, built-in optimizer, no caching Not OLTP Not fast as MapReduce Good forNot so good for

33 Conclusion  Hadoop is great for its Purposes and here to stay  BUT Not a common cure for every problem  Developing standards and best practices very important  Users may abuse the resources and scalability  Integration with Windows Platform  Existing systems, tools, Expertise  Parallelization  Easier to scale as need  Economical  Commodity Hardware  Relatively short training, application development time with Windows

34 Resources&References  Hadoop: The Definitive Guide by Tom White   Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers,  Apache Hadoop   Microsoft Big data page  intelligence/big-data.aspx intelligence/big-data.aspx  Hortonworks Data Platform   Hadoop SDK 

35 Thank you! Any Questions?

Download ppt "Big Data and Hadoop On Windows.Net SIG Cleveland Image credit: on."

Similar presentations

Ads by Google