Zoie Barrett and Brian Lam Big Data and Hadoop Zoie Barrett and Brian Lam
Agenda What is Big Data? What Tools are There? Hadoop Hadoop vs SQL Examples Questions?
What is Big Data? Large, complex, rapidly growing, unstructured data sets that are difficult to process using traditional methods Analyzing Big Data is very complex and requires skills of programmers and statistics majors
Dimensions Volume Determining relevance How to use analytics to create value Velocity Unprecedented speeds for data streaming Reacting quickly Variety Multiple formats many unstructured Managing, merging and governing
Big Statistics 90% of the worlds data created in last 2 years We create 2.5 quintillion bytes of data a day 48 hrs of video uploaded to YouTube every minute (nearly 8 years of content every day) 100 terrabytes of data uploaded to Facebook daily 230 million Tweets a day http://analyzingmedia.com/2012/infographic-big-flood-of-big-data-in-digital-marketing/
Concerns with Big Data Data storage is becoming cheaper and cheaper but how do we manage it? Read/Write speeds are not keeping up with the amount of data being generated Data is unstructured and hard to analyze What is the solution? http://www.slideshare.net/martyhall/hadoop-tutorial-overview-of-hadoop
Tools
Hadoop Open-source software framework for storage and processing large data sets Fundamental assumption: hardware failures are common Clusters of commodity hardware Batch not Real Time Hadoop based projects for real time analysis
History of Hadoop Doug Cutting and Mike Cafarella wanted to develop a better open source search engine Created Nutch (web crawler) Based on Lucene (search engine library)
How Hadoop works Hadoop Distributed Filesystem (HDFS) designed to run on commodity hardware data is stored across multiple servers fault tolerant MapReduce processes data Map - divides jobs into pieces and distributes Reduce - combines results
Who Uses Hadoop?
Hadoop vs SQL SQL Data Storage: logical, interrelated tables and defined columns Hadoop Data Storage: compressed file of text or other data types https://gigaom.com/2013/03/04/the-history-of-hadoop-from-4-nodes-to-the-future-of-data/
Examples UPS - reduced maintenance cost Schwan’s - analyzed customer feedback Memphis PD - used analytics to reduce crime
Dilbert
Questions?