Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.

Slides:



Advertisements
Similar presentations
QA practitioners viewpoint
Advertisements

Supporting End-User Access
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
CHAPTER 4 ANALYTICS, DECISION SUPPORT, AND ARTIFICIAL INTELLIGENCE
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
CASE Tools CIS 376 Bruce R. Maxim UM-Dearborn. Prerequisites to Software Tool Use Collection of useful tools that help in every step of building a product.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Fraud Detection in Banking using Big Data By Madhu Malapaka For ISACA, Hyderabad Chapter Date: 14 th Dec 2014 Wilshire Software.
Hadoop Ecosystem Overview
Big Data Use Cases in the cloud Peter Sirota, GM Elastic
Security Guidelines and Management
Apache Spark and the future of big data applications Eric Baldeschwieler.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Analytics Map Reduce Query Insight Hive Pig Hadoop SQL Map Reduce Business Intelligence Predictive Operational Interactive Visualization Exploratory.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Tyson Condie.
Processing and Analyzing Large log from Search Engine Meng Dou 13/9/2012.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Why I LIKE the Facebook Database… Sharon Viente May 2010.
Introduction to Hadoop and HDFS
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Click to add text TWA New Job Types with Tivoli Workload Scheduler for Applications 8.6 TWS Education.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
© 2009 IBM Corporation Maximize Cost Savings While Improving Visibility Into Lines of Business Wendy Tam, CDC Product Marketing Manager
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
This is a free Course Available on Hadoop-Skills.com.
© 2007 IBM Corporation IBM Software Strategy Group IBM Google Announcement on Internet-Scale Computing (“Cloud Computing Model”) Oct 8, 2007 IBM Confidential.
Microsoft Ignite /28/2017 6:07 PM
Data Mining - Introduction Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A Tutorial on Hadoop Cloud Computing : Future Trends.
Data Analytics (CS40003) Introduction to Data Lecture #1
CNIT131 Internet Basics & Beginning HTML
Connected Infrastructure
Introduction BIM Data Mining.
Connected Living Connected Living What to look for Architecture
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
ANOMALY DETECTION FRAMEWORK FOR BIG DATA
Defining Data Warehouse Concepts and Terminology
Zhangxi Lin, The Rawls College,
Connected Living Connected Living What to look for Architecture
Hadoop.
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Track and measure Social Media and Darknet through
Our Solutions Focus: Threat Detection and Investigation
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Defining Data Warehouse Concepts and Terminology
Big Data - in Performance Engineering
Big Data Overview.
Supporting End-User Access
Big Data Young Lee BUS 550.
Bleeding edge technology to transform Data into Knowledge
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Presentation transcript:

Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Check out some use cases Discuss some architectures

USE CASES

Common Use Cases Log Processing Image Identification Extract Transform Load Recommendation Engines Time-Series Storage and Processing Building Search Indexes Long-Term Archive Audit Logging

Non-Use Cases Data processing handled by one large server ACID Transactions

A Bank Problem – Need to analyze customer activity across multiple products to predict credit risk – Acquired a number of banks Solution – Setup a single Hadoop cluster with data from multiple EDWs – Bank added new sources of customer service data to get a clear picture of a customer’s financial situation

A Mobile Carrier Problem – Why are our customers terminating their service contracts? Solution – Combined transactional and event data with social network data – Combined coverage maps with account data

An Online Dating Service Problem – Surveys, demographic, and web activity to build a picture – Customers wanted better recommendations – Algorithms improved and number of users grew Solution – Moved data and analysis to Hadoop – Able to size system to meet needs of customers

Ad Targeting Problem – Advertising is a special kind of recommendation – Need to select best ad for a particular visitor, but each advertiser is paying to have its ad seen Solution – Collect stream of user activity with continuous analysis – Build sophisticated models of user behavior

POS Transaction Analysis Problem – Retailers able to collect much more data in stores and online – EDW do not generally support sophisticated analysis to provide better forecasting Solution – Loaded 20 years of sales transactions and used Hive to do same analysis as before – Now able to use new algorithms with new data sets

Sensor Data Problem – Volume of sensor data from every generator across multiple grids is enormous – Clear picture depends on real-time and forensic analysis Solution – Capture and store all streaming sensor data – Built continuous analysis system to watch performance of generators

Threat Analysis Problem – How do we detect threats and fraudulent activity in an online world? Solution – Use of HBase to store virus signatures – Use of MapReduce to compare spam or malware Lambda Architecture

Trade Surveillance Problem – Difficult to monitor trades for compliance, and impossible to catch rogue traders Solution – Store trade data and trading party data – Continuously monitor activity and build connections – Provides cheap storage for law-required auditing

Search Problem – Indexing stuff is pretty easy, until we went and had to index the Internet – User preferences make it harder Solution – MapReduce was designed for indexing – Online retailers depend on search for users finding and buying products

Data Sandbox Problem – ??? Solution – Simple storage mechanism with diverse tools for data analysis and exploration

ARCHITECTURES

Building your Data Lake

12 3 4

Lambda Architecture All Data Precompute Views QFD 1 QFD 2 QFD N QFD 1 QFD 2 QFD N Process Stream Increment Views New Data Stream Query Real-Time Increment Batch recompute Storm Real-time views Batch views BATCH LAYER SERVING LAYER SPEED LAYER Hadoop (Apache HBase) (HDFS/SQL)

Facebook EDW (Oracle) was unable to scale and perform Investigated small Hadoop system Engineers loved it Began developing Hive

Facebook Time-series summaries Ad hoc jobs over historical data Long-term archival store for logs Look up log events by specific attributes

Facebook Architecture

Facebook Messaging Needed a short set of temporal data A growing set of data that is rarely accessed HBase fit their needs more than other open- source technologies

Twitter Architecture

LinkedIn Architecture

LinkedIn Applications

LinkedIn Future MapReduce is not suited for large graph processing Batch-oriented nature is not suited for “breaking news”

References Hadoop: The Definitive Guide, Chapter ecosystem-at-linkedin hardware-twitter-size-does-matter the-data-lake-dream/ unique-use-cases-for-apache-hadoop content/uploads/2011/03/ten_common_hadoopable_ problems_final.pdf