Download presentation
Presentation is loading. Please wait.
Published byPenelope Carpenter Modified over 9 years ago
1
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar
2
Ameya Kanitkar – That’s me! Big Data Infrastructure Engineer @ Groupon, Palo Alto USA (Working on Deal Relevance & Personalization Systems) ameya.kanitkar@gmail.com http://www.linkedin.com/in/ameyakanitkar @aktwits
3
Agenda Basics of Hadoop & HBase How you can use Hadoop & HBase for big data application Case Study: Deal Relevance and Personalization Systems at Groupon with Hadoop & HBase
4
Big Data Application Examples Recommendation Systems Ad targeting Personalization Systems BI/ DW Log Analysis Natural Language Processing
5
So what is Hadoop? General purpose framework for processing huge amounts of data. Open Source Batch / Offline Oriented
6
Hadoop - HDFS Open Source Distributed File System. Store large files. Can easily be accessed via application built on top of HDFS. Data is distributed and replicated over multiple machines Linux Style commands eg. ls, cp, mv, touchz etc
7
Hadoop – HDFS Example: hadoop fs –dus /data/ 185453399927478 bytes =~ 168 TB (One of the folders from one of our hadoop cluster)
8
Hadoop – Map Reduce Application Framework built on top of HDFS to process your big data Operates on key-value pairs Mappers filter and transform input data Reducers aggregate mapper output
9
Example Given web logs, calculate landing page conversion rate for each product So basically we need to see how many impressions each product received and then calculate conversion rate of for each product
10
Map Reduce Example Map 1: Process Log File: Output: Key (Product ID), Value (Impression Count) Map 2: Process Log File: Output: Key (Product ID), Value (Impression Count) Map N: Process Log File: Output: Key (Product ID), Value (Impression Count) Reducer: Here we receive all data for a given product. Just run simple for loop to calculate conversion rate. (Output: Product ID, Conversion Rate Map Phase Reduce Phase
11
Recap We just processed terabytes of data, and calculated conversion rate across millions of products. Note: This is batch process only. It takes time. You can not start this process after some one visits your website. How about we generate recommendations in batch process and serve them in real time?
12
HBase Provides real time random read/ write access over HDFS Built on Google’s ‘Big Table’ design Open Sourced This is not RDBMS, so no joins. Access patterns are generally simple like get(key), put(key, value) etc.
13
RowCf: ….Cf: Row 1Cf1:qual1Cf1:qual2 Row 11Cf1:qual2Cf1:qual22Cf1:qual3 Row 2Cf2:qual1 Row N Dynamic Column Names. No need to define columns upfront. Both rows and columns are (lexicological) sorted
14
RowCf: …. user1Cf1:click_history:{actual_cl icks_data} Cf1:purchases:{actual_pur chases} user11Cf1:purchases:{actual_pur chases} user20Cf1:mobile_impressions:{a ctual mobile impressions} Cf1:purchases:{actual_pur chases} Note: Each row has different columns, So think about this as a hash map rather than at table with rows and columns
15
Putting it all together Analyze Data (Map Reduce) Generate Recommendations (Map Reduce) Store data in HDFS Serve Real Time Requests (HBase) Web Mobile Do offline analysis in Hadoop, and serve real time requests with HBase
16
Use Case: Deal Relevance & Personalization @ Groupon
17
What are Groupon Deals?
18
Our Relevance Scenario Users
19
Our Relevance Scenario Users How do we surface relevant deals ? Deals are perishable (Deals expire or are sold out) No direct user intent (As in traditional search advertising) Relatively Limited User Information Deals are highly local
20
Two Sides to the Relevance Problem Algorithmic Issues How to find relevant deals for individual users given a set of optimization criteria Scaling Issues How to handle relevance for all users across multiple delivery platforms
21
Developing Deal Ranking Algorithms Exploring Data Understanding signals, finding patterns Building Models/Heuristics Employ both classical machine learning techniques and heuristic adjustments to estimate user purchasing behavior Conduct Experiments Try out ideas on real users and evaluate their effect
22
Data Infrastructure 2013 20112012 20+ 400+ 2000+ Growing DealsGrowing Users 100 Million+ subscribers We need to store data like, user click history, email records, service logs etc. This tunes to billions of data points and TB’s of data
23
Deal Personalization Infrastructure Use Cases Deliver Personalized Emails Deliver Personalized Website & Mobile Experience Offline System Online System Email Personalize billions of emails for hundredsof millions of users Personalize one of the most popular e-commerce mobile & web app for hundreds of millions of users & page views
24
Architecture HBase Offline System HBase for Online System Real Time Relevance Email Relevance Map/Reduce Replication Data Pipeline We can now maintain different SLA on online and offline systems We can tune HBase cluster differently for online and offline systems
25
HBase Schema Design User IDColumn Family 1Column Family 2 Unique Identifier for Users User History and Profile Information Email History For Users Most of our data access patterns are via “User Key” This makes it easy to design HBase schema The actual data is kept in JSON Overwrite user history and profile info Append email history for each day as a separate columns. (On avg each row has over 200 columns)
26
Cluster Sizing Hadoop + HBase Cluster 100+ machine Hadoop cluster, this runs heavy map reduce jobs The same cluster also hosts 15 node HBase cluster Online HBase Cluster HBase Replication 10 Machine dedicated HBase cluster to serve real time SLA Machine Profile 96 GB RAM (HBase 25 GB) 24 Virtual Cores CPU 8 2TB Disks Data Profile 100 Million+ Records 2TB+ Data Over 4.2 Billion Data Points
27
Questions? Thank You! (We are hiring!) www.groupon.com/techjobs
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.