Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo.

Similar presentations


Presentation on theme: "Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo."— Presentation transcript:

1 Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo

2 Agenda Audience Expansion Spark Application Spark scalability: problems and our solutions Performance tuning

3 AUDIENCE EXPANSION How we built audience expansion on Spark

4 Audience Expansion Train a model to find users perform similar as sample users Find more potential “converters”

5 System Large scale machine learning system Logistic Regression TBs input data, up to TBs intermediate data Hadoop pipeline is using mappers, 2000 reducers, 16 hrs run time All hadoop streaming, ~20 jobs Use Spark to reduce latency and cost

6 Pipeline Labeling Label positive/negative samples 6-7 hrs, IO intensive, 17 TB intermediate IO in hadoop Feature Extraction Extract Features from raw events Model Training Logistic regression phase, CPU bound Score/Analyze models Validate trained models, parameters combinations, select new model Validation/Metrics Validate and publish new model

7 How to adopt to Spark efficiently? Very complicated system 20+ hadoop streaming map reduce jobs 20k+ lines of code Tbs data, person.months to do data validation 6+ person, 3 quarters to rewrite the system based on Scala from scratch

8 Our migrate solution Build transition layer automatically convert hadoop streaming jobs to Spark job Don’t need to change any Hadoop streaming code 2 person*quarter Private Spark

9 Spark ZIPPO: Hadoop Streaming Over Spark ZIPPO: Hadoop Streaming Over Spark Hadoop Streaming ZIPPO HDFS Audience Expansion Pipeline 20+ Hadoop Streaming jobs Audience Expansion Pipeline 20+ Hadoop Streaming jobs

10 ZIPPO A layer (zippo) between Spark and application Implemented all Hadoop Streaming interfaces Migrate pipeline without code rewriting Can focus on rewriting perf bottleneck Plan to open source HDFS Audience Expansion Pipeline Hadoop Streaming Spark ZIPPO: Hadoop Streaming Over Spark ZIPPO: Hadoop Streaming Over Spark

11 ZIPPO - Supported Features Partition related – Hadoop Partitioner class (-partitioner) – Num.map.key.fields, num.map.parition.fields Distributed cache – -cacheArchive, -file, -cacheFile Independent working directory for each task instead of each executor Hadoop Streaming Aggregation Input Data Combination (to mitigate many small files) Customized OutputFormat, InputFormat

12 Performance Comparison 1Tb data Zippo Hadoop streaming Spark cluster – 1 hard drive – 40 hosts Perf data: – 1hr 25 min Original Hadoop streaming Hadoop cluster – 1 hard drives – 40 Hosts Perf data – 3hrs 5 min

13 SPARK SCALABILITY

14 Spark Shuffle Mapper side of shuffle write all the output to disk(shuffle files) Data can be large scale, so not able to all hold in memory Reducers transfer all the shuffle files for each partition, then process

15 Spark Shuttle Mapper 1 Mapper m-2 Reducer Partition 1 Reducer Partition 2 Reducer Partition n Reducer Partition 3 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle n Shuffle 1 Shuffle 2 Shuffle 3 Shuffle n

16 On each Reducer Every partition needs to hold all the data from all the mappers In hash map In memory Uncompressed Reducer i of 4 cores Partition 1 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Partition 2 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Partition 4 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n Partition 3 Shuffle mapper 1 Shuffle mapper 2 Shuffle mapper 3 Shuffle mapper n

17 Host 2 (4 cores) How many partitions? Need to have small enough partitions to put all in memory Host 1 (4 cores) Partition 1 Partition 2 Partition 3 Partition 4 Partition 5 Partition 6 Partition 7 Partition n Partition 8 Partition 9 …… Partition 10 Partition 11 Partition 12 Partition 13 Partition 14 ……

18 Spark needs many Partitions So a common pattern of using Spark is to have big number of partitions

19 On each Reducer For 64 Gb memory host 16 cores CPU For compression ratio 30:1, 2 times overhead To process 3Tb data, Needs partitions To process 3Pb data, Need 46 million partitions

20 Non Scalable Not linear scalable. No matter how many hosts in total do we have, we always need 46k partitions

21 Issues of huge number of partitions Issue 1: OOM in mapper side – Each Mapper core needs to write to 46k shuffle files simultaneously – 1 shuffle file = OutputStream + FastBufferStream + CompressionStream – Memory overhead: FD and related kernel overhead FastBufferStream (for making ramdom IO to sequential IO), default 100k buffer each stream CompressionStream, default 64k buffer each stream – So by default total buffer size: 164k * 46k * 16 = 100+ Gb

22 Issues of huge number of paritions Our solution to Mapper OOM – Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream (kernel block size) – Based on our Contributed patch https://github.com/mesos/spark/pull/685 https://github.com/mesos/spark/pull/685 Set spark.storage.compression.codec to spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can still have good compression ratio) – Total buffer size after this: 12k * 46k * 16 = 10Gb

23 Issues of huge number of partitions Issue 2: large number of small files – Each Input split in Mapper is broken down into at least 46K partitions – Large number of small files makes lots of random R/W IO – When each shuffle file is less then 4k (kernel block size), overhead becomes significant – Significant meta data overhead in FS layer – Example: only manually deleting the whole tmp directory can take 2 hour as we have too many small files – Especially bad when splits are not balanced. – 5x slower than Hadoop Input Split 1 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle … Input Split 2 Shuffle 1 Shuffle 2 Shuffle 3 Shuffle … Input Split n Shuffle 1 Shuffle 2 Shuffle 3 Shuffle …

24 Reduce side compression Current shuffle in reducer side data in memory is not compressed Can take times more memory With our patch https://github.com/mesos/spark/pull/686, we reduced memory consumption by 30x, while compression overhead is only less than 3% https://github.com/mesos/spark/pull/686 Without this patch it doesn’t work for our case 5x-10x performance improvement

25 Reduce side compression Reducer side – compression – 1.6k files – Noncompression – 46k shuffle files

26 Reducer Side Spilling Reduce Compression Bucket 1 Compression Bucket 2 Compression Bucket 3 Compression Bucket n … Spill 1 Spill 2 Spill n

27 Reducer Side Spilling Spills the over-size data to Disk in the aggregation hash table Spilling - More IO, more sequential IO, less seeks All in mem – less IO, more random IO, more seeks Fundamentally resolved Spark’s scalability issue

28 Align with previous Partition function Our input data are from another map reduce job We use exactly the same hash function to reduce number of shuffle files

29 Previous Job Generating Input data Previous Job Generating Input data Spark Job Align with previous Partition function New hash function, More even distribution Input Data 0 Input Data 0 Mod 4 Key 0, 4, 8… Key 2, 6, 10… Key 1,5,9… Key 3, 7, 11… shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 Mod 5 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4 shuffule file 0 shuffule file 1 shuffule file 2 shuffule file 3 shuffule file 4

30 Spark Job Previous Job Generating Input data Previous Job Generating Input data Align with previous Partition function Use the same hash function Input Data 0 Input Data 0 Mod 4 Key 0, 4, 8… Key 2, 6, 10… Key 1,5,9… Key 3, 7, 11… Mod 4 1 shuffle file

31 Align with previous Hash function Our Case: – 16m shuffle files, 62kb on average (5-10x slower) – 8k shuffle files, 125mb on average Several different input data sources Partition function from the major one

32 PERFORMANCE TUNNING

33 All About Resource Utilization Maximize the resource utilization Use as much CPU,Mem,Disk,Net as possbile Monitor vmstat, iostat, sar

34 Resource Utilization (This is old diagram, to update)

35 Resource Utilization Ideally CPU/IO should be fully utilized Mapper phase – IO bound Final reducer phase – CPU bound

36 Shuffle file transfer Spark transfers all shuffle files to reducer memory before start processing. Non-streaming(very hard to change to streaming). For poor resource utilization – So need to make sure maxBytesInFlight is set big enough – Consider allocating 2x more threads than physical core number

37 Thanks. Gavin Li Jaebong Kim Andrew Feng


Download ppt "Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo."

Similar presentations


Ads by Google