Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin | 22.02.2019.

Similar presentations


Presentation on theme: "Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin | 22.02.2019."— Presentation transcript:

1 Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers
Berlin |

2 TABLE OF CONTENTS ABOUT zanox AND Me Reporting in general
QUERY ROUTING QUESTIONS Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

3 THE ZANOX NETWORK Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

4 WHO am i? Senior Architect at Zanox
6+ years experience in building distributed search systems based on Lucene Lucid Certified Apache Solr/Lucene Developer 4+ years of using Hadoop for mining billions of views and millions of clicks Cloudera Certified Developer for Apache Hadoop I have applied different machine-learning techniques mainly to optimise resource usage while performing distributed search during my PhD See my book: “Beyond Centralised Search Engines An Agent-Based Filtering Framework” How can you reach me? Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

5 ABOUT zanox and me Reporting in general QUERY ROUTING QUESTIONS
Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

6 QUERY ROUTING Reporting in general 01:00 00:00 02:00 05:00 06:00 21:00
04:00 03:00 00:00 18:00 03:00 06:00 15:00 09:00 12:00 00:10:20 00:10:40 00:11:00 00:10:00 00:11:20 00:00:00 00:11:40 00:02:00 00:04:00 00:06:00 00:08:00 00:12:00 00:22:00 00:24:00 00:26:00 00:20:00 00:28:00 00:18:00 00:14:00 00:16:00 00:00:00.065 00:00:00.060 00:00:00.050 00:00:00.070 00:00:00.055 00:00:00.080 00:00:00.045 00:00:00.090 00:00:00.085 00:00:00.075 00:00:00.020 00:00:00.010 00:00:00.000 00:00:00.005 00:00:00.015 00:00:00.095 00:00:00.035 00:00:00.030 00:00:00.025 00:00:00.040 00:00:00.200 00:00:00.900 00:00:00.850 00:00:00.800 00:00:00.750 00:00:00.910 00:00:00.920 00:00:00.100 00:00:00.950 00:00:00.940 00:00:00.700 00:00:00.930 00:00:00.500 00:00:00.300 00:00:00.400 00:00:00.600 00:00:00.650 QUERY ROUTING Tracking Map Reduce Distributed Reporting JBoss Data-Import Data producers are pushing data during a whole day MR Jobs Reporting jobs are started at midnight It takes around 6 hours to process data from yesterday Index loading normally lasts no more that 30 minutes Reporting There are multiple search clusters for redundancy Every cluster has one merger that coordinates searchers The health of every cluster is ensured by observer Searchers are grouped based on their capabilities Within 100ms queries are propagated to best searchers Capable to provide reports in sub-second time T D D D D T D D D D S R S S S C T D D D T Name Merger Role Knows how to build sub-queries for searchers Able to assign indexes to searchers Aware of available searchers Design RPC Server Report aggregation component Sorting and master-data management Name Searcher Role Knows how to build reports for queries Able to load assigned indexes from HDFS Design RPC Server Query optimiser Lucene searcher with RAM or FS directories Report aggregation component D D D Name Google Yahoo Miva Role Provides search engine cost data Design … Name Database server FTP server Log server Role Provides data needed for reporting Design … Name Tracking Server Role Tracks events such as sales, clicks, views … Writes tracked events to HDFS Design Proprietary application Name Observer Role Checks the health of mergers and searchers Automatic deployment of clusters Repair unhealthy mergers and searchers Provide information about observed clusters Design RPC Server Name Client Role Keep track of available mergers Rerouting of failed queries Take care of load when assigning queries Design JBoss application RPC client for communication with mergers S R S R S S R M Q Q Q Q Q C T Q D T D D D D D D D S R S R S S OK OK O T D D D T D D D C S S S S S S S S C M Custom D D D D D D D D D S S S S O C I I I D D I I I I I I I I I D D D D D D D D I I I I I I I I I S S S S L L C D D D D D D D S S S S M G F I I I I I I I I I I I I I I I D D D D D D D C S S S S O Y D D D D M D D D D Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

7 ABOUT zanox and me Reporting in general QUERY ROUTING Questions
Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

8 QUERY ROUTING H M M H Indexing Routing Estimation Analytics
Distributed Reporting M S O M S O S S S S S S S S M S S S S O Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

9 Auto insurance company
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Date Publisher Advertiser Keyword Commission :25:54.650 :00:00.000 Jim HUK Auto insurance company 23.56 € M :00:00.000 :05:15.762 Tom DEVK Health insurance 9.24 € S S :00:00.000 :29:31.987 Tom DEVK Health insurance 43.81 € S S S S :00:00.000 :09:39.250 Jim HUK Cheap auto insurance 7.22 € S S :00:00.000 :05:14.098 Jim HUK Auto insurance company 73.05 € 43.07 € 43.07 € 43.07 S S :15:32.870 :00:00.000 Jim HUK Auto insurance company 29.98 € S S :00:00.000 :25:59.990 Tom DEVK Health insurance 12.03 € :00:00.000 :20:05.030 Jim DEVK Cheap auto insurance 9.72 € Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

10 QUERY ROUTING H M M H Indexing Routing Estimation Analytics M S S S S
Date Publisher Advertiser Keyword Commission :00:00.000 Jim HUK Auto insurance company 23.56 € M :00:00.000 Tom DEVK Health insurance 9.24 € 9.24 € 53.05 € 9.24 S S :00:00.000 Tom DEVK Health insurance 43.81 € S S S S :00:00.000 Jim HUK Cheap auto insurance 7.22 € S S :00:00.000 Jim HUK Auto insurance company 73.05 € S S S S :00:00.000 Tom DEVK Health insurance 12.03 € :00:00.000 Jim DEVK Cheap auto insurance 9.72 € Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

11 QUERY ROUTING H M M H Indexing Routing Estimation Analytics M S S S S
Date Publisher Advertiser Keyword Commission Jim HUK Auto insurance company 23.56 € 96.61 € 23.56 € 23.56 M Tom DEVK Health insurance 65.08 € +12.03 53.05 € 53.05 € 53.05 S S S S S S Jim HUK Cheap auto insurance 7.22 € S S Jim HUK Auto insurance company 73.05 € S S S S Tom DEVK Health insurance 12.03 € Jim DEVK Cheap auto insurance 9.72 € Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

12 QUERY ROUTING H M M H Indexing Routing Estimation Analytics M S S S S
Date Publisher Advertiser Keyword Commission Jim HUK Auto insurance company 23.56 € 30.78 € 23.56 € + 7.22 23.56 M Tom DEVK Health insurance 53.05 € S S S S S S Jim HUK Cheap auto insurance 7.22 € S S Jim HUK Auto insurance company 73.05 € S S S S Tom DEVK Health insurance 12.03 € Jim DEVK Cheap auto insurance 9.72 € Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

13 QUERY ROUTING H M M H Indexing Routing Estimation Analytics
MR Indexing Aggregated records are saved as Lucene indexes Optimised mappers for simultaneous aggregation Partitioners enable parallel building of different indexes Searcher Group Merger has several searcher groups on its disposal Searchers from same group have same group fields Date aggregation makes searchers different Date Field Aggregated date value is stored in this field Several aggregation levels are available (Hour, Day,…) Make possible to optimise searcher selection Multiple searchers can have same date aggregation Date Publisher Advertiser Keyword Commission Jim HUK 30.78 € M Tom DEVK 53.05 € 65.08 € 53.05 € +12.03 53.05 S S S S S S S S Jim HUK 73.05 € 30.78 € 30.78 S S S S Tom DEVK 12.03 € Jim DEVK 9.72 € Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

14 Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

15 QUERY ROUTING H M M H Indexing Routing Estimation Analytics 1st Group
2nd Group 3rd Group M Publisher Publisher Publisher S S Advertiser Advertiser S S Keyword Keyword S S Hour S S Day Hour S S 5-Days Day Hour S S Month Day Day Year Month Month Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

16 QUERY ROUTING 2012-09-04 2012-10-01 2012-11-01 2012-11-08 2012-09-04
H M M H Indexing Routing Estimation Analytics Q 1st Group 2nd Group 3rd Group M GROUP BY GROUP BY CHECK Publisher Publisher Publisher Advertiser Advertiser Advertiser Keyword FILTER CHECK Keyword Keyword DAY FILTER S Hour PUBLISHER GROUP SELECTION Jim S Day S Hour MONTH FROM S 5-Days S Day S Hour S Month S Day S Day TO SEARCHER SELECTION DAY S Year S Month S Month Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

17 OR OR Exactly what Query needs Post aggregation is needed
QUERY ROUTING H M M H Indexing Routing Estimation Analytics OR OR Date Publisher Advertiser Keyword Commission Jim HUK Auto insurance company Cheap auto insurance 7.22 € 23.56 € 30.78 € 3rd Group 1st Group Q Advertiser FILTER GROUP BY Jim PUBLISHER TO FROM Q Exactly what Query needs 1st Group Date Publisher Advertiser DAY 1st Group 2nd Group 3rd Group M GROUP BY GROUP BY CHECK Publisher Publisher Publisher Advertiser 5-DAYS Advertiser Advertiser FILTER CHECK Both 1st and 3rd group can be used, but which one is better? Keyword Keyword MONTH FILTER S Hour PUBLISHER GROUP SELECTION Jim 3rd Group Post aggregation is needed S Day S Hour Date Publisher Advertiser Keyword 5-DAYS FROM S 5-Days S Day S Hour S Month S Day S Day TO SEARCHER SELECTION DAY S Year S Month S Month Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

18 Available Data 01.01.2012 – present
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Q 0.39 0.42 0.19 1st Group 2nd Group 3rd Group M GROUP BY GROUP BY CHECK Group Selection is faced with a very hard challenge because: Each and every Group passed both group-by and filter checks, There is no Group that offers exactly what is requested by query, Solution is to somehow estimate the applicability of every group. Publisher Publisher Publisher Advertiser Advertiser FILTER CHECK Keyword Keyword DAY FILTER S Hour PUBLISHER GROUP SELECTION Jim S Day Aggregation Day Available Data – present S Hour MONTH FROM S 5-Days S Day S Hour S Month S Aggregation Day Available Data – Day S Day TO SEARCHER SELECTION DAY S Year S Month S Month Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

19 Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

20 QUERY ROUTING H M M H Indexing Routing Estimation Analytics 1st Group
2nd Group 3rd Group M 2nd Group S S S 3rd Group S S S S S S S S S Berlin | 07/02/2012 | zanox | Presentation title

21 Exhaustive Estimation based on number of Hits
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Exhaustive Estimation based on number of Hits 1st Group EQ EQ EQ EQ EQ EQ EQ EQ EQ EQ EQ EQ Q M Merger creates Estimation Query to asks every single Searcher in each Group to provide information about how many records it has to aggregate for given query ER S ER S 1st Group 2nd Group 3rd Group ER S ER S S ER 1712 847 2nd Group EQ 1398 984 3065 781 2156 ER S S ER S ER 3rd Group Every Searcher only retrieves the number of hits and sends this back as Estimation Result 3026 2696 5221 S ER ER 0.39 0.42 0.19 ER S ER S ER S Berlin | 07/02/2012 | zanox | Presentation title

22 Selective Estimation based on number of Hits
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Selective Estimation based on number of Hits 1st Group EQ EQ EQ EQ EQ EQ EQ EQ Q M Merger creates Estimation Query to asks every single Searcher only in selected groups of searchers about how many records it has to aggregate for given query ER S ER S 1st Group 2nd Group 3rd Group ER S S ER S ER 1712 847 2nd Group EQ 1398 984 781 ER S S ER S ER 3rd Group Selected Searchers retrieve the number of hits and send this back as their estimation 3026 2696 S ER 0.47 0.53 S S S Berlin | 07/02/2012 | zanox | Presentation title

23 Rule Estimation based on unneeded fields
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Rule Estimation based on unneeded fields 1st Group 1st Group Publisher M Q Merger creates estimation based on rules that can be as simple as being based on the cardinality of fields that have to be aggregated due to not being requested by query S Advertiser S FILTER GROUP BY TO FROM Jim PUBLISHER 1st Group 2nd Group 3rd Group S S S Advertiser Keyword Advertiser 92328 76854 92328 2nd Group 2nd Group Publisher Keyword S Keyword 76854 S S 3rd Group 3rd Group Publisher Hadoop can provide further insight into underlying data and improve estimation rules 92328 76854 169182 Advertiser S 0.39 0.42 0.19 S Keyword S S Berlin | 07/02/2012 | zanox | Presentation title

24 Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

25 There are so many groups but no one has fields
QUERY ROUTING H M M H Indexing Routing Estimation Analytics Creating a separate searcher group for each and every possible group-field combination in query is simply not feasible because number of searcher-groups grows extremely fast (exponentially) We have created separate searcher group for every possible combination of group-fields. Therefore we are well prepared for each query that can hit us. 1st Group There are so many groups but no one has fields Website Publisher Keyword 7th Group Keyword 10th Group Website 13th Group Website Publisher Publisher M Q Advertiser GROUP BY Website Number of group fields Number of searcher groups 3 23 – 1 = 7 4 24 – 1 = 15 9 29 – 1 = 511 2nd Group Keyword Publisher 6th Group Advertiser Keyword 9th Group Website Advertiser Publisher 12th Group Website Advertiser Publisher FILTER Insurance KEYWORD Jim PUBLISHER HUK ADVERTISER Keyword But we are attacked where we haven’t thought that it is possible. One new field have been requested and that has spoiled everything 3rd Group 4th Group Publisher 5th Group Advertiser 8th Group Advertiser Website Keyword Publisher 11th Group Website Keyword Advertiser Solution is to analyze already processed queries to discard low performing groups and to create new ones for emerging queries Keyword Berlin | 07/02/2012 | zanox | Presentation title

26 There are so many groups but no one has fields
QUERY ROUTING H M M H Indexing Routing Estimation Analytics 1st Group queries 451236 There are so many groups but no one has fields Website Publisher Keyword 7th Group Keyword queries 4 10th Group Website queries 5644 13th Group Website Publisher queries 15 Publisher M Q FILTER GROUP BY Jim PUBLISHER Keyword HUK ADVERTISER Insurance KEYWORD Website Publisher Advertiser response 45.54 ms response ms response 29.76 ms response ms 2nd Group queries 9 14th Group Website Publisher Keyword 6th Group Advertiser Keyword queries 32 9th Group Website Advertiser Publisher queries 543765 12th Group Website Advertiser queries 89 Publisher response 15.54 ms response 87.03 ms response 35.87 ms response ms Keyword 3rd Group queries 221128 4th Group Publisher queries 5th Group Advertiser Publisher queries 236 8th Group Advertiser Website Keyword queries 2 11th Group Website Keyword queries 228732 Advertiser response 82.87 ms response 23.54 ms response 92.33 ms response ms response 61.03 ms Keyword Berlin | 07/02/2012 | zanox | Presentation title

27 There is no group that is optimized for queries having fields
QUERY ROUTING H M M H Indexing Routing Estimation Analytics We optimized system by adding specialized group that is excellent for found slow queries 1st Group queries 20236 There is no group that is optimized for queries having fields Publisher Keyword Publisher M Q Advertiser response 45.54 ms GROUP BY Website = 9th Group queries 91234 Publisher 14th Group Website Publisher Keyword 15th Group Keyword Publisher queries 32546 Website Fields Website queries 9874 Advertiser FILTER HUK ADVERTISER Insurance KEYWORD Jim PUBLISHER response 15.54 ms response ms Publisher Publisher response ms Keyword 3rd Group queries 221128 4th Group Publisher queries 286136 11th Group Website Keyword Publisher queries 22336 Fields queries 22678 Advertiser response 82.87 ms response 23.54 ms response 92.33 ms Publisher Keyword response ms Keyword Berlin | 07/02/2012 | zanox | Presentation title

28 ABOUT zanox and me Reporting in general QUERY ROUTING questions
Berlin | 22/02/2019 | zanox | Zanox Reporting Systems

29 QUESTIONS Berlin | 22/02/2019 | zanox | Zanox Reporting Systems


Download ppt "Flexible Distributed Reporting for Millions of Publishers and Thousands of Advertisers Berlin | 22.02.2019."

Similar presentations


Ads by Google