We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byTrevin Shaker
Modified over 2 years ago
© 2009 IBM Corporation Simply Top Talkers Jeroen Massar, Andreas Kind and Marc Ph. Stoecklin IBM Research - Zurich
© 2010 IBM Corporation IBM Research – Zurich 2 Motivation and Outline Need to understand and correctly handle dominant aspects within the overall flow of traffic –Top-k Problem –Optimize peering relationships (top autonomous systems) –Analyze congestion cases (top hosts) Technical challenge –Compute sorted top-n views from very large numbers of flow information records –No possibility to store individual counters per aspect components This presentation –Propose and analyze simple techniques to compute top-k listings for single and composed traffic aspects –Show that the techniques are suitable for in-memory aggregation databases
© 2010 IBM Corporation IBM Research – Zurich 3 Related Work E. D. Demaine, A. Lopez-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms, pages 348–360, C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270–313, Metwally A, Agrawal D, El Abbadi A (2005) Efficient computation of frequent and top-k elements in data streams. In: Proceeding of the 2005 international conference on database theory (ICDT05), Edinburgh, UK, pp 398–412 X. Dimitropoulos, P. Hurley, and A. Kind: Probabilistic Lossy Counting: An Efficient Algorithm for Finding Heavy Hitters," ACM SIGCOMM Computer Communication Review, Jan X. Dimitropoulos, M. Stoecklin, P. Hurley, and A. Kind: The Eternal Sunshine of the Sketch Data Structure," Elsevier Computer Networks, 2008.
© 2010 IBM Corporation IBM Research – Zurich 4 Top-k Problem Large number of unique items per observation period –Large alphabet –Counters and usage arrays dont fit into memory How to cut back sorted list? –Correct ordering –Correct volume information for each item –Reduce the cost of inserting new items (that anyway would not make it into the sorted top-k list) …… Primary observation period (e.g. 5 minutes) …… List of counters and usage variation information (ie, arrays) for all unique items (ie, colors) during observation period … … … … … … … … Secondary observation periods (e.g. hours, weeks, months, years)
© 2010 IBM Corporation IBM Research – Zurich 5 One Threshold Scheme (max1) read next Item record update item in list entry exists for item in list? create item in list yes start take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list Primary list can grows up to unique number of items Secondary lists is limited by max1
© 2010 IBM Corporation IBM Research – Zurich 6 Two Threshold Scheme (max1, max2) read next Item record update item in list entry exists for item in list? |list| > max2 cut list back to max1 add item in list yes start take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list Secondary lists is limited by max1 Primary list is limited by max2 Cut back is below max2 to reduce costly cut back operations (incl. sorting)
© 2010 IBM Corporation IBM Research – Zurich 7 Two Threshold Scheme (max1, max2) with Heuristic read next Item record update item in list entry exists for item in list? |list| > max2 cut list back to max1 add item in list |list| > max1 accept record? yes start Use heuristic to avoid insertion of record information that is likely to have no impact on top items take next primary list start |IP list| > max1 add or update entries in list cut list back to max1 yes Merging primary list into secondary list add item in list Secondary lists is limited by max1 Primary list is limited by max2 Cut back is below max2 to reduce costly cut back operations (incl. sorting)
© 2010 IBM Corporation IBM Research – Zurich 8 Heuristics 0.Exact max1,max2 style: No Heuristic Short duration (< 1 second) Low Packet count in one flow ( < 4 pkts) Bytes < total_bytes_seen/total_flows_seen (averaging) These heuristics are very simple and more importantly cheap to implement memory and cpu wise.
© 2010 IBM Corporation IBM Research – Zurich 9 Data Set We picked three days of data collected at one of the IBM datacenters Did an expensive query: calculate the exact top K entries Fortunately there are machines with large amounts of memory and number of cores… Then we applied the defined heuristics to the data set, which run in near real-time as we dont need a lot of memory or cpu power to store all the entries One of the biggest advantages of course is that one can almost stay in-cache when one has proper cpus (4 or 8 MiB cache per core) results on next slides…. Note that the dataset represents several Petabytes of network traffic, the error rate in comparison is thus effectively very low…..
© 2010 IBM Corporation IBM Research – Zurich 10 Error rate for Top K without heuristic Amount of bytes that are not seen for top X items
© 2010 IBM Corporation IBM Research – Zurich 11 Error rate for Top K when applying various heuristics
© 2010 IBM Corporation IBM Research – Zurich 12 Conclusion Simple heuristics and thus cheap processing: –Differences are minimal for this datacenter dataset, other datasets, eg for an enterprise network seem to prefer the bytes < avg heuristic a bit more, which makes sense as it is hard to get in the top K unless you can at least beat the average. Two thresholds –max2 = 2 * max1 –Correct order up to k = X * max1
© 2010 IBM Corporation AURORA Questions?
Data Stream Algorithms Frequency Moments Graham Cormode
Copyright 2011 John Wiley & Sons, Inc Business Data Communications and Networking 11th Edition Jerry Fitzgerald and Alan Dennis John Wiley & Sons, Inc.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered.
Introduction to Information Retrieval Kangnam Univ. Introduction to Information Retrieval Kangnam Univ. Lecture 4: Index Construction.
Chapter 14 Query Optimization. ©Silberschatz, Korth and Sudarshan14.2Database System Concepts 3 rd Edition Chapter 14: Query Optimization Introduction.
IOS103 OPERATING SYSTEM VIRTUAL MEMORY. Objectives At the end of the course, the student should be able to: Define virtual memory; Discuss the demand.
G.Vadivu & P.Subha, Assistant Professor, SRM University, Kattankulathur 8/22/2011School of Computing, Department of IT.
Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc,
Information Systems: Key tool for analysts Many advantages Must serve users.
2 Google GFS Bigtable Mapreduce Yahoo Hadoop.
Solve a Real Problem - Step #1 Identify the Problem I want to know where I need to focus my attention first. In this case Im using a reliability analysis.
Using Trees to Depict a Forest Bin Liu, H.V. Jagadish Department of EECS University of Michigan Ann Arbor, USA Proceedings of Very Large Data Base Endowment.
What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. Operating system goals: Execute.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 12: Query Processing.
DBMS Storage and Indexing. Disk Storage Disks and Files DBMS stores information on (“hard”) disks. This has major implications for DBMS design! ▫ READ:
1. Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism.
3 Google GFS Bigtable Mapreduce Yahoo Hadoop.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Query Optimizer Overview Conor Cunningham Principal Architect, SQL Server Query Processor 1.
1 Indexing. 2 Overview An index is a table containing a list of keys associated with a reference field pointing to the record where the information referenced.
Abhinandan Das, Mayur Datar, Ashutosh Garg WWW2007, May 8-12, 2007, Banff, Alberta, Canada Presented By, Naveenkumar Selvaraj Google News Personalization:
Copyright: Silberschatz, Korth and Sudarshan 1 Chapter 23: Advanced Data Types and New Applications.
© 2013 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Storage at Facebook December 3, 2013.
REFEREE: An open framework for practical testing of recommender systems using ResearchIndex Proceedings of the 28 th VLDB Conference Hong Kong, China,
Memory Management Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium.
CSE 6324: Advanced Topics in Software Engineering Paper Presentation on An Overview of Security Practices in Agile Software Development - Naieem Khan.
Copyright © 2003, Lucent Technologies. All Rights Reserved.August 11, A Telephone Switching System -- A Pattern Language Robert Hanmer Lucent.
Copyright ©2004 Carlos Guestrin VLDB 2004 Efficient Data Acquisition in Sensor Networks Presented By Kedar Bellare (Slides adapted.
© 2016 SlidePlayer.com Inc. All rights reserved.