Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Stream Star Schema Stephen A. Broeker 1010.

Similar presentations


Presentation on theme: "1 The Stream Star Schema Stephen A. Broeker 1010."— Presentation transcript:

1 1 The Stream Star Schema Stephen A. Broeker 1010

2 2 Conclusion The Stream Star Schema processes data streams in real- time. Up to gigabits per second. Stream Star performance is O(1). 2020

3 3 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets are data rich. But real-time analysis po Large Fast Dynamic Data Streams 3030

4 4 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets Data rich. But poor in real-time analysis. Large Fast Dynamic Data Streams 4040 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets

5 5 What are the consequences? Large Fast Dynamic Data Streams 5050

6 6 hard tosee patternshard tosee patterns Therefore difficult to detect problems. Large Fast Dynamic Data Streams 6060

7 7 Network monitoring at high speed is difficult: Packets arrive every nanosecond on a 1Gbps NIC Must use SRAM for per-packet processing Traditional solution of sampling is inherently not accurate due to the loss of data. Challenge of Network Monitoring 7070

8 8 Achieve real-time OLAP for massive data streams. Achieve cybernetic control for systems that depend on rapid data analysis. Vision 8080

9 9 Detection 9090

10 10 Forensics 10

11 11 Data RATES are measured in bits per second. So, Gigabits (Gb) Gigabytes (GB). Data Rates versus Data Storage Lowercase b 11

12 12 Data RATES are measured in bits per second. Data STORAGE is measured in Bytes. So, Gigabits (Gb) Gigabytes (GB). Data Rates versus Data Storage Lowercase bUppercase B 12

13 13 Ethernet Network Interface Card transferring data at 1 Gbps. Data accumulates at 450MB per hour. Thats 10.5 TB per day, 73.8 TB per week! Data Storage based on Data Rate 13

14 14 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 14

15 15 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 15

16 16 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 16

17 17 What if BYTES were pennies? Picturing Orders of Magnitude X At 1Gbps, 2.2 PB accumulate per month. Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 17

18 18 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 18 = 2 60 17

19 19 The network stream is segmented into flows, which are inserted into a database. Observed database input rate for 1 Gb Ethernet NIC: 700,000 flows per hour. Existing databases cant keep up! From Streaming Data to Database 18

20 20 Disk Star Schema STREAM Star Schema Consider 2 Database Schemas 19

21 21 So wheres the star? Disk Star Schema From Fact Table to Dimension Tables Content Table Sender Table Subject Table Recipient Table Destination IP Table Content Destination IP Sender Recipient Subject Thats all there is to the star concept. Heres the star. 20

22 22 Value of the Disk Star Schema Conserve Disk Space 21

23 23 Dimensions Each Dimension gets a key. 22

24 24 Resulting in a Dimension Table 1NF: No Repeating Groups 23

25 25 Thus deriving a Fact Table. Substitute Keys for Facts 24

26 26 Disk Star Schema = Slow data insertion time. Relational databases are normalized to conserve space. Speed is sacrificed. So real-time analysis is compromised. 25 Slow Bottleneck

27 27 Disk Star Schema 26

28 28 Disk Star Schema 27

29 29 Disk Star Schema 28

30 30 Disk Star Schema 29

31 31 Dimension table insertion time depends on the table size which is O (log n ) where n is the number of records in a table. Disk Star Schema insertion time, is the sum of all dimension table insert times O ( Ʃ 1 i l (log n i )) where l is the number of attributes in the database and n i is the number of values for attribute i. Cant fill dimension tables fast enough! Bottleneck 30

32 32 1,000,000,000 bit Ethernet NIC (1Gb) 700,000 Observed Flows per hour 460 MBs per hour, 10.5 TBs a day All we can get is a snapshot-analysis! Short Pause to Review Numbers 31

33 33 Disk Star Schema STREAM Star Schema Consider 2 Database Schemas 32

34 34 Stream Star Schema 33 Stream Star Schema

35 35 34 Stream Star Schema

36 36 Stream Star Schema 35 Stream Star Schema

37 37 Disk Star Schema Nearly 1:1 Correspondence between string attributes and Dimension tables. 36

38 38 Disk Star Schema Two kinds of tables - fact, dimension. All string dimensions have dimension tables. Minimize disk space. Dimension tables can be large. Long insert time = O ( Ʃ 1 i l (log n i )) No string duplication. 37

39 39 Many:1 38 Stream Star Schema

40 40 Three kinds of tables - fact, dimension, string. Few dimension tables. Dimension tables are small. Minimizes insertion time. I n s e r t t i m e i s c o n s t a n t. Allow string duplication. Allow string duplication. 39 Stream Star Schema

41 41 Side x Side Comparison SlowFast OldNew 40

42 42 Test Results 41

43 43 Test Results The magnified area is different because I measured the insert time for (1, 10, 100) as opposed to (1000, 2000, 3000) streams. 42

44 44 Test Results The magnified area is different because of how MySQL works. I can only present a hypothesis since I dont have the MySQL source code. But I suspect that MySQL is optimized for less than 100 streams for this problem. 43

45 45 Conclusion 44

46 46 Conclusion The Stream Star Schema processes data streams in real- time. Up to gigabits per second. Stream Star performance is O(1). 45

47 47 Hope Detection Forensics RFID 46

48 48 Theres data flow 47

49 49 And then theres DATA FLOW! 48

50 50 Disk Star Schema handles 3 million flows per hour, about this much. 49

51 51 The Stream Star Schema handles 113 million flows per hour! Disk Star Schema handles 3 million flows per hour, about this much. 50

52 52 Nearly 40x Faster! 51

53 53 For The Future Implement the Stream Star Schema in the Cloud. Use multiple Stream Star Schema computer nodes to handle an infinite stream. Storage could be handled similarly to S3. 52

54 54 For The Future The Stream Star Schema fully supports the analysis of high-speed data streams thus enabling security applications and forensic processing. 53

55 55 END


Download ppt "1 The Stream Star Schema Stephen A. Broeker 1010."

Similar presentations


Ads by Google