Presentation is loading. Please wait.

Presentation is loading. Please wait.

Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University.

Similar presentations


Presentation on theme: "Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University."— Presentation transcript:

1 Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University

2 Us C. Aaron Cois, Ph.D. Software Architect, Team Lead CMU Software Engineering Institute Digital Intelligence and Investigations Directorate Tim Palko Senior Software Engineer CMU Software Engineering Institute Digital Intelligence and Investigations Directorate © 2011 Carnegie Mellon University @aaroncois

3 Overview Problem Statement Sensor Hardware & System Requirements System Overview – Data Collection – Data Modeling – Data Access – Event Monitoring and Notification Conclusions and Future Work

4 The Goal Critical infrastructure/facility protection via Environmental Monitoring

5 Why? Stuxnet Two major components: 1) Send centrifuges spinning wildly out of control 2) Record ‘normal operations’ and play them back to operators during the attack 1 Environmental monitoring provides secondary indicators, such as abnormal heat/motion/sound 1 http://www.nytimes.com/2011/01/16/world/middleeast/16stuxnet.html?_r=2&

6 The Broader Vision Quick, flexible out-of-band monitoring Set up monitoring in minutes Versatile sensors, easily repurposed Data communication is secure (P2P VPN) and requires no existing systems other than outbound networking

7 A CMU research project called Sensor Andrew Features: – Open-source sensor platform – Scalable and generalist system supporting a wide variety of applications – Extensible architecture Can integrate diverse sensor types The Platform

8 Sensor Andrew

9 Gateway Server End Users Sensor Andrew Overview Nodes

10 What is a Node? Environment Node Sensors Light Audio Humidity Pressure Motion Temperature Acceleration Power Node Sensors Current Voltage True Power Energy A node collects data and sends it to a collector, or gateway Radiation Node Sensors Alpha particle count per minute Particulate Node Sensors Small Part. Count Large Part. Count

11 What is a Gateway? A gateway receives UDP data from all nodes registered to it An internal service: – Receives data continuously – Opens a server on a specified port – Continually transmits UDP data over this port Gateway

12 Requirements 1. Collect data from nodes once per second 2. Scale to 100 gateways each with 64 nodes 3. Detect events in real-time 4. Notify users about events in real-time 5. Retain all data collected for years, at least We need to..

13 What Is Big Data?

14 “When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze and share it.”

15 Problems Size Transmission StorageRate

16 Problems Size Transmission StorageRate

17 Problems Size Transmission StorageRate

18 Problems Size Transmission StorageRate

19 Problems Size Transmission StorageRate

20 Problems Size Transmission StorageRate Retrieval

21 Collecting Data Problem: Data cannot remain on the nodes or gateways due to security concerns. Limited infrastructure. Constraints: Store and retrieve immense amounts of data at a high rate. ? ? Gateway 8 GB / hour Complex Analytics

22 We Tried PostgreSQL… Advantages: – Reliable, tested and scalable – Relational => complex queries => analytics Problems: – Performance problems reading while writing at a high rate; real-time event detection suffers – ‘COPY FROM’ doesn’t permit horizontal scaling

23 Q: How can we decrease I/O load?

24 A: Read and write collected data directly from memory

25 Enter Redis Commonly used as a web application cache or pub/sub server Redis is an in-memory NoSQL database

26 Redis Created in 2009 Fully In-memory key-value store – Fast I/O: R/W operations are equally fast – Advanced data structures Publish/Subscribe Functionality – In addition to data store functions – Separate from stored key-value data

27 Persistence Snapshotting – Data is asynchronously transferred from memory to disk AOF (Append Only File) – Each modifying operation is written to a file – Can recreate data store by replaying operations – Without interrupting service, will rebuild AOF as the shortest sequence of commands needed to rebuild the current dataset in memory

28 Replication Redis supports master-slave replication Master-slave replication can be chained Be careful: – Slaves are writeable! – Potential for data inconsistency Fully compatible with Pub/Sub features

29 Redis Features Advanced Data Structures List Set Sorted Set Hash [A, B, C, D] “A” “B” “C” “D” D D C C B B A A A:3 C:1 D:2 B:4 {A, B, C, D} {C:1, D:2, A:3, D:4} “A” “B” “C” “D” field1 field2 field3 field4 {field1:“A”, field2:“B”…} {value:score}{key:value}

30 Our Data Model

31 Constraints Our data store must: – Hold time-series data – Be flexible in querying (by time, node, sensor) – Allow efficient querying of many records – Accept data out of order

32 Tradeoffs: Efficiency vs. Flexibility Motion Audio Light Pressure Humidity Acceleration Temperature Motion Audio Light Pressure Humidity Acceleration Temperature Motion VS Light Audio Pressure Temperature Humidity Acceleration One record per timestamp One record per sensor data type A A

33 Our Solution: Sorted Set Score Value Datapoint sensor:env:101 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

34 Our Solution: Sorted Set Score Value Datapoint sensor:env:101 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

35 Our Solution: Sorted Set Score Value Datapoint sensor:env:101 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

36 Our Solution: Sorted Set Score Value Datapoint sensor:env:101 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

37 Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} …

38 Sorted Set 1357542004000: {“temp”:523,..} 1357542005000: {“temp”:523,..} 1357542006000: {“temp”:527,..} <- fits nicely 1357542007000: {“temp”:530,..} 1357542008000: {“temp”:531,..} 1357542009000: {“temp”:540,..} 1357542001000: {“temp”:545,..} …

39 Know your data structure! A set is still a set… Score Value Datapoint 1357542004000 {“bat”: 192, "temp": 523, "digital_temp": 216, "mac_address": "20f", "humidity": 22, "motion": 203, "pressure": 99007, "node_type": "env", "timestamp": 1357542004000, "audio_p2p": 460, "light": 820, "acc_z": 464, "acc_y": 351, "acc_x": 311}

40 Requirement Satisfied Redis Gateway

41 There is a disturbance in the Force..

42 Collecting Data Redis Gateway

43 “In Memory” Means Many Things The data store capacity is aggressively capped – Redis can only store as much data as the server has RAM

44 Collecting Big Data Redis Gateway

45 We could throw away data… If we only cared about current values However, our data – Must be stored for 1+ years for compliance – Must be able to be queried for historical/trend analysis

46 We Still Need Long-term Data Storage Solution? Migrate data to an archive with expansive storage capacity

47 Winning Redis Gateway Postgre SQL Archiver

48 Winning? Redis Gateway Postgre SQL Archiver ? ? ? Some Poor Client

49 Yes, Winning Redis Gateway Postgre SQL Archiver APIAPI APIAPI Some Happy Client

50 Gateway Redi s Postg reSQL Archiver APIAPI APIAPI Best of both worlds Redis allows quick access to real-time data, for monitoring and event detection PostgreSQL allows complex queries and scalable storage for deep and historical analysis

51 We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events

52 We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”?

53 We Have the Data, Now What? Incoming data must be monitored and analyzed, to detect significant events What is “significant”? What about new data types?

54 Gateway Django App App DB APIAPI APIAPI New guy: provide a way to read the data and create rules motion > x && pressure < y && audio > z Redis Postgre SQL Archiver

55 Gateway Event Monitor Django App App DB Redis Postgre SQL Archiver APIAPI APIAPI New guy: read the rules and data, trigger alarms motion > x pressure < y audio > z All true?

56 Gateway Event Monitor Django App App DB Redis Postgre SQL Archiver APIAPI APIAPI Event monitor services can be scaled independently

57 Getting The Message Out

58 Considerations Event monitor already has a job, avoid re- tasking as a notification engine

59 Getting The Message Out Considerations Event monitor already has a job, avoid re- tasking as a notification engine Notifications most efficiently should be a “push” instead of needing to poll

60 Getting The Message Out Considerations Event monitor already has a job, avoid re- tasking as a notification engine Notifications most efficiently should be a “push” instead of needing to poll Notification system should be generalized, e.g. SMTP, SMS

61 If only…

62 Gateway Event Monitor Django App App DB Archiver APIAPI APIAPI Redis Data Redis Pub/Sub Worker Notificatio n Worker SMTP Pub/Sub with synchronized workers is an optimal solution to real-time event notifications. No need to add another system, Redis offers pub/sub services as well! Postgre SQL

63 Conclusions Redis is a powerful tool for collecting large amounts of data in real-time In addition to maintaining a rapid pace of data insertion, we were able to concurrently query, monitor, and detect events on our Redis data collection system Bonus: Redis also enabled a robust, scalable real-time notification system using pub/sub

64 Things to watch Data persistence – if Redis needs to restart, it takes 10-20 seconds per gigabyte to re-load all data into memory 1 – Redis is unresponsive during startup 1 http://oldblog.antirez.com/post/redis-persistence-demystified.html

65 Future Work Improve scalability through: – Data encoding – Data compression – Parallel batch inserts for all nodes on a gateway Deep historical data analytics

66 Acknowledgements Project engineers Chris Taschner and Jeff Hamed @ CMU SEI Prof. Anthony Rowe & CMU ECE WiSE Lab http://wise.ece.cmu.edu/ Our organizations CMUhttps://www.cmu.edu CERThttp://www.cert.org SEIhttp://www.sei.cmu.edu Cylabhttps://www.cylab.cmu.edu

67 Thank You

68 Questions?

69 Slides of Live Redis Demo

70 A Closer Look at Redis Data redis> keys * 1)"sensor:environment:f80” 2)"sensor:environment:f81” 3)"sensor:environment:f82" 4)"sensor:environment:f83" 5)"sensor:environment:f84" 6)"sensor:power:f85" 7)"sensor:power:f86" 8)"sensor:radiation:f87" 9)"sensor:particulate:f88"

71 A Closer Look at Redis Data redis> keys sensor:power:* 1)"sensor:power:f85" 2)"sensor:power:f86”

72 A Closer Look at Redis Data redis> zcount sensor:power:f85 –inf +inf (integer) 3565958 (45.38s)

73 A Closer Look at Redis Data redis> zcount sensor:power:f85 1359728113000 +inf (integer) 47

74 A Closer Look at Redis Data redis> zrange sensor:power:f85 -1000 -1 1)"{\"long_energy1\": 73692453, \"total_secs\": 6784, \"energy\": [49, 175, 62, 0, 0, 0], \"c2_center\": 485, \"socket_state\": 1, \"node_type\": \"power\", \"c_p2p_low2\": 437, \"socket_state1\": 0, \"mac_address\": \"103\", \"c_p2p_low\": 494, \"rms_current\": 6, \"true_power\": 1158, \"timestamp\": 1359728143000, \"v_p2p_low\": 170, \"c_p2p_high\": 511, \"rms_current1\": 113, \"freq\": 60, \"long_energy\": 4108081, \"v_center\": 530, \"c_p2p_high2\": 719, \"energy1\": [37, 117, 100, 4, 0, 0], \"v_p2p_high\": 883, \"c_center\": 509, \"rms_voltage\": 255, \"true_power1\": 23235}” 2)…

75 Redis Python API import redis pool = redis.ConnectionPool(host=127.0.0.1, port=6379, db=0) r = redis.Redis(connection_pool=pool) byindex = r.zrange(“sensor:env:f85”, -50, -1) # ['{"acc_z":663,"bat":0,"gpio_state":1,"temp":663,"light”:… byscore = r.zrangebyscore(“sensor:env:f85”, 1361423071000, 1361423072000) # ['{"acc_z":734,"bat":0,"gpio_state":1,"temp":734,"light”:… size = r.zcount(“sensor:env:f85”, "-inf", "+inf") # 237327L


Download ppt "Large-Scale Data Collection Using Redis C. Aaron Cois, Ph.D. -- Tim Palko CMU Software Engineering Institute © 2011 Carnegie Mellon University."

Similar presentations


Ads by Google