Ricardo Jimenez-Peris Universidad Politecnica de Madrid

Scalable Autonomic Streaming Middleware for Real-Time Processing of Massive Data Flows
Ricardo Jimenez-Peris Universidad Politecnica de Madrid Project Coordinator

Project Data Start: February 2008. Duration: 3 years. Partners:
UPM – Spain (coord.). FORTH - Greece. TU Dresden - Germany. Telefonica - Spain. Exodus - Greece. Epsilon - Italy.

Background Data streaming is a new paradigm developed in the database community to process large data flows in memory in an online fashion. It allows to perform continuous queries over flowing data. Most existing platforms are centralized, and a few distributed, and perform 1-2 orders of magnitude better than relational DBs.

Background: Data Streaming Operators

Background: Data Streaming Query

Scope Many potential applications in Internet today require to process huge amounts of information in an online fashion: Mitigation of DDoS attacks. Spam filtering. Processing the output of sensor networks. Detecting fraud in cellular telephony. Financial applications. QoS monitoring for enforcing SLAs. Real time data mining. Etc.

Objectives Stream aims at developing a highly scalable middleware infrastructure to process massive data flows in real time. The innovation lies in the sheer scale targeted by the project 1-2 orders of magnitude higher than current technology.

Innovation Parallelizing data streaming operators:
Currently a query operator can be deployed on a single site and it has to process the full data flow thus becoming the bottleneck. Stream is developing distributed versions of query operators that enable to run individual query operators in a cluster of sites.

Innovation: Parallel Data Streaming
Op1 upstream downstream O p2 p3

Innovation Exploiting leading edge high performance networks and IO systems: Reaching 40 gbs for both networking and IO. This results in high throughput communication among sites and very low latency. Low cost storage system: 1 PC controlling 40 disks.

Architecture Data Mining Layer Autonomic Controller Layer
Parallel Data Streaming Layer Data Streaming Layer High Performance IO & Storage Layer

Innovation Self-healing: Self-configuring: Self-provisioning:
Able to tolerate failures  Novel approach. Able to online recover new nodes. Self-configuring: Dynamic load balancing. Self-provisioning: Nodes are added and removed as needed depending on the load.

Expected Outcome Highly scalable and autonomic infrastructure to process massive data flows. 2 orders of magnitude more scalable than current distributed data streaming platforms. Application to 3 different markets: Telco: Fighting fraud in cellular telephony. Services: Real-time checking of SLAs fulfillment. Financial/banking: Detection of laundry financial operations/Fraud detection in credit card payments/Real time data warehousing.

Current Status Month 8 of the project.
Prototypes of all layers (except automic controller foreseen for the 2nd year). Cluster with 50 nodes interconnected with Myrinet10G setup. First tests of parallel data streaming exhibiting high scalability. Prototypes of IO and storage tiers in advanced state.

Questions?

Ricardo Jimenez-Peris Universidad Politecnica de Madrid

Similar presentations

Presentation on theme: "Ricardo Jimenez-Peris Universidad Politecnica de Madrid"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ricardo Jimenez-Peris Universidad Politecnica de Madrid

Similar presentations

Presentation on theme: "Ricardo Jimenez-Peris Universidad Politecnica de Madrid"— Presentation transcript:

Similar presentations

About project

Feedback