Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ernestina Menasalvas Ruiz Pedro Sousa. GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour.

Similar presentations

Presentation on theme: "Ernestina Menasalvas Ruiz Pedro Sousa. GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour."— Presentation transcript:

1 Ernestina Menasalvas Ruiz Pedro Sousa

2 GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour models

3 What is Data Mining? Many Definitions – Non-trivial extraction of implicit, previously unknown and potentially useful information from data – Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 KDD process

4 …. CRISP-DM ( Busines Understanding Data Understanding Data Preparation ModelEvaluate ARSS fleet

5 Challenges Data integration Aircraft information Context: sensors, space weather, location, weather Operations: pre-flight, departure, climb, enroute, arrival, taxing, post-flight Aviation safety reports Dynamic and complex data: – theoretical and practical aspects of the algorithms have to be analyzed to discover the most appropriate techniques: trend analysis, association of events, datastream methods, context integration, resource awareness

6 GOAL (cont) apply algorithms to mine the various data sources for information – to identify patterns: atypical flights, anomalous cockpit procedures Groups of safety reports BUT: – KDD is a process Static vs dynamic

7 KDD process Aprox. 80% effort

8 Data Exploration and transformation Exploration of the data to better understand its characteristics. – Helping to select the right tool for preprocessing or analysis – Making use of humans’ abilities to recognize patterns – Integrate semantic of data – Clustering and anomaly detection will be used as exploratory techniques Transform data prior to mining so to be able to extract the useful patterns

9 Data Mining Tasks Prediction (Supervised learning) – Use some historical information to learn a model that can help to predict unknown or future values of some variable. – Base for forecasting Classification Regression Deviation Detection Description (Unsupervised) – Find patterns that describe the data – Clustering – Association Rule Discovery – Sequential Pattern Discovery

10 Classification Given a collection of records in which the class is known: – Find a model able to describe the class given values of the rest of attributes. Measurements have to be used to validate the model and determine accuracy of prediction – Train and test Techniques – Induction tree C4.5, ID3 Very effcients if we look at the execution time Very intuitive results – Neural networks The result is a neural network: black box Robust No intuitive

11 Regresion analysis Lineal regression: Y =  +  X –  and  specify the line and are estimated using the data. Multiple regression: Y = b0 + b1 X1 + b2 X2. Log-linear models: – The table of joint probabilities is aproximated by tables of inferior orders.

12 Clustering Given a set of records (unclassified), group records in such a way that: – records in one cluster are more similar to one another. – records in separate clusters are less similar to one another. Similarity Measures have to be defined: – Special attention to distance understanding Approaches – Divisive Algorithms: They first build different partitions and then these partitions are evaluated: K-means – Hierarchical: They build a hierarchical descomposition – Density based: density functions are used – Kohonen networks [Kohonen ‘95]

13 Association Rule Discovery Given a set of records described by a set of attributes: – Find associations in values of attributes – Once associations are discovered, rules can be obtained – Confidence vs support. – Apriori Algoritm At1=1 and At3=1 and At4=1

14 Association algorithms The problem of association rule finding can be divided in two: – Find the set of products that have the minimum support – Use the frequent itemsets to generate the rules Apriori [Agrawal ‘93] – Advantages: Apriori and its variants are the most used in this kind of analysis. Eficient in great volumes of data. – Disadvantages: Memory comsumtion

15 Challenges of the algorithms Algorithm to find anomalies in large dataset : – be fast – scalable. – Accurate Algorithms have to be able to deal with: – continuous sequences, representing sensor data such as airspeed and altitude – discrete sequences, such as sequences of pilot switch presses.

16 Data streams vs static data

17 Data streams Challenges into algorithms: -Processing data in a single pass. -Generation models in an incremental way. -Ability to detect model changes over time. -Limit usage of memory and computing time. -Possibility of automating the evaluation process. A data stream: -is potentially unbound in size -needs to be analyzed over time -arrives at very high rate -and its undelying model evolves over time [Aggarwal et al.] “Data Streams: Models and Algorithms”. Advances in Database Systems, Springer, 2007 [Aguilar-Ruiz, Gama] “Data Streams”. Journal of UniversalComputer Science, 2005 [Barbará] “Requirements for clustering data streams”. SIGKDD’02.

18 Goal New challenges introduced by evolving data like: – resource aware learning, – change detection, – novelty detection – important application areas where data evolution must be taken into account – how learning under constraints (time, storage capacity and other resources) is affected by data evolution – how context can help learning process

19 Change and concept drift [Joao Gama 2010] Concept drift: the underlying concept may shift unexpectedly from time to time. Changes appear: Adversary actions Varying personal interest Changing population Complex environment

20 Required features Examples have to be processed as they arrive Each example should be processed: – Small constant time – Fixed amount of main memory – Single scan of the data – Without (or reduced) revisit old records. Produce models equivalent to the one that would be obtained by a batch data-mining algorithm Detect and react to concept drift [Joao Gama 2010]

21 Recurrent concepts Many learning algorithms to deal with concept drift – Based on: time windows, ensembles, drift detection. – FLORA, SEA, DWM, DMM,... What about Recurrent concepts? – Particular type of concept drift. – Fogetting mechanisms, past data and models are discarded. – However, its common for concepts to reappear.

22 Context and data stream

23 Context Context representation: Context similarity: numeric: nominal:

24 Context integration We want to integrate context information with previously learned models. freqC is the most frequent Context in a sequence of context states {C1, C2,... Cn} Concept history with associated context. h(M k |C i ) Estimate that M k represents the current underlying concept given the current context.

25 Model Storage Model storage for a model M k : the period k where the model was used. using NB requires storing the CV the frequent context freqC for period k. accuracy of the model when it was in use. Represented as the tuple:

26 Model Retrieval Model retrieval for a model M k : – using a sample S n of recent records, – compute the MSE for M k – get the freqC for S n – use history h(M k |freqC) The utility is defined based on model accuracy (highest) and with context similar (min distance) to the current one. Retrieve the model with highest utility as:

27 CALDS: learning process Incrementally Learn the underlying concept When warning is signaled: Prepare a new base learner for the possible new concept Anticipate to drift When drift is detected: Store the current model Reuse a previously learned model when the underlying concept is recurrent.

28 CALDS: learning process

29 Improvements integrating context Overall accuracy: 72.5 %; 69,6%; 62,2%


31 Other current applications ESA- European Space Agency – Event Reporting Tool for non-manned satellite passes (Cryosat monitoring) 31

32 current applications ESA- European Space Agency / Galileo Industries – Galileo - Ground Control Segment Central Monitoring & Control Facility 32

33 Some current applications Portuguese Navy – Singrar – Integrated System for Ship Repair and Resource allocation 33

34 The process 34 Integrated Risk Plans Activation / Maintenance Drillings Training Application Input

35 Space Weather

36 Why – Space Weather? To protect systems and people that might be at risk from space weather effects, we need to understand the causes of space weather.

37 Space Weather Decision Support System SWDSS Third project financed by the European Space Agency (ESA) about SW SWDSS main objective is to develop software capable of storing, manipulating and reacting to adverse Space Weather situations in spacecrafts:. Providing tools for analyzing the collected data;. Supplying reporting facilities for systems management;. Supplying a knowledge discovery tool for nowcast, forecast and data mining.

38 Data sources and providers Mission’s telemetry (payload and/or housekeeping) data and processed data Mission’s auxiliary data, e.g. orbital coordinates, apogee and perigee crossings, station coverage and hand-over, events, 3D models, metadata Data available from other sources, e.g. NOAA, SIDC, SWENET, National Agencies Data from ground-based measurements

39 Satellite Monitoring

40 Conclusion Huge amount of aviation data 1.Integrate data (micro and macro level) 2.Enrich data with semantics 3.Map data with technique to discover patterns (static and streams) : 1.Anomalities 2.predictive 3.Sequences 4.Context influence Data mining in other similar domains has obtained results Next step: data mining for aviation safety

41 Ernestina Menasalvas Ruiz Pedro Sousa

Download ppt "Ernestina Menasalvas Ruiz Pedro Sousa. GOAL Extract knowledge from aviation data sources to obtain patterns that help detection of incidents Learn behaviour."

Similar presentations

Ads by Google