5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of Ljubljana, Faculty of Civil and Geodetic Engineering) WP5 Meeting: Ljubljana, 17-18 June, 2015

Investigation of pressures/stressors correspondence with water quality data and geo- climatic factors Geodatabase will contain datasets regarding the Multiple stressors, Ecological status, Water quantity, Water quality, and Ecosystem services. We will use data-driven modelling approach (namely regression and classification trees) to investigate the relationship between pressures/stressors, geo-climatic factors and the state of the waterbodies.

Data-driven modelling approach and decision trees The goal of the data-driven methods is to learn the dependencies between the inputs and the outputs of the observed system from the measured data. Decision tree learning is a commonly used method in data mining. A tree can be learned by splitting the source dataset into subsets based on attribute value tests. Two types of decision trees: Classification trees: when the predicted (target) variable is a class Regression trees: when the target variable is numeric or continuous

Classification trees Classification trees are used to separate the dataset into classes. ATT 1ATT 2ATT 3… TARGET (CLASS) 3.678.5000.005…Poor 4.157.2070.005…Moderate 5.328.3570.011…Good 7.807.9290.005…Good 8.117.0960.005…Poor 9.367.8040.005…Poor 10.876.0180.005…Moderate 11.107.4000.006…Poor 10.235.4570.011…Good 8.395.4860.014…Moderate 7.425.4860.013…Moderate 4.068.3070.005…Good …………… DATA SET (EXAMPLES)Classification treeSet of IF THEN rules IF (ATT 1_value ≤ value1) THEN (class_value = class1) IF (ATT 1_value > value1 and ATT 2_value ≤ value 2) THEN (class _value = class2) IF (ATT 1_value > value1 and ATT 2_value > value 2) THEN (class _value = class3). class1 class2 class3

Regression trees Regression trees are needed when the target variable is numeric or continuous. They are used for the prediction of the target value. ATT 1ATT 2ATT 3…TARGET 3.678.5000.005…2.133 4.157.2070.005…2.601 5.328.3570.011…3.718 7.807.9290.005…3.481 8.117.0960.005…1.791 9.367.8040.005…1.128 10.876.0180.005…1.471 11.107.4000.006…1.521 10.235.4570.011…0.869 8.395.4860.014…0.535 7.425.4860.013…1.034 4.068.3070.005…1.636 …………… DATA SET (EXAMPLES) Regression tree TARGET = 2*ATT 1 + 0.4*ATT 2 + 0*ATT 3 ATT 1 < 10 ATT 2 < 0.011 TARGET = 0.2*ATT 1 + 3*ATT 2 + 4*ATT 3 = NO = YES = NO TARGET = 0.01*ATT 1 + 0*ATT 2 + 5*ATT 3 = YES leaves, where the target variable is predicted nodes Set of equations for the prediction of a target value (i.e. regression model) IF (ATT 1 < 10) THEN (target = 0.2 ∙ ATT 1 + 3 ∙ ATT 2 + 4 ∙ ATT 3) IF (ATT 1 > 10 and ATT 2 < 0.011) THEN (target = 0.01 ∙ ATT 1 + 0 ∙ ATT 2 + 5 ∙ ATT 3) IF (ATT 1 > 10 and ATT 2 > 0.011) THEN (target = 2 ∙ ATT 1 + 0.4 ∙ ATT 2 + 0 ∙ ATT 3).

Finding the most important relationship between state and pressure/drivers data MARS Geodatabase is prepared in a way that allows Pressure (multiple stressors) and State data (water quality) linkage to spatial objects.

Preparing datasets: Spatial objects: River segments with an unique identifier „tr“ Main drains in FEC and other river segments Linkage of river segments with FEC and its „Hinterland“ (all FECs in drainage area): „tr“ linked to „zhyd“ Linkage of SoE monitoring stations with main drains Water quality and quantity data: Data on nutrients Pressure data: From EUROSTAT, E-PRTR and UWWTD; to include also modelled data (Moneris, JRC - GreenModel?)

FEC and hinterland polygons data: Surface area Have (average altitude), Hmin, Hmax Average slope (%) Precipitation, Temperature (1950-2000) Population number, density Hydroecoregion, Bioregion, Ecoregion (FEC) Corine land cover (1st order) River Straler order (for main drain) River name (main drain) WFD WB - ID WFD ecological status Monitoring station ID on main drain Available water quality data (state) FieldDescriptionUnit Prefix Fdata applies to FEC Prefix Hdata applies to FEC Prefix Sstate WaterbaseIWISE SoE quality station ID trID of ECRINS river segment on which SoE quality station is located ZHYD_FECID of ECRINS FEC on which SoE quality station is located hinterlandDoes SoE quality staion have hinterland or not (YES/NO) zhyd_hinterlandID of hinterland of SoE quality station Hinterland_Area_km2Hinterland area in km2 strahlerStrahler order of tr where SoE quality staion is located SoE_RivMName of river WaterBody_IDWFD Water body ID WFD_ecol_stEcological station of river segment from WFD SoE_RiverDischRiver discharge (data from SoE quality database) DEM_altituAltitude of SoE quality station from DEM H_CLC1Agricultural areasShare of hinterland area H_CLC2Artificial surfacesShare of hinterland area H_CLC3Forest and semi natural areasShare of hinterland area H_CLC4Water bodiesShare of hinterland area H_CLC5WetlandsShare of hinterland area F_DEM_AAverage (mean) altitude derived from DMV[m a.s.l] F_DEM_MiMinimal altitude derived from DMV[m a.s.l] F_DEM_MxMaximal altitude derived from DMV[m a.s.l] F_SLOP_AAverage slope derived from DMV[percent reise] F_PON_A_WPopulation count of the World Version 3 (Wv3)-year 2000[people] F_POD_A_WPopulation density of the World Version 3 (Wv3)-year 2000[people/km2] F_POD_A_JRCPopulation density disaggregated with Corine land cover 2000-year 2000[people/km2] F_AR_km2Functional elementary cachment (FEC) area[m2] F_PRE_5000Average yearly precipitation for periode 1950-2000[mm/year] F_PRE1_5000Average january precipitation for periode 1950-2000[mm/month] F_PRE7_5000Average july precipitation for periode 1950-2000[mm/month] F_TEM_5000Average yearly precipitation for periode 1950-2000[°C] F_TEM1_5000Average january precipitation for periode 1950-2000[°C] F_TEM7_5000Average july precipitation for periode 1950-2000[°C] F_ECOR_IDEco regions ID (AREA_ID) 1-25Share of FEC area F_BIOR_IDBiogeographical regions ID (ABBRE)Share of FEC area F_HER_IDHydro eco region ID (HERCODE) - European Hydro-EcoregionsShare of FEC area

Data from EUROSTAT (mainly on agriculture) Farms Livestock Crops Irrigation All data on NUTS 2 level, for year 2010 In case there were no 2010 reportings, we used averages (from 2005 to 2009) or last reported values. FieldDescriptionUnit H_beehivesBeehivesnumber H_cattleCattleheads H_dairy_cattleDairy cattleheads H_equidaeEquidaeheads H_farms_lsFarms with livestocknumber H_irr_areaTotal irrigable areaha H_irr_volumeIrrigation water volumem3 H_maizeMaize yield100_kg/ha H_oth_cattleOther cattleheads H_oth_pigsOther pigsheads H_pigsPigs totalheads H_potatoesPotato yields100_kg/ha H_poultryPoultry1000_heads H_rabbitsRabbitsheads H_sheepSheepheads H_sowsSowsheads H_uaaUtilized agricultural areaha H_vineyardsVineyards100_kg/ha H_wheatWheat yield100_kg/ha

Data from WISE: -UWWTD and -SOE water quality FieldDescriptionUnit H_BOD_dischargesum of BOD discharges in hinterland (UWWTD)[t/y] H_COD_dischargesum of COD discharges in hinterland (UWWTD)[t/y] H_P_dischargesum of P discharges in hinterland (UWWTD)[t/y] H_N_dischargesum of N discharges in hinterland (UWWTD)[t/y] H_uwws_countnumber of UWW systems in hinterland (UWWTD)[t/y] H_TN_releasetotal nitrogen release (E-PRTR)[t/y] H_TP_releasetotal phosphorus release (E-PRTR)[t/y] S_ammoniummg/l N S_total ammoniummg/l N S_bod5mg/l O2 S_bod7mg/l O2 S_chlorophyll_aµg/l S_codcrmg/l O2 S_codmnmg/l O2 S_DOCdissolved organic carbonmg/l C S_DOdissolved oxygenmg/l O2 S_ECelectrical conductivityµS/cm S_KNkjeldahl nitrogenmg/l N S_nitratemg/l N S_orthophosphatesmg/l P S_OSoxygen saturation% S_ph S_silicatemg/l Si S_Twater temperature°C S_TOCtotal organic carbon (toc)mg/l C S_TPtotal phosphorusmg/l P

Temperature (°C)Precipitation (mm/year) Case study: Drava river catchement (1)

Ecoregions (Illies)Hydoecoregion (Rebecca project) Case study: Drava river catchement (2)

Drava river: 107 monitoring stations (water quantity)

HR_RV_29111 hinterland.pngAT_RV_FW61400127 hinterland.png

Temperature Population Density (people/km 2 ) Maize yield (100 kg/ha, 2010) Irrigation water volume (m 3 /year, 2010) Pigs (heads, 2010)

Modelling exercise – Drava river catchment The previously mentioned data were used to generate different classification trees using WEKA software. We decided to use the ecological status of water bodies (according to WFD) as a target variable. Our target variable can fall into one of the following three classes: Good: 33 examples Moderate: 42 examples Poor: 13 examples

1st Classification Tree (no. of parameters: 32; cross- validation (CV): 64 %; training data: 85 %... 85% of all cases are classigfed by this rules correctly ) -the most important parameter is eco-region; the eco-region "Hungarian Lowlands" has poor ecological state, while the eco-region "Dinaric Western Balkans" has moderate ecological status. -In eco-region "Alps" the most significant is percentage of urban areas, followed by the percentage of water surface. -Interestingly, a greater number of beehives resulted in a better ecological state of the watercourse. Test results – Drava river catchment (1) Number of beehives CLC1: agriculture CLC2 : artificial CLC3 : forest CLC4 : water bodies

Test results – Drava river catchment (2) 2nd Classification Tree (no. of parameters 12, CV: 64 %, training set: 84 %): -Here the most important parameter is altitude. If it is lower than 161.24 m asl, then the ecological status is poor. -At higher altitudes, the most important parameter is percentage of forest land. If the forest area covers more than 89.61 %, then the status of the water is good and percentage of urban areas becomes important. If it is less than 1.22 %, the status is good. -Interestingly (and logically) the ecological state of the higher-lying sections (Strahler <= 5) is better (good vs. moderate). Treshold 17 cases prove the rule, 3 failed CLC1: agriculture CLC2 : artificial CLC3 : forest CLC4 : water bodies IF THE HINTERLAND OF STATION IS COVERED with MORE THAN 90% OF FOREST IS A LARGE PROBABILTIY THAT THIS WB WILL HAVE GOOD ES. If not it depend on other land uses: if we have less then 90% of forest and urban areas less then 1.2, we can also expect good ES; otherwise we check again the forrest coverage. If it is less then 90 but more then 70, then we check the River discharge. If Forest is less then 70 (that is: urban area more then 1.2 and forest less then 70) we check the agricultural coverage. If more then 30, then moderate. Otherwise we go to the last check.

Test results – Drava river catchment (3) 3rd Classification Tree (no. of parameters: 13, CV: 65 % (more robust tree…preform better on validation dataset; training data: 73 %) In this case we used techniques to obtain smaller trees. Some information may be lost but the tree is more robust against new (validation) data. The tree is similar to the tree no. 2, only slightly shorter and easier to interpret. The important attributes are altitude, percentage of forest areas, and the percentage of water surface. IF THE HINTERLAND OF STATION IS COVERED with MORE THAN 90% OF FOREST IS A LARGE PROBABILTIY THAT THIS WB WILL HAVE GOOD ES. If not it depend on other land uses: if we have less then 90% of forest and water surface more 0.18% we can expect moderate state. Otherwise if we have less than 73% OF FOREST we can hardly expet moderate status. MODEL INDICATES THE IMPROTANCE OF THE FOREST AND WATER SURFACE: IF WE HAVE ENOUGH FOREST SURFACE WE CAN AFFORD OTHER ACTIVITIES. CLC1: agriculture CLC2 : artificial CLC3 : forest CLC4 : water bodies

Conclusions (1) 1)THE MOST IMPORTANT DRIVER/PRESSURE IS LAND USE FOREST COVER IS THE DOMINANT LAND USE CLASS AND THE TRESHOLD OF FOREST COVERAGE FOR GOOD STATUS IS 89% IN THE HINTERLAND WATER SURFACE IN THE HITERLAND IS THE SECOND MOST LAND USE CLAS: THE TRESHOLD IS 0,179% (IF MORE THAN 0,179% WATER INTHE HINTERLAND, THAN ONE CAN EXPECT MODERATE STATUS) (THESE ARE DRIVERS THAT REDUCE PRESSURES FROM OTHER DRIVERS) 2) FOR HIGH ALTITUDE: IF WE HAVE ENOUGH SURFACES OF FOREST (MORE THAN 90%) WE CAN EXPECT GOOD ECOLOGICAL STATUS; IF NOT WE HAVE TO SEE WHAT ELSE WE ARE DOING IN THE H: IF FOREST AREA BETWEEN 70-90%, THAN Ecdological Status DEPENDS ON RIVER DISCHARGE (MORE IS BETTER) IF AGRIC MORE THAN 30% THAN WE CAN EXPECT MODERATE STATUS; IF AGRIC LESS THAN 30% THAN 3) ALSO SEEMS IMPORTANT: IRRIGATION AND URBAN AREAS

How to interpret and use the trees Clear message from all models is that forest coverage is most important attribute for the ecological status of water bodies in Drava catchment The models provide with threshold values of the attributes, based on which a strategy for land use management in the hinterland can be developed. For example, clear guideline for managers is: In Drava catchment hinterlands below 161 m.a.s.l. tend to be problematic regardless of the land use and need more attention. Hinterlands above 161: If we keep more then 90 % of the land use as forest there is big probability to have good ES. If less forest: then pay attention to percentages of water surface, agricultural areas, urban areas and river discharge. These thresholds are given in the models from the previous slides Important to note: Classification trees were trained on Drava catchment, thus the info they disclose is valid for this catchment only

Modelling exercise – further work Each SoE station is affected by the corresponding drainage area. Therefore, it is more reasonable to use data aggregation on hinterland-level instead of on FEC-level, especially for the geo-climatic factors (e.g., average slope, average annual prectipitation, etc.). Not only ecological status, we can model other variables from the SoE stations as well (e.g. P and N ranges) We will model other catchments and find similarities. The size of the catchments to be discussed We still need to include point sources data (from E-PRTR and UWWTD databases), which will hopefully improve the interpretability of the models

Points for further discussion Which additional attributes should we include in our modeling tasks? Which target variables should we predict? Which type of decision trees seems more usefull – classification or regression trees? Should we perform modelling tasks for single test cases, river groups with common properties or for the Europe as a whole? Where do you see a potential use of the constructed decision trees within the other MARS Tasks?

5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Similar presentations

Presentation on theme: "5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of.

Similar presentations

Presentation on theme: "5.1.2 Analysis of stressors-responses relations with decision trees Lidija Globevnik, Nataša Atanasova, Mateja Škerjanec, Maja Koprivšek (University of."— Presentation transcript:

Similar presentations

About project

Feedback