Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett.

Similar presentations


Presentation on theme: "Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett."— Presentation transcript:

1 Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett

2 2 Welcome Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence (TXRF). combining MIR and TXRF for the prediction of soil properties. MIRS randomForests prediction models for soil properties. Demo application of RF to MIRS calibration.

3 3 “randomForest” (RF) implements Breiman’s random forest algorithm for classification and regression based on a forest of trees using random inputs. Version 4.6-7 Depends R (>= 2.5.0) Description: Classification and regression based on a forest of trees using random inputs.  URL http://stat-www.berkeley.edu/users/breiman/RandomForestshttp://stat-www.berkeley.edu/users/breiman/RandomForests  Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32. Introduction I

4 4 RF is fast and easy to implement, produce highly accurate predictions It runs efficiently on large data bases. It can handle thousands of input variables without variable deletion and without overfitting. It gives estimates of variable importance in the classification. RF handles complex data types well. Obviates the need for transformation of predictors to approximate normal distributions. Features of Random Forests

5 5 What are the challenges of RF? X There are many possible alternative nodes; X reseeding will give different models. How does RF work? The out-of-bag (oob) error estimate  In RF, each tree is constructed using a different bootstrap sample from the original data.  ~ 1/3 of the cases are left out of the bootstrap sample and not used in the construction of the k th tree.  Data to get a running unbiased estimate of classification error as trees are added to the forest.  It is used to get estimates of variable importance. Features of RF

6 6 RF can output a list of predictor variables that are important in predicting the outcome. The randomForest package in R has two measures of importance.  One is "total decrease in node impurities from splitting on the variable, averaged over all trees.”  The other is based on a permutation test. How does RF work?

7 7 Study 1: Variability and patterns in total element composition of sub-Saharan Africa (SSA) soils using TXRF. The objectives were to; 1. quantify the variability in total element composition of soils from a diverse set of soils across SSA using TXRF, and 2. explore the patterns in total element composition of soils analysed. Usage

8 8 Materials and Methods Soils from 34 randomly- located 100-km 2 sentinel sites across Africa.

9 Consistent field protocol Soil spectroscopy Sentinel sites Randomized sampling schemes LDSF = a hierarchical spatially stratified random sampling scheme with ten 100 m 2 plots nested within sixteen 1 km 2 clusters, nested within 100 km 2 sites. Land degradation surveillance framework (LDSF)

10 10 Materials and Methods Soil samples collected at two depths, 0-20 & 20-50 cm. Total of 1074 samples  (16 samples per cluster x 2 soil depths x 34 sentinel sites) used for exploring spectral (TXRF) patterns. Total element conc. for 17 elements; Al, P, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Sr, Y, Ta, & Pb.

11 11 Materials and Methods PCA on the TXRF data RF regression of factors vs the first 5 PCs of the TXRF element conc. to confirm whether site or soil-forming factors (e.g., mineralogy, climate, topography & vegetation) are important drivers of total elemental conc. in the soil to view the importance of the predictor variables. Site factors extracted for each site from LDSF database & Worldclim data & mineralogy data from XRD analysis raw semi-quantitative mineralogy data & dominant mineralogy grouping.

12 12 Total element conc. values were within the range reported globally for soil Cr, Mn, Zn, Ni, V, Sr, & Y and in the high range for Al, Cu, Ta, Pb, & Ga. Values compiled from this study (mg kg-1) Reported mean and ranges of background contents of elements in crust and worldwide soils (mg kg -1 ) ElementMeanRange Worldwide ranges Crustal Average Worldwide mean Median values Ghana soil Al 3392794 - 89068 10000-40000--- P 14325 – 2358 ---- K 10893291 - 77898 ---- Ca 978082 - 426431 ---- Ti 42642.6 - 25611 200-240004400-- V 370.7 - 393 5.0-50013560- Cr 640.7 - 598 1-15001004272 Mn 4661.6 - 6575 9000900418- Fe 2795420 - 181691 1000-550000--- Ni 190.3 - 364 0.2-500201839 Cu 170.3 - 114 1.0-250551417-29 Zn 290.3 - 138 10-602706245-47 Ga 80.2 - 31 0.4-70151.2- Sr 1181.2 - 1985 32->1000375147- Y 130.2 - 109 16-333312- Ta 30.1 - 16 0.8-5.32.01.1 - Pb 370.3 - 638 2.0-16338142518-22 Results

13 13 Significant variations (P < 0.05) in total element composition within & between the sites for the 17 elements analysed. Greatest proportion of total variance & number of significant variance components occurred at the site (55-88%) followed by the cluster nested within site levels (10- 40%). Ele me nt n SiteSite*ClusterSite*DepthDepthResidual Estimate %Tot var Estimate %Tot var Estimate %Tot var Estimate %Tot var Estimate %Tot var Al 10680.966 88 0.112 10 0.004 0.4 0.005 0.45 0.016 1.4 P 10590.718 76 0.198 21 0.002 0.2 1.4*10 -21 <0.01 0.025 2.6 K 10650.913 71 0.354 28 0.003 0.2 6.8*10 -21 <0.01 0.010 0.8 Ca 10682.186 79 0.480 17 0.034 1.2 0.017 0.60 0.051 1.8 Ti 10671.398 87 0.199 12 0.001 0.1 0.001 0.04 0.014 0.9 V 10671.463 77 0.379 20 0.009 0.5 0.008 0.39 0.053 2.8 Cr 10680.808 65 0.384 31 0.005 0.4 0.006 0.46 0.039 3.2 Mn 10671.007 68 0.393 27 0.023 1.6 0.008 0.51 0.040 2.7 Fe 10661.459 80 0.335 18 0.005 0.3 0.009 0.47 0.026 1.4 Results

14 14 PCA revealed that patterns in total element conc. between sites appeared to relate to differences in mineralogical ‘functional groups’. The pattern of clustering of the individual minerals and sorting of heavy minerals (V, Pb, Ni, Cr, Cu Ti, and Fe) along the positive Dim1 axis is apparent. Biplots (arrow sizes are proportional to the “initial” variability in the elements present) based on the principal component Dim 1 vs Dim 2 and Dim 1 and Dim 3, on the log transformed data of the soil total element concentration from all sites analysed. Results

15 15 Strong observed within site & between site variations in many elements can serve to diagnose of soil fertility potential. Elements clustered out differently in the different sample sets from different sentinel sites, indicating a wide variation in associations. some elements are poorly represented (short arrows in the PCA). Biplots based on PCA of element concentration for 4 sentinel sites. Results

16 16 Results RF model performances were acceptable with R 2 >0.75. Most important variables = cluster, topography, landuse, precipitation and temperature, The importance of cluster explained by spatial correlation at distances of < 1 km. Variable importance plots showing the model accuracies & mean decrease in accuracy (%IncMSE) of the Random Forests regression of TXRF element concs against mineralogy + site/soil-forming factors (a) including cluster and (b) excluding cluster.

17 17 Study 2: Potential of combining MIR & TXRF spectroscopy for the prediction of soil properties Objectives:  to evaluate whether TXRF can complement MIR for predicting soil test values, especially for tests that are poorly predicted by MIR (e.g. extractable P and K; and some micronutrients). Usage

18 18 Materials and Methods Georeferenced soil samples associated with the AfSIS Project.  A total of 700 soil samples  44 random 100-km 2 sentinel sites,  stratified according to Köppen-Geiger climatic zones  distributed across SSA.

19 Samples were analysed using MIR spectrometer. 19 Fourier-Transform MIR spectrometer Infrared absorbance spectra were recorded at 4 cm -1 intervals in the range of 400 to 4000 cm -1. The average of the spectra for 4 replicates was taken. TXRF methodology for total elemental concentrations in each soil sample. TXRF spectrometer Materials and Methods

20 RF-OOB calibration models developed (n= 700).  to predict the reference properties from the TXRF total element composition using the raw total element concentration data as ‘spectra’. Raw TXRF spectra in conjunction with 1 st derivative MIR spectra to predict the reference soil properties. RF used to calibrate the residuals of the predictions from the MIR spectral data to the raw TXRF total element data  as mixing different data types in the predictor variables might affect the variable importance weights in the fitted models. 20 Materials and Methods

21 21 Results MIR spectra resulted in very good prediction models using RF out- of-bag validation (R 2 > 0.80) for organic C and N, total C and N, exchangeable Ca, Mehlich-3 Al and pH. Also predicted well (R 2 > 0.60) were Ca/Mg ratio, exchangeable bases, exchangeable Mg, phosphorus sorption index (PSI) water- and calgon-dispersed particles analysed by laser diffraction for sand content, clay content, and silt content.

22 22 Results Calibration models were not satisfactory (R 2 <0.60) Mehlich-3 extractable K, Mn, Fe, Cu, B, Zn, P, S, and Na, exchangeable acidity, electrical conductivity (ECd), exchangeable sodium percentage (ESP), exchangeable sodium ratio (ESR), air-dispersed particles for silt content, clay content and sand contents.

23 23 Results RF was able to improve prediction accuracies if the raw TXRF spectra was added to the MIR data.  e.g. ECd (63% reduction in rmse), Mehlich-3 S (54), exchangeable Na (53%), ESP (50%), ESR (50%), total C (29%), Mehlich-3 B (28%), Mehlich-3 Mn (26%), exchangeable Mg (17%), Mehlich-3 Cu (15%), Mehlich-3 Fe (11%), organic C (10%), Mehlich-3 Zn (6%), and silt content (8-50 microns) air-dispersed particles by laser diffraction (4%)).  The improvement in the predictions was mostly due to TXRF detecting a few outlier samples that were different from the rest of the samples. TXRF data used as a predictor did not add value to MIR beyond identifying outlying samples,  these could not be detected as MIR spectral outliers  hence TXRF may be used as an outlier detector. 22

24 24 Study 3: Analysis of MIRS randomForests prediction models for soil properties. Ongoing study  attempt to offer an in-depth analysis of random forests models for the prediction of a number of soil properties using MIR spectroscopy. Usage

25 25 Materials and Methods 1907 soil samples scanned through MIR spectrometer at a resolution of 4 cm -1. 1 st derivative of the spectral range 601.7-4001.6 cm -1 calculated  smoothing interval of 21 data points using the soil.spec package in R. RF-OOB built to predict the reference properties from the MIRS 1 st derivative spectra using the entire data set.

26 26 Preliminary Results

27 27 Demo: R package “randomForests”

28 28 R package “randomForests” Thank you for your attention


Download ppt "Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett."

Similar presentations


Ads by Google