Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Mehlich 3 Evaluation Robert O. Miller ALP Technical Director Colorado State University Fort Collins, CO SERA-6 Meeting, Raleigh, NC, June 21, 2011 Miller,
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Sparse vs. Ensemble Approaches to Supervised Learning
BA 555 Practical Business Analysis
Ensemble Learning: An Introduction
Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley
Lecture 4. Interpolating environmental datasets
Elaine Martin Centre for Process Analytics and Control Technology University of Newcastle, England The Conjunction of Process and.
Ensemble Learning (2), Tree and Forest
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Inference for regression - Simple linear regression
Fruit & Vegetable Production Unit for Plant Science Core Curriculum Lesson 3: Site Evaluation Fruit & Vegetable Production Unit for Plant Science Core.
Soil Sampling and Nutrient Recommendations Soil Education Short Course.
(a.k.a: The statistical bare minimum I should take along from STAT 101)
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Basic concepts in ordination
A unifying model of cation binding by humic substances Class: Advanced Environmental Chemistry (II) Presented by: Chun-Pao Su (Robert) Date: 2/9/1999.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Chapter 9 – Classification and Regression Trees
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
A Survey of Phosphorus in the Yaqui Valley, Sonora, Mexico Barbara Cade-Menun Geological and Environmental Sciences Stanford University.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
The geochemistry of Thai paddy soils
Accuracy Assessment of Sampling Designs for Surveying Heavy Metal Content in Soil Using SSSI Aihua Ma; Jinfeng Wang; Keli Zhang
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.
Extent and Mask Extent of original data Extent of analysis area Mask – areas of interest Remember all rasters are rectangles.
Maths Study Centre CB Open 11am – 5pm Semester Weekdays
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Effects of parent material and land use on soil phosphorus forms in Southern Belgium Renneson 1 M., Dufey 2 J., Bock 1 L. and Colinet 1 G. 1 University.
BOT / GEOG / GEOL 4111 / Field data collection Visiting and characterizing representative sites Used for classification (training data), information.
PCB 3043L - General Ecology Data Analysis.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Beyond Spectral and Spatial data: Exploring other domains of information: 3 GEOG3010 Remote Sensing and Image Processing Lewis RSU.
Data Analytics CMIS Short Course part II Day 1 Part 3: Ensembles Sam Buttrey December 2015.
Declining atmospheric deposition impacts forest soil solution chemistry in Flanders, Belgium Arne Verstraeten 15 th Meeting of the ICP Forests Expert Panel.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Results & Discussion Phosphorus Mobility from Organic and Inorganic Soil Amendments: Rainfall Simulation Studies T.J. Rew, D.A. Graetz, M.S. Josan*, V.D.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Skudnik M. 1*, Jeran Z. 2, Batič F. 3 & Kastelec D. 3 1 Slovenian Forestry Institute, Ljubljana, Slovenia 2 Jožef Stefan Institute, Ljubljana, Slovenia.
Stats Methods at IC Lecture 3: Regression.
Machine Learning: Ensemble Methods
Machine Learning with Spark MLlib
JMP Discovery Summit 2016 Janet Alvarado
HIERARCHICAL CLASSIFICATION OF DIFFERENT CROPS USING
Determining How Costs Behave
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Supervised Time Series Pattern Discovery through Local Importance
PCB 3043L - General Ecology Data Analysis.
MODELING THE CURRENT AND FUTURE DISTRIBUTIONS OF
Essential Statistics (a.k.a: The statistical bare minimum I should take along from STAT 101)
METREAU part II Analysis Division March 10,
Multivariate Analysis of Trace Elements from Coral Cores
The greatest blessing in life is
Somi Jacob and Christian Bach
Ensemble learning Reminder - Bagging of Trees Random Forest
Classification with CART
10th TFMM meeting, June, 2009, France, Paris
Regression and Correlation of Data
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
An introduction to Machine Learning (ML)
Presentation transcript:

Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett

2 Welcome Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence (TXRF). combining MIR and TXRF for the prediction of soil properties. MIRS randomForests prediction models for soil properties. Demo application of RF to MIRS calibration.

3 “randomForest” (RF) implements Breiman’s random forest algorithm for classification and regression based on a forest of trees using random inputs. Version Depends R (>= 2.5.0) Description: Classification and regression based on a forest of trees using random inputs.  URL  Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), Introduction I

4 RF is fast and easy to implement, produce highly accurate predictions It runs efficiently on large data bases. It can handle thousands of input variables without variable deletion and without overfitting. It gives estimates of variable importance in the classification. RF handles complex data types well. Obviates the need for transformation of predictors to approximate normal distributions. Features of Random Forests

5 What are the challenges of RF? X There are many possible alternative nodes; X reseeding will give different models. How does RF work? The out-of-bag (oob) error estimate  In RF, each tree is constructed using a different bootstrap sample from the original data.  ~ 1/3 of the cases are left out of the bootstrap sample and not used in the construction of the k th tree.  Data to get a running unbiased estimate of classification error as trees are added to the forest.  It is used to get estimates of variable importance. Features of RF

6 RF can output a list of predictor variables that are important in predicting the outcome. The randomForest package in R has two measures of importance.  One is "total decrease in node impurities from splitting on the variable, averaged over all trees.”  The other is based on a permutation test. How does RF work?

7 Study 1: Variability and patterns in total element composition of sub-Saharan Africa (SSA) soils using TXRF. The objectives were to; 1. quantify the variability in total element composition of soils from a diverse set of soils across SSA using TXRF, and 2. explore the patterns in total element composition of soils analysed. Usage

8 Materials and Methods Soils from 34 randomly- located 100-km 2 sentinel sites across Africa.

Consistent field protocol Soil spectroscopy Sentinel sites Randomized sampling schemes LDSF = a hierarchical spatially stratified random sampling scheme with ten 100 m 2 plots nested within sixteen 1 km 2 clusters, nested within 100 km 2 sites. Land degradation surveillance framework (LDSF)

10 Materials and Methods Soil samples collected at two depths, 0-20 & cm. Total of 1074 samples  (16 samples per cluster x 2 soil depths x 34 sentinel sites) used for exploring spectral (TXRF) patterns. Total element conc. for 17 elements; Al, P, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Sr, Y, Ta, & Pb.

11 Materials and Methods PCA on the TXRF data RF regression of factors vs the first 5 PCs of the TXRF element conc. to confirm whether site or soil-forming factors (e.g., mineralogy, climate, topography & vegetation) are important drivers of total elemental conc. in the soil to view the importance of the predictor variables. Site factors extracted for each site from LDSF database & Worldclim data & mineralogy data from XRD analysis raw semi-quantitative mineralogy data & dominant mineralogy grouping.

12 Total element conc. values were within the range reported globally for soil Cr, Mn, Zn, Ni, V, Sr, & Y and in the high range for Al, Cu, Ta, Pb, & Ga. Values compiled from this study (mg kg-1) Reported mean and ranges of background contents of elements in crust and worldwide soils (mg kg -1 ) ElementMeanRange Worldwide ranges Crustal Average Worldwide mean Median values Ghana soil Al P – K Ca Ti V Cr Mn Fe Ni Cu Zn Ga Sr > Y Ta Pb Results

13 Significant variations (P < 0.05) in total element composition within & between the sites for the 17 elements analysed. Greatest proportion of total variance & number of significant variance components occurred at the site (55-88%) followed by the cluster nested within site levels (10- 40%). Ele me nt n SiteSite*ClusterSite*DepthDepthResidual Estimate %Tot var Estimate %Tot var Estimate %Tot var Estimate %Tot var Estimate %Tot var Al P * < K * < Ca Ti V Cr Mn Fe Results

14 PCA revealed that patterns in total element conc. between sites appeared to relate to differences in mineralogical ‘functional groups’. The pattern of clustering of the individual minerals and sorting of heavy minerals (V, Pb, Ni, Cr, Cu Ti, and Fe) along the positive Dim1 axis is apparent. Biplots (arrow sizes are proportional to the “initial” variability in the elements present) based on the principal component Dim 1 vs Dim 2 and Dim 1 and Dim 3, on the log transformed data of the soil total element concentration from all sites analysed. Results

15 Strong observed within site & between site variations in many elements can serve to diagnose of soil fertility potential. Elements clustered out differently in the different sample sets from different sentinel sites, indicating a wide variation in associations. some elements are poorly represented (short arrows in the PCA). Biplots based on PCA of element concentration for 4 sentinel sites. Results

16 Results RF model performances were acceptable with R 2 >0.75. Most important variables = cluster, topography, landuse, precipitation and temperature, The importance of cluster explained by spatial correlation at distances of < 1 km. Variable importance plots showing the model accuracies & mean decrease in accuracy (%IncMSE) of the Random Forests regression of TXRF element concs against mineralogy + site/soil-forming factors (a) including cluster and (b) excluding cluster.

17 Study 2: Potential of combining MIR & TXRF spectroscopy for the prediction of soil properties Objectives:  to evaluate whether TXRF can complement MIR for predicting soil test values, especially for tests that are poorly predicted by MIR (e.g. extractable P and K; and some micronutrients). Usage

18 Materials and Methods Georeferenced soil samples associated with the AfSIS Project.  A total of 700 soil samples  44 random 100-km 2 sentinel sites,  stratified according to Köppen-Geiger climatic zones  distributed across SSA.

Samples were analysed using MIR spectrometer. 19 Fourier-Transform MIR spectrometer Infrared absorbance spectra were recorded at 4 cm -1 intervals in the range of 400 to 4000 cm -1. The average of the spectra for 4 replicates was taken. TXRF methodology for total elemental concentrations in each soil sample. TXRF spectrometer Materials and Methods

RF-OOB calibration models developed (n= 700).  to predict the reference properties from the TXRF total element composition using the raw total element concentration data as ‘spectra’. Raw TXRF spectra in conjunction with 1 st derivative MIR spectra to predict the reference soil properties. RF used to calibrate the residuals of the predictions from the MIR spectral data to the raw TXRF total element data  as mixing different data types in the predictor variables might affect the variable importance weights in the fitted models. 20 Materials and Methods

21 Results MIR spectra resulted in very good prediction models using RF out- of-bag validation (R 2 > 0.80) for organic C and N, total C and N, exchangeable Ca, Mehlich-3 Al and pH. Also predicted well (R 2 > 0.60) were Ca/Mg ratio, exchangeable bases, exchangeable Mg, phosphorus sorption index (PSI) water- and calgon-dispersed particles analysed by laser diffraction for sand content, clay content, and silt content.

22 Results Calibration models were not satisfactory (R 2 <0.60) Mehlich-3 extractable K, Mn, Fe, Cu, B, Zn, P, S, and Na, exchangeable acidity, electrical conductivity (ECd), exchangeable sodium percentage (ESP), exchangeable sodium ratio (ESR), air-dispersed particles for silt content, clay content and sand contents.

23 Results RF was able to improve prediction accuracies if the raw TXRF spectra was added to the MIR data.  e.g. ECd (63% reduction in rmse), Mehlich-3 S (54), exchangeable Na (53%), ESP (50%), ESR (50%), total C (29%), Mehlich-3 B (28%), Mehlich-3 Mn (26%), exchangeable Mg (17%), Mehlich-3 Cu (15%), Mehlich-3 Fe (11%), organic C (10%), Mehlich-3 Zn (6%), and silt content (8-50 microns) air-dispersed particles by laser diffraction (4%)).  The improvement in the predictions was mostly due to TXRF detecting a few outlier samples that were different from the rest of the samples. TXRF data used as a predictor did not add value to MIR beyond identifying outlying samples,  these could not be detected as MIR spectral outliers  hence TXRF may be used as an outlier detector. 22

24 Study 3: Analysis of MIRS randomForests prediction models for soil properties. Ongoing study  attempt to offer an in-depth analysis of random forests models for the prediction of a number of soil properties using MIR spectroscopy. Usage

25 Materials and Methods 1907 soil samples scanned through MIR spectrometer at a resolution of 4 cm st derivative of the spectral range cm -1 calculated  smoothing interval of 21 data points using the soil.spec package in R. RF-OOB built to predict the reference properties from the MIRS 1 st derivative spectra using the entire data set.

26 Preliminary Results

27 Demo: R package “randomForests”

28 R package “randomForests” Thank you for your attention