Presentation on theme: "Longer-Lead Water-Supply Forecasts - Statistical Forecasting with Optimal Climate Predictor Selection Hamid Moradkhani Department of Civil and Environmental."— Presentation transcript:
Longer-Lead Water-Supply Forecasts - Statistical Forecasting with Optimal Climate Predictor Selection Hamid Moradkhani Department of Civil and Environmental Engineering 1 Seasonal to two year forecasting workshop for the Colorado Basin March 21-22
Interest in water supply forecasting has grown prominently in the western US due to population growth and increasing demands for water In the Western US, the NWRFC and the NRCS jointly issue seasonal water supply outlook (WSO) forecasts of naturalized or unimpaired flow Successful management of the West’s water supply is necessary to provide an uninterrupted and dependable water supply to meet all the needs One important aspect of successfully managing the West’s supply of water is accurate and reliable forecasts of seasonal streamflow volumes Longer-lead forecasts are useful to water managers and decision makers but difficult to make due to the uncertainty in future winter and spring climate conditions and the lack of snowpack information Background and Motivation
Developing a Forecast Model l Model selection: Statistical regression-based models A regression model consists of l Dependent variable l Predictor variable(s) A regression model established a linear relationship between the predictor variable(s) and the dependent variable (predictand)
Developing a Forecast Model l The dependent variable is the total volumetric flow over a particular period at a specific point in a basin. Total Volumetric Flow = Sum(Apr to Sept Volumes) Total volume is what we want to predict
Yakima River Basin l The Yakima River is located in central Washington in the Yakima River Basin and is approximately 215 miles in length. l The Yakima Basin drains approximately 6,150 square miles of area. l The basin is bordered on the west by the Cascade Mountains, on the north by the Wenatchee Mountains, on the east by the Columbia, and on the south by the Simcoe Mountains. l The climate in most of the basin is dry, with a mean annual precipitation over the entire basin of 27 inches. Most of the precipitation in the basin falls during the winter months as snow in the mountains.
Sprague River Basin l The Sprague River is located in southwestern Oregon in the Upper Klamath River Basin and is approximately 75 miles in length. l The Sprague River drains an area east of the Cascade Mountains that is approximately 1,600 square miles in area. l The climate in most of the basin is much drier than that of western Oregon, and has more extreme temperatures, especially in the winter months. l Snowfall accounts for 30 percent of the annual precipitation in the valleys and as much as 50 percent in the mountains.
Rogue River Basin l The Rogue River is located in southwestern Oregon in the Rogue River Basin and is approximately 220 miles in length. l The Rogue River drains an area between the Cascade Mountains and the Pacific Ocean that is approximately 5,160 square miles. l The climate of southwestern Oregon is cool and wet in the winter and among the hottest and driest in the western Cascades in the summer. l Mean annual precipitation in the headwaters of the basin range from 20 to 30 inches, and 80 to 100 inches near the Oregon coast.
Basin location: Southwestern Oregon Basin size: 5,160 square miles Rogue River Basin Sprague River Basin Basin location: Southwestern Oregon Basin size: 1,600 square miles Month Streamflow (1000-AF) MAP (in) Month Streamflow (1000-AF) MAP (in) Yakima River Basin Basin location: Central Washington Basin size: 6,150 square miles Month Streamflow (1000-AF) MAP (in) Study Basins
Developing a Forecast Model l Our objective is to establish a linear relationship between several predictor variables (x), and a predictand (y) l Given a set of data that consists of n observations and j predictor variables the problem becomes finding the regression function Each of the j predictor variables has its own regression parameters, b j, and regression constant, b 0. The best possible estimates of the regression parameters are the ones which minimize the sum of the squared errors
Identifying Potential Predictors l Explore the relationship between streamflow and: Snow water equivalent Precipitation for past months Streamflow for past months l Streamflow in the West is the result of the Accumulation of seasonal snowpack over the winter months Melting of this snowpack over the spring and summer
Identifying Potential Predictors l Fall and winter precipitation can also provide information about spring runoff l This gives us information about the soil moisture state in the basin Total Fall and Winter Precipitation = Sum(Oct to Forecast Issue Date)
Identifying Potential Predictors l Past months streamflow also provides useful information about future streamflow volumes. l Combining these three predictors would results in the following: Forecast Issue Date Forecast Window Antecedent Precipitation Antecedent Streamflow SWE Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr - Sep
Identifying Potential Predictors l Recent research has shown that forecast accuracy can be improved by including large-scale climate information into forecasts models. l Some useful climate teleconnection indices commonly used for streamflow forecasting in the western US: Southern Oscillation Index (SOI) Pacific Decadal Oscillation (PDO) Multivariate El Nino Southern Oscillation Index (MEI) Pacific North American Index (PNA) Trans-Nino Index (TNI)
Identifying Potential Predictors l So how do we incorporate large-scale climate information into seasonal forecasts… l Goal: Establish a relationship between climate information and spring runoff. l Analysis: Investigate how climate information from the previous year relates to spring runoff of the forecast year.
Identifying Potential Predictors l Correlation analysis can be performed by taking monthly or 3- month aggregations of climate data and correlating with spring runoff.
Identifying Potential Predictors l Combining SWE, climate information from past months, precipitation from past months, and streamflow from past months Forecast Issue Date Forecast Window Antecedent Precipitation Antecedent Streamflow SWE 3-Month Climate Data Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr - Sep
Identifying Potential Predictors l We have a set of potential predictors that capture some of the physical driving processes of streamflow, now how do we use them in a regression model? l Looking at the data we would find that our predictors are intercorrelated.
Intercorrelated Predictor Variables l If the predictor variables have minimal correlation among themselves, then application of the OLS regression methodology works as intended. l If, on the other hand, the predictor variables are mutually correlated, then the predictor variables contain redundant information, and this condition leads to parameter estimates that are biased. l Several remediation techniques are available for dealing with datasets that would otherwise give bias parameter estimations. Principal component regression Partial least squares regression Z-score regression Independent component regression
Principal Component Regression l Principal component regression (PCR) combines principal component analysis (PCA) with multiple linear regression. l PCA is a technique that creates new variables from the original predictor variables that can be scored and ordered. l Principal components: are composites of many of the original predictor variables they represent a smaller number of variables that maintain the important patterns in the data. l Principal components are generated from linear combinations of the predictor variables.
Principal Component Regression l Principal components are ordered The first principal component accounts for as much of the variability in the data set as possible Each succeeding component accounts for as much of the remaining variability as possible. l Most importantly, the principal components created from PCA are uncorrelated, which allows them to be used in multiple linear regression.
Principal Component Regression l PCR Steps: 1.Start by standardizing the predictor dataset – subtract the sample mean and divide by the sample standard deviation. 2.Compute the sample variance-covariance matrix of the standardized dataset. 3.Calculate the eigenvectors and eigenvalues of the variance- covariance matrix. 4.Calculate the principal component time series by multiplying each predictor variable times the corresponding eigenvector. 5.Use the principal components obtained from step 4 in linear regression model.
Principal Component Regression l PCA produces the same number of principal components as there are predictor variables. l Some measure must be used to select the number of principal components to retain in the regression model. l We investigated two methods: The regression parameters are tested using a standard t-test and a user specified critical t-value The Prediction REsidual Sum of Squares (PRESS) statistic is calculated using a leave one out cross-validation procedure
Principal Component Regression l Principal components are added into the regression model one- by-one l The addition of each principal component is checked for statistical significance using the standard t-test. l The principal components are retained in the regression model as long as the regression parameters are statistically significant (t-value > 1.6).
Principal Component Regression l The PRESS statistic is calculated for each of the extracted factors j using: PRESS(j) = ∑(q obs – q sim (j)) 2 l Where q obs observed streamflow value q sim (j) predicted streamflow value for the extracted factor j l The PRESS statistic with the minimum value is then used to determine the number of extracted components to keep in the final model.
Partial Least Squares Regression l Partial least squares regression (PLSR) is based on the principal components of both the predictor (X) and the dependent variable (Y) l PLSR is different than PCR in that PLSR searches for a set of components that explains the maximum covariance between X and Y, where PCR concentrates on the variance of X only l PLSR decomposes X and Y into a score matrix (S x, S y ) times a loading matrix (L x, L y ) and a residual matrix (E x, E y ): X = S x * L x + E x Y = S y * L y + E y l This is referred to as the outer relations
Partial Least Squares Regression l PLSR seeks to minimize E y while maintaining the correlation between X and Y by an inner relations: X = S x * L x + E x (1) Y = S y * L y + E y (2) S y = D * S x + E i (3) Where, D = diagonal correlation matrix between X and Y E i = error term l Inserting equation (3) into equation (2) gives a predictive model for Y Y = D * S x * L y + E yy where E yy is to be minimized Outer relations Inner relations
Partial Least Squares Regression l PLSR also produces the same number of components as there are predictor variables. l In this study the PRESS statistic is used for determining the optimal number of components to retain PRESS(j) = ∑(q obs – q sim (j)) 2 Where q obs observed streamflow value q sim (j) predicted streamflow value for the extracted factor j
Development of Climate Predictors using ICA l This research investigated the use of Independent Component Analysis (ICA) for the decomposition of large-scale climate data in order to see if statistically significant climate predictors could be extracted. l The idea is that the oceanic and atmospheric systems receive contributions from many sources and that the observations represent a linear mixture of independent signals. l Before ICA was implemented a quick correlation analysis was performed using the NCEP Reanalysis data provided by the Physical Sciences Division NOAA/OAR/ESRL (http://www.cdc.noaa.gov/).http://www.cdc.noaa.gov/
LabelPatternLabelPattern PNAPacific North American IndexAMO Atlantic Multidecadal Oscillation (unsmoothed) EPNPEastern Pacific OscillationEAIndex WPWestern Pacific IndexPEPolar/ Index NAOOscillationSCANDIndex SOISouthern Oscillation IndexSOLAR FLUXSolar Flux (10.7cm) PDOPacific Decadal IndexOLROutgoing Long Wave Radiation Equator QB30 Quasi-Biennial Oscillation (30mb zonal wind)SLPSea Level Pressure QB50 Quasi-Biennial Oscillation (50mb zonal wind)SLPSea Level Pressure GLAAMGlobally Integrated Angular Momentum850MB TRADE WINDS850mb Trade Wind Index MEIMultivariate ENSO index1000MB ZWINDNCEP 1000mb Zonal Wind NINO1+2Extreme Eastern Tropical Pacific SSTKAPLAN SSTSurface Temperature NINO3Eastern Tropical Pacific SSTNCEP SSTSurface Temperature NINO3.4Eastern Central Tropical Pacific SSTNCEP SUR PRESSNCEP Surface Pressure NINO4Central Tropical Pacific SSTNCEP SLPNCEP Sea Level Pressure ONIOceanic Nino IndexNCEP AIR TEMPNCEP Surface Air Temperature TNITrans-Nino IndexNCEP 1000MB GEONCEP 1000mb Geopotential Height WHWPwarm poolNCEP REL HUMIDNCEP Surface Relative Humidity TNATropical IndexNCEP PRCP RATENCEP Surface Precipitation Rate TSATropical IndexNCEP PRCP WATERNCEP Surface Precipitable Water Climate Signals Correlated With Spring Runoff Volumes
Climate Signal Correlation With Apr-Sept Streamflow Volume Yakima River BasinRogue River Basin Sea Level Pressure Geopotential HeightRelative Humidity Surface Air TemperatureSurface Pressure Geopotential HeightPrecipitable Water Surface Air Temperature
Independent Component Analysis l ICA is a method that separates mutually independent signals from observation data l The ICA model: X = A * S X = observed data A = some unknown mixing matrix S = independent components l Assumptions: Observed signals are linear mixtures of unobserved independent signals Independent signals are non-Gaussian and mutually independent
Independent Component Analysis What is ICA? “Independent component analysis (ICA) is a method for finding underlying factors or components from multivariate (multi-dimensional) statistical data. What distinguishes ICA from other methods is that it looks for components that are both Statistically Independent and NonGaussian.” Mixing matrix A n sources, m observations Observations x1x1 x2x2 Sources s2s2 s1s1 The simple “Cocktail Party” Problem x 1 = a 11 s 1 + a 12 s 2 x 2 = a 21 s 1 + a 22 s 2 x = As
Two variables of y 1 and y 2 are independent if and only if: P(y 1,y 2 ) = P 1 (y 1 )P 2 (y 2 ) l Uncorrelatedness is a weaker form of independence. l When two variables are independent they are also uncorrelated however the inverse is not true. Independent versus Uncorrelated
The independent components must be non-Gaussian for ICA to be possible. Maximizing the non-Gaussianity of W T x results in the independent components.
Measures of Non-Gaussianity We need to have a quantitative measure of non-Gaussianity for ICA Estimation. Kurtosis : gauss = 0(sensitive to outliers) Entropy : Gauss = largest Neg-entropy : Gauss = 0(difficult to estimate) Approximations where v is a standard Gaussian random variable and :
Principle Component Analysis (PCA) translates a set of possibly correlated variables into a set of values of uncorrelated variables which is called principal components. Independent Component Analysis (ICA) goes beyond and finds independent variables. Therefore some preprocessing steps would make the problem of ICA estimation simpler: Centering Whitening ICA vs PCA
l Centering: is achieved by subtracting the mean vector (m) from the mixed signal “x”. l After estimating the mixing matrix with centered data, one can add the mean vector of s (obtained from A -1 ) back to the centered estimates of s. l Whitening: to linearly transform the observed vector “x” so that its components are uncorrelated and their variances equal unity. ICA Preprocessing
Mixing Matrix [A] Original Signals [S o ] Mixed Signal [X] Independent Component Analysis
Estimated Signals [S] Independent Component Analysis Original Signals [S o ]
Seasonal Climate Signal Correlation with Rogue Apr-Sept Streamflow Volume NCEP/NCAR Reanalysis
Seasonal Climate Signal Correlation with Sprague Apr-Sept Streamflow Volume NCEP/NCAR Reanalysis
Development of Climate Predictors using ICA l ICA climate predictor identification steps: 1.Select one of the large-scale climate data sets to analyze. The climate data should consist of monthly values and will have dimensions n x m where n is the number of historical observations and m is the number of months. The number of historical observations, n, used in this study was 30 (1979-2008).
2. Perform the following steps i times, where i = 1 to m: Use PCA to find i principal components for the selected climate data set. Whiten the data using the eigenvectors and eigenvalues from PCA above. Use ICA algorithm to estimate i independent components. In this study it is assumed that the number of signals is equal to the number of principal components. Retain each of the estimated i independent components. The number of components retained, r, is given by r = m(m+1)/2. 3. Calculate the Pearson’s correlation coefficient between each r independent climate signal and spring runoff. Retain the climate signal that has the highest correlation coefficient.
Development of Climate Predictors using ICA 4.Repeat steps 2 through 3 k times, where k = 50 for this study. Repeating steps 2 through 3 is necessary because the ICA program implements a stochastic algorithm and the ICA decomposition is only unique up to sign, scaling and permutation. This bootstrapping technique is a way of ordering and selecting only those signals that are linearly correlated with spring runoff. The result of step 5 will be a matrix of dimensions n x k. 5.Perform PCA on the matrix n x k from step 4 above, keeping only the first principal component. 6.Whiten the data using the eigenvectors and eigenvalues from step 5. 7.Perform ICA using the whitened data from step 6 to estimate one independent component. The result from this step is a n x 1 column vector which becomes the climate predictor variable in the multiple linear regression model.
New predictors have strong correlation with spring runoff ICA procedure found 9 new predictors to include in regression model ICA climate predictor selection for the Yakima River Basin
ICA climate predictor selection for the Rogue River Basin New predictors have strong correlation with spring runoff ICA procedure found 9 new predictors to include in regression model
ICA climate predictor selection for the Sprague River Basin New predictors have strong correlation with spring runoff ICA procedure found 7 new predictors to include in regression model
Data Description Data TypeSite NoSite IDAgencyElev.LatLong Period of Record SNOTEL FISH LAKE SWE47821B04SNRCS3371'47º 32'-121º 4'1984-2008 SNOTEL OLALLIE MEADOWS SWE67221B55SNRCS3700'47º 22'-121º 26'1984-2008 SNOTEL SASSE RIDGE SWE73421B51SNRCS4200'47º 23'-121º 3'1984-2008 SNOTEL STAMPEDE PASS SWE78821B10SNRCS3860'47º 16'-121º 20'1984-2008 SNOW TUNNEL AVENUE SWEN/A21B08NRCSN/A47º 26'-121º 31'1984-2008 SNOTEL FISH LAKE PRCP47821B04SNRCS3371'47º 32'-121º 4'1984-2008 SNOTEL OLALLIE MEADOWS PRCP67221B55SNRCS3700'47º 22'-121º 26'1984-2008 SNOTEL SASSE RIDGE PRCP73421B51SNRCS4200'47º 23'-121º 3'1984-2008 SNOTEL STAMPEDE PASS PRCP78821B10SNRCS3860'47º 16'-121º 20'1984-2008 YAKIMA RIVER AT CLE ELUM, WA STRM12479500N/AUSGS1902'47º 11'-120º 56'1984-2008 ICA GLAAM SIGNAL ICANA NOAANA90N 90S-180W 180E1979-2008 ICA QBO30 SIGNAL ICANA NOAANA60N 60S90W -73E1979-2008 ICA NCEP SST SIGNAL ICANA NOAANA60N 30S200W 240E1979-2008 ICA NCEP SLP SIGNAL ICANA NOAANA60N 30S140W 240E1979-2008 ICA NCEP AIR TEMP SIGNAL ICANA NOAANA60N 30S200W 240E1979-2008 ICA NCEP 1000MB GEO SIGNAL ICANA NOAANA60N 30S140W 240E1979-2008 ICA NCEP PRCP WATER SIGNAL ICANA NOAANA5N 5S150W 160E1979-2008 Sprague River Basin Data type and Sources
Sprague River Basin Predictor Selection PREDICTORS AVAILABLE MONTHLY PREDICTORS USED IN ICA CLIMATE MODEL SeptOctNovDecJanFebMarApr SNOTEL TAYLOR BUTTE SWE SNOTEL SILVER CREEK SWE SNOTEL SUMMER RIM SWE SNOTEL QUARTZ MOUNTAIN SWE SNOTEL STRAWBERRY SWE SNOTEL TAYLOR BUTTE PRCP SNOTEL SILVER CREEK PRCP SNOTEL SUMMER RIM PRCP SNOTEL QUARTZ MOUNTAIN PRCP SNOTEL STRAWBERRY PRCP SPRAGUE RIVER ANTECEDENT STRMFLW ICA EPNP SIGNAL ICA GLAAM SIGNAL ICA AMO SIGNAL ICA TNI SIGNAL ICA NCEP 1000MB ZWIND SIGNAL ICA NCEP PRCP WATER SIGNAL ICA NCEP PRCP RATE SIGNAL
Predictor Identification & Forecast Model Development 53
54 Volumetric Streamflow Forecast For Sep. 2008-Apr. 2009 Rogue River Basin
Yakima River Basin Volumetric Streamflow Forecast For Sep. 2008-Apr. 2009
Model Evaluation and Performance measures The model skill (coefficient of determination and root mean square error) A leave-one-out (LOO) cross-validation procedure The forecast skill of each cross-validation model is evaluated using the Linear Error in Probability Space (LEPS) score: L = 3*(1-|P f – P 0 | + P f 2 – P f + P 0 2 – P 0 ) – 1 Benchmark Efficiency:
FORECAST ISSUE DATESept 1 st Oct 1 st Nov 1 st Dec 1 st Jan 1 st Feb 1 st Mar 1 st Apr 1 st No. of Comps11112111 Calibration RMSE122.34108.8588.3285.7574.5175.475.9764.96 Cross-Validation RMSE137.93122.3294.7995.1682.0781.2584.1473.21 Calibration R 2 0.720.780.850.860.90.89 0.92 Jackknife Cross-Validation R 2 0.640.720.83 0.870.880.870.9 Benchmark Efficiency--- 0.840.80.790.72 Linear Error in Prob. Space0.590.620.690.720.770.780.730.76 Yakima River Basin Error Statistics
Model Forecast Skill while using Climate Predictors Forecast Issue Date BE LEPS Forecast Issue Date BE LEPS Yakima River Basin Rogue River Basin