Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with.

Similar presentations


Presentation on theme: "Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with."— Presentation transcript:

1 Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory petra@mcs.anl.gov Joint work with Mihai Anitescu, Miles Lubin and Victor Zavala

2 Motivation  Sources of uncertainty in complex energy systems –Weather –Consumer Demand –Market prices  Applications @Argonne – Anitescu, Constantinescu, Zavala –Stochastic Unit Commitment with Wind Power Generation –Energy management of Co-generation –Economic Optimization of a Building Energy System 2

3 Stochastic Unit Commitment with Wind Power  Wind Forecast – WRF(Weather Research and Forecasting) Model –Real-time grid-nested 24h simulation –30 samples require 1h on 500 CPUs (Jazz@Argonne) 3 Slide courtesy of V. Zavala & E. Constantinescu Wind farm Thermal generator Thermal Units Schedule? Minimize Cost Satisfy Demand/Adopt wind power Have a Reserve Technological constraints

4 Optimization under Uncertainty  Two-stage stochastic programming with recourse (“here-and-now”)  4 subj. to. continuous discrete Sampling Statistical Inference M batches Sample average approximation (SAA) subj. to.

5 Solving the SAA problem – PIPS solver  Interior-point methods (IPMs) –Polynomial iteration complexity: (in theory) –IPMs perform better in practice (infeasible primal-dual path-following) –No more than 30-50 iterations have been observed for n less than 10 million –We can confirm that this is still true for n being hundred times larger –Two linear systems solved at each iteration –Direct solvers needs to be used because IPMs linear systems are ill-conditioned and needs to be solved accurately –We solve the SAA problems with a standard IPM (Mehrotra’s predictor-corrector) and specialized linear algebra –PIPS solver Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 5

6 Linear Algebra of Primal-Dual Interior-Point Methods 6 subj. to. Min Convex quadratic problem IPM Linear System Two-stage SP arrow-shaped linear system (modulo a permutation) Multi-stage SP nested S is the number of scenarios

7 7 The Direct Schur Complement Method (DSC)  Uses the arrow shape of H 1.Implicit factorization 2. Solving Hz=r 2.1. Backward substitution 2.2. Diagonal Solve 2.3. Forward substitution

8 Parallelizing DSC – 1. Factorization phase 8 2. Triangular solves Process 1 Process 2 Process p Process 1 Factorization of the 1 st stage Schur complement matrix = BOTTLENECK Sparse linear algebra MA57 Dense linear algebra LAPACK

9 Parallelizing DSC – 2. Triangular solves 9 Process 1 Process 2 Process p Process 1 Process 2 Process p 1 st stage backsolve = BOTTLENECK 1.Factorization Sparse linear algebra Dense linear algebra

10 Implementation of DSC Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 10 FactBacksolves FactBacksolves FactBacksolves CommComm Dense factbacksolve MPI_Allreduce CommComm forw.subst.Dense solve Dense factbacksolve Dense factbacksolve forw.subst.Dense solve forw.subst.Dense solve FactorizationTriangular solves Proc 1 Proc 2 Proc p Computations are replicated on each process.

11 Scalability of DSC 11 Unit commitment 76.7% efficiency but not always the case Large number of 1 st stage variables: 38.6% efficiency on Fusion @ Argonne

12 BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER 12

13 The Stochastic Preconditioner  The exact structure of C is  IID subset of n scenarios:  The stochastic preconditioner (P. & Anitescu, in COAP 2011)  For C use the constraint preconditioner (Keller et. al., 2000) 13

14 Implementation of PSC 14 CommComm MPI_Reduce (to proc p+1) F.B. FactBacksolves Dense fact of prcnd. CommComm MPI_Allreduce backsolve Backsolve CommComm MPI_Reduce (to proc 1) Krylov solve Prcnd tri. slv. comm forw.subst. CommComm MPI_Bcast Proc 1 Proc 2 Proc p Proc p+1 Proc 1 Proc 2 Proc p Proc p+1 REMOVES the factorization bottleneck Slightly larger solve bottleneck F.B. F.B. F.B. FactBacksolves FactBacksolves idle

15 The “Ugly” Unit Commitment Problem 15  DSC on P processes vs PSC on P+1 process Optimal use of PSC – linear scaling Factorization of the preconditioner can not be hidden anymore. 120 scenarios

16 Quality of the Stochastic Preconditioner  “Exponentially” better preconditioning (P. & Anitescu, 2011)  Proof: Hoeffding inequality  Assumptions on the problem’s random data 1.Boundedness 2.Uniform full rank of and 16 not restrictive

17 Quality of the Constraint Preconditioner  has an eigenvalue 1 with order of multiplicity.  The rest of the eigenvalues satisfy  Proof: based on Bergamaschi et. al., 2004. 17

18 The Krylov Methods Used for  BiCGStab using constraint preconditioner M  Preconditioned Projected CG (PPCG) (Gould et. al., 2001) –Preconditioned projection onto the –Does not compute the basis for Instead, – 18

19 Performance of the preconditioner  Eigenvalues clustering & Krylov iterations  Affected by the well-known ill-conditioning of IPMs. 19

20 SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA 20

21 Parallelizing the 1 st stage linear algebra  We distribute the 1 st stage Schur complement system.  C is treated as dense.  Alternative to PSC for problems with large number of 1 st stage variables.  Removes the memory bottleneck of PSC and DSC.  We investigated ScaLapack, Elemental (successor of PLAPACK) –None have a solver for symmetric indefinite matrices (Bunch-Kaufman); –LU or Cholesky only. –So we had to think of modifying either. 21 dense symm. pos. def., sparse full rank.

22 Cholesky-based -like factorization   Can be viewed as an “implicit” normal equations approach.  In-place implementation inside Elemental: no extra memory needed.  Idea: modify the Cholesky factorization, by changing the sign after processing p columns.  It is much easier to do in Elemental, since this distributes elements, not blocks.  Twice as fast as LU  Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block. 22

23 Distributing the 1 st stage Schur complement matrix  All processors contribute to all of the elements of the (1,1) dense block  A large amount of inter-process communication occurs.  Each term is too big to fit in a node’s memory.  Possibly more costly than the factorization itself.  Solution: collective MPI_Reduce_scatter calls Reduce (sum) terms, then partition and send to destination (scatter) Need to reorder (pack) elements to match matrix distribution Columns of the Schur complement matrix are distributed as they are calculated 23

24 DSC with distributed first-stage Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 24 FactBacksolves CommComm backsolve MPI_Reduce_scatter CommComm MPI_Allreduce forw.subst. backsolve forw.subst. Proc 1 Proc 2 Proc p ELEMENTALELEMENTAL ELEMENTALELEMENTAL FactBacksolves FactBacksolves Schur complement matrix is computed and reduced block-wise. (B blocks of columns ) For each b=1:B

25 Reduce operations 25  Streamlined copying procedure - Lubin and Petra (2010)  Loop over continuous memory and copy elements in send buffer  Avoids divisions and modulus ops needed to compute the positions  “Symmetric” reduce for  Only lower triangle is reduced  Fixed buffer size  A variable number of columns reduced.  Effectively halves the communication (both data & # of MPI calls).

26 Large-scale performance 26  First-stage linear algebra: ScaLapack (LU), Elemental(LU), and  Strong scaling of PIPS with and  90.1% from 64 to 1024 cores  75.4% from 64 to 2048 cores  > 4,000 scenarios  On Fusion  Lubin, P., Anitescu, in OMS 2011 SAA problem: 1 st stage variables: 82,000 Total #: 189 million Thermal units: 1,000 Wind farms: 1,200

27 Towards real-life models – Economic dispatch with transmission constraints  Current status: ISOs (Independent system operator) use –deterministic wind profiles, market prices and demand –network (transmission) constraints –Outer 1-h timestep 24 horizon simulation –Inner 5-min timestep 1h horizon corrections  Stochastic ED with transmission constraints (V. Zavala et. al. 2010) –Stochastic wind profiles & transmission constraints –Deterministic market prices and demand –24 horizon with 1h timestep –Kirchoff’s laws are part of the constraints –The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil 27 Generator Load node (bus)

28 Solving ED with transmission constraints on Intrepid BG/P  32k wind scenarios (k=1024)  32k nodes (131,072 cores) on Intrepid BG/P  Hybrid programming model: SMP inside MPI –Sparse 2 nd -stage linear algebra: WSMP (IBM) –Dense 1 st -stage linear algebra: Elemental with SMP BLAS + OpenMP for packing/unpacking buffer.  For a 4h Horizon problem very good strong scaling  Lubin, P., Anitescu, Zavala – in proceedings of SC 11. 28

29 Stochastic programming – a scalable computation pattern  Scenario parallelization in a hybrid programming model MPI+SMP –DSC, PSC (1 st stage < 10,000 variables)  Hybrid MPI/SMP running on Blue Gene/P –131k cores (96% strong scaling) for Illinois ED problem with grid constraints. 2B variables, maybe largest ever solved?  Close to real-time solutions (24 hr horizon in 1 hr wallclock) –Further development needed, since users aim for More uncertainty, more detail (x 10) Faster Dynamics  Shorter Decision Window (x 10) Longer Horizons (California == 72 hours) (x 3) Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All" 29

30 Thank you for your attention! Questions? 30


Download ppt "Scalable Stochastic Programming Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory Joint work with."

Similar presentations


Ads by Google