Multi-Model Multi-Domain Computational Methods: Highlights

For an NSF Multidisciplinary Computing Challenges group to live up to expectations, it should integrate the contributions of a variety of disciplines and through that integration excite the communities associated with each, either through raising the "high water mark" of what is computable in a discipline (intrinsic), or through enhancing the relevance of one of the disciplines to another (extrinsic). Some groups will be fortunate enough to be in existence during and contribute to paradigm shifts in computational practice. Several aspects of our progress encourage us along these lines. Highlights of our recent work on parallel implicit methods for 3D PDE problems include:
  • resolution capacity --- over ten million unknowns
  • parallel efficiency --- 85% on up to 128 processors of a T3E
  • sustained per-processor computation rates --- sparse unstructured solvers rendered competitive with sparse structured solvers, and both running within a factor of three of dense solvers
  • algorithmic tuning --- interprocedural optimization of the nested family of preconditioners, accelerators, Newton steppers, and continuation methods
  • algorithmic pathfinding --- new forms of Schwarz methods for convective, diffusive, and wavelike phenomena
  • software engineering --- high visibility for our "rescue" of legacy codes and adoptions of our released software

  • resolution capacity. For those who model the continuum, no amount of computer memory is ever "enough", but drawing upon the physically distributed aggregate memories of today's multiprocessor systems is sufficient to convert some dream computations into reality. Our demonstrated capacity to solve implicitly in parallel nonlinear systems of more than ten million degrees of freedom makes possible new levels of modeling fidelity and geometric complexity. We are within one order of magnitude of the problem size of one hundred million unknowns often contemplated by NASA and DOE engineers and scientists for Teraflops-scale applications, as we await our initial access to Teraflops testbeds to run problems of importance to the design of high-lift systems in commercial aircraft and with three-dimensional wave propagation problems in acoustics.

    Vertices Unknowns Lift Drag Moment
    2,800 4,092 .110 .023 .115
    9,428 37,712 .105 .020 .105
    53,961 215,844 .104 .012 .101
    357,900 1,431,600 .104 .005 .098
    2,761,774 11,047,096 .104 .002 .098

    Table showing the grid convergence of lift, drag, and moment coefficients for an M6 wing at a 3-degree angle of attack in an incompressible, inviscid flow, over a range of three orders of magnitude in discrete problem size, on an unstructured tetrahedral grid. Reference: D. E. Keyes, D. K. Kaushik and B. F. Smith, 1997, "Prospects for CFD on Petaflops Systems," (to appear in CFD Review 1997, M. Hafez, et. al., eds., Wiley).

     

  • parallel efficiency. Running large problems on large numbers of processors is a stunt of limited value if the resources are not being used in a scalable manner, which may loosely be defined as sustaining a relative parallel efficiency over over a range of problem sizes and numbers of processors that is at most weakly dependent on either. By attention to load balance, surface-to-volume minimization, inspector/executor optimizations, message merging, communication/computation overlap, and data copy elimination; and by explicit creation of concurrency in the critical preconditioner phase of any implicit method, we have achieved high relative efficiencies for Grand Challenge-sized problems --- from the point where they first fit on the aggregate memory of a subcollection of processors up to the capacities of the machines presently available to us.

    Number of Procs Iterations Time (sec) Speedup Efficiency
    16 77 2588 1.00 1.00
    32 75 1262 2.05 1.03
    64 75 662 3.91 0.98
    128 82 382 6.77 0.85

    Table showing a relative efficiency of 85% on a fixed-size (1.4 million degree of freedom) unstructured grid Euler simulation on up to 128 processors of a Cray T3E (from 16 processors, the smallest configuration on which it fits). Reference: D. E. Keyes, D. K. Kaushik and B. F. Smith, 1997, "Prospects for CFD on Petaflops Systems," (to appear in CFD Review 1997, M. Hafez, et. al., eds., Wiley).

     

  • sustained per-processor floating point rates. High parallel efficiency is easy to attain in an artificial manner when the SPMD code generated for execution on each individual processor makes such inefficient use of the arithmetic capabilities of the processor that the parallel thieves of synchronization and latency are dwarfed. By attention to RISC pipeline efficiency by unrolling, cache locality by vertex reordering and component reordering, cache workingset size through subdomain-blocked preconditioning, and capture of uniform substructure by expression in the algorithms (rather than leaving it to the discovery of the compiler), we have achieved sustained per-processor floating point execution rates that are about as high as we know of for explicit or implicit PDE codes run on our architectures.

    Application Number of Procs Iterations Time (sec) Mflop/s Mflop/s per Proc
    Euler Code 64 163 9,160.91 5,480 85.6
    Euler Code 80 162 7,330.73 6,810 85.1
    LINPACK 1 -- -- 233 233

    Table comparing sustained parallel floating point rates for a complete unstructured grid (11.4 million degree of freedom) Euler simulation on 64 and 80 processors of an IBM SP (with thin 120MHz P2SC nodes), compared with the sequential floating point rate for the dense LINPACK100 benchmark on the same processor. References: D. E. Keyes, D. K. Kaushik and B. F. Smith, 1997, "Prospects for CFD on Petaflops Systems," (to appear in CFD Review 1997, M. Hafez, et. al., eds., Wiley) and J. J. Dongarra, 1997, "Performance of Various Computers Using Standard Linear Equations Software."

     

  • algorithmic tuning. High performance on a per inner iteration basis does not necessarily improve wall clock performance if the inner iterations are relatively weak, so that many of them (or many outer iterations around them) are needed. Our algorithm development takes into account interprocedural considerations in an attempt to tune the various levels of iteration so as to minimize the overall execution time. We have separately studied the convergence rate and cost per iteration effects of Schwarz preconditioner, Krylov accelerator, Newton method, and continuation variants and can demonstrate algorithmic superiority for many of our decisions via robustness and execution time records, when contrasted to other implementations of comparable implementation quality. Our research on solution algorithms has also fed back into discretization research, including the need for differentiable limiters in high-order convective schemes in transonic and supersonic flows. We expect deeper solution/discretization interactions in our work on Helmholtz problems.

    Method Iterations Time (sec)
    Explicit BC, Explicit Jacobian (Constant CFL) 781 1793.4
    Implicit BC, Explicit Jacobian (Constant CFL) 675 1346.8
    Implicit BC, Explicit Jacobian (Advancing CFL) 267 702.8
    Implicit BC, Matrix-Free Jacobian (Advancing CFL) 56 474.6

    Table comparing the convergence history (in nonlinear outer iterations and in overall execution time) of our preferred matrix-free pseudo-transient NKS technique over less aggressively implicit algorithms on the same problem (98x18x18 structured grid, run on 2 processors of an IBM SP2). Corresponding convergence graphs are also available. Reference: W. D. Gropp, D. E. Keyes, L. C. McInnes and M. D. Tidriri, 1997, "Parallel Implicit PDE Computations: Algorithms and Software", in Proceedings of Parallel CFD'97, A. Ecer et al., eds., Elsevier (to appear).

     

  • algorithmic pathfinding. Algorithmic experience that is not transferable is mainly of academic interest. We have a strong interest in influencing algorithmic design and parallel implementation. Such influence is, in principle, measureable in a citation study conducted over a sufficiently long period of time. Over shorter periods, it is measurable in invitations to give seminars, tutorials, short courses, and in the adoption of one's nomenclature and perspective in the work of others. Newton-Krylov-Schwarz methods, the composite term having been first strung together in collaborations of Cai, Gropp and Keyes that slightly predated the formation of our group, are becoming more and more popular in the high-performance applications community. During the time of our project, we have contributed to several innovations in the Schwarz-style preconditioning that is the most critical and innovative aspect of the overall NKS framework. Variable-degree-Schwarz, anisotropic-sensitive Schwarz, and restricted Schwarz preconditioners have all been developed and tested to support our applications, and novel Schwarz features --- including overlapping Sommerfeld interface conditions for wave problems and mortar schemes for nonconforming discretizations, and the significance of novel Schwarz parameters --- including "wavelap" in wave problems have all been developed.

    The results of our algorithmic work have been invited for presentation during the twelve months ending November 1997 at the following meetings:

  • Workshop on Adaptive Mesh Refinement, HPC Year, Mar 1997, IMA
  • Workshop on Computational Heat Transfer, Mar 1997, Las Vegas
  • Petaflops Algorithms Workshop, Apr 1997, Williamsburg
  • Workshop on the Future of HPC, May 1997, Bergen (Norway)
  • Defense Modernization Parallel Computing Short Course, May 1997, Wright-Patterson AFB
  • Defense Modernization Parallel Computing Short Course, Jun 1997, Arnold AFB
  • Workshop on PDEs, HPCC Year, Jun 1997, IMA
  • SIAM Annual Meeting, Jul 1997, Stanford
  • Tenth International Conference on Domain Decomposition, Aug 1997, Boulder
  • Gordon Conference on HPCC, Jul 1997, NH
  • and at the following universities, industries and laboratories:

  • Boeing Commercial Airplane Company
  • Florida State University
  • Lawrence Livermore National Laboratory
  • NASA Langley Research Center
  • Northwestern University
  • University of Bath (UK)
  • University of Chicago
  • University of Greenwich (UK)
  • Stanford University
  • Yale University
  • along with many additional contributed presentations.

     

  • software engineering. When significant effort is devoted to the creation of efficient parallel software, the actual code --- and not just the algorithmic expression --- has transferable value, both as code that others can build and execute and as a software exemplar for others to imitate in their own coding. As our project matures, our computational examples are making their way, as planned, into the tutorial subdirectories of the PETSc library distribution. We have already been honored by the availability of PETSc demos in software packages released by others. Moreover, our paradigm for "rescuing" legacy codes for the parallel environment --- preserving the value of the discretization while discarding the sequentially bound solver --- is becoming influential.

    Our software has been a focus of presentation during the twelve months ending November 1997 at the following meetings:

  • "Bring Your Own Code" Short Course, Dec 1996, ICASE
  • "Bring Your Own Code" Short Course, Apr 1997, Cornell Theory Center
  • Workshop on Scientific Software, HPCC Year, Jun 1997, IMA
  • Defense Modernization Parallel Computing Short Course, May 1997, Wright-Patterson AFB
  • Defense Modernization Parallel Computing Short Course, May 1997, Arnold AFB
  • PETSc Tutorial, Nov 1997, SC97 Conference
  • Our work on the NASA FUN3D code was featured in the CRPC Newsletter, Parallel Computing Research .


  • [ MMMDCM Home Page | Highlights | Overview | Distinctives | Applications | People | Papers | Related Links ]