We present a new approach for automatic adjoint differentiation (AAD) with a special focus on computations where derivatives ∂F(X)/∂X are required for multiple instances of vectors X. In practice, the presented approach is able to calculate all the differentials faster than the primal (original) C++ program for F.
(1) Logic, M., 2019. AAD: Breaking the Primal Barrier. Wilmott, 2019(103), pp.8-11.
(2) Goloubentsev, Dmitri, and Evgeny Lakshtanov. "A New Approach to Parallel Computing Using Automatic Differentiation: Getting Top Performance on Modern Multicore Systems", Parallel Universe Magazine, (40), 2020
(3) Goloubentsev, Dmitri, and Evgeny Lakshtanov. "Automatic adjoint differentiation for gradient descent and model calibration." International Journal of Wavelets, Multiresolution and Information Processing (2020): 2040004.
Deterministic methods for global optimization use convex relaxations to furnish crucial global bounding information. In global dynamic optimization, convex relaxations must be generated for the solutions of parametric ordinary differential equations (ODEs). Established approaches construct these relaxations via auxiliary ODE systems, whose right-hand sides are derived from convex relaxations of the original ODEs’ right-hand side. However, there is thus far no established method for computing these ODE relaxations’ subgradients, which would be useful when computing bounds for an overarching global optimization method.
This presentation outlines a new approach based on AD techniques for computing subgradients of these ODE relaxations. First, we present new forward/tangent and reverse/adjoint AD methods for efficiently propagating subgradients of convex relaxations for composite functions, based on recent advances in convex relaxation generation. Next, we show that subgradients of ODE relaxations can be constructed by modified adjoint ODE sensitivity systems with the new AD subgradient methods embedded. Thus, subgradients for ODE relaxations may be computed by feeding appropriate inputs to existing ODE adjoint solvers. We expect this approach to generally speed up the computation of lower bounds in global dynamic optimization.
We demonstrate the application of automatic differentiation (AD) in solving inverse problems with complicated forward models in the field of 3D x-ray computational imaging. With AD, we successfully developed a new algorithm that addresses the multiple scattering problem common in thick samples beyond the depth of focus limit, and extended it for multiple microscopy setups. We also show that by exploiting the flexibility of AD, one can build a generic framework that works for a wide range of gradient-based image reconstruction problems. Experimental parameters that may suffer from imprecisions, such as probe positions and the affine alignment of near-field images, can be refined using the gradients provided by AD. We have also devised a memory-efficient distribution scheme so that the framework can run on high-performance computing clusters with multiple MPI ranks.
HPC applications running on our supercomputers are used to solve critical problems. These systems are expected to perform tasks not just quickly, but also correctly. In this talk, I will focus on two factors that can affect the correctness of a program, namely, faults and reduced precision, and describe our work in analyzing the impact of these errors using Automatic Differentiation. In the first part of the talk I will describe an analytical method that we developed using Automatic Differentiation to identify critical variables that have a high impact on the output when corrupted, and by selectively protecting them we can improve the resilience of an application. The second part of the talk will be on floating-point precision tuning. I will present our work on using Automatic Differentiation to estimate the error of using lower precision and how we used that to develop a mixed-precision version of the program to improve the performance while satisfying a specified error threshold.
Implementation of efficient automatic differentiation in C++ is not easy due to the nature of the language. Challenges include language rules that are too numerous, too complex and that evolve too often for a custom parser to handle. Alternatives based on operator overloading require the introduction of a new floating point type and are unfit to legacy code, or to code which is not written with a particular tool in mind.
In this talk we introduce another approach, a compiler extension using the clang parser, which becomes a phase in the compiler toolchain that is able to algorithmically differentiate complex C++ constructs. The extension, clad, has full access to Clang compiler internals. This allows it to influence the generation of LLVM code and to provide several useful tools to the user, including retargeting to accelerator devices. Clad can generate derivatives, gradients, hessians and jacobians.
If time permits we will briefly show how to use clad in the C++ interpreter cling to generate derivatives on the fly.
This talk presents Enzyme, a high-performance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme can synthesize gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional source-to-source and operator-overloading tools, Enzyme performs AD on optimized IR. On a machine-learning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve state-of-the-art performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with state-of-the-art performance, enabling foreign code to be directly incorporated into existing machine learning workflows.
Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward-mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.
We come up with a more general definition of separability by considering intermediate values that separate the set of independent variables. For the exploitation of this property in (unconstrained) global optimization with optima in the interior domain the first order optimality condition can be simplified if the derivative of the objective function with respect to a separator is nonzero on the domain. In that case the optimization problem decomposes into smaller optimization problems. A reduction of the problem size is indispensable for the application of branch and bound algorithms to optimization problems with larger domain sizes. We show how to utilize (discrete) interval-valued adjoints to implement the algorithm and to find separator candidates automatically.
Subgradients of functions provide sensitivity information that is used in subgradient methods for nonsmooth convex optimization and bundle methods for nonconvex optimization. However, for several classes of nonsmooth functions including optimal-value functions and solutions of nonsmooth parametric ordinary differential equations, evaluating directional derivatives is far simpler than evaluating correct subgradients by established methods. This presentation shows that a subgradient for a convex function of two variables may always be computed by four forward AD passes, assuming nothing beyond the availability of a forward AD oracle. This is a new result in convex analysis, but surprisingly it also extends to nonconvex functions, where the evaluated “subgradient” is now an element of Clarke’s generalized gradient instead. These results do not extend directly to functions of three or more variables, but may nevertheless be leveraged in these cases to compute lower bounds of black-box convex functions efficiently.
Regarding the approximation of abs-smooth functions by abs-linear models we have made two observations this year. Firstly as suggested by Tom Streubel the generalized Taylor approximation can be extended to arbitrary order and in fact an absolutely convergent series of abs-polynomial terms. Secondly, we can split the abs-linear approximation into a convex and a concave part in the fashon of the "Difference of Convex Functions" approach.
Naumann (RWTH Aachen University)
Optimization of Generalized Jacobian Chain Products without Memory Constraints
The efficient computation of Jacobians represents a fundamental challenge in computational science and engineering. Large-scale modular numerical simulation programs can be regarded as sequences of evaluations of in our case differentiable modules with corresponding local Jacobians. The latter are typically not available. Tangent and adjoint versions of the individual modules are assumed to be given as results of algorithmic differentiation instead. The classical (Jacobian) matrix chain product formulation is extended with the optional evaluation of matrix-free Jacobian-matrix and matrix-Jacobian products as tangents and adjoints. We propose a dynamic programming algorithm for the minimization of the computational cost of such generalized Jacobian chain products without considering constraints on the available persistent system memory. In other words, the naive evaluation of an adjoint of the entire simulation program is assumed to be a feasible option. No checkpointing is required. Under the given assumptions we obtain optimal solutions which improve the best state of the art methods by factors of up to seven on a set of randomly generated problem instances of growing size.
Various attempts have been made at CD-adapco/Siemens to develop a suitable AD tool for the fluid dynamics simulation software Star-CCM+. This work presents the latest iteration and is considered a synthesis of several years work.
The approach developed leverages a number of language features found in C++14 in order to differentiate a top-level function, which in turn will cause the differentiation of arbitrary nested functions.
Nested functions are facilitated by having them construct an expression node from a generic node template taking an arbitrary number of parameters. The approach enables them to be treated in a similar way to built-in functions.
The generic approach to handling nested functions has proved invaluable to the development of the adjoint models in Star-CCM+, particularly with the added capacity to support arbitrary virtual functions.
Given all the different overloading C++ Tools, I see lots of benefits in defining a common Interface which covers all/most of the tools for simple tangent and tape based adjoint mode:
Creating tests and benchmarks allow tool developers to increase the quality of their software for example.
Having a common interface might convince other library vendors (e.g. Eigen) to integrate fast external adjoints for stand linear algebra kernels.
The talk will give a draft of the Interface I have in mind to cover standard tangent and tape based adjoint, but also gives an outlook on a more fine grain interface to cover more advance features. A github repo for discussion/collaboration will also be released.
Participation of and discussion with other tool developer is very welcome.
CoDiPack has initially been released 5 years ago. Since then, additions like primal value taping and several helper structures have been added to the code base. Recent developments for SIMD, tasking, OpenMP and DSL support have shown that the initial design of CoDiPack makes it difficult to add such special interest tapes in a maintainable way and therefore CoDiPack is redesigned for the 2.0 release. The focus for the redesign is mainly on the maintainability aspects. In addition, there are also changes for the ease of development and applying modern coding guidelines. The talk will cover the lessons learned, new design choices and a performance analysis of the new implementation.
1700 –1800 Breakouts + Socials
1800 Closing Remarks
Thursday, August 13, 2020
1430 –1600 Welcome, Contributed Talks V
Torsten Bosse (FSU Jena)
How I tortured Torch and how Torch tortured me...
JAX is a system for high-performance machine-learning research. It offers the familiarity of Python+NumPy together with hardware acceleration, and it enables the definition and composition of user-wielded function transformations useful for machine-learning programs. These transformations include automatic differentiation, automatic batching, end-to-end compilation (via XLA) and parallelizing over multiple accelerators.
All those rewrites were designed with a heavy focus on composability, which allows a very rich set of rewrites to be built from very simple building blocks. In this talk I will focus on the transformations used within the automatic differentiation subsystem. For example, I will discuss how the derivative formulas for reverse-mode AD can be derived from the derivative formulas for forward-mode AD fully automatically, through a transformation for linear function transposition. Then, we’ll move on to discuss how this factorization of different AD methods enables easy extensions to more advanced techniques such as gradient checkpointing as well as memory-efficient AD for point-wise and invertible functions.
Optimal neural network architecture is a very important factor for computational complexity and memory footprints of neural networks. In this regard, a robust pruning method based on interval adjoints significance analysis is presented to prune irrelevant and redundant nodes from a neural network. The significance of a node is defined as a product of a node's interval width and an absolute maximum of the first-order derivative of that node's interval. Based on the significance of nodes, one can decide how much to prune from each layer. We show that the proposed method works effectively on hidden and input layers by experimenting on famous and complex datasets of machine learning. In the proposed method, a node is removed based on its significance, and bias is updated for remaining nodes.
In min-min optimization or max-min optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closed-form, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a super-efficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.
1600 –1630 Break, Invited Talk II
Robert Gower (Facebook AI Research)
Calculating the Hessian matrix
The gradient can be famously calculated with a single forward and reverse pass of the functions computational graph. What about the Hessian? In this talk I will show that the Hessian matrix can be efficiently computed using only a single forward and reverse pass. To do this, I will carefully explain how to make use of the graph interpretation of reverse mode differentiation, showing both how the computation of gradients and Hessian matrices are represented on the computational graph of the underlying function. Using this graph based intuition I will deduce a method for computing Hessian matrices using a single reverse sweep. The resulting algorithm (called edge pushing) can calculate Hessian matrices of dimensions 10^5 x 10^5 in a matter of seconds. I will conclude by showing experimental results on how the edge pushing method is faster than alternative coloring based algorithms.
We explore both classical “perturbation confusion” that arises from nesting AD in first-order code, and the so-called “amazing bug”, the confusion that can arise in AD of higher-order functions when the one-to-one correspondence between derivative-taking and invocations of the derivative-taking operator, guaranteed in first-order code, is lost. These issues arise in both forward and reverse mode AD, and are becoming more important in practice as demand from users encourages implementations to support both nesting (e.g., for learning-to-learn) and AD of higher-order functions (e.g., for AD through ODE solvers or optimization routines.). In exploring these issues we cover what it means to AD-transform a higher-order function, and various methods for ameliorating perturbation confusion and the amazing bug.
Dougal Maclaurin (Google Research)
Tackling the array indexing problem in functional AD
Functional array languages offer O(1) array indexing (subscripting). But the
transpose of this operation, building a one-hot array of mostly zeros, is O(N),
which violates the cheap gradient principle. We describe how we solve this
problem in the array language Dex, with an effect system that can express
in-place accumulations. Our effect system also preserves opportunities for
parallelism and other optimizations, which can be a struggle for purely
imperative AD systems.
Mehrdad Maleki (National University of Ireland Maynooth)
Many Formulations of One Transformation: Reverse AD in a Functional Context
We compare various formalizations of reverse-mode AD in a higher-order functional context. For each formulation, we give a brief description of the mathematical theories used, and describe how this machinery transforms fanout in the primal (forward phase) into addition in the adjoint (reverse phase). One strategy for transforming higher-order functions is closure conversion of their inputs and outputs, and we describe how these frameworks can be regarded in that light. We also touch upon the complexity, laziness, compiler-friendliness, and ability to nest, of each formalization.