
Programme of the 23rd Euro AD Workshop
Tuesday, August 11, 2020
 14^{30} –16^{00} Welcome, Contributed Talks I (all times in CEST)
 DMITRI GOLOUBENTSEV(1), Evgeny Lakshtanov(2,1) ((1) Matlogica LTD, London, UK. (2).University of Aveiro, Portugal.)
AAD: Breaking the Primal Barrier
We present a new approach for automatic adjoint differentiation (AAD) with a special focus on computations where derivatives ∂F(X)/∂X are required for multiple instances of vectors X. In practice, the presented approach is able to calculate all the differentials faster than the primal (original) C++ program for F.
(1) Logic, M., 2019. AAD: Breaking the Primal Barrier. Wilmott, 2019(103), pp.811.
(2) Goloubentsev, Dmitri, and Evgeny Lakshtanov. "A New Approach to Parallel Computing Using Automatic Differentiation: Getting Top Performance on Modern Multicore Systems", Parallel Universe Magazine, (40), 2020
(3) Goloubentsev, Dmitri, and Evgeny Lakshtanov. "Automatic adjoint differentiation for gradient descent and model calibration." International Journal of Wavelets, Multiresolution and Information Processing (2020): 2040004.
 Yingkai Song (McMaster University)
Generating Adjoint Subgradients for Global Dynamic Optimization
Deterministic methods for global optimization use convex relaxations to furnish crucial global bounding information. In global dynamic optimization, convex relaxations must be generated for the solutions of parametric ordinary differential equations (ODEs). Established approaches construct these relaxations via auxiliary ODE systems, whose righthand sides are derived from convex relaxations of the original ODEs’ righthand side. However, there is thus far no established method for computing these ODE relaxations’ subgradients, which would be useful when computing bounds for an overarching global optimization method.
This presentation outlines a new approach based on AD techniques for computing subgradients of these ODE relaxations. First, we present new forward/tangent and reverse/adjoint AD methods for efficiently propagating subgradients of convex relaxations for composite functions, based on recent advances in convex relaxation generation. Next, we show that subgradients of ODE relaxations can be constructed by modified adjoint ODE sensitivity systems with the new AD subgradient methods embedded. Thus, subgradients for ODE relaxations may be computed by feeding appropriate inputs to existing ODE adjoint solvers. We expect this approach to generally speed up the computation of lower bounds in global dynamic optimization.
 Markus Towara (RWTH Aachen University)
Towards Discrete Adjoint Optimization of CHT Problems in OpenFOAM
Conjugate Heat Transfer (CHT) simulations allow the prediction of complex interactions between fluid and solid mediums.
We will present the adaption of existing primal solvers into our discrete adjoint OpenFOAM framework and the arising challenges.
Our application is the optimization of heat transfer between heat sinks and a cooling fluid, used to extract the heat from server infrastructure.
 Ming Du (Argonne National Laboratory)
Applications of automatic differentiation in image reconstruction and experimental parameter refinement for 3D microscopy
We demonstrate the application of automatic differentiation (AD) in solving inverse problems with complicated forward models in the field of 3D xray computational imaging. With AD, we successfully developed a new algorithm that addresses the multiple scattering problem common in thick samples beyond the depth of focus limit, and extended it for multiple microscopy setups. We also show that by exploiting the flexibility of AD, one can build a generic framework that works for a wide range of gradientbased image reconstruction problems. Experimental parameters that may suffer from imprecisions, such as probe positions and the affine alignment of nearfield images, can be refined using the gradients provided by AD. We have also devised a memoryefficient distribution scheme so that the framework can run on highperformance computing clusters with multiple MPI ranks.
 16^{00} –16^{30} Break, Invited Talk 1
 Harshitha Gopalakrishnan Menon (Lawrence Livermore National Laboratory)
Error Analysis Using Automatic Differentiation in HPC Applications
HPC applications running on our supercomputers are used to solve critical problems. These systems are expected to perform tasks not just quickly, but also correctly. In this talk, I will focus on two factors that can affect the correctness of a program, namely, faults and reduced precision, and describe our work in analyzing the impact of these errors using Automatic Differentiation. In the first part of the talk I will describe an analytical method that we developed using Automatic Differentiation to identify critical variables that have a high impact on the output when corrupted, and by selectively protecting them we can improve the resilience of an application. The second part of the talk will be on floatingpoint precision tuning. I will present our work on using Automatic Differentiation to estimate the error of using lower precision and how we used that to develop a mixedprecision version of the program to improve the performance while satisfying a specified error threshold.
 16^{30} –17^{00} Social Event
 17^{00} –18^{00} Contributed Talks II
 Vassil Vassilev (Princeton University)
Clad  Automatic Differentiation in C++ and Clang
Implementation of efficient automatic differentiation in C++ is not easy due to the nature of the language. Challenges include language rules that are too numerous, too complex and that evolve too often for a custom parser to handle. Alternatives based on operator overloading require the introduction of a new floating point type and are unfit to legacy code, or to code which is not written with a particular tool in mind.
In this talk we introduce another approach, a compiler extension using the clang parser, which becomes a phase in the compiler toolchain that is able to algorithmically differentiate complex C++ constructs. The extension, clad, has full access to Clang compiler internals. This allows it to influence the generation of LLVM code and to provide several useful tools to the user, including retargeting to accelerator devices. Clad can generate derivatives, gradients, hessians and jacobians.
If time permits we will briefly show how to use clad in the C++ interpreter cling to generate derivatives on the fly.
 William Moses (MIT)
PostOptimization Automatic Differentiation by Synthesizing LLVM
This talk presents Enzyme, a highperformance automatic differentiation (AD) compiler plugin for the LLVM compiler framework capable of synthesizing gradients of statically analyzable programs expressed in the LLVM intermediate representation (IR). Enzyme can synthesize gradients for programs written in any language whose compiler targets LLVM IR including C, C++, Fortran, Julia, Rust, Swift, MLIR, etc., thereby providing native AD capabilities in these languages. Unlike traditional sourcetosource and operatoroverloading tools, Enzyme performs AD on optimized IR. On a machinelearning focused benchmark suite including Microsoft's ADBench, AD on optimized IR achieves a geometric mean speedup of 4.5x over AD on IR before optimization allowing Enzyme to achieve stateoftheart performance. Packaging Enzyme for PyTorch and TensorFlow provides convenient access to gradients of foreign code with stateoftheart performance, enabling foreign code to be directly incorporated into existing machine learning workflows.
 Eric Phipps (Sandia National Laboratories)
Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado
Automatic differentiation (AD) is a wellknown technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forwardmode, operator overloadingbased differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of finegrained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.
 18^{00} Closing Remarks

Wednesday, August 12, 2020
 14^{30} –16^{00} Welcome, Contributed Talks III
 Jens Deussen (RWTH Aachen University)
Generalization of Separability for Optimization Problems
We come up with a more general definition of separability by considering intermediate values that separate the set of independent variables. For the exploitation of this property in (unconstrained) global optimization with optima in the interior domain the first order optimality condition can be simplified if the derivative of the objective function with respect to a separator is nonzero on the domain. In that case the optimization problem decomposes into smaller optimization problems. A reduction of the problem size is indispensable for the application of branch and bound algorithms to optimization problems with larger domain sizes. We show how to utilize (discrete) intervalvalued adjoints to implement the algorithm and to find separator candidates automatically.
 Kamil Khan (McMaster University)
Computing a subgradient by forward/tangent AD for functions of two variables
Subgradients of functions provide sensitivity information that is used in subgradient methods for nonsmooth convex optimization and bundle methods for nonconvex optimization. However, for several classes of nonsmooth functions including optimalvalue functions and solutions of nonsmooth parametric ordinary differential equations, evaluating directional derivatives is far simpler than evaluating correct subgradients by established methods. This presentation shows that a subgradient for a convex function of two variables may always be computed by four forward AD passes, assuming nothing beyond the availability of a forward AD oracle. This is a new result in convex analysis, but surprisingly it also extends to nonconvex functions, where the evaluated “subgradient” is now an element of Clarke’s generalized gradient instead. These results do not extend directly to functions of three or more variables, but may nevertheless be leveraged in these cases to compute lower bounds of blackbox convex functions efficiently.
 Andreas Griewank (HU Berlin)
New stuff on abssmooth/linear functions
Regarding the approximation of abssmooth functions by abslinear models we have made two observations this year. Firstly as suggested by Tom Streubel the generalized Taylor approximation can be extended to arbitrary order and in fact an absolutely convergent series of abspolynomial terms. Secondly, we can split the abslinear approximation into a convex and a concave part in the fashon of the "Difference of Convex Functions" approach.
 Naumann (RWTH Aachen University)
Optimization of Generalized Jacobian Chain Products without Memory Constraints
The efficient computation of Jacobians represents a fundamental challenge in computational science and engineering. Largescale modular numerical simulation programs can be regarded as sequences of evaluations of in our case differentiable modules with corresponding local Jacobians. The latter are typically not available. Tangent and adjoint versions of the individual modules are assumed to be given as results of algorithmic differentiation instead. The classical (Jacobian) matrix chain product formulation is extended with the optional evaluation of matrixfree Jacobianmatrix and matrixJacobian products as tangents and adjoints. We propose a dynamic programming algorithm for the minimization of the computational cost of such generalized Jacobian chain products without considering constraints on the available persistent system memory. In other words, the naive evaluation of an adjoint of the entire simulation program is assumed to be a feasible option. No checkpointing is required. Under the given assumptions we obtain optimal solutions which improve the best state of the art methods by factors of up to seven on a set of randomly generated problem instances of growing size.
 16^{00} –17^{00} Break, Contributed Talks IV
 Dominic Jones (Siemens PLM, London)
Recursive compile time adjoint in C++
Various attempts have been made at CDadapco/Siemens to develop a suitable AD tool for the fluid dynamics simulation software StarCCM+. This work presents the latest iteration and is considered a synthesis of several years work.
The approach developed leverages a number of language features found in C++14 in order to differentiate a toplevel function, which in turn will cause the differentiation of arbitrary nested functions.
Nested functions are facilitated by having them construct an expression node from a generic node template taking an arbitrary number of parameters. The approach enables them to be treated in a similar way to builtin functions.
The generic approach to handling nested functions has proved invaluable to the development of the adjoint models in StarCCM+, particularly with the added capacity to support arbitrary virtual functions.
 Klaus Leppkes (RWTH Aachen University)
Standard Open Overloading AD for C++
Given all the different overloading C++ Tools, I see lots of benefits in defining a common Interface which covers all/most of the tools for simple tangent and tape based adjoint mode:
Creating tests and benchmarks allow tool developers to increase the quality of their software for example.
Having a common interface might convince other library vendors (e.g. Eigen) to integrate fast external adjoints for stand linear algebra kernels.
The talk will give a draft of the Interface I have in mind to cover standard tangent and tape based adjoint, but also gives an outlook on a more fine grain interface to cover more advance features. A github repo for discussion/collaboration will also be released.
Participation of and discussion with other tool developer is very welcome.
 Max Sagebaum (TU Kaiserslautern)
CoDiPack 2.0: Lessons learned and new ideas
CoDiPack has initially been released 5 years ago. Since then, additions like primal value taping and several helper structures have been added to the code base. Recent developments for SIMD, tasking, OpenMP and DSL support have shown that the initial design of CoDiPack makes it difficult to add such special interest tapes in a maintainable way and therefore CoDiPack is redesigned for the 2.0 release. The focus for the redesign is mainly on the maintainability aspects. In addition, there are also changes for the ease of development and applying modern coding guidelines. The talk will cover the lessons learned, new design choices and a performance analysis of the new implementation.
 17^{00} –18^{00} Breakouts + Socials
 18^{00} Closing Remarks

Thursday, August 13, 2020
 14^{30} –16^{00} Welcome, Contributed Talks V
 Torsten Bosse (FSU Jena)
How I tortured Torch and how Torch tortured me...
In this presentation, I report some results/experiences of my experiments approximating nonlinear FNN by their piecewise linear approximations using the ML library Torch.
 Adam Paszke (Google Research)
JAX: AD through composable function transformations
JAX is a system for highperformance machinelearning research. It offers the familiarity of Python+NumPy together with hardware acceleration, and it enables the definition and composition of userwielded function transformations useful for machinelearning programs. These transformations include automatic differentiation, automatic batching, endtoend compilation (via XLA) and parallelizing over multiple accelerators.
All those rewrites were designed with a heavy focus on composability, which allows a very rich set of rewrites to be built from very simple building blocks. In this talk I will focus on the transformations used within the automatic differentiation subsystem. For example, I will discuss how the derivative formulas for reversemode AD can be derived from the derivative formulas for forwardmode AD fully automatically, through a transformation for linear function transposition. Then, we’ll move on to discuss how this factorization of different AD methods enables easy extensions to more advanced techniques such as gradient checkpointing as well as memoryefficient AD for pointwise and invertible functions.
 Sher Afghan (RWTH Aachen University)
Interval Adjoint Significance Analysis for Neural Networks
Optimal neural network architecture is a very important factor for computational complexity and memory footprints of neural networks. In this regard, a robust pruning method based on interval adjoints significance analysis is presented to prune irrelevant and redundant nodes from a neural network. The significance of a node is defined as a product of a node's interval width and an absolute maximum of the firstorder derivative of that node's interval. Based on the significance of nodes, one can decide how much to prune from each layer. We show that the proposed method works effectively on hidden and input layers by experimenting on famous and complex datasets of machine learning. In the proposed method, a node is removed based on its significance, and bias is updated for remaining nodes.
 Pierre Ablin (ENS/PSL)
Superefficiency of automatic differentiation for functions defined as a minimum
In minmin optimization or maxmin optimization, one has to compute the gradient of a function defined as a minimum. In most cases, the minimum has no closedform, and an approximation is obtained via an iterative algorithm. There are two usual ways of estimating the gradient of the function: using either an analytic formula obtained by assuming exactness of the approximation, or automatic differentiation through the algorithm. In this paper, we study the asymptotic error made by these estimators as a function of the optimization error. We find that the error of the automatic estimator is close to the square of the error of the analytic estimator, reflecting a superefficiency phenomenon. The convergence of the automatic estimator greatly depends on the convergence of the Jacobian of the algorithm. We analyze it for gradient descent and stochastic gradient descent and derive convergence rates for the estimators in these cases. Our analysis is backed by numerical experiments on toy problems and on Wasserstein barycenter computation. Finally, we discuss the computational complexity of these estimators and give practical guidelines to chose between them.
 16^{00} –16^{30} Break, Invited Talk II
 Robert Gower (Facebook AI Research)
Calculating the Hessian matrix
The gradient can be famously calculated with a single forward and reverse pass of the functions computational graph. What about the Hessian? In this talk I will show that the Hessian matrix can be efficiently computed using only a single forward and reverse pass. To do this, I will carefully explain how to make use of the graph interpretation of reverse mode differentiation, showing both how the computation of gradients and Hessian matrices are represented on the computational graph of the underlying function. Using this graph based intuition I will deduce a method for computing Hessian matrices using a single reverse sweep. The resulting algorithm (called edge pushing) can calculate Hessian matrices of dimensions 10^5 x 10^5 in a matter of seconds. I will conclude by showing experimental results on how the edge pushing method is faster than alternative coloring based algorithms.
 16^{30} –17^{00} Social Event
 17^{00} –18^{00} Contributed Talks VI
 Barak A. Pearlmutter (Maynooth University)
Perturbation Confusion, Nesting, and The Amazing Bug
We explore both classical “perturbation confusion” that arises from nesting AD in firstorder code, and the socalled “amazing bug”, the confusion that can arise in AD of higherorder functions when the onetoone correspondence between derivativetaking and invocations of the derivativetaking operator, guaranteed in firstorder code, is lost. These issues arise in both forward and reverse mode AD, and are becoming more important in practice as demand from users encourages implementations to support both nesting (e.g., for learningtolearn) and AD of higherorder functions (e.g., for AD through ODE solvers or optimization routines.). In exploring these issues we cover what it means to ADtransform a higherorder function, and various methods for ameliorating perturbation confusion and the amazing bug.
 Dougal Maclaurin (Google Research)
Tackling the array indexing problem in functional AD
Functional array languages offer O(1) array indexing (subscripting). But the
transpose of this operation, building a onehot array of mostly zeros, is O(N),
which violates the cheap gradient principle. We describe how we solve this
problem in the array language Dex, with an effect system that can express
inplace accumulations. Our effect system also preserves opportunities for
parallelism and other optimizations, which can be a struggle for purely
imperative AD systems.
 Mehrdad Maleki (National University of Ireland Maynooth)
Many Formulations of One Transformation: Reverse AD in a Functional Context
We compare various formalizations of reversemode AD in a higherorder functional context. For each formulation, we give a brief description of the mathematical theories used, and describe how this machinery transforms fanout in the primal (forward phase) into addition in the adjoint (reverse phase). One strategy for transforming higherorder functions is closure conversion of their inputs and outputs, and we describe how these frameworks can be regarded in that light. We also touch upon the complexity, laziness, compilerfriendliness, and ability to nest, of each formalization.
 18^{00} Closing Remarks

