|
In mathematics and computer algebra, automatic differentiation, or AD, sometimes alternatively called algorithmic differentiation, is a method to numerically evaluate the derivative of a function specified by a computer program. Two classical ways of doing differentiation are: Euclid, Greek mathematician, 3rd century BC, as imagined by by Raphael in this detail from The School of Athens. ...
A computer algebra system (CAS) is a software program that facilitates symbolic mathematics. ...
For a non-technical overview of the subject, see Calculus. ...
Figure 1: How automatic differentiation relates to symbolic differentiation The drawback with symbolic differentiation is low speed, and the difficulty of converting a computer program into a single expression. Moreover, many functions grow quite complex as higher derivatives are calculated. Two important drawbacks with finite differences are round-off errors in the discretization process, and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Automatic differentiation solves all of the mentioned problems. Image File history File links Size of this preview: 800 Ã 324 pixelsFull resolution (2244 Ã 908 pixel, file size: 75 KB, MIME type: image/png) I made this myself using TikZ, Beamer and LaTeX It displays how automatic differentiation relates to symbolic differentiation, the two upper boxes represent a source code...
Image File history File links Size of this preview: 800 Ã 324 pixelsFull resolution (2244 Ã 908 pixel, file size: 75 KB, MIME type: image/png) I made this myself using TikZ, Beamer and LaTeX It displays how automatic differentiation relates to symbolic differentiation, the two upper boxes represent a source code...
Numerical differentiation is a technique of numerical analysis to produce an estimate of the derivative of a mathematical function or function subroutine using values from the function and perhaps other knowledge about the function. ...
A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. ...
Discretization concerns the process of transferring continuous models and equations into discrete counterparts. ...
AD exploits the fact that any computer program that implements a vector function y = F(x) (generally) can be decomposed into a sequence of elementary assignments, any one of which may be trivially differentiated by a simple table lookup. These elemental partial derivatives, evaluated at a particular argument, are combined in accordance with the chain rule from derivative calculus to form some derivative information for F (such as gradients, tangents, the Jacobian matrix, etc.). This process yields exact (to numerical accuracy) derivatives. Because the symbolic transformation occurs only at the most basic level, AD avoids the computational problems inherent in complex symbolic computation. In calculus, the chain rule is a formula for the derivative of the composite of two functions. ...
For other uses, see Gradient (disambiguation). ...
In mathematics, the word tangent has two distinct but etymologically-related meanings: one in geometry and one in trigonometry. ...
In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. ...
Derivation of augmented arithmetic for Forward-Mode AD using dual numbers
Forward-Mode Automatic Differentiation is accomplished by augmenting the algebra of real numbers and obtaining a new arithmetic. An additional component is added to every number which will represent the derivative of a function at the number, and all arithmetic operators are extended for the augmented algebra. The augmented algebra is the algebra of dual numbers. Algebra is a branch of mathematics concerning the study of structure, relation and quantity. ...
Please refer to Real vs. ...
Arithmetic tables for children, Lausanne, 1835 Arithmetic or arithmetics (from the Greek word αÏιθμÏÏ = number) is the oldest and most elementary branch of mathematics, used by almost everyone, for tasks ranging from simple daily counting to advanced science and business calculations. ...
A variety of dualities in mathematics are listed at duality (mathematics). ...
Replace every number with the number , where x' is a real number, but is nothing but a symbol with the property . Using only this, we get for the regular arithmetic   and likewise for subtraction and division. Now, we may calculate polynomials in this augmented arithmetic. If , then In mathematics, polynomial functions, or polynomials, are an important class of simple and smooth functions. ...
x' can be chosen arbitrarily for now, it will be denoted a seed later. The new arithmetic consists of ordered pairs, elements written , with ordinary arithmetics on the first component, and first order differentiation arithmetic on the second component, as can be derived as above. Extending the above results on polynomials to analytic functions we obtain a list of the basic arithmetic and some standard functions for the new arithmetic: In mathematics, an ordered pair is a collection of two objects such that one can be distinguished as the first element and the other as the second element (the first and second elements are also known as left and right projections). ...
In mathematics, an analytic function is one that is locally given by a convergent power series. ...
          and in general for the primitive function g,  where g(1) and g(2) are the derivatives of g with respect to its first and second arguments, respectively. Also, when a binary basic arithmetic operation is applied to mixed arguments, say the ordered pair and the real number c, the real number is first lifted to . The derivative of a function at the point x0 is now found by calculating using the above arithmetic, which gives as the result.
Vector arguments and functions Multivariate functions can be handled with the same efficiency and mechanisms as univariate functions by adopting a directional derivative operator, which finds the directional derivative y' in of at x in in the direction x' in by calculating using the same arithmetic as above.
Higher order differentials The above arithmetic can be generalized, in the natural way, to second order and higher derivatives. However, the arithmetic rules quickly grow very complicated, complexity will be quadratic in the highest derivative degree. Instead, truncated Taylor series arithmetic is used. This is possible because the Taylor summands in a Taylor series of a function are products of known coefficients and derivatives of the function. Computations of Hessians using AD has proven useful in some optimization contexts. As the degree of the Taylor series rises, it approaches the correct function. ...
In mathematics, the Hessian matrix is the square matrix of second partial derivatives of a scalar-valued function. ...
The chain rule, forward and reverse accumulation Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition f(x) = g(h(x)) the chain rule gives In calculus, the chain rule is a formula for the derivative of the composite of two functions. ...
 Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode, the use of "backward" is discouraged). Forward accumulation specifies that one traverses the chain rule from right to left, that is first one computes dh / dx and then dg / dh, and reverse accumulation from left to right.
Figure 2: Example of forward accumulation with computational graph Image File history File links Size of this preview: 800 Ã 409 pixelsFull resolution (1346 Ã 688 pixel, file size: 56 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX. It displays how derivative values propagate according to the chain rule in...
Image File history File links Size of this preview: 800 Ã 409 pixelsFull resolution (1346 Ã 688 pixel, file size: 56 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX. It displays how derivative values propagate according to the chain rule in...
Forward accumulation Forward accumulation automatic differentiation is easiest to understand and to implement. The function f(x1,x2) = x1x2 + sin(x1) is by a computer (or the human programmer) interpreted as the sequence of elementary operations on the work variables wi, and an AD tool for forward accumulation adds the corresponding operations on the second component of the augmented arithmetic. | Original code statements | Added statements for derivatives | | w1 = x1 | w'1 = 1 (seed) | | w2 = x2 | w'2 = 0 (seed) | | w3 = w1w2 | w'3 = w'1w2 + w1w'2 = 1x2 + x10 = x2 | | w4 = sin(w1) | w'4 = cos(w1)w'1 = cos(x1)1 | | w5 = w3 + w4 | w'5 = w'3 + w'4 = x2 + cos(x1) | The derivative computation for f(x1,x2) = x1x2 + sin(x1) needs to be seeded in order to distinguish between the derivative with respect to x1 or x2. The table above seeds the computation with w'1 = 1 and w'2 = 0 and we see that this results in x2 + cos(x1) which is the derivative with respect to x1. Note that although the table displays the symbolic derivative, in the computer it is always the evaluated (numeric) value that is stored. Figure 2 represents the above statements in a computational graph. In order to compute the gradient of this example function, that is and , two sweeps over the computational graph is needed, first with the seeds w'1 = 1 and w'2 = 0, then with w'1 = 0 and w'2 = 1. The computational complexity of one sweep of forward accumulation is proportional to the complexity of the original code. As a branch of the theory of computation in computer science, computational complexity theory describes the scalability of algorithms, and the inherent difficulty in providing scalable algorithms for specific computational problems. ...
Forward accumulation is superior to reverse accumulation for functions with as only one sweep is necessary, compared to m sweeps for reverse accumulation.
Figure 3: Example of reverse accumulation with computational graph Image File history File links Size of this preview: 800 Ã 466 pixelsFull resolution (1319 Ã 769 pixel, file size: 69 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX It displays how derivative values propagate according to the chain rule in...
Image File history File links Size of this preview: 800 Ã 466 pixelsFull resolution (1319 Ã 769 pixel, file size: 69 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX It displays how derivative values propagate according to the chain rule in...
Reverse accumulation Reverse accumulation traverses the chain rule from left to right, or in the case of the computational graph in Figure 3, from top to bottom. The example function is real-valued, and thus there is only one seed for the derivative computation, and only one sweep of the computational graph is needed in order to calculate the (two-component) gradient. This is only half the work when compared to forward accumulation, but reverse accumulation requires the storage of some of the work variables wi, which may represent a significant memory issue. The data flow graph of a computation can be manipulated to calculate the gradient of its original calculation. This is done by adding an adjoint node for each primal node, connected by adjoint edges which parallel the primal edges but flow in the opposite direction. The nodes in the adjoint graph represent multiplication by the derivatives of the functions calculated by the nodes in the primal. For instance, addition in the primal causes fanout in the adjoint; fanout in the primal causes addition in the adjoint; a unary function y = f(x) in the primal causes x' = f'(x)y' in the adjoint; etc. Reverse accumulation is superior to forward accumulation for functions with . Backpropagation of errors in multilayer perceptrons, a technique used in machine learning, is a special case of reverse mode AD. Backpropagation is a supervised learning technique used for training artificial neural networks. ...
Jacobian computation The Jacobian J of is an matrix. The Jacobian can be computed using n sweeps of forward accumulation, of which each sweep can yield a column vector of the Jacobian, or with m sweeps of reverse accumulation, of which each sweep can yield a row vector of the Jacobian. In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. ...
Beyond forward and reverse accumulation Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. In order to compute a full Jacobian of , there is a path somewhere in between which is optimal, however, locating that path is probably an NP-hard problem. In computational complexity theory, NP-hard (Non-deterministic Polynomial-time hard) refers to the class of decision problems that contains all problems H such that for all decision problems L in NP there is a polynomial-time many-one reduction to H. Informally this class can be described as containing...
Implementation Forward-mode AD is implemented by a Nonstandard interpretation of the program in which real numbers are replaced by dual numbers, constants are lifted to dual numbers with a zero epsilon coefficient, and the numeric primitives are lifted to operate on dual numbers. This nonstandard interpretation is generally implemented using one of two strategies: source code transformation or operator overloading.
Source code transformation
Figure 4: Example of how source code transformation could work The source code for a function is replaced by an automatically generated source code that includes statements for calculating the derivatives interleaved with the original instructions. Image File history File links Size of this preview: 800 Ã 138 pixelsFull resolution (2488 Ã 430 pixel, file size: 56 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how source code transformation in automatic differentiation could work. ...
Image File history File links Size of this preview: 800 Ã 138 pixelsFull resolution (2488 Ã 430 pixel, file size: 56 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how source code transformation in automatic differentiation could work. ...
Source code transformation can be implemented for all programming languages, and it is also easier for the compiler to do compile time optimizations. However, the implementation of the AD tool itself is more difficult. Examples: - ADIC (C/C++, forward mode)
- ADIFOR (Fortran77)
- OpenAD (Fortran77, Fortran95, C/C++)
- TAPENADE (Fortran77, Fortran95)
Operator overloading
Figure 5: Example of how operator overloading could work Operator overloading is a possibility for source code written in a language supporting it. Objects for real numbers and elementary mathematical operations must be overloaded to cater for the augmented arithemtic depicted above. This requires no change in the original source code for the function to be differentiated. Image File history File links Size of this preview: 800 Ã 199 pixelsFull resolution (2498 Ã 620 pixel, file size: 53 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how operator overloading in automatic differentiation could work. ...
Image File history File links Size of this preview: 800 Ã 199 pixelsFull resolution (2498 Ã 620 pixel, file size: 53 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how operator overloading in automatic differentiation could work. ...
In computer programming, operator overloading (less commonly known as operator ad-hoc polymorphism) is a specific case of polymorphism in which some or all of operators like +, = or == have different implementations depending on the types of their arguments. ...
Operator overloading for forward accumulation is easy to implement, and also possible for backward accumulation. However, current compilers lag behind in optimizing the code when compared to forward accumulation. Example: References - Andreas Griewank (2000), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM, ISBN 0-89871-451-6
External links |