FACTOID # 109: What is in a name? More than 90% of people in Bhutan, Burundi and Burkina Faso are involved in agriculture.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

FACTS & STATISTICS    Simple view

  1. Select countries to view: (hold down Control key and click to select several)

     

     

    Compare:

     

     

  1. Select fact or statistic: (* = graphable)

     

     

     

  2. (OPTIONAL) Compare to statistic: (both need to be graphable)

     

     

     

  3. View result as:

     

       
(OR) SEARCH ALL encyclopedia, stats & forums:   

Encyclopedia > Automatic differentiation

In mathematics and computer algebra, automatic differentiation, or AD, sometimes alternatively called algorithmic differentiation, is a method to numerically evaluate the derivative of a function specified by a computer program. Two classical ways of doing differentiation are: Euclid, Greek mathematician, 3rd century BC, as imagined by by Raphael in this detail from The School of Athens. ... A computer algebra system (CAS) is a software program that facilitates symbolic mathematics. ... For a non-technical overview of the subject, see Calculus. ...

Figure 1: How automatic differentiation relates to symbolic differentiation
Figure 1: How automatic differentiation relates to symbolic differentiation

The drawback with symbolic differentiation is low speed, and the difficulty of converting a computer program into a single expression. Moreover, many functions grow quite complex as higher derivatives are calculated. Two important drawbacks with finite differences are round-off errors in the discretization process, and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Automatic differentiation solves all of the mentioned problems. Image File history File links Size of this preview: 800 × 324 pixelsFull resolution (2244 × 908 pixel, file size: 75 KB, MIME type: image/png) I made this myself using TikZ, Beamer and LaTeX It displays how automatic differentiation relates to symbolic differentiation, the two upper boxes represent a source code... Image File history File links Size of this preview: 800 × 324 pixelsFull resolution (2244 × 908 pixel, file size: 75 KB, MIME type: image/png) I made this myself using TikZ, Beamer and LaTeX It displays how automatic differentiation relates to symbolic differentiation, the two upper boxes represent a source code... Numerical differentiation is a technique of numerical analysis to produce an estimate of the derivative of a mathematical function or function subroutine using values from the function and perhaps other knowledge about the function. ... A round-off error, also called rounding error, is the difference between the calculated approximation of a number and its exact mathematical value. ... Discretization concerns the process of transferring continuous models and equations into discrete counterparts. ...


AD exploits the fact that any computer program that implements a vector function y = F(x) (generally) can be decomposed into a sequence of elementary assignments, any one of which may be trivially differentiated by a simple table lookup. These elemental partial derivatives, evaluated at a particular argument, are combined in accordance with the chain rule from derivative calculus to form some derivative information for F (such as gradients, tangents, the Jacobian matrix, etc.). This process yields exact (to numerical accuracy) derivatives. Because the symbolic transformation occurs only at the most basic level, AD avoids the computational problems inherent in complex symbolic computation. In calculus, the chain rule is a formula for the derivative of the composite of two functions. ... For other uses, see Gradient (disambiguation). ... In mathematics, the word tangent has two distinct but etymologically-related meanings: one in geometry and one in trigonometry. ... In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. ...

Contents

Derivation of augmented arithmetic for Forward-Mode AD using dual numbers

Forward-Mode Automatic Differentiation is accomplished by augmenting the algebra of real numbers and obtaining a new arithmetic. An additional component is added to every number which will represent the derivative of a function at the number, and all arithmetic operators are extended for the augmented algebra. The augmented algebra is the algebra of dual numbers. Algebra is a branch of mathematics concerning the study of structure, relation and quantity. ... Please refer to Real vs. ... Arithmetic tables for children, Lausanne, 1835 Arithmetic or arithmetics (from the Greek word αριθμός = number) is the oldest and most elementary branch of mathematics, used by almost everyone, for tasks ranging from simple daily counting to advanced science and business calculations. ... A variety of dualities in mathematics are listed at duality (mathematics). ...


Replace every number ,x with the number x + x'varepsilon, where x' is a real number, but varepsilon is nothing but a symbol with the property varepsilon^2=0. Using only this, we get for the regular arithmetic

(x + x'varepsilon) + (y + y'varepsilon) = x + y + (x' + y')varepsilon
(x + x'varepsilon) cdot (y + y'varepsilon) = xy + xy'varepsilon + yx'varepsilon + x'y'varepsilon^2 = xy + (x y' + yx')varepsilon

and likewise for subtraction and division.


Now, we may calculate polynomials in this augmented arithmetic. If P(x) = p_0 + p_1 x + p_2x^2 + cdots + p_n x^n, then In mathematics, polynomial functions, or polynomials, are an important class of simple and smooth functions. ...

P(x + x'varepsilon) =, p_0 + p_1(x + x'varepsilon) + cdots + p_n (x + x'varepsilon)^n
=, p_0 + p_1 x + cdots + p_n x^n
, + p_1x'varepsilon + 2p_2xx'varepsilon + cdots + np_n x^{n-1} x'varepsilon
=, P(x) + P'(x)x'varepsilon

x' can be chosen arbitrarily for now, it will be denoted a seed later.


The new arithmetic consists of ordered pairs, elements written langle x, x' rangle, with ordinary arithmetics on the first component, and first order differentiation arithmetic on the second component, as can be derived as above. Extending the above results on polynomials to analytic functions we obtain a list of the basic arithmetic and some standard functions for the new arithmetic: In mathematics, an ordered pair is a collection of two objects such that one can be distinguished as the first element and the other as the second element (the first and second elements are also known as left and right projections). ... In mathematics, an analytic function is one that is locally given by a convergent power series. ...

langle u,u'rangle +langle v,v'rangle = langle u+v, u'+v' rangle
langle u,u'rangle -langle v,v'rangle = langle u-v, u'-v' rangle
langle u,u'rangle *langle v,v'rangle = langle u v, u'v+uv' rangle
langle u,u'rangle /langle v,v'rangle = leftlangle frac{u}{v}, frac{u'v-uv'}{v^2} rightrangle quad ( vne 0)
sinlangle u,u'rangle = langle sin(u) , u' cos(u) rangle
coslangle u,u'rangle = langle cos(u) , -u' sin(u) rangle
explangle u,u'rangle = langle exp u , u' exp u rangle
loglangle u,u'rangle = langle log(u) , u'/u rangle quad (u>0)
langle u,u'rangle^k = langle u^k , k u^{k-1} u' rangle quad (u ne 0)
left| langle u,u'rangle right| = langle left| u right| , u' mbox{sign} u rangle quad (u ne 0)

and in general for the primitive function g,

g(langle u,u' rangle , langle v,v' rangle ) = langle g(u,v) , g^{(1)}(u,v) u' + g^{(2)}(u,v) v' rangle

where g(1) and g(2) are the derivatives of g with respect to its first and second arguments, respectively.


Also, when a binary basic arithmetic operation is applied to mixed arguments, say the ordered pair langle u,u' rangle and the real number c, the real number is first lifted to langle c,0 rangle. The derivative of a function f : mathbb{R}rightarrowmathbb{R} at the point x0 is now found by calculating f(langle x_0, 1 rangle) using the above arithmetic, which gives langle f ( x_0 ) , f' ( x_0 ) rangle as the result.


Vector arguments and functions

Multivariate functions can be handled with the same efficiency and mechanisms as univariate functions by adopting a directional derivative operator, which finds the directional derivative y' in mathbb{R}^m of f:mathbb{R}^nrightarrowmathbb{R}^m at x in mathbb{R}^n in the direction x' in mathbb{R}^n by calculating (langle y_1,y'_1rangle, ldots, langle y_m,y'_mrangle) = f(langle x_1,x'_1rangle, ldots, langle x_n,x'_nrangle) using the same arithmetic as above.


Higher order differentials

The above arithmetic can be generalized, in the natural way, to second order and higher derivatives. However, the arithmetic rules quickly grow very complicated, complexity will be quadratic in the highest derivative degree. Instead, truncated Taylor series arithmetic is used. This is possible because the Taylor summands in a Taylor series of a function are products of known coefficients and derivatives of the function. Computations of Hessians using AD has proven useful in some optimization contexts. As the degree of the Taylor series rises, it approaches the correct function. ... In mathematics, the Hessian matrix is the square matrix of second partial derivatives of a scalar-valued function. ...


The chain rule, forward and reverse accumulation

Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition f(x) = g(h(x)) the chain rule gives In calculus, the chain rule is a formula for the derivative of the composite of two functions. ...

frac{df}{dx} = frac{dg}{dh} frac{dh}{dx}

Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode, the use of "backward" is discouraged). Forward accumulation specifies that one traverses the chain rule from right to left, that is first one computes dh / dx and then dg / dh, and reverse accumulation from left to right.

Figure 2: Example of forward accumulation with computational graph
Figure 2: Example of forward accumulation with computational graph

Image File history File links Size of this preview: 800 × 409 pixelsFull resolution (1346 × 688 pixel, file size: 56 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX. It displays how derivative values propagate according to the chain rule in... Image File history File links Size of this preview: 800 × 409 pixelsFull resolution (1346 × 688 pixel, file size: 56 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX. It displays how derivative values propagate according to the chain rule in...

Forward accumulation

Forward accumulation automatic differentiation is easiest to understand and to implement. The function f(x1,x2) = x1x2 + sin(x1) is by a computer (or the human programmer) interpreted as the sequence of elementary operations on the work variables wi, and an AD tool for forward accumulation adds the corresponding operations on the second component of the augmented arithmetic.

Original code statements Added statements for derivatives
w1 = x1 w'1 = 1 (seed)
w2 = x2 w'2 = 0 (seed)
w3 = w1w2 w'3 = w'1w2 + w1w'2 = 1x2 + x10 = x2
w4 = sin(w1) w'4 = cos(w1)w'1 = cos(x1)1
w5 = w3 + w4 w'5 = w'3 + w'4 = x2 + cos(x1)

The derivative computation for f(x1,x2) = x1x2 + sin(x1) needs to be seeded in order to distinguish between the derivative with respect to x1 or x2. The table above seeds the computation with w'1 = 1 and w'2 = 0 and we see that this results in x2 + cos(x1) which is the derivative with respect to x1. Note that although the table displays the symbolic derivative, in the computer it is always the evaluated (numeric) value that is stored. Figure 2 represents the above statements in a computational graph.


In order to compute the gradient of this example function, that is partial f/partial x_1 and partial f / partial x_2, two sweeps over the computational graph is needed, first with the seeds w'1 = 1 and w'2 = 0, then with w'1 = 0 and w'2 = 1.


The computational complexity of one sweep of forward accumulation is proportional to the complexity of the original code. As a branch of the theory of computation in computer science, computational complexity theory describes the scalability of algorithms, and the inherent difficulty in providing scalable algorithms for specific computational problems. ...


Forward accumulation is superior to reverse accumulation for functions f:mathbb{R} rightarrow mathbb{R}^m with m gg 1 as only one sweep is necessary, compared to m sweeps for reverse accumulation.

Figure 3: Example of reverse accumulation with computational graph
Figure 3: Example of reverse accumulation with computational graph

Image File history File links Size of this preview: 800 × 466 pixelsFull resolution (1319 × 769 pixel, file size: 69 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX It displays how derivative values propagate according to the chain rule in... Image File history File links Size of this preview: 800 × 466 pixelsFull resolution (1319 × 769 pixel, file size: 69 KB, MIME type: image/png) I made this myself using the graphics programming language TikZ, Beamer (LaTeX) and LaTeX It displays how derivative values propagate according to the chain rule in...

Reverse accumulation

Reverse accumulation traverses the chain rule from left to right, or in the case of the computational graph in Figure 3, from top to bottom. The example function is real-valued, and thus there is only one seed for the derivative computation, and only one sweep of the computational graph is needed in order to calculate the (two-component) gradient. This is only half the work when compared to forward accumulation, but reverse accumulation requires the storage of some of the work variables wi, which may represent a significant memory issue.


The data flow graph of a computation can be manipulated to calculate the gradient of its original calculation. This is done by adding an adjoint node for each primal node, connected by adjoint edges which parallel the primal edges but flow in the opposite direction. The nodes in the adjoint graph represent multiplication by the derivatives of the functions calculated by the nodes in the primal. For instance, addition in the primal causes fanout in the adjoint; fanout in the primal causes addition in the adjoint; a unary function y = f(x) in the primal causes x' = f'(x)y' in the adjoint; etc.


Reverse accumulation is superior to forward accumulation for functions f:mathbb{R}^n rightarrow mathbb{R} with n gg 1.


Backpropagation of errors in multilayer perceptrons, a technique used in machine learning, is a special case of reverse mode AD. Backpropagation is a supervised learning technique used for training artificial neural networks. ...


Jacobian computation

The Jacobian J of f:mathbb{R}^n rightarrow mathbb{R}^m is an m times n matrix. The Jacobian can be computed using n sweeps of forward accumulation, of which each sweep can yield a column vector of the Jacobian, or with m sweeps of reverse accumulation, of which each sweep can yield a row vector of the Jacobian. In vector calculus, the Jacobian is shorthand for either the Jacobian matrix or its determinant, the Jacobian determinant. ...


Beyond forward and reverse accumulation

Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. In order to compute a full Jacobian of f:mathbb{R}^n rightarrow mathbb{R}^m, there is a path somewhere in between which is optimal, however, locating that path is probably an NP-hard problem. In computational complexity theory, NP-hard (Non-deterministic Polynomial-time hard) refers to the class of decision problems that contains all problems H such that for all decision problems L in NP there is a polynomial-time many-one reduction to H. Informally this class can be described as containing...


Implementation

Forward-mode AD is implemented by a Nonstandard interpretation of the program in which real numbers are replaced by dual numbers, constants are lifted to dual numbers with a zero epsilon coefficient, and the numeric primitives are lifted to operate on dual numbers. This nonstandard interpretation is generally implemented using one of two strategies: source code transformation or operator overloading.


Source code transformation

Figure 4: Example of how source code transformation could work
Figure 4: Example of how source code transformation could work

The source code for a function is replaced by an automatically generated source code that includes statements for calculating the derivatives interleaved with the original instructions. Image File history File links Size of this preview: 800 × 138 pixelsFull resolution (2488 × 430 pixel, file size: 56 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how source code transformation in automatic differentiation could work. ... Image File history File links Size of this preview: 800 × 138 pixelsFull resolution (2488 × 430 pixel, file size: 56 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how source code transformation in automatic differentiation could work. ...


Source code transformation can be implemented for all programming languages, and it is also easier for the compiler to do compile time optimizations. However, the implementation of the AD tool itself is more difficult.


Examples:

  • ADIC (C/C++, forward mode)
  • ADIFOR (Fortran77)
  • OpenAD (Fortran77, Fortran95, C/C++)
  • TAPENADE (Fortran77, Fortran95)

Operator overloading

Figure 5: Example of how operator overloading could work

Operator overloading is a possibility for source code written in a language supporting it. Objects for real numbers and elementary mathematical operations must be overloaded to cater for the augmented arithemtic depicted above. This requires no change in the original source code for the function to be differentiated. Image File history File links Size of this preview: 800 × 199 pixelsFull resolution (2498 × 620 pixel, file size: 53 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how operator overloading in automatic differentiation could work. ... Image File history File links Size of this preview: 800 × 199 pixelsFull resolution (2498 × 620 pixel, file size: 53 KB, MIME type: image/png) I made this using TikZ, Beamer and LaTeX The figure displays how operator overloading in automatic differentiation could work. ... In computer programming, operator overloading (less commonly known as operator ad-hoc polymorphism) is a specific case of polymorphism in which some or all of operators like +, = or == have different implementations depending on the types of their arguments. ...


Operator overloading for forward accumulation is easy to implement, and also possible for backward accumulation. However, current compilers lag behind in optimizing the code when compared to forward accumulation.


Example:

References

  • Andreas Griewank (2000), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM, ISBN 0-89871-451-6

External links


  Results from FactBites:
 
Derivative - Wikipedia, the free encyclopedia (2137 words)
A function is differentiable at a point x if its derivative exists at that point; a function is differentiable on an interval if it is differentiable at every x within the interval.
Perhaps the most natural situation is that of functions between differentiable manifolds; the derivative at a certain point then becomes a linear transformation between the corresponding tangent spaces and the derivative function becomes a map between the tangent bundles.
For complex functions of a complex variable differentiability is a much stronger condition than that the real and imaginary part of the function are differentiable with respect to the real and imaginary part of the argument.
Automatic Differentiation (436 words)
The automatic differentiation in contrast with symbolic differentiation propagates numerical values of the derivatives rather symbolic expressions.
Differentiation of a composite function is performed in the forward mode: first differential tuples for independent variables are generated, then they are propagated through the computational graph of the function.
The automatic differentiation algorithms are implemented through overloaded elementary functions (exp, pow, sin, cos, etc.), arithmetic operations(+, -, *, /), various utility functions including differentiation operators, and a function to generate differential tuples corresponding to independent variables [6].
  More results at FactBites »


 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your comments
Please enter the 5-letter protection code

Want to know more?
Search encyclopedia, statistics and forums:

 


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.