FACTOID # 123: The top ten countries for tourist destinations account for 49.6 percent of all tourist arrivals worldwide.
 
 Home   Encyclopedia   Statistics   Countries A-Z   Flags   Maps   Education   Forum   FAQ   About 
 
WHAT'S NEW
RECENT ARTICLES
More Recent Articles »
 

Encyclopedia > Jensen's inequality

In mathematics, Jensen's inequality, named after the Danish mathematician Johan Jensen, relates the value of a convex function of an integral to the integral of the convex function. It was proved by Jensen in 1906[1]. Given its generality, the inequality appears in many forms depending on the context, some of which are presented below. Euclid, Greek mathematician, 3rd century BC, known today as the father of geometry; shown here in a detail of The School of Athens by Raphael. ... Johan Ludwig William Valdemar Jensen, mostly known as Johan Jensen, (May 8, 1859—March 5, 1925) was a Danish mathematician and engineer. ... In mathematics, convex function is a real-valued function f defined on an interval (or on any convex subset C of some vector space), if for any two points x and y in its domain C and any t in [0,1], we have Convex function on an interval. ... In calculus, the integral of a function is an extension of the concept of a sum. ... 1906 (MCMVI) was a common year starting on Monday (see link for calendar). ...

Contents

Statements

The classical form of Jensen's inequality involves several numbers and weights. The inequality can be stated quite generally using measure theory, and can be further generalized to its full strength in a probabilistic setting. In mathematics, a measure is a function that assigns a number, e. ...


Finite form

For a real convex function φ, numbers xi in its domain, and positive weights ai, Jensen's inequality can be stated as:

varphileft(frac{sum a_{i} x_{i}}{sum a_{i}}right) le frac{sum a_{i} varphi (x_{i})}{sum a_{i}};

and the inequality is clearly reversed if φ is concave. In calculus, a differentiable function f is convex on an interval if its derivative function f ′ is increasing on that interval: a convex function has an increasing slope. ...


As a particular case, if the weights ai are all equal to unity, then

varphileft(frac{sum x_{i}}{n}right) le frac{sum varphi (x_{i})}{n}.

For instance, the log(x) function is concave, so substituting varphi(x)=-log(x) in the previous formula, this establishes the (logarithm of) the familiar arithmetic mean-geometric mean inequality: In mathematics, the inequality of arithmetic and geometric means, or more briefly the AM-GM inequality, states that the arithmetic mean of a list of non-negative real numbers is greater than or equal to the geometric mean of the same list; and further, that the two means are equal...

frac{x_1 + x_2 + cdots + x_n}{n} ge sqrt[n]{x_1 x_2 cdots x_n}.

The variable x may, if required, be a function of another variable (or set of variables) t, so that xi = g(ti). All of this carries directly over to the general continuous case: the weights ai are replaced by a non-negative integrable function f(x), such as a probability distribution, for example; and the summations replaced by integrals.


In measure-theoretic notation

Let (Ω,A,μ) be a measure space, such that μ(Ω) = 1. If g is a real-valued function that is μ-integrable, and if φ is a measurable convex function on the real axis, then: In mathematics, a measure is a function that assigns a number, e. ... In mathematics, the real numbers may be described informally in several different ways. ... The integral can be interpreted as the area under a curve. ... In mathematics, measurable functions are well-behaved functions between measurable spaces. ... In mathematics, convex function is a real-valued function f defined on an interval (or on any convex subset C of some vector space), if for any two points x and y in its domain C and any t in [0,1], we have Convex function on an interval. ...

varphileft(int_{Omega} g, dmuright) le int_Omega varphi circ g, dmu.

In probability-theory notation (real space)

The same result can be stated in a probability theory setting. Let (Omega, mathfrak{F},mathbb{P}) be a probability space, X an integrable real-valued random variable and φ a measurable convex function. Then: Probability theory is the mathematical study of phenomena characterized by randomness or uncertainty. ... In mathematics, a probability space or probability measure is a set S, together with a σ-algebra X on S and a measure P on that σ-algebra such that P(S) = 1. ... In mathematics, the term integrable function refers to a function whose integral may be calculated. ... A random variable is a mathematical function that maps outcomes of random experiments to numbers. ...

varphileft(mathbb{E}{X}right) leq mathbb{E}{varphi(X)}.

In this probability setting, the measure μ is intended as a probability mathbb{P}, the integral with respect to μ as an expected value mathbb{E}, and the function g as a random variable X. In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are... A random variable is a mathematical function that maps outcomes of random experiments to numbers. ...


In probability-theory notation (general)

More generally, let T be a real topological vector space, and X a T-valued integrable random variable. In this general setting, integrable means that for any element z in the dual space of T: mathbb{E}|langle z, X rangle|<infty, and there exists an element mathbb{E}{X} in T, such that langle z, mathbb{E}{X}rangle=mathbb{E}{langle z, X rangle}. Then, for any measurable convex function φ and any sub-σ-algebra mathfrak{G} of mathfrak{F}: In mathematics a topological vector space is one of the basic structures investigated in functional analysis. ... In mathematics, the term integrable function refers to a function whose integral may be calculated. ... In mathematics, the existence of a dual vector space reflects in an abstract way the relationship between row vectors (1×n) and column vectors (n×1). ... In mathematics, a σ-algebra (pronounced sigma-algebra) or σ-field over a set X is a collection Σ of subsets of X that is closed under countable set operations; σ-algebras are mainly used in order to define measures on X. The concept is important in mathematical analysis and probability theory. ...

varphileft(mathbb{E}{X|mathfrak{G}}right) leq mathbb{E}{varphi(X)|mathfrak{G}}.

Here mathbb{E}{cdot|mathfrak{G} } stands for the expectation conditioned to the σ-algebra mathfrak{G}. This general statement reduces to the previous ones when the topological vector space T is the real axis, and mathfrak{G} is the trivial σ-algebra {emptyset, Omega}. In probability theory, a conditional expectation is the expected value of a real random variable with respect to a conditional probability distribution. ... ...


Proofs

A graphical "proof" of Jensen's inequality for the probabilistic case. The dashed curve along the X axis is the hypothetical distribution of X, while the dashed curve along the Y axis is the corresponding distribution of Y values. Note that the convex mapping Y(X) increasingly "stretches" the distribution for increasing values of X.
A graphical "proof" of Jensen's inequality for the probabilistic case. The dashed curve along the X axis is the hypothetical distribution of X, while the dashed curve along the Y axis is the corresponding distribution of Y values. Note that the convex mapping Y(X) increasingly "stretches" the distribution for increasing values of X.

A proof of Jensen's inequality can be provided in several ways, and three different proofs corresponding to the different statements above will be offered. Before embarking on these mathematical derivations, however, it is worth analyzing an intuitive graphical argument based on the probabilistic case where X is a real number (see figure). Assuming a hypothetical distribution of X values, one can immediately identify the position of mathbb{E}{X} and its image varphi(mathbb{E}{X}) in the graph. Noticing that for convex mappings Y=varphi(X) the corresponding distribution of Y values is increasingly "stretched out" for increasing values of X, it is easy to see that the distribution of Y is broader than that of X in the interval corresponding to X > X0 and narrower in X < X0 for any X0; in particular, this is also true for X_0 = mathbb{E}{ X }. Consequently, in this picture the expectation of Y will always shift upwards with respect to the position of varphi(mathbb{E}{ X } ), and this "proves" the inequality, i.e. Image File history File links Jensen_graph. ... Image File history File links Jensen_graph. ...

mathbb{E}{ Y(X) } geq Y(mathbb{E}{ X } ),

the equality taking place when varphi(X) is not strictly convex, e.g. when it is a straight line.


The proofs below formalize this intuitive notion.


Proof 1 (using the finite form)

If lambda_1,,lambda_2 are two arbitrary positive real numbers such that λ1 + λ2 = 1, then convexity of varphi implies varphi(lambda_1 x_1+lambda_2 x_2)leq lambda_1,varphi(x_1)+lambda_2,varphi(x_2) for any x_1,,x_2. This can be easily generalized: if lambda_1,,lambda_2,ldots,lambda_n are n positive real numbers such that lambda_1+lambda_2+ldots+lambda_n=1, then

varphi(lambda_1 x_1+lambda_2 x_2+ldotslambda_n x_n)leq lambda_1,varphi(x_1)+lambda_2,varphi(x_2)+ldots+lambda_n,varphi(x_n),

for any x_1,,x_2,ldots,,x_n. This finite form of the Jensen's inequality can be proved by induction: by convexity hypotheses, the statement is true for n = 2. Suppose it is true also for some n, one needs to prove it for n+1. At least one of the λi is strictly positive, say λ1; therefore by convexity inequality: Mathematical induction is a method of mathematical proof typically used to establish that a given statement is true of all natural numbers. ...

varphileft(sum_{i=1}^{n+1}lambda_i x_iright)= varphileft(lambda_1 x_1+(1-lambda_1)sum_{i=2}^{n+1} frac{lambda_i}{1-lambda_1} x_iright)leq lambda_1,varphi(x_1)+(1-lambda_1) varphileft(sum_{i=2}^{n+1}left( frac{lambda_i}{1-lambda_1} x_iright)right).

Since sum_{i=2}^{n+1} frac{lambda_i}{1-lambda_1} =1, one can apply the induction hypotheses to the last term in the previous formula to obtain the result, namely the finite form of the Jensen's inequality.


In order to obtain the general inequality from this finite form, one needs to use a density argument. The finite form can be re-written as:

varphileft(int x,dmu_n(x) right)leq int varphi(x),dmu_n(x),

where μn is a measure given by an arbitrary convex combination of Dirac deltas: A convex combination is a linear combination of data points (which can be vectors or scalars) where all coefficients are positive and sum up to 1. ... The Dirac delta function, introduced by Paul Dirac, can be informally thought of as a function &#948;(x) that has the value of infinity for x = 0, the value zero elsewhere, and a total integral of one. ...

mu_n=sum_{i=1}^n lambda_i delta_{x_i}.

Since convex functions are continuous, and since convex combinations of Dirac deltas are weakly dense in the set of probability measures (as could be easily verified), the general statement is obtained simply by a limiting procedure. In mathematics, a continuous function is a function for which, intuitively, small changes in the input result in small changes in the output. ... In mathematics, weak topology is an alternative term for initial topology. ... In topology and related areas of mathematics, a subset A of a topological space X is called dense (in X) if, intuitively, any point in X can be well-approximated by points in A. Formally, A is dense in X if for any point x in X, any neighborhood of...


Proof 2 (measure theoretic notation)

Let g be a real-valued μ-integrable function on a measure space Ω, and let φ be a convex function on the real numbers. Define the right-handed derivative of φ at x as

varphi^prime(x):=lim_{tto0^+}frac{varphi(x+t)-varphi(x)}{t}.

Since φ is convex, the quotient of the right-hand side is decreasing when t approaches 0 from the right, and bounded below by any term of the form

frac{varphi(x+t)-varphi(x)}{t}

where t < 0, and therefore, the limit does always exist.


Now, let us define the following:

x_0:=int_Omega g, dmu,
a:=varphi^prime(x_0),
b:=varphi(x_0)-x_0varphi^prime(x_0).

Then for all x, ax+bleqvarphi(x). To see that, take x>x0, and define t = x − x0 > 0. Then,

varphi^prime(x_0)leqfrac{varphi(x_0+t)-varphi(x_0)}{t}.

Therefore,

varphi^prime(x_0)(x-x_0)+varphi(x_0)leqvarphi(x)

as desired. The case for x < x0 is proven similarly, and clearly ax_0+b=varphi(x_0).


φ(x0) can then be rewritten as

ax_0+b=aleft(int_Omega g,dmuright)+b.

But since μ(Ω) = 1, then for every real number k we have

int_Omega k,dmu=k.

In particular,

aleft(int_Omega g,dmuright)+b=int_Omega(ag+b),dmuleqint_Omegavarphicirc g,dmu.

Proof 3 (general inequality in probabilistic notation)

Let X be an integrable random variable that takes value in a real topological vector space T. Since varphi:T mapsto mathbb{R} is convex, for any x,y in T, the quantity

frac{varphi(x+theta,y)-varphi(x)}{theta},

is decreasing as θ approaches 0. In particular, it is well defined the subdifferential of varphi evaluated at x in the direction y, defined by:

(Dvarphi)(x)cdot y:=lim_{theta to 0} frac{varphi(x+theta,y)-varphi(x)}{theta}=inf_{theta neq 0} frac{varphi(x+theta,y)-varphi(x)}{theta}.

It is easily seen that the subdifferential is linear in y and, since the infimum taken in the right-hand side of the previous formula is smaller than the value of the same term for θ = 1, one gets:

varphi(x)leq varphi(x+y)-(Dvarphi)(x)cdot y.

In particular, for an arbitrary sub-σ-algebra mathfrak{G} we can evaluate the last inequality when x=mathbb{E}{X|mathfrak{G}},,y=X-mathbb{E}{X|mathfrak{G}} to obtain:

varphi(mathbb{E}{X|mathfrak{G}})leq varphi(X)-(Dvarphi)(mathbb{E}{X|mathfrak{G}})cdot (X-mathbb{E}{X|mathfrak{G}}).

Now, if we take the expectation conditioned to mathfrak{G} on both sides of the previous expression, we get the result since:

mathbb{E}{left[(Dvarphi)(mathbb{E}{X|mathfrak{G}})cdot (X-mathbb{E}{X|mathfrak{G}})right]|mathfrak{G}}=(Dvarphi)(mathbb{E}{X|mathfrak{G}})cdot mathbb{E}{ left( X-mathbb{E}{X|mathfrak{G}} right) |mathfrak{G}}=0,

by the linearity of the subdifferential in the y variable, and well-known properties of the conditional expectation. In probability theory, a conditional expectation is the expected value of a real random variable with respect to a conditional probability distribution. ...


Applications and special cases

Form involving a probability density function

Suppose Ω is a measurable subset of the real line and f(x) is a non-negative function such that

int_{-infty}^infty f(x),dx = 1.

In probabilistic language, f is a probability density function. In mathematics, a probability density function (pdf) serves to represent a probability distribution in terms of integrals. ...


Then Jensen's inequality becomes the following statement about convex integrals:


If g is any real-valued measurable function and φ is convex over the range of g, then

varphileft(int_{-infty}^infty g(x)f(x), dxright) le int_{-infty}^infty varphi(g(x)) f(x), dx.

If g(x) = x, then this form of the inequality reduces to a commonly used special case:

varphileft(int_{-infty}^infty x, f(x), dxright) le int_{-infty}^infty varphi(x),f(x), dx.

Alternative finite form

If Ω is some finite set {x_1,x_2,ldots,x_n}, and if μ is a counting measure on Ω, then the general form reduces to a statement about sums: In mathematics, the counting measure is an intuitive way to put a measure on any set: the size of a subset is taken to be the number of the subsets elements if this is finite, and ∞ if the subset is infinite. ...

varphileft(sum_{i=1}^{n} g(x_i)lambda_i right) le sum_{i=1}^{n} varphi(g(x_i))lambda_i,

provided that lambda_1 + lambda_2 + cdots + lambda_n = 1, lambda_i ge 0.


There is also an infinite discrete form.


Statistical physics

Jensen's equality is of particular importance in statistical physics when the convex function is an exponential, giving:

e^{langle X rangle} leq leftlangle e^X rightrangle,

where angle brackets denote expected values with respect to some probability distribution in the random variable X. In probability theory the expected value (or mathematical expectation) of a random variable is the sum of the probability of each possible outcome of the experiment multiplied by its payoff (value). Thus, it represents the average amount one expects as the outcome of the random trial when identical odds are... In mathematics and statistics, a probability distribution, more properly called a probability density, assigns to every interval of the real numbers a probability, so that the probability axioms are satisfied. ... A random variable is a mathematical function that maps outcomes of random experiments to numbers. ...


The proof in this case is very simple (cf. Chandler, Sec. 5.5). The desired inequality follows directly, by writing

leftlangle e^X rightrangle = e^{langle X rangle} leftlangle e^{X - langle X rangle} rightrangle

and then applying the inequality

e^X geq 1+X ,

to the final exponential.


Information theory

If p(x) is the true probability distribution for x, and q(x) is another distribution, then applying Jensen's inequality for the random variable Y(x) = q(x)/p(x) and the function φ(y) = −log(y) gives

Bbb{E}{varphi(Y)} ge varphi(Bbb{E}{Y})
Rightarrow int p(x) log frac{p(x)}{q(x)}dx ge - log int p(x) frac{q(x)}{p(x)}dx
Rightarrow int p(x) log frac{p(x)}{q(x)}dx ge 0
Rightarrow - int p(x) log q(x) ge - int p(x) log p(x),

a result called Gibbs' inequality. In information theory, Gibbs inequality is a statement about the mathematical entropy of a discrete probability distribution. ...


It shows that the average message length is minimised when codes are assigned on the basis of the true probabilities p rather than any other distribution q. The quantity that is greater than zero is called the Kullback-Leibler distance of q from p. In probability theory and information theory, the Kullback-Leibler divergence, or relative entropy, is a quantity which measures the difference between two probability distributions. ...


Rao-Blackwell theorem

Main article: Rao-Blackwell theorem

If L is a convex function, then from Jensen's inequality we get In statistics, the Rao-Blackwell theorem describes a technique that can transform an absurdly crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria. ...

L(Bbb{E}{delta(X)}) le Bbb{E}{L(delta(X))}
Rightarrow Bbb{E}{L(Bbb{E}{delta(X)})} le Bbb{E}{L(delta(X))}.

So if δ(X) is some estimator of an unobserved parameter θ given a vector of observables X; and if T(X) is a sufficient statistic for θ; then an improved estimator, in the sense of having a smaller expected loss L, can be obtained by calculating In statistics, an estimator is a function of the observable sample data that is used to estimate an unknown population parameter; an estimate is the result from the actual application of the function to a particular set of data. ... In statistics, one often considers a family of probability distributions for a random variable X (and X is often a vector whose components are scalar-valued random variables, frequently independent) parameterized by a scalar- or vector-valued parameter, which let us call &#952;. A quantity T(X) that depends on...

delta_{1}(X) = Bbb{E}_{theta}{delta(X') ,|, T(X')= T(X)},

the expectated value of δ with respect to θ, taken over all possible vectors of observations X compatible with the same value of T(X) as that observed.


This result is known as the Rao-Blackwell theorem. In statistics, the Rao-Blackwell theorem describes a technique that can transform an absurdly crude estimator into an estimator that is optimal by the mean-squared-error criterion or any of a variety of similar criteria. ...


See also

There are very few or no other articles that link to this one. ...

References

Johan Ludwig William Valdemar Jensen, mostly known as Johan Jensen, (May 8, 1859&#8212;March 5, 1925) was a Danish mathematician and engineer. ... Acta Mathematica is a journal publishing original research papers in all fields of mathematics. ...

External links

Footnotes

  1. ^ Jensen, J. Sur les fonctions convexes et les inégalités entre les valeurs moyennes.

 

COMMENTARY     


Share your thoughts, questions and commentary here
Your name
Your location
Your comments
Please enter the 5-letter protection code


Lesson Plans | Student Area | Student FAQ | Reviews | Press Releases |  Feeds | Contact
The Wikipedia article included on this page is licensed under the GFDL.
Images may be subject to relevant owners' copyright.
All other elements are (c) copyright NationMaster.com 2003-5. All Rights Reserved.
Usage implies agreement with terms.