Jensen’s Inequality


Jensen’s Inequality is one of those lovely maths tricks that comes up in the most random of places and is just useful to have in your back pocket.  In particular it’s great for proving a number of other useful inequalities by choosing an appropriate function to use within the theorem and doing some rearranging.

The goal in this post is to prove the weighted version of Jensen’s Inequality and then give an example usage of it.  We start by talking about convex functions and then use convex functions when talking about Jensen’s Inequality.

Convex Functions

Definition (Convex Function)

The function f(x) such that f: I \mapsto \mathbb{R} is convex on I = [a,b] \subseteq \mathbb{R} if for every x_1, x_2 \in I and 0 \leq \lambda \leq 1 we have 

    \begin{displaymath} f\left\{ \lambda x_1 + (1 - \lambda) x_2 \right\} \leq \lambda f(x_1) + (1-\lambda) f(x_2) \end{displaymath}

At a high level a function is convex on an interval if any line segment between two points within that interval lie above the function.  The diagram below shows this quite nicely.


The definition is all good and well … but how do we tell if a given function is actually convex?  For certain (common) functions we can use a little bit of calculus to help us out here and we state the below theorem without proof.

Theorem (How to Show Convexity)

Let f(x) be twice differentiable on the interval I \subseteq \mathbb{R}.  Then f(x) is convex on I \Leftrightarrow f''(x) \geq 0

As an example a well known convex function is f(x) = e^x where x \in \mathbb{R}.  The proof that this is convex is very straightforward since f''(x) = e^x \geq 0.  In addition I managed to find a proof which doesn’t rely on derivatives here: How to prove e^x is convex.

Formal Statement of Jensen’s Inequality

Theorem (Jensen)

Let f(x) be a convex function defined on an interval I \subseteq \mathbb{R}.  If x_1, x_2, \ldots, x_N \in I and \lambda_1, \lambda_2, \ldots, \lambda_N \geq 0 are such that \sum_i^N \lambda_i = 1 then

    \begin{displaymath} f \left( \sum_{i = 1}^N \lambda_i x_i \right) \leq \sum_{i=1}^N \lambda_i f(x_i). \end{displaymath}


In addition to the above and to ensure the requirement that \sum_{i=1}^N \lambda_i = 1 we can instead re-write the inequality for Jensen as

Theorem (Jensen – alternative)

Let f(x) be a convex function defined on an interval I \subseteq \mathbb{R}.  If x_1, x_2, \ldots, x_N \in I and \p_1, \p_2, \ldots, \p_N \geq \in \mathbb{R}_{\geq 0} then

    \begin{displaymath} f \left( \frac{ \sum_{i = 1}^N p_i x_i }{\sum_{i=1}^N p_i} \right) \leq \frac{\sum_{i=1}^N p_i f(x_i)}{\sum_{i=1}^N p_i}. \end{displaymath}

Proving the Theorem

We give a short proof by induction.

  • For the case N = 1 the theorem trivially holds;
  • For N = 2 we have f(\lambda_1 x_1 + \lambda_2 x_2) \leq \lambda_1 f(x_1) + \lambda_2 f(x_2) from the definition of a convex function so again the theorem holds;

For the inductive step we assume the theorem holds for k-1 where k \geq 2 and show it holds for k.  At a high level we do this be re-expressing the various summations in terms of k-1 and by scaling the weights \lambda_i to fit this re-scaling.  We use

    \begin{displaymath} f\left( \sum_{i=1}^k x_i \lambda_i \right) = f\left( \sum_{i=1}^{k-1} x_i \lambda_i + x_k \lambda_k \right). \end{displaymath}

Then since \sum_{i=1}^k \lambda_k = 1 \implies \sum_{i=1}^{k-1} \lambda_i = 1 - \lambda_k.  We define \mu_i = \frac{\lambda_i}{1 - \lambda_k}.

The values of \mu_i are our scaled \lambda_i values that go from 1 to k-1.  From this we can write

    \begin{displaymath} \sum_{i=1}^{k-1} \mu_i = \frac{\sum_{i=1}^{k-1} \lambda_i}{1 - \lambda_k} = \frac{1 - \lambda_k}{1 - \lambda_k} = 1. \end{displaymath}

Hence the \mu_i can be used as weighting factors for x_i from 1 to k-1.  Then

(1)   \begin{align*} f\left( \sum_{i=1}^k x_i \lambda_i \right) &= f\left( \left( 1 - \lambda_k \right) \sum_{i=1}^{k-1} x_i \mu_i + x_k \lambda_k \right) \nonumber \\ & =  f\left( y_1 \gamma_1 + y_2 \gamma_2 \right) \nonumber \end{align*}

Where y_1 = \sum_{i=1}^{k-1} x_i \mu_i, y_2 = x_k, \gamma_1 = 1 - \lambda_k and \gamma_2 = \lambda_k.  Since \gamma_1 + \gamma_2 = 1 - \lambda_k + \lambda_k = 1 and we have a convex function we can use case of N = 2 and write

    \begin{displaymath} f\left( \sum_{i=1}^k x_i \lambda_i \right)  \leq (1 - \lambda_k) f\left( \sum_{i=1}^{k-1} x_i \mu_i \right) + \lambda_k f(x_k). \end{displaymath}

Next we use the inductive assumption that our theorem holds for k-1 to get

    \begin{displaymath} f\left( \sum_{i=1}^k x_i \lambda_i \right) & \leq (1 - \lambda_k) \sum_{i = 1}^{k-1} f (x_i) \mu_i + \lambda_k f(x_k). \end{displaymath}

Then by using \frac{\lambda_i}{1 - \lambda_k} = \mu_i we can re-write the RHS of the above inequality as \sum_{i=1}^k \lambda_i f(x_i) which gives us our final result

    \begin{displaymath} f\left( \sum_{i=1}^k x_i \lambda_i \right)  \leq \sum_{i=1}^k \lambda_i f(x_i). \end{displaymath}

And we are done.

Young’s Inequality and the Arithmetic-Geometric Mean (Cauchy)

To arrive at this we start with the Jensen Inequality and use f(x) = - \ln (x) = \ln (1/x).  We also use the properties of logs:

    \begin{displaymath} \sum_{i=1}^N \ln \frac{1}{x_i} & = \ln \left( \prod_{i=1}^N \frac{1}{x_i} \right). \end{displaymath}

From these we can write

    \begin{displaymath} \sum_{i=1}^N \lambda_i \ln \frac{1}{x_i} = \ln \left( \prod_{i=1}^N \frac{1}{x_i^{\lambda_i}} \right) \geq \ln \left( \sum_{i=1}^N \frac{1}{x_i \lambda_i}\right). \end{displaymath}

Dropping the logs in the above from both sides gives Young’s Inequality (the weighted Arithmetic-Geometric Mean):

    \begin{displaymath} \sum_{i=1}^N x_i \lambda_i \geq \prod_{i=1}^N x_i^{\lambda_i}. \end{displaymath}

The special case of \lambda_i = 1/N for all i gives the Arithmetic-Geometric Mean Inequality – also known as the Cauchy Inequality:

    \begin{displaymath} \frac{\sum_{i=1}^N x_i}{N} \geq \sqrt[\leftroot{-2}\uproot{2} N ]{ \prod_{i=1}^N x_i}. \end{displaymath}

Harmonic Means

In the case of the Arithmetic-Geometric Mean Inequality we can replace the x_i with \frac{1}{x_i} to arrive at an inequality involving harmonic means:

    \begin{displaymath} \sqrt[\leftroot{-2}\uproot{2} N ]{ \prod_{i=1}^N x_i} \geq \frac{N}{\sum_{i=1}^N \frac{1}{x_i}}. \end{displaymath}

In words, the geometric mean of a finite collection of positive numbers is always greater than their harmonic mean, unless all the numbers are equal when the two means coincide.

An inequality with Natural Exponents

For this example we take f(x) = \e^x and \lambda_i = 1/N for all i.  This is strictly convex on \mathbb{R} so we can use

    \begin{displaymath} \sum_{i=1}^N \lambda_i f(x_i) > f \left( \lambda_i x_i \right). \end{displaymath}

Hence we get

    \begin{displaymath} \frac{\sum_{i=1}^N \e^{x_i}}{N} > \e^{\frac{\sum_{i=1}^N x_i}{N}} \end{displaymath}

and for the special case of N = 2 we use x_1 = a and x_2 = b to get

    \begin{displaymath} \frac{\e^a + \e^b}{2} > e^{\frac{a+b}{2}}. \end{displaymath}

Leave a Reply

Your email address will not be published. Required fields are marked *