Jensen’s inequality

If $f$ is convex, averaging before applying $f$ gives a smaller answer than applying $f$ first and averaging:

E [f (X)] \geq f (E [X])

For concave $f$ the inequality reverses.

f(E[X]) = 17.454 E[f(X)] ≈ 28.284 E[X] ≈ 4.178 Gap: 10.830 (convex, expect sign +)

f(x) μ (lognormal location) 1.2 σ (lognormal scale) 0.70 n (samples) 3000

X is lognormal. The indigo dot is

f (E [X])

, the amber dot is

E [f (X)]

. For convex f, amber sits above indigo (

E [f (X)] \geq f (E [X])

); for concave f, below (

E [f (X)] \leq f (E [X])

). The gap grows with σ.

What to notice

Convex f puts the amber dot above the indigo. $E [f (X)]$ exceeds $f (E [X])$ . The chord joining two points on the curve always sits above the curve for convex functions — Jensen is the distributional version of that geometric fact.
Concave f flips the sign. Switch to log or √x. Now indigo sits above amber. Same inequality, other direction.
The gap grows with variance. Crank σ up. The sample spreads out, the chord gets longer, and the distance between the chord’s midpoint and the curve widens. Linear $f$ means zero gap — that’s the only case Jensen is an equality.

Why it matters

Jensen is a one-line proof of uncountably many inequalities:

Log-mean vs. mean-log (AM–GM for logs):

$lo g E [X] \geq E [lo g X]$
Apply Jensen with the concave log, and you get the arithmetic-mean–geometric-mean inequality for free.
Non-negativity of KL divergence. Apply Jensen with the convex $- lo g$ to the ratio $q / p$ :

$D_{KL} (p ∥ q) = E_{p} [lo g \frac{p ( X )}{q ( X )}] \geq 0$
Rao–Blackwell. Replacing an estimator with its conditional expectation given a sufficient statistic never increases mean squared error — because squared error is convex.
Log-sum-exp. The softmax’s shift-invariance and numerical-stability tricks are all Jensen-flavored.

Proof sketch

Every convex function has a supporting line at any point $m$ : there exists $c$ such that $f (x) \geq f (m) + c (x - m)$ for all x. Take $m = E [X]$ and take expectations on both sides. The linear term vanishes (its expectation is zero) and you’re left with $E [f (X)] \geq f (E [X])$ .