Concentration inequalities are used to bound the deviation of a random variable from some number, and they show up everywhere. The treatment here closely follows Chapter 2 of the excellent book High Dimensional Probability, by Vershynin. I have added some intuition, solved exercises, and included some simulations that I felt to be enlightening.

The motivating question will be the following: Consider a coin flipped N times. What is the probability of observing 34N heads? We will find that the probability decreases exponentially with N.

Central Limit Theorem

First, let’s discuss what the central limit theorem (CLT) says about our coin-flipping experiment. Denote the outcome of the ith coinflip as the Bernoulli random variable Xi which takes the value 1 if the outcome is heads and 0 if it is tails. Denote the number of heads in an experiment with N tosses as SN=iXi. The CLT (in this case, the de Moivre-Laplace theorem) states

ZNDN(0,1),

where ZN=SNNpNp(1p) is simply the number of heads centered and rescaled. That is, the distribution of the number of heads in our experiment will approach a normal distribution as the number of coin tosses N goes to infinity.

Let’s check it out in R:

ns = c(10,20,50,1000) #Number of coin flips 
p <- 0.5; # It's a fair coin
test_size <- 10000; #Number of times we will simulate the experiment.
toplots <- data.frame();
for (ni in 1:length(ns)){
  n <- ns[ni]
  S_N <- rbinom(n=test_size,size=n,prob=p); # Simulating the experiment by drawing from a binomial distribution
  Z_N <- (S_N-n*p)/sqrt(p*(1-p)*n)
  Z <- rnorm(test_size)
  toplot <- data.frame(x= c(Z_N,Z),
                       group=c(rep("# of Heads",test_size),rep("Normal Distribution",test_size)), 
                       N=rep(n,test_size))
  toplots <- rbind(toplots,toplot)
}
ggplot() + geom_density(data=toplots, aes(x=x,color=group)) +facet_wrap(~N,scales="free")+theme(legend.title  = element_blank(), panel.background = element_blank(), legend.position = "bottom") + xlab("")

CLT

Clearly, the density of ZN is approaching that of a normal distribution. Can this help us in getting an exponential bound for the probability of the number of heads exceeding 3N4? After all, the tails of a Gaussian decay exponentially, so if our random variable of interest approaches a Gaussian, shouldn’t that be enough?

Interestingly, it turns out, this does not work at all! The problem is that the rate at which the ZN converges in distribution is not nearly fast enough. A quantitative version of the central limit theorem, the Berry-Esseen CLT, says that

|P{ZNt}P{gt}|CN

where ZN(0,1) and C is a constant that we won’t worry about here. The convergence is of order 1/N, so it can’t be used to obtain exponential decay. Furthermore, this estimate is sharp. Let’s see for ourselves in some simulated data

p=.5
ns = c(seq(from=10,to=1000, by=10))
empirical_diff <- rep(-1,length(ns)) 
test_size = 1000;
for (ni in 1:length(ns) ){
  n <- ns[ni]
  
  # We want to estimate P{Z_N > t}, and we know that S_N ~ binom(n,p). To use
  # the CDF of S_N, we need to rescale t P{Z_N > t} = P{ S_N > t(\sqrt(p(1-p)N))+ Np}
  ts <- seq(from=-3,to=3,by = 0.1) 
  ts_ <- ts*(sqrt(p*(1-p)*n))+n*p 
  Z_N_gt_t = 1-pbinom(ts_,size = n,prob=p) 
  g_gt_t = 1-pnorm(ts)
  empirical_diff[ni] <- max(abs(Z_N_gt_t - g_gt_t)); # LHS of our inequality
}

berry_esseen <- predict(lm(empirical_diff ~ I(ns^-0.5))) # Let's fit a curve 1/sqrt(N) to compare
toplot <- data.frame(ns=c(ns,ns), probability= c(empirical_diff,berry_esseen),
                     group=c(rep("Empirical Difference",length(ns)),rep("Berry Esseen Bound",length(ns))))
ggplot(toplot, aes(x=ns, probability, color=group)) +geom_line(aes(linetype=group),size=2)+theme(legend.title  = element_blank(), legend.position = "bottom") + xlab("")

CLT

The LHS of our inequality (in blue) decays exactly as 1N. This is really slow, much slower than exponential. So this is why we need concentration inequalities: CLT is not going to help us here.

Concentration Inequalities

In these notes, the bounds we will consider will increase in strength as they increase in the amount of information (i.e. assumptions) about the random variables (or sums of random variables) they are bounding. We begin with Markov Inequality, which uses only non-negativity of the random variable. Then we use it to prove Chebyshev, which uses the variance. By assuming independence, we obtain an exponential bound using Hoeffding. And finally, by taking into account the mean of our random variables, we obtain Chernoff.

Markov & Chebyshev

To get us warmed up, let’s start with the Markov inequality:

For any non-negative random variable X, P{Xt}EXt

Proof.
Let (Ω,Σ,P) be a probability space. EX=XdP{Xt}XdPt{Xt}dPtP{Xt}

This is often a very weak bound, as it does not use anything we know about the random variable X. But, it is used in the proof of every other bound in these notes, such as Chebyshev inequality, which uses the variance of X to obtain a concentration bound.

For any random variable X, P{|XEX|t}Var(X)t2

Proof.
We obtain Chebyshev inequality squaring both sides of |XEX|t and applying the Markov inequality. P{|XEX|2t2}E|XEX|2t2=Var(X)t2

Hoeffding’s Inequality

Hoeffding is going to give us the exponential bound for the deviation iXi that we are looking for. To do so, it makes a crucial assumption: independence.

To start off, we will prove a special case, where there variables X1,,XN are symmetric Bernoulli. This means that they take the values 1 or -1 with equal probability.

If X1,...,XN are symmetric Bernoulli random variables, and aRN, then for any t0, we have P{Ni=1aiXit}exp(t22
Proof.
\begin{align*} \mathrm{P} \left\{\sum_{i=1}^N a_i X_i \geq t \right\} & = \mathrm{P} \left\{\exp \left( \lambda \sum_{i=1}^N a_i X_i \right)\geq e^{\lambda t} \right\} \\ & \leq e^{-\lambda t}\mathrm{E} \left\{\exp \left( \lambda \sum_{i=1}^N a_i X_i \right) \right\} \end{align*} where Markov's inequality was used in the second step. Now, by independence: \begin{align*} \mathrm{E} \left\{\exp \left( \lambda \sum_{i=1}^N a_i X_i \right) \right\} &= \mathrm{E} \left\{\prod_{i=1}^N \exp(\lambda a_i X_i)\right\}\\ &= \prod_{i=1}^N \mathrm{E} \left\{\exp(\lambda a_i X_i)\right\}. \end{align*} And noting that X_i takes -1 and 1 with probability 1/2, we can easily compute its expectation as \mathrm{E} \left\{\exp(\lambda a_i X_i)\right\} = \frac{e^{\lambda a_i} + e^{-\lambda a_i}}{2} \leq e^{\lambda^2 a_i^2/2}. Because the Taylor series of the exponential, e^x = \sum_{k=0}^\infty \frac{x^k}{k!}\\ \frac{e^x + e^{-x}}{2} = \sum_{k=0}^\infty \frac{x^{2k}}{(2k)!}\\ e^{x^2/2} = \sum_{k=0}^\infty \frac{x^{2k}}{2^k (k!)} Therefore, \frac{e^x + e^{-x}}{2} \leq e^{x^2/2}. Now, we can plug into the product above, and assuming \|a\|^2 = 1: \begin{align*} \mathrm{P} \left\{\sum_{i=1}^N a_i X_i \geq t \right\} &\leq e^{-\lambda t}\left( \prod_{i=1}^Ne^{\lambda^2 a_i^2/2}\right) \\ &\leq e^{-\lambda t}\left( e^{\lambda^2 \sum_{i=1}^Na_i^2/2}\right) \\ &= e^{-\lambda t}\left( e^{\lambda^2 /2}\right) \\ &= e^{\lambda^2 /2 - \lambda t } \\ \end{align*} This holds for any \lambda, and it is minimized by \lambda = t, so we obtain \mathrm{P} \left\{\sum_{i=1}^N a_i X_i \geq t \right\} \leq e^{-t^2/2} Of course, by homogeneity, we have it for any a First, note that we can assume \|a\| = 1, because \mathrm{P} \left\{\sum_{i=1}^N \frac{a_i}{\|a\|} X_i \geq \frac{t}{\|a\|} \right\} \leq e^{-\frac{t^2}{2\|a\|^2}}

With this theorem in hand, we can obtain bounds for our coin flipping problem. Let Y_i = 2(X_i-\frac{1}{2}), which is a symmetric binomial variable.

where the theorem was applied with a = [1,1,…,1] so \|a\|_{2}^2 = N. So

Simulating the coin flips, we see that Hoeffding’s Inequality gives the exponential bound that is observed from the simulation.

library(ggplot2)
library(patchwork) # devtools::install_github("thomasp85/patchwork")
p=.5
ns = c(seq(from=10,to=100, by=10))
chebyshev_upper_bounds <- 4/ns;
hoeffding_upper_bounds <- exp(-ns/8 )
empirical_upper_bounds <- rep(0,length(ns))
test_size = 1E7;
for (ni in 1:length(ns) ){
  n <- ns[ni]
  expvec <- rbinom(n=test_size,size=n,prob=p);
  empirical_upper_bounds[ni] <- (sum(( expvec - n*p) > n/4 ))/test_size;
}

toplot <- data.frame(ns=c(ns,ns,ns), probability= c(empirical_upper_bounds, chebyshev_upper_bounds, hoeffding_upper_bounds),
                     group=c(rep("Empirical",length(ns)),rep("Chebyshev",length(ns)), rep("Hoeffding",length(ns))))
g_linear <- ggplot(toplot, aes(x=ns, probability, color=group)) + geom_line()  +xlab("N")+ theme(legend.title = element_blank(), legend.position = "bottom")
g_log <- ggplot(toplot, aes(x=ns, probability, color=group)) + geom_line()  + scale_y_log10() +xlab("N") + theme(legend.title = element_blank(), legend.position = "bottom")
g <- g_linear + g_log

Concentration Inequalities for Binomial

We can prove a slightly more general statement for any bounded random variable.

If X_1,...,X_N are independent random variables, and X_i \in [m_i,M_i] almost surely. Then for any t>0, \mathrm{P} \left\{ \sum_{i=1}^N (X_i - \mathop{\mathrm E} X_i) \geq t \right\} \leq \exp\left(- \frac{2 t^2}{\sum_{i=1}^N (M_i - m_i)^2}\right)
Proof.
As before, we multiply by \lambda, exponentiate, and use Markov's inequality. \begin{align*} \mathrm P \left\{ \sum ( X_i - \mathop{\mathrm E} X_i) \geq t \right\} &= \mathrm P \left\{ \exp\left(\lambda \sum ( X_i - \mathop{\mathrm E} X_i)\right) \geq e^{\lambda t} \right\}\\ &\leq \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum ( X_i - \mathop{\mathrm E} X_i)\right) \right\}e^{-\lambda t} \\ &= \prod_i \mathop{\mathrm E} \left\{ \exp\left(\lambda ( X_i - \mathop{\mathrm E} X_i)\right) \right\}e^{-\lambda t} \\ \end{align*} Now, we must bound that expectation. To do so, we use a technique called symmetrization, where we introduce an independent copy of X_i, which we call X_i'. Notably, X_i' has the same distribution as X_i, and hence \mathrm E X_i = \mathop{\mathrm E} X_i'. We see that \begin{align*} \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum ( X_i - \mathop{\mathrm E} X_i)\right) \right\} &= \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum ( X_i - \mathop{\mathrm E} X_i')\right) \right\}\\ &= \mathop{\mathrm E}_{X_i} \left\{ \exp\left(\mathop{\mathrm E}_{X_i'}\lambda \sum ( X_i - X_i')\right) \right\}\\ &\leq \mathop{\mathrm E}_{X_i} \mathop{\mathrm E}_{X_i'}\left\{ \exp\left(\lambda \sum ( X_i - X_i')\right) \right\}\\ &= \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum ( X_i - X_i')\right) \right\}, \end{align*} where the last inequality is obtained by applying Jensen's inequality (because the exponential is convex). Interestingly, we note that X_i - X_i' is symmetric around zero. In particular, its distribution is the same as S(X_i - X_i'), where S is a Rademacher variable (a random sign variable that is 1 or -1 with equal probability). So, we have \begin{align*} \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum ( X_i - \mathop{\mathrm E} X_i)\right) \right\} &= \mathop{\mathrm E}_{X_i,X_i'} \left\{ \mathop{\mathrm E}_S \exp\left(\lambda \sum S( X_i - X_i')\right) \right\}\\ &\leq \mathop{\mathrm E}_{X_i,X_i'} \left\{ \exp(\lambda^2(X_i - X_i')^2/2)\right\}\\ &\leq \exp(\lambda^2(M_i-m_i)^2/2)\\ \end{align*} where the first inequality follows from the Taylor expansion of the exponential (exactly as in the proof above) and the second uses the boundedness of X_i. Now, plugging into our first inequality above, \begin{align*} \mathrm P \left\{ \sum ( X_i - \mathop{\mathrm E} X_i) \geq t \right\} &\leq \prod_i \mathop{\mathrm E} \left\{ \exp\left(\lambda ( X_i - \mathop{\mathrm E} X_i)\right) \right\}e^{-\lambda t} \\ &\leq \prod_i\exp(\lambda^2(M_i-m_i)^2/2) e^{-\lambda t} \\ &= \exp\left(\lambda^2\sum_i(M_i-m_i)^2/2-\lambda t\right) \\ &\leq\exp\left(\frac{2t^2}{\sum_i(M_i-m_i)^2}\right), \\ \end{align*} where the last inequality chooses the \lambda that minimizes the exponent.

Here is a nice exercise from Vershynin’s book that nicely demonstrates the utility of this inequality.

Suppose we have a binary classification problem, and a random algorithm that correctly classifies a datapoint with probability \frac{1}{2}+\delta, for \delta >0. This is a random algorithm, so two runs are completely independent. A natural way to construct an improved classifier would be to run the algorithm N times and take the majority vote. Suppose we decide that we want our algorithm to have only \epsilon >0 probability of classifying incorrectly. What is the minimum number of times we need to run the algorithm?
Solution.
We can model each run of the algorithm as a Bernoulli random variable X_i which is 1 if incorrect (with probability \frac{1}{2}-\delta) and 0 if correct (with probability \frac{1}{2} +\delta). In this formulation, we need to establish \mathrm{P} \left\{ \frac{1}{N} \sum_{i=1}^N X_i \geq \frac{1}{2} \right\} < \epsilon. To solve this problem, we start by writing out Hoeffding's inequality: \mathrm{P} \left\{ \sum_{i=1}^N (X_i - \mathop{\mathrm E} X_i) \geq t \right\} \leq \exp\left(- \frac{2 t^2}{N}\right) This holds for any t, so let's pick a t that gives us the left hand side of our inequality. \begin{align*} \frac{1}{N} \sum_{i=1}^N \left( X_i - \mathop{\mathrm E} X_i \right) &\geq \frac{t}{N}\\ \frac{1}{N} \sum_{i=1}^N \left( X_i \right) &\geq \frac{t}{N} + \mathop{\mathrm E} X_i \\ &= \frac{t}{N} + \frac{1}{2} - \delta \end{align*} We want the right hand side to be \frac{1}{2} , so we choose t = \delta N giving \mathrm{P} \left\{ \sum_{i=1}^N X_i \geq \frac{1}{2} \right\} \leq \exp\left(- 2 \delta^2N\right). We need to choose N large enough such that \exp\left(- 2 \delta^2N\right) < \epsilon. Solving for N, \begin{align*} \exp\left(- 2 \delta^2N\right) &< \epsilon\\ - 2 \delta^2N &< \ln(\epsilon)\\ N &> \frac{1}{2}\delta^{-2}\ln(\epsilon^{-1}) \end{align*} For example, if we want 99% accuracy (\epsilon = 0.01), but \delta = 0.05, then we need to run the algorithm at least 20,000 times.

Chernoff Bounds

The Hoeffding bound for the Bernoulli random variable with p=0.5 was quite sharp. Let’s see what happens for small p. As before, let X_i \sim \text{Bernoulli(p)} then S_N = \sum_{i=1}^N X_i. Suppose we need to bound the probability that S_N > 10pN

This is a bound for the probability that the Binomial random variable deviates from its mean by more than 9 times the mean. Intuitively, we would expect this to be highly improbable. Indeed, with n=100 and p=0.5, the bound above gives essentially zero. But with p=0.01, we get 0.2, which is completely useless! Hoeffding is just not enough for small p; we need something stronger.

If X_1,...,X_N are independent Bernoulli random variables with parameters p_i. Let S_N = \sum_i X_i and \mu = \mathop{\mathrm E} S_N then for t > \mu, \mathrm{P} \left\{ S_N \geq t \right\} \leq \exp\left(- \mu \right) \left( \frac{e \mu}{t} \right)^t.
Proof.
Again, we multiply by \lambda, exponentiate, and use Markov's inequality. \begin{align*} \mathrm P \left\{ S_N \geq t \right\} &\leq \mathop{\mathrm E} \left\{ \exp\left(\lambda \sum_i X_i \right) \right\}e^{-\lambda t} \\ &= \prod_i \mathop{\mathrm E} \left\{ \exp\left(\lambda X_i \right) \right\}e^{-\lambda t} \\ \end{align*} Now, we bound each expectation separately. We use the inequality 1+x \leq e^x, which comes from noting that 1+x is the tangent of of e^x at 0, and e^x is convex, so it must lie above its tangent lines. \begin{align*} E \left\{ \exp\left(\lambda X_i \right) \right\} &= e^\lambda p_i + (1-p_i)\\ &= 1 + (e^\lambda -1)p_i \\ &\leq \exp \left( p_i (e^\lambda -1) \right) \\ \end{align*} Which gives \begin{align*} \mathrm P \left\{ S_N \geq t \right\} &\leq \prod_i \exp \left( p_i (e^\lambda -1) \right)e^{-\lambda t} \\ &\leq \exp \left((e^\lambda -1)\sum_i p_i \right)e^{-\lambda t} \\ &\leq \exp \left((e^\lambda -1)\mu \right)e^{-\lambda t} \\ \end{align*} We are free to choose any \lambda we like, so we substitute \lambda = \ln(t/\mu), giving \begin{align*} \mathrm P \left\{ S_N \geq t \right\} &\leq \exp \left((\frac{t}{\mu} -1)\mu \right)\left( \frac{\mu}{t} \right)^t \\ &= \exp \left(t-\mu \right)\left( \frac{\mu}{t} \right)^t \\ &= \exp \left(-\mu \right)\left( \frac{e\mu}{t} \right)^t \\ \end{align*}

In our case, we have

require(ggplot2)
library(patchwork) # devtools::install_github("thomasp85/patchwork")
p=.01
ns = c(seq(from=10,to=100, by=10))
chernoff_upper_bounds <- exp(-p*ns)*((exp(1)/10))^(10*p*ns);
hoeffding_upper_bounds <- exp(-(2*(9*p)^2*ns) )
empirical_upper_bounds <- rep(0,length(ns))
test_size = 5E7;
for (ni in 1:length(ns) ){
  n <- ns[ni]
  expvec <- rbinom(n=test_size,size=n,prob=p);
  empirical_upper_bounds[ni] <- (sum( expvec > 10*n*p ))/test_size;
}

toplot <- data.frame(ns=c(ns,ns,ns), probability= c(empirical_upper_bounds, chernoff_upper_bounds, hoeffding_upper_bounds),
                     group=c(rep("Empirical",length(ns)),rep("Chernoff",length(ns)), rep("Hoeffding",length(ns))))
g_linear <- ggplot(toplot, aes(x=ns, probability, color=group)) + geom_line()  +xlab("N")+ theme(legend.title = element_blank(), legend.position = "bottom")
g_log <- ggplot(toplot, aes(x=ns, probability, color=group)) + geom_line()  + scale_y_log10() +xlab("N") + theme(legend.title = element_blank(), legend.position = "bottom")
g <- g_linear + g_log

Chernoff

Now this is much better! By taking into account the means of X_i, we get a much tighter bound when those means are very small.

Final Notes

The primary reference for these notes is Prof. Roman Vershynin’s High Dimensional Probability, which I highly recommend. I also found these notes by Prof. John Duchi to be helpful. Also many thanks to Prof. Stefan Steinerberger and Prof. John Lafferty whose classes got me first interested in the topic.

These notes focused on sums of Bernoulli variables, but the field of concentration inequalities is remarkably deep. For example, please see Vershynin’s book for extensions of these inequalities for sub-Gaussian and sub-Exponential variables, which I did not write about here. Also, please contact me (Twitter or Email) for comments or questions!