last updated: May 9, 2024

Probability with Measure

7.4 Characteristic functions (\(\Delta \))

In this section we introduce the main tool that we will need to prove the central limit theorem. It relies on Lebesgue integration in \(\C \), which we studied in Section 4.8. In this section we will also use complex versions of exercises that were earlier set for real valued functions. In such cases the complex version follows by applying the real version to real and imaginary parts.

  • Definition 7.4.1 Let \(X\) be a random variable defined on a probability space \((\Omega , {\cal F}, \P )\). The characteristic function \(\phi _{X}:\R \rightarrow \C \) of \(X\) and is defined, for each \(u \in \R \), by

    \begin{equation} \label {eq:char_func} \phi _{X}(u) = \E \l [e^{iuX}\r ] = \int _{\R } e^{iuy} \, dp_{X}(y). \end{equation}

The integral formula for \(\E [e^{iuX]}\) on the right hand side of (7.7) follows by Exercise 5.14. Note also that \(y \rightarrow e^{iuy}\) is measurable since \(e^{iuy} = \cos (uy) + i\sin (uy)\), and in \(\mc {L}^1\) by Exercise 4.5 since \(|e^{iuy}| \leq 1\) for all \(y \in \R \) and \(p_X\) is a finite measure.

  • Example 7.4.2 Suppose that \(X \sim N(\mu , \sigma ^{2})\), that is \(X\) has a normal distribution (sometimes known as a Gaussian) with mean \(\mu \) and variance \(\sigma ^{2}\). In Problem 7.9 you can show for yourself that in this case \(\phi _{X}(u) = \exp \l (i\mu u - \frac {1}{2}\sigma ^{2}u^{2}\r )\) for all \(u \in \R \).

Equation (7.7) states that the characteristic function of \(X\) is the Fourier transform of the law \(p_{X}\) of the random variable \(X\). In elementary probability theory courses we often meet the Laplace transform \(\E [e^{uX}]\) of \(X\), which is called the moment generating function of \(X\). The moment generating function has the disadvantage that it only exists when \(|u|\) is small enough, because the function \(y \mapsto e^{uy}\) may not be \(\Lone \). How small \(|u|\) is required to be depends on \(X\), but unfortunately there are random variables for which \(\E [e^{uX}]\) is undefined for all \(u\neq 0\). In such cases the moment generating function is useless. The characteristic function has the important advantage that it is always defined for all \(u\in \R \).

Here is a useful property of characteristic functions, which is another instance of the ‘independence means multiply’ philosophy that we developed in Section 5.4.

  • Lemma 7.4.3 If \(X\) and \(Y\) are independent random variables then for all \(u \in \R \),

    \[ \phi _{X+Y}(u) = \phi _{X}(u)\phi _{Y}(u).\]

Proof: We have \(\phi _{X+Y}(u) = \E \big [e^{iu(X+Y)}\big ] = \E \l [e^{iuX}e^{iuY}\r ] = \E \l [e^{iuX}\r ]\E \l [e^{iuY}\r ] = \phi _{X}(u)\phi _{Y}(u)\), where the key step follows by the complex version of Theorem 5.4.2.   ∎

We’ll end this section with two important properties of characteristic functions, neither of which will be proved within this course. The first says that characteristics are unique to random variables, in the sense that any random variables with same characteristic function also have the same law. The second relates characteristic functions to convergence in distribution.

  • Theorem 7.4.4 If \(X\) and \(Y\) are two random variables for which \(\phi _{X}(u)=\phi _{Y}(u)\) for all \(u \in \R \) then \(p_{X} = p_{Y}\).

  • Theorem 7.4.5 Let \(X_n,X\) be random variables with laws \(p_{X_n}\) and \(p_X\) (respectively), and characteristic functions \(\phi _n\) and \(\phi \). The following statements are equivalent:

    • 1. \(X_n\tod X\),

    • 2. for all \(u\in \R \) we have \(\phi _n(u)\to \phi (u)\).

7.4.1 Approximating characteristic functions with polynomials \((\Delta )\)

We now an inequality that we will be used in our proof of the central limit theorem. Let \(x \in \R \) and let \(R_{n}(x)\) be the remainder term in the Taylor series expansion of \(e^{ix}\),

\[ R_{n}(x) = e^{ix} - \sum _{k=0}^{n}\frac {(ix)^{k}}{k!}.\]

We start with an upper bound on \(R_n(x)\), which we then convert into a bound on the distance between a characteristic function and an approximating polynomial.

  • Lemma 7.4.6 For all \(n\in \N \cup \{0\}\) and \(x\in \R \),

    \[ |R_{n}(x)| \leq \min \left \{\frac {2|x|^{n}}{n!}, \frac {|x|^{n+1}}{(n+1)!}\right \}.\]

Proof: A simple calculation based on Exercise 4.14 shows that

\[R_{0}(x) = e^{ix} - 1 = \begin {cases} \int _{0}^{x}ie^{iy}\,dy & \text { if }x > 0, \\ -\int _{x}^{0}ie^{iy}\,dy & \text { if }x < 0. \end {cases} \]

Using that \(|ie^{iy}|\leq 1\) and the absolute values property of integrals (from the complex version of Theorem 4.5.3) gives that \(|R_0(x)|\leq |x|\). By the periodicity of \(\sin \) and the fact that \(\int _0^{2\pi }\sin y\,dy=0\), we have that for all \(x\in \R \), \(|\int _0^x \sin y\,dy|\leq \int _0^\pi \sin y\,dy =2\). The same applies to \(\cos \). Noting that \(e^{iy}=\cos y+i\sin y\), we obtain that \(|R_0(x)|\leq (2+2)^{1/2}=2\). Putting all this together, we have \(|R_{0}(x)| \leq \min \{2, |x|\}\).

A slightly longer calculation, again using on Exercise 4.14, shows that

\[R_{n}(x) = \begin {cases} \int _{0}^{x}iR_{n-1}(y)\,dy&\text { if }x > 0,\\ -\int _{x}^{0}iR_{n-1}(y)\,dy &\text { if }x < 0. \end {cases}\]

Using this relationship, the result can be shown via induction (which is left for you to check), starting from the base case \(n=0\) above.   ∎

  • Lemma 7.4.7 Let \(X\) be a random variable such that \(\E [|X|^n]<\infty \) and \(\E [X]=0\). Let \(\phi \) be the characteristic function of \(X\). Then

    last updated: May 9, 2024

    Probability with Measure

    7.5 The central limit theorem (\(\Delta \))

    The law of large numbers in Section 7.3 told us that if \((X_n)\) was an i.i.d. sequence of \(\mc {L}^1\) random variables then the sample mean \(\overline {X_n}\) because close to \(\mu =\E [X_n]\), for large \(n\). The central limit theorem, which we study in this section, examines how close they become.

    Let \(S_n=\sum _{i=1}^n X_n\), and let us write

    \begin{equation} \label {eq:standardize} Y_{n} = \ds \frac {\overline {X_{n}} - \mu }{\sigma /\sqrt {n}} \end{equation}

    where \(\mu =\E [X_n]\) and \(\sigma =\var (X_n)\), both assumed to be finite. We saw in Section 7.3 that \(\E [\ov {X_n}]=\mu \), \(\var (\ov {X_n})=\sigma ^2/n\), which means that \(\E [Y_{n}] = 0\) and \(\var (Y_{n}) = 1\) for all \(\nN \). For this reason (7.8) is often known as a standardization of \(\ov {X_n}\). If \(\overline {X}_n=\mu \) then \(Y_n=0\). More generally, how far away \(Y_n\) is from zero corresponds to how far \(\overline {X_n}\) is away from \(\mu \).

    Its difficult to underestimate the importance of the next result. It shows that if the \(X_n\) have finite variance then \(Y_n\tod N(0,1)\), where \(N(0,1)\) denotes the standard normal distribution, regardless of the distribution of the i.i.d. random variables \(X_n\). This is extraordinarily useful from an experimental point of view, because it tells you something that you should expect to see happen in every experiment – provided you can take repeated independent samples and they have finite variance. It allows statistical tests to be constructed without knowing the precise distribution of the random quantities involved, and that allows experimental science to quantify how likely a particular experiment (repeated sufficiently many times) is to have observed some important fact and not just random chance. In uncertain situations, the scientific basis for how well we understand the world around us comes primarily from the central limit theorem. We will discuss its history in Section 7.5.1.

    • Theorem 7.5.1 (Central Limit Theorem) Let \((X_{n})\) be a sequence of i.i.d. random variables each having finite mean \(\mu \) and finite variance \(\sigma ^{2}\). Let \(Y_n\) be given by (7.8). Then \(Y_n\) converges in distribution to the \(N(0,1)\) distribution, as \(n\to \infty \).

    Our proof will be based on Lemma 7.4.7 and the following lemma, which is a slight extension of the famous result that \(\lim _{n\to \infty }(1+\frac {x}{n})^n\to e^x\).

    • Lemma 7.5.2 Let \(y \in \R \) and let \(\alpha _{n}\) be a sequence of real or complex numbers such that \(\lim _n\alpha _{n} = 0\). Then for all \(y\in \R \), as \(n\to \infty \) we have

      \begin{equation} \label {eq:e_lim} \l (1 + \frac {y + \alpha _{n}}{n}\r )^{n} \to e^{y} \end{equation}

    The proof of Lemma 7.5.2 is an exercise in real analysis, which is included as Problem 7.11. We are now ready to prove the central limit theorem.

    Proof of Theorem 7.5.1: Without loss of generality we assume that \(\mu = 0\) and \(\sigma = 1\). We can recover the general result from this special case by replacing \(X_{n}\) by \((X_{n} - \mu )/\sigma \). Hence \(\E [X_1]=\mu =0\) and \(\E [X_1^2]=1\). We have also that \(\var (X_1)\) and \(\E [X_1^1]\) are finite.

    Our strategy is to show that the characteristic function of \(\phi \) converges to the characteristic function of the \(N(0,1)\) distribution, and then apply Theorem 7.4.5. Let \(\psi \) be the common characteristic function of the \(X_{n}\), given by \(\psi (u) = \E [e^{iuX_{1}}]\) for all \(u \in \R \). Let \(\phi _{n}\) be the characteristic function of \(Y_{n}\) for each \(\nN \). Using independence and Lemma 7.4.3 we have that

    \begin{equation} \label {eq:phinu_1} \phi _{n}(u) = \E \l [e^{i\frac {u}{\sqrt {n}}(X_{1} + X_{2} + \cdots + X_{n})}\r ] = \psi (u/\sqrt {n})^{n}. \end{equation}

    Applying Lemma 7.4.7 to \(\psi \), for all \(y\in \R \) we have

    \[\l |\psi (y)-\l (1+iy\E [X_1]-y^2\E [X_1^2]\r )\r |\leq \E \l [\min \l \{|yX_1|^2,\frac {|yX_1|^3}{6}\r \}\r ].\]

    Setting \(y=u/\sqrt {n}\), using that \(\E [X_1]=0\) and \(\E [X_1^2]=1\),

    \[\l |\psi (u/\sqrt {n})-\l (1-\frac {u^2}{2n}\r )\r |\leq \E \l [\min \l \{\frac {u^2|X_1|^2}{n},\frac {|X_1|^3}{6n^{3/2}}\r \}\r ].\]

    Let us write \(\theta _n(u)=\E \l [\min \l \{\frac {u^2|X_1|^2}{n},\frac {|X_1|^3}{6n^{3/2}}\r \}\r ]\). Putting the above into (7.10), we obtain

    \[\l (1-\frac {u^2/2}{n}-\theta _n(u)\r )^n\leq \phi _n(u)\leq \l (1-\frac {u^2/2}{n}+\theta _n(u)\r )^n\]

    and thus

    \begin{equation} \label {eq:phinu_2} \l (1+\frac {-u^2/2-n\theta _n(u)}{n}\r )^n\leq \phi _n(u)\leq \l (1+\frac {-u^2/2+n\theta _n(u)}{n}\r )^n. \end{equation}

    We have \(n\theta _n(u)=\E \l [\min \l \{u^2|X_1|^2,\frac {|X_1|^3}{6n^{2/2}}\r \}\r ]\). Note that \(\min \l \{u^2|X_1|^2,\frac {|X_1|^3}{6n^{2/2}}\r \}\in \Lone \) by Lemma 4.6.1 because \(|X_1|^2 \in \Lone \), and that \(\min \l \{u^2|X_1|^2,\frac {|X_1|^3}{6n^{2/2}}\r \}\to 0\) pointwise as \(n\to \infty \). By the dominated convergence theorem \(n\theta _n(u)\to 0\) as \(n\to \infty \). From (7.11) and Lemma 7.5.2 we thus have \(\phi _n(u)\to e^{-\frac {u^2}{2}}\) for all \(u\in \R \). In Exercise 7.9 we showed that \(e^{-\frac {u^2}{2}}\) is the characteristic function of the \(N(0,1)\) distribution. Hence from Theorem 7.4.5 we have \(Y_n\tod N(0,1)\), as required.   ∎

    7.5.1 Further discussion (\(\star \))

    The central limit theorem is arguably the single most important result in probability and statistics. It is likely to be one of the earliest things you learned about, even though we could not give a rigorous proof until now.

    The picture we know today was pieced together gradually by many different people. The sheer number of mathematicians that were involved in efforts to prove central limit theorems, particularly in the late 19th and early 20th century, makes a concise description of its history all but impossible. The modern statement of Theorem 7.5.1 is not attributed to any single author. The term ‘central limit’ is generally thought to have been introduced by Hungarian mathematian George Pólya in 1920.

    One of the first examples is the case of tosses of a fair coin (taking e.g. taking \(+1\) for heads and \(-1\) for tails), studied by de Moivre in 1733 and later extended by Laplace – see Exercise 7.10 for this case. Around 1890, Chebyshev was the first mathematician to consider formulating the central limit theorem in terms of a sequence of independent random variables. Before that point, the results were formulated in terms of the convergence of particular probabilities. As we noted at the start of Chapter 5, the foundation of probability theory in terms of Lebesgue measure was also not established at that time. In fact, early proofs of what became the central limit theorem were based on extremely difficult calculations. Substantial effort was made to find less convoluted proofs, eventually leading to the argument given in these notes.

    The version given in Theorem 7.5.1 has been extensively generalised during the 20th century. For example, conditions are known under which the central limit theorem holds for dependent sequences of random variables, for martingales, for stochastic processes with independent increments, and for cases where the normalization differs from that of subtracting the mean and dividing by the standard deviation.

    We will discuss only a few such results here. If the i.i.d. sequence \((X_{n})\) is such that \(\mu = 0\) and \(\E [|X_{n}|^{3}] = \rho ^{3} < \infty \), the Berry-Esseen theorem gives a useful bound for the difference between the cdf of the normalised sum and the cdf \(\Phi \) of the standard normal. To be precise we have that for all \(x \in \R , \nN \):

    \[ \left |\P \left (\frac {S_{n}}{\sigma \sqrt {n}} \leq x\right ) - \Phi (x)\right | \leq C\frac {\rho }{\sqrt {n}\sigma ^{3}},\]

    where \(C > 0\).

    We can also relax the requirement that the sequence \((X_{n})\) be independent. Consider the triangular array \((X_{nk}, k=1,\ldots ,n, \nN )\) of random variables which we may list as follows:

    \[\begin {array}{c c c c c } X_{11} & ~ & ~ & ~ & ~\\ X_{21} & X_{22} & ~ & ~ & ~\\ X_{31} & X_{32} & X_{33} & ~ & ~\\ \vdots & \vdots & \vdots & ~ & ~ \\ X_{n1} & X_{n2} & X_{n3} & \ldots & X_{nn}\\ \vdots & \vdots & \vdots & \vdots & \vdots \end {array}\]

    We assume that each row comprises independent random variables, but we allow random variables within each row to have dependences. Assume further that \(\E [X_{nk}] = 0\) and \(\sigma _{nk}^{2} = \E [X_{nk}^{2}] < \infty \) for all \(k,n\). Define the row sums \(S_{n} = X_{n1} + X_{n2} + \cdots + X_{nn}\) for all \(\nN \) and define \(\tau _{n} = \var (S_{n}) = \sum _{k=1}^{n}\sigma _{nk}^{2}\). Lindeburgh’s central limit theorem states that if we have the asymptotic tail condition

    \[ \lim _{n \rightarrow \infty }\sum _{k=1}^{n}\frac {1}{\tau _{n}^{2}}\int _{|X_{nk}| \geq \eps \tau _{n}}X_{nk}^{2}(\omega )\,d\P (\omega ) = 0,\]

    for all \(\eps > 0\) then \(\frac {S_{n}}{\tau _{n}}\) converges in distribution to a standard normal as \(n \rightarrow \infty \).

    The highlights of this chapter have been the proofs of the law of large numbers and central limit theorem. There is a third result that is often grouped together with the other two as one of the key results about sums of i.i.d. random variables. It is called the law of the iterated logarithm and it gives bounds on the fluctuations of \(S_{n}\) for an i.i.d sequence with \(\mu = 0\) and \(\sigma = 1\). The result is quite remarkable. It states that almost surely,

    \begin{equation*} \ds \li \frac {S_{n}}{\sqrt {2n \log \log (n)}} = -1, \quad \quad \ds \ls \frac {S_{n}}{\sqrt {2n \log \log (n)}} = 1. \end{equation*}

    This means that (with probability one) if \(c > 1\) then only finitely many of the events \(S_{n} > c\sqrt {2n \log \log (n)}\) occur but if \(c < 1\) then infinitely many of such events occur. Similarly, the other way up, at \(-1\). This gives a very precise description of the long-term behaviour of \(S_n\).

    Copyright Nic Freeman, Sheffield University, last updated May 9, 2024