0$, choose $N$ s.t. $m,n>N\Rightarrow\Vert f_{m}-f_{n}\Vert_{p}<\delta^{1/p}\epsilon$.
Now $\Vert f_{m}-f_{n}\Vert_{p}^{p}\geq\epsilon^{p}\mu(\{\omega\colon\vert f_{m}(\omega)-f_{n}(\omega)\vert>\epsilon\})$
so $m,n>N\Rightarrow\mu(\{\omega\colon\vert f_{m}(\omega)-f_{n}(\omega)\vert>\epsilon\})<\delta$.
\medskip{}
We earlier had a theorem that if a sequence converges in measure,
then a subsequence converges almost everywhere (compare the next part
of this proof to the proof of that theorem). In fact, it is sufficient
for the sequence to be Cauchy in measure.
Choose a subsequence $(f_{n_{k}})$ s.t. $m,n\geq n_{k}\Rightarrow\mu(\{\omega\colon\vert f_{m}(\omega)-f_{n}(\omega)\vert>2^{-k}\})<2^{-k}$,
so that\[
\sum_{k}\mu(\{\omega\colon\vert f_{n_{k}}(\omega)-f_{n_{k+1}}(\omega)\vert>2^{-k}\})<\infty\]
and so by Borel-Cantelli I, $\mu(A)=0$ where\[
A=\limsup_{k\rightarrow\infty}\{\omega\colon\vert f_{n_{k}}(\omega)-f_{n_{k+1}}(\omega)\vert>2^{-k}\}\]
Unpacking the definition of $A$ says that if $\omega\in A^{c}$,
then there is some $k$ (depending on $\omega$) s.t. $j\geq k\Rightarrow\vert f_{n_{j}}(\omega)-f_{n_{j+1}}(\omega)\vert\leq2^{-j}$,
and so $i,j\geq k\Rightarrow\vert f_{n_{i}}(\omega)-f_{n_{j}}(\omega)\vert\leq\sum_{j\geq k}2^{-j}=2^{-k+1}$.
Hence for $\omega\in A^{c}$, the sequence $(f_{n_{k}}(\omega))$
is Cauchy (in $\R$), so has a limit.
\medskip{}
Define $f(\omega)=\limsup f_{n_{k}}(\omega)$. This is measurable,
and if $\omega\in A^{c}$ then $f(\omega)=\lim f_{n_{k}}(\omega)$
so $f_{n_{k}}\rightarrow f$ almost everywhere.
We need to check that $f\in L^{p}$. This follows from Fatou since
$\vert f\vert^{p}=\liminf_{k}\vert f_{n_{k}}\vert^{p}$ a.e., so $\int\vert f\vert^{p}d\mu\leq\liminf_{k}\int\vert f_{n_{k}}\vert^{p}d\mu$
and $\int\vert f_{n_{k}}\vert^{p}d\mu\leq(\Vert f_{n_{k}}-f_{n_{1}}\Vert_{p}+\Vert f_{n_{1}}\Vert_{p})^{p}\leq(1+\Vert f_{n_{1}}\Vert_{p})^{p}$
for all $k$.
\medskip{}
Finally we show that $\Vert f_{n}-f\Vert_{p}\rightarrow0$.
Given $\epsilon>0$, choose $k$ s.t. $m,n\geq n_{k}\Rightarrow\Vert f_{n}-f_{m}\Vert_{p}<\epsilon$.
For $m\geq n_{k}$, $\vert f_{m}-f\vert^{p}=\lim_{k}\vert f_{m}-f_{n_{k}}\vert^{p}=\liminf_{k}\vert f_{m}-f_{n_{k}}\vert^{p}$
a.e. so by Fatou, \[
\int\vert f_{m}-f\vert^{p}d\mu\leq\liminf_{k\rightarrow\infty}\int\vert f_{m}-f_{n_{k}}\vert^{p}d\mu\leq\epsilon^{p}\]
So $m\geq n_{k}\Rightarrow\Vert f_{m}-f\Vert_{p}\leq\epsilon$.\end{proof}
\subsection{The space \texorpdfstring{$L^\infty$}{L-infinity} }
There is one more space to consider: the space $L^{\infty}$ of essentially
bounded functions.
A function $f\in\mS$ is \defterm{essentially bounded} if there is
some $K$ s.t. $\mu(\{\omega\colon\vert f\vert>K\})=0$.
We define a norm (the \defterm{essential supremum}) by $\Vert f\Vert_{\infty}={\displaystyle \inf_{N\in\cS,\mu(N)=0}\sup_{\omega\not\in N}\vert f(\omega)\vert}$.
Let $\cL^{\infty}$ be the set of essentially bounded functions $(S,\cS)\rightarrow\R$,
and identify a.e.-equal functions to form the quotient $L^{\infty}$
as before. $\Vert\cdot\Vert_{\infty}$ is a norm on $L^{\infty}$;
the triangle inequality is a piece of straightforward analysis.
A sequence $f_{n}\rightarrow f$ in $L^{\infty}$ iff there is a null
set $N$ s.t. $f_{n}\rightarrow f$ uniformly on $S\backslash N$
(this uses the fact that a countable union of null sets is null).
The relationship between $L^{\infty}$ convergence and uniform convergence
is the same as the relationship between almost everywhere convergence
and pointwise convergence.
We can use this to show that $L^{\infty}$ is complete, by showing
that a Cauchy sequence in $L^{\infty}$ is uniformly Cauchy except
on a null set, then using the completeness of the uniform norm.
\section{$L^{2}$ and conditional expectation}
%
\begin{framed}%
{[}1{]} $L^{2}$ as a Hilbert space. Orthogonal projection, relation
with conditional probability. Variance and covariance.\end{framed}
\subsection{Conditional probability}
In elementary probability, for an event $B$ with $\P(B)>0$, and
for any event $A$, we define the \defterm{conditional probability of $A$ given $B$}
as\[
\P(A|B)=\frac{\P(A\cap B)}{\P(B)}\]
Suppose we perform an experiment with a finite set of possible outcomes,
with events $B_{1},\ldots,B_{n}$ corresponding to each outcome (these
events are disjoint and cover $\Omega$, and must all have non-zero
probability). After gaining this information, we want to know the
probability of some other event $A$; this will be $\P(A|B_{j})$
for the appropriate $j$. Hence our guess as to the probability of
$A$ after the experiment is itself a random variable:\[
g(\omega)=\P(A\cap B_{j})/\P(B_{j})\mbox{ when }\omega\in B_{j}\]
Now consider a more sophisticated experiment, where the set of possible
outcomes may be infinite and may contain zero-probability events.
Let $\cG$ be the set of all events for which the experiment tells
us whether they occur or not; these form a sub-$\sigma$-field of
$\cF.$ (In the above finite case, $\cG=\sigma(B_{1},\ldots,B_{n})$
is the set of all possible unions of the $B_{j}$. If the experiment
measures some random variable $X$, then $\cG=\sigma(X)$.)
Again, we want a random variable $g$ which gives the probability
of $A$ after knowing the results of this experiment.
The value of $g$ must of course depend only on events in $\cG$;
for example, in the case $\cG=\sigma(B_{1},\ldots,B_{n})$, $g$ is
constant on each $B_{j}$. This condition is made precise by saying
that $g$ must be $\cG$-measurable (note that since $\cG\subset\cF$,
this is a stronger condition than being $\cF$-measurable).
Secondly, consider $\P(A\cap B)$ for any $B\in\cG$. Once we have
performed the experiment, we know either that $B$ did not occur,
so $A\cap B$ certainly does not occur, or $B$ did occur, in which
case $A\cap B$ occurs with conditional probability $g(\omega)$.
If we add (integrate) this for all $\omega$, then we should get $\P(A\cap B)$:
$\int_{B}gd\P=\P(A\cap B)$.
\medskip{}
So we define a \defterm{(version of the) conditional probability of $A$ given $\cG$}
to be a random variable $\P(A|\cG)$ s.t.
\begin{enumerate}
\item $\P(A|\cG)$ is $\cG$-measurable
\item For any $B\in\cG$, $\int_{B}\P(A|\cG)d\P=\P(A\cap B)$
\end{enumerate}
It is not at all obvious that such a random variable exists in general,
or that it is unique (hence the need to talk about versions); we shall
worry about this later.
\subsection{Conditional expectation}
If we fix $\cG$ and $\omega$, and allow $A$ to vary, then $\P(A|\cG)(\omega)$
gives a probability measure on $(\Omega,\cF)$. We can integrate with
respect to this measure to get the conditional expectation of a random
variable $X$:\[
\E(X|\cG)(\omega)=\int_{\Omega}Xd(\P(-|\cG)(\omega))\]
Now fixing $X$ and considering this as a function of $\omega$, we
get a new random variable $\E(X|\cG)$. This will have the properties
(these follow from the corresponding properties for conditional probability,
by the standard machine):
\begin{enumerate}
\item $\E(X|\cG)$ is $\cG$-measurable
\item For any $B\in\cG$, $\int_{B}\E(X|\cG)d\P=\int_{B}Xd\P$
\end{enumerate}
We define a \defterm{(version of the) conditional expectation of $X$ given $\cG$}
to be a random variable with these two properties.
Again it is not clear that such a random variable exists. And it is
only unique up to almost sure equality: if $g$ is a version of $\E(X|\cG)$
and $f$ a $\cG$-measurable r\@.v\@. which is zero almost surely,
then $g+f$ is still a version of $\E(X|\cG)$.
However, if $g_{1},g_{2}$ are two versions of $\E(X|\cG)$, then
let $B=\{\omega:g_{1}(\omega)\geq g_{2}(\omega)+\epsilon\}$. $B\in\cG$
since $g_{1},g_{2}$ are $\cG$-measurable so $\int_{B}g_{1}d\P=\int_{B}Xd\P=\int_{B}g_{2}d\P$
but $\int_{B}(g_{1}-g_{2})d\P\geq\epsilon\P(B)$ so $\P(B)=0$ for
any $\epsilon$. Hence $g_{1}=g_{2}$ almost surely.
\subsection{$L^{2}$ as a Hilbert space}
We shall prove the existence of conditional expectation for square-integrable
random variables, by exploiting the fact that $L^{2}(\Omega,\cF,\P)$
is a Hilbert space i.e. its norm is given by an inner product. (Linear
Analysis yields more light on why this is the case, specifically that
$L^{2}$ is self-dual). The inner product is \[
\left\langle f,g\right\rangle =\int fgd\mu\]
which is finite for $f,g\in L^{2}$ by Hölder (the case $p=q=2$,
often called the Schwarz or Cauchy-Schwarz inequality).
In the case of probability, the inner product has an interpretation
in terms of covariance: the \defterm{covariance} of $X,Y\in L^{2}$
is $\cov(X,Y)=\left\langle X-\E X,Y-\E Y\right\rangle =\left\langle X,Y\right\rangle -\E(X)\E(Y)$.
The \defterm{variance} of a random variable is $\var(X)=\cov(X,X)=\Vert X-\E X\Vert_{2}^{2}$,
which is finite iff $X\in L^{2}$.
The benefit of Hilbert space is that we can apply geometrical ideas.
In particular there is an idea of orthogonality: $f,g\in L^{2}$ are
\defterm{orthogonal} if $\left\langle f,g\right\rangle =0$. We also
have the parallelogram law: \[
\Vert f+g\Vert^{2}+\Vert f-g\Vert^{2}=2(\Vert f\Vert^{2}+\Vert g\Vert^{2})\]
\subsection{Orthogonal projection}
An important geometrical property of Hilbert spaces is the following,
proved in Linear Analysis:
\begin{theorem}If $G$ is a closed subspace of a Hilbert space $H$,
then there is a unique continuous linear map $\pi:H\rightarrow G$
s.t. $\left\langle f-\pi f,g\right\rangle =0$ for all $f\in H,g\in G$.
This map has the properties:
\begin{enumerate}
\item $\Vert f-\pi f\Vert\leq\Vert f-g\Vert$ for all $g\in G$
\item $\pi g=g$ for all $g\in G$
\item $\left\langle \pi f,g\right\rangle =\left\langle f,\pi g\right\rangle $
for all $f,g\in H$
\end{enumerate}
\end{theorem}
This map $\pi$ is called the \defterm{orthogonal projection} from
$H$ to $G$. Note that {}``$G$ a closed subspace'' means that
$G$ must be closed in the topology on $H$ and must be a subspace
of $H$ as a vector space.
Work with $H=L^{2}(\Omega,\cF,\P)$ and let $\cG$ be a sub-$\sigma$-field
of $\cF$. The space $G=L^{2}(\Omega,\cG,\P)$ is a closed subspace
of $H$ (closed because it is complete), so we may let $\pi$ be the
orthogonal projection $H\rightarrow G$.
For any $X\in H$, $\pi(X)$ is $\cG$-measurable since $\pi(X)\in G$.
And for any $B\in\cG$, $I_{B}\in G$ so\[
\int_{B}\pi(X)d\P=\left\langle I_{B},\pi(X)\right\rangle =\left\langle \pi(I_{B}),X\right\rangle =\left\langle I_{B},X\right\rangle =\int_{B}Xd\P\]
Hence $\pi(X)$ is a conditional expectation of $X$ given $\cG$.
Since we are working with $L^{2}$ rather than $\cL^{2}$, this is
only defined up to almost sure equality, corresponding to the uniqueness
property of conditional expectation.
Intuitively, a projection is an operation into a subspace which discards
information lying outside that subspace: for example, consider $\R^{n}$
with basis $\{e_{1},\ldots,e_{n}\}$. The orthogonal projection onto
the subspace generated by $\{e_{1},\ldots,e_{m}\}$ corresponds to
throwing away coordinates after the $m$-th. In the conditional expectation
case, $\pi$ is throwing away information not contained in the $\sigma$-field
$\cG$.
The property $\Vert X-\E(X|\cG)\Vert_{2}\leq\Vert X-Y\Vert_{2}$ for
all $Y\in G$ shows that for $X\in L^{2}$, $\E(X|\cG)$ is the best
estimate we can give for $X$ after knowing the information represented
by $\cG$, if we measure {}``best'' by minimising $\E(\vert X-Y\vert^{2})$.
\section{Ergodicity}
%
\begin{framed}%
{[}4{]} The strong law of large numbers, proof for independent random
variables with bounded fourth moments. Measure preserving transformations,
Bernoulli shifts. Statements {*}and proofs{*} of maximal ergodic theorem
and Birkhoff's almost everywhere ergodic theorem, proof of the strong
law.\end{framed}
\subsection{Strong law of large numbers}
\begin{theorem}[Strong law of large numbers]If $X_{n}$ are i.i.d.
r.v.s with $\E\vert X_{1}\vert<\infty$ then\[
\overline{X}_{n}=\frac{X_{1}+\ldots+X_{n}}{n}\rightarrow\E X_{1}\mbox{ almost surely.}\]
\end{theorem}
The strong law of large numbers tells us that if we take finite samples
from some probability distribution, as we increase the sample size,
the sample mean converges almost surely to the expectation of the
distribution. This is important because it justifies the intuitive
understanding of expectation and many applications in statistics.
The weak law is a strictly weaker result than the strong law: it says
that, with the same hypothesis, the sample mean converges in probability
to the expectation. It may still be useful because it has a simpler
proof than the strong law. Indeed it is non-trivial that $\{\omega\colon\overline{X}_{n}\rightarrow\E X_{1}\}$
is measurable; this holds because $\overline{X}_{n}$ is certainly
measurable, so $\limsup_{n}\overline{X}_{n},\liminf_{n}\overline{X}_{n}$
are measurable, and so $\{\limsup\overline{X}_{n}=\liminf\overline{X}_{n}=\E X_{1}\}$
is measurable. This is not a problem for the weak law, which only
talks about events $\{\vert\overline{X}_{n}-\E X_{1}\vert>\epsilon\}$.
There are many proofs of the strong law, which vary in the conditions
they require and techniques they use. We begin with an easy proof,
subject to the assumption that the random variables have bounded fourth
moments. Note that this proof does not require the variables to be
identically distributed; the condition that $\E X_{n}=0$ is just
for simplicity and can be obtained by translating.
\begin{theorem}If $X_{n}$ are independent r.v.s with $\E X_{n}=0$
and $\E(X_{n}^{4})**0\}$, then $\int_{A}fd\mu\geq0$.\end{theorem}
\begin{proof}Let $M_{n}=\max\{0,S_{1},\ldots,S_{n}\}$, and $A_{n}=\{\omega\colon M_{n}>0\}$.
Then $M_{n}$ is increasing, and $A_{n}\uparrow A$ so $fI_{A_{n}}\rightarrow fI_{A}$.
And $\vert fI_{A_{n}}\vert\leq\vert f\vert\in L^{1}$ so by the Dominated
Convergence Theorem,\[
\int_{A_{n}}fd\mu\rightarrow\int_{A}fd\mu\]
Hence we just need to show $\int_{A_{n}}fd\mu\geq0$ for each $n$.
Consider the shifted sums $S_{n}'=S_{n}\circ T=\sum_{k=1}^{n}f(T^{k}\omega)$.
Let $M_{n}'=M_{n}\circ T=\max\{0,S_{1}',\ldots,S_{n}'\}$ and $A_{n}'=T^{-1}(A_{n})=\{\omega\colon M_{n}'>0\}$.
Now $S_{n}=f+S_{n-1}'$ so if $M_{n}>0$ then $M_{n}=f+\max\{S_{1}',\ldots,S_{n-1}'\}\leq f+M_{n-1}'$.
Hence $M_{n}I_{A_{n}}-M_{n-1}'I_{A_{n}}\leq fI_{A_{n}}$.
But we can control $M_{n-1}'I_{A_{n}}$ using $M_{n-1}'I_{A_{n}}\leq M_{n}'I_{A_{n}}\leq M_{n}'I_{A_{n}'}=(M_{n}I_{A_{n}})\circ T$,
so\[
\int_{A_{n}}f\geq\int(M_{n}-M_{n-1}')I_{A_{n}}\geq\int M_{n}I_{A_{n}}-\int(M_{n}I_{A_{n}})\circ T=0\]
(the last equality since $T$ is measure-preserving).\end{proof}
\begin{theorem}[Birkhoff's Ergodic Theorem]If $T:(\Omega,\cF,\P)\rightarrow(\Omega,\cF,\P)$
is measure-preserving, $\cI$ is the invariant $\sigma$-field of
$T$, and $f\in L^{1}$ then\[
\frac{1}{n}\sum_{k=1}^{n}f(T^{k-1}\omega)\rightarrow\E(f|\cI)\mbox{ almost surely.}\]
\end{theorem}
\begin{proof}Assume wlog that $\E(f|\cI)=0$.
Let $X_{n}=f(T^{n-1}\omega),S_{n}=\sum_{k=1}^{n}X_{k}$.
Given $\epsilon>0$, let $A=\{\omega\colon\limsup_{n}S_{n}/n>\epsilon\}$.
We need to show that $\P(A)=0$.
Let $f'=(f-\epsilon)I_{A}$, $X_{n}'=(X_{n}-\epsilon)I_{A}$ and $S_{n}'=\sum_{k=1}^{n}X_{n}'=(S_{n}-n\epsilon)I_{A}$.
Now if $A$ occurs then $S_{n}>n\epsilon$ for some (in fact infinitely
many) $n$, so $\sup S_{n}'>0$.
And if $\sup S_{n}'>0$, then $I_{A}=1$ so $A$ occurs.
So $A=\{\omega\colon\sup S_{n}'>0\}$.
Now for any fixed $\omega$, $X_{1}/n\rightarrow0$ as $n\rightarrow\infty$
so $\limsup_{n}S_{n}/n=\limsup_{n}S_{n-1}'/n=\limsup_{n}S_{n}'/n$.
Hence $A=T^{-1}(A)$ and $X_{n}'$ is stationary.
Hence by the maximal ergodic theorem, \[
0\leq\int_{A}X_{1}'d\P=\int_{A}X_{1}d\P-\epsilon\P(A)\]
Now $A\in\cI$ so the definition of conditional probability gives
\[
\int_{A}X_{1}d\P=\int_{A}\E(X_{1}|\cI)d\P=0\]
so we get $\epsilon\P(A)\leq0$ giving $\P(A)=0$.\end{proof}
\end{document}
**