What is probability? There are two main approachs to understand this concept. In **Frequentist** interpretation, probability represents long term frequency of events. In **Bayesian** interpretation, probability is used to quantify the **Uncertainty** about a specific event.

In frequentist, for example, if we flip the coins for numerous times, the likelihooh to have its head down will approach 0.5.

In bayesian interpretation, the coins is equally likely to land heads or tails on the next toss.

One big advantage of bayesian is that it can be used to model our uncertainty about events that lack long term frequencies, hence it is related to information rather than repeat trials. On such conditions that repeatly trails are not accessible, bayesian approach is the best choice.

## Brief review of probablity theory

### 1.1 Discrete random variables

The expression p(A) denotes the probability that event A is true. Typically, p(A) locates between 0 and 1. The more "0" p(A) is, the less likely event A would happend and vice versa. We also use p(\bar{A}) to denote the probability of the event not A.

We can also expand the notion of binary events by defining **Discrete Random Varible**$X$. $X$ can take any value from a finite or countable infinite set $\chi$. The probability that $X=x$ can be denoted as $p(X=x)$ or just $p(x)$ in short. Here $p()$ is called **Probability mass function** or **PMF**, it should satisfy two rules that 0<p(x)<1 and \sum_{x\in \chi}p(x)=1.

### 1.2 Fundamental rules

#### 1.2.1 Probability of a union of two events

Given two events A and B, we define the probability of A \cup B as follows:

p(A \cup B) =p(A) + p(B) -p(A\cap B)

#### 1.2.2 Joint probability

We define the probability of the joint event A and B as follows:

p(A,B)=p(A\cap B)=p(A|B)p(B)

This is sometimes called the product rule. Given a joint distribution on two events p(A,B). The marginal distribution can be defined as:

p(A)=\mathop{\sum_b}p(A,B)=\mathop{\sum_b}p(A|B=b)p(B=b)

where we summing up all the possible state of B, we can define p(B) similiarly. This is also called the sum rule or the rule of total probability. This rule can be expamd to the chain rule:

p(X_{1:D})=p(X_1)p(X_2|X_1)p(X_3|X_2,X_1)...p(X_D|X_{1:D-1})

#### 1.2.3 Conditional Probability

We define the conditional probability of event A, given the event B is true, as follows:

p(A|B)=\frac{p(A,B)}{p(B)} \text{ if } p(B)>0

### 1.3 Bayes Rule

Combining the definition of conditional probability with the product and the sum rules yields **Bayes Rule**

p(X=x|Y=y)=\frac{p(X=x.Y=y)}{p(Y=y)}=\frac{p(X=x)p(Y=y|X=x)}{\sum_{x'}p(X=x')p(Y=y|X=x')}

### 1.4 Independence and conditional independence

If we can denote the joint probability as the product of to separate events, we call these events X and Y **Unconditionally independent** or **Marginally independent**, such as:

X \perp Y \Longleftrightarrow p(X,Y) = p(X)p(Y)

But in many cases, unconditional independent does not always exists due to the fact that most varibles will affect each other. Nevertheless, when this kind of influence can mediate via other varibles, for instance Z, we call this kind of independence **Conditionally independence(CI)**:

X \perp Y |Z \Longleftrightarrow p(X,Y|Z)=p(X|Z)p(Y|Z)

### 1.5 Continuous random variables

Suppose X is some uncertain continuous quantity. The probability that X lies in the interval a\le X\le b can be described as follows.

Define the event A=(X\leq a),B=(X\leq b),W=(a<X\leq B), we can find that B=A\lor W. Given that A and W are exclusive, we will get that

p(B) = p(A) + p(W) \Longleftrightarrow p(W)=p(B)-p(A)

If we define F(q)\doteq p(X\le q) as the **cumulative distribution function (CDF)** of the variable X. Obviously this is a monotonically increasing function. We can have such expression:

p(a\le X \le b)=F(b)-F(a)

Now we define f(x)=\frac{d}{dx}F(x) (Suppose the derivative exists), which called **Probability density function (PDF)**. We can then calculate the probability of a continuous variable :

P(a\le X \le b)=\int_a^bf(x)dx

As the size of (a,b) decreases, we can write

P(x\le X \le x+dx)\approx p(x)dx

### 1.6 Quantiles

Since the cdf F is a monotonically increasing function, it has an inverse F^{-1}. If F is the cdf of X, then F^{-1}(\alpha ) is the value of x_\alpha such that P(X\le x_\alpha)=\alpha. This is called the \alpha **quantile** of F. The value F^{-1}(0.5) is the median of the distribution. The values F^{-1}(0.25) and F^{-1}(0.75) are the lower and upper quantiles.

### 1.7 Mean and variance

The most familiar property of a distribution is its **mean**, or **expected value**, denoted by \mu. For discrete rv's, it is defined as:

E[X] \doteq \sum_{x\in \chi}xp(x)

For continuous rv's, it is defined as

E[X] \doteq \sum_{\chi}xp(x)

The variance is a measure of the "spread" of a distribution, denoted by \sigma^2. This is defined as :

var[X]\doteq E[(X-\mu)^2]=\int (x-\mu)^2p(x)dx=E[X^2]-\mu^2

## Some common discrete distributions

### 2.1 The binomial and Bernoulli distributions

Suppose we tose a coin n times. let X\in {0,...,n} be the number of heads. If the probability of head is \theta, then we say X has a **binomial** distribution, writen as X \sim Bin(n,\theta). The pmf is given by

Bin(k|n,\theta)\doteq (\begin{array}{c}

n\\

k

\end{array}) \theta^k(1-\theta)^{n-k}

where

(\begin{array}{c}

n\\

k

\end{array}) \doteq \frac{n!}{(n-k)!k!}

is the number of ways to choose k items from n (C_n^k, known as the binomial coefficient). The distribution has the following mean and variance:

mean = \theta , var = n \theta (1-\theta)

Now suppose we toss a coin only once, Let X\in {0,1} be a binary random variable, with probability of \theta. We say that X has a **Bernoulli** distribution. This is written as X\sim Ber(\theta), where the pmf is defined as:

Ber(x|\theta)=\theta^{I(x=1)}(1-\theta)^{I(x=0)}

In other words.

Ber(x|\theta)=

\begin{cases}

\theta,&x = 1,

1-\theta,&x = 0

\end{cases}

### 2.2 The multinomial and multinouli distributions

The binomial distribution can be used to model the outcomes of coin tosses. To model the outcomes of tossing a K-side die, we can use the **multinomial** distribution. This is defined as follows: let x=(x_1,...x_K) be a random vector, where x_j is the number of times side j of the die occurs. Then x has the following pmf:

Mu(x|n,\theta)\doteq

(\begin{array}{c}

n\\

x_1,...x_K

\end{array})

\prod_{j=1}^K\theta_j^{x_j}

where \theta_j is the probability that side j shows up, and

(\begin{array}{c}

n\\

x_1,...x_K

\end{array})\doteq \frac{n!}{x_1!x_2!...x_K!}

is the **multinomial coefficient** (the number of ways to divide a set of size n=\sum_{k=1}^K into subsets with sizes x_1 up to x_K)

Now suppose n=1. This is like rolling a K-sided dice once, so x will be a vector of 0s and 1s. Specifically, if the dice shows up as face k, then the face k'th bit wll be on. In this case, we can think of x as being a scalar categorial random varible with K states, x is then called **One-hot encoding** or **dummy encoding**.

Mu(x|1,\theta)=\proc_{j=1}^K \theta_j^{I(x_j=1)}

### 2.3 The possion ditribution

We say X\in {0,1,2...} has a **Possion** distribution with parameter \lambda, written X\sim Poi(\lambda), its pmf is

Poi(x|\lambda)=e^{-\lambda}\frac{\lambda^x}{x!}

### 2.4 The empirical distribution

Given a set of data D={x_1,...x_N}, we define the **empirical distribution**, also called the empirical measure, as follows:

p_emp(A)\doteq \frac{1}{N}\sum_{i=1}^N\delta_{x_i}(A)

where \delta_x(A) is the **Dirac measure**, defined by

\delta_x(A)={ 1 if x\in A

## Continuous Distributions

### 3.1 Gaussian distributions

Gaussian distribution is the most common distribution in statistic and machine learning is the Gaussian or the normal distribution

N(x|\mu,\sigma^2) \doteq \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}

Here \mu=E(X) is the mean and mode, while \sigma^2=var[X] is the variance. \sqrt{2\pi \sigma^2} is the normalization constant needed to ensure the density integrates to 1. We write X\sim N(\mu,\sigma^2) to denote that p(X=x)=N(x|\mu,\sigma^2). If X\sim N(0,1), we say X follows **Standard distribution** , its pdf is called **bell curve**. We may often discuss the **precision** of a gaussian distribution, by which we mean the reverse of the variance \lambda={1/\sigma^2}

Note that, since this is a pdf, we can have p(x)>1. To have this, consider evaluating the density at its center x=\mu. We have N(\mu|\mu, \sigma^2)=(\sigma\sqrt{2\pi})^{-1}e^0, so if \sigma<1/\sqrt{2\pi}, we have p(x)>1

The cumulative distribution function or cdf of the Gaussian distribution is

\Phi(x;\mu,\sigma^2)\doteq \int_{- \inf}^{x}N(z|\mu,\sigma^2)

This function has no closed form expression. But when built in the computer, we compute the Gaussian distribution with the **error function**

\Phi(x;\mu,\sigma)=\frac{1}{2}[1+erf(z/2)]

z=(x-\mu)/\sigma

erf(x)\doteq \frac{2}{\sqrt{\pi}}\int_0^xe^{-t^2}dt

There are several reasons for gaussian to become the most popular distributions

+ Only two parameters: Easy to interpret

+ Central limit theorem: Sum of the random variables are approx. Gaussian distribution

+ Gaussian distribution has the least number of assumptions, or has maxiumun entropy

+ Simple mathematical form

### 3.2 Degenerative pdf

In the limit that \sigma^2 \rightarrow 0, the gaussian becomes an infinitely tall and thin "spike" centered at \mu

\mathop{lim}_{\sigma^2 \rightarrow 0}N(x|\mu,\sigma^2)=\delta(x-\mu)

where \delta is called **Dirac delta function**

\delta (x)= \inf if x=0

such that

\int_{-\inf}^{+\inf}\delta (x)dx=1

A useful property of delta fuctions is the sifting property, which select out a single term from a sum or integral

\int_{-\inf}^{+\inf}f(x)\delta (x-\mu)dx=f(\mu)

since the integrand is only non-zero if x-\mu=0

One problem with the Gaussian distribution is thatit is sensitive to outliners,since the log-probability only decays quadratically withdistane from the center. A more robust distribution is the **Student t distribution**. Its pdf is as follows:

\Tau(x|\mu,\sigma^2,v) \varpropto \lbrack 1+\frac{1}{v}(\frac{x-\mu}{\sigma})^2\rbrack^{-\frac{v+1}{2}}

where \mu is the mean, \sigma^2>0 is the scale parameter, and v>0 is called the **degree of freedom**. Thus t distribution has such properties:

mean = \mu , mode = \mu, var = \frac{v\sigma^2}{(v-2)}

As the expression shows, the mean is only defined if v>1, the variance is only defined if v>2. If v=1, t distribution is known as **Cauchy** or **Lorentz** distribution.

### 3.3 The Laplace distribution

Another distribution with heavy tails is the **Laplace distribution**, also known as the double sided exponential distribution. This has the following pdf:

Lap(x|\mu,b) \doteq \frac{1}{2b} exp(-\frac{|x-b|}{b})

Here \mu is a location parameter and b > 0 is a scale parameter. This distribution has the following properties:

mean = \mu, mode = \mu, var = 2b^2

### 3.4 The gamma distribution

The gamma distribution is a flexible distribution for positive real valued rv’s, x > 0. It is defined in terms of two parameters, called the shape a > 0 and the rate b > 0

Ga(T|a, b)\doteq \frac{b^a}{\Gamma (a)}T^{a-1}e^{-Tb}

where \Gamma (a) is the gamma function:

\Gamma \doteq \int_{0}^{\infty}u^{x-1}e^{-u}du

Such distributions have the following properties:

mean=\frac{a}{b}, mode = \frac{a-1}{b}, var=\frac{a}{b^2}

There also exist some special forms of gamma distribution

+ Exponential distribution This is defined by Expon(x|λ) \doteq Ga(x|1,λ), where λ is the rate parameter. This distribution describes the times between events in a Poisson process, i.e. a process in which events occur continuously and independently at a constant average rate λ.

+ Erlang distribution This is the same as the Gamma distribution where a is an integer. It is common to fix a = 2, yielding the one-parameter Erlang distribution, Erlang(x|λ) = Ga(x|2,λ), where λ is the rate parameter.

+ Chi-squared distribution This is defined by \chi^2(x|v)\doteq Ga(x|\frac{v}{2}, \frac{1}{2}).This is the distribution of the sum of squared Gaussian random variables. More precisely, if Z_i \sim N(0,1) , and S=\sum_{i=1}^v Z_i^2, then S \sim \chi^2_v

Another useful result is : if X \sim Ga(a,b), then one can show that 1/X \sim IG(a,b), where IG is the **Inverse Gamma** distribution defined by:

IG(x|a,b) \doteq \frac{b^a}{\Gamma (a)}x^{-(a+1)}e^{-b/x}

The distributions has these properties:

mean=\frac{b}{a-1}, mode =\frac{b}{a+1},var=\frac{b^2}{(a-1)^2(a-2)}

The mean only exists if a>1. The variance only exists if a>2.