## Basic Introduction about Machine Learning

## 1.1 What and Why?

Definations of **Machine Learning**: Automatic methods of data analysis that can detect patterns in the data, then use the undiscovered patterns to predict future data or perform other kinds of decision.

There are many represenations of machine learning. But there has been the comments that the best way to solve such problems is the **Probability Model**. The probability model mainly focus on the the **uncertainty** in the machine learning for instance the probable prediction based on ML, or the probable cluster the data may locate. All of this book focus on the varies of probability models.

## 1.2 Types of ML

Machine learning algorithms can be divided into two parts: **Supervised Learning** and **Unsupervised Learning**.

In the Predictive or supervised learning, a mapping from input x and output y is to be made given the *Training Set* D={ (x_i,y_i)} ^N_{i=1} which has N cases. In the simplest case, input x_i is a D-dimensional vector which contains the *features / attributions/ covariates*. Output y_i is also a vector called *response variates*, but in most finite cases the output vectors are called *categorial* or *nominal* vectors. There are three usual output stytle.

+ y_i is **categorial**, the problem is known as **Classification/Pattern Recognization**

+ y_i is **real-valued**, the problem is known as **Regression**

+ y_i is **Separate real-valued**, the problem is known as **Ordinal Regression**

In the Descriptive or unsupervised learning, we should found the interesting pattern in the input data D={ x_i }^N_{i=1}. It's hard to define this kind of learning process

There is also another kind of learning called reinforced learning.

## Varies learning methods

## 2.1 Supervised Learning

### 2.1.1 Classification

The goal of classification is to learn a kind of mapping from inut x and output y\in \lbrace 1,...,C \rbrace. If C=2, the classification becomes binary calssification. And if C>2, the classification is named as multiclass classification. But if the feature is not exclusive, this kind of classification is called multilabel classification, which can be seen as a combination of series of binary classification.

The target in this case is the probability among all classes p(c_i|x,D) given the input x and the training set D. We also need to make it explicit that the probability is conditional on the specific input and training set. In some cases, the hidden model also matters in the probability, so the probability also denotes as p(c_i|x,D,M).

Given the probability model and data, we want to make best prediction about that in what cluster the input locates. That means picking the cluster in which the data will represent higher probability, which also called **MAP(Maximun A Posteriori)**

\hat{y}=\hat{f}(x)=\mathop{\arg\max_{c=1}^C}p(y=c|x,D,M)

### 2.1.2 Regression

Regression is just classification expect the responses are *continuous*.

## 2.2 Unsupervised Learning

An unsupervised learning model will receive input data but will not give any output. The goal of unsupervised learning is to discover the interesting information in the data. Unlike the supervised learning, we are not told what desired output is for each input. Instead, we will formalize our taks as **Density Estimation** which means models in form of p(x_i|\theta). There are two main differences between supervised learning and unsupervised learning: First, in supervised learning we are trying to construct p(y_i|x_i,D) while in unsupervised learning we consrtuct a unconditional density estimation p(x_i|\theta). Second, x_i is a vector of features, we need to create multivariate probability models in unsupervised learning.

Unsupervised learning acts more like human and animal learning, and is also more widely applicable than supervised learning for its non-labeled datasets require less work on "tagging" the data.

### 2.2.1 Discovering clusters

As a canonical example of unsupervised learning, data clustering, dividing the data into different groups. The first goal in data clustering is to estimate the probability among all K classes p(K|D), this tell us whether there exists any subpopulations within the group. In supervised learning K is fixed, but in unsupervised learning K is flexible, we can choose as many or as few clusters as we want. Process to determine K is called model selection.

Our second goal is to estimate which cluster each point belongs to. Let z_i \in \lbrace 1,....,K \rbrace represent the cluster to which data point i is assigned. It's also an example of a hidden or latent variable.

### 2.2.2 Discovering latent factors

When dealing with high dimension data, it is often useful to reduce the dimensionality by projecting the data into a lower dimensional subspace which captures the "essence" of data. This is called **dimensionality reduction**. The motivation behind this technique is that although the data may appear high dimensional, there may only exist small number of degrees of variability, corresponding to **latent factors**. When used as input to other statistical models, such data with lower dimension result in better predictive accuracy because the filtering out the "Trash" but focus on essential data. The most common method for dimensional reduction is **PCA(Principle Component Analysis)**

## Basic concepts in machine learning

**!Contents here will be released afterwards**