Intoduction
- Supervised learning
- Unsupervised learning
Linear Regression
Classification and logistic regression
Generalized Linear Models(GLMs)
- The exponential family
- Constructing GLMs
Generative Learning Algorithms
- Gaussian Discriminant Analysis (GDA)
- Naive Bayes
Support Vector Machines
Learning Theory
Regularization and Model Selection
Decision Tree & Ensembling Methods
- Decision Tree: a Non-linearity, region-based algorithm
- Ensembling Methods
Deep Learning
Unsupervised Learning Problem
- k-Means clustering algorithm
- Mixtures of Gaussians and the EM algorithm
The EM Algorithm: A More General View
Factor Analysis
Principal Components Analysis (PCA)
Independent Components Analysis (ICA)
Reinforcement Learning and Control
LQR, DDP and LQG

Intoduction

What is machine learning?
Experience E, task T, performance measure P. If its performance on T, as measured by P, improves with experience E.
Types:
- Supervised learning & Unsupervised learning
- Reinforcement learning & recommender sysytems
How to apply learning algorithm
Basic knowledge of linear algebra
- Matrix: rows x columns
- Vector: An n x 1 matrix; 1-indexed and 0-indexed(The first element of vector y is called y1 or y0)
- Scalar means real number
- PredictionVector = DataMatrix x ParametersVector
- PredictionMatrix = DataMatrix x Multi_Pairs_ParametersMatrix
- Not Commutative, Associative, Inverse(singular or degenerate), Transpose

Supervised learning

Def: We give the computer a dataset with the correct answer of each of the data
Regression: Predict continuous value output
Classification: Discrete valued output
Deal with infinite features: we have algorithm to do that

Unsupervised learning

Def: We just give the computer a dataset, and let the computer to find the structure of the data
- Type 1: Split the data into different groups
- Type 2: Audio seperating

Linear Regression

Training set -> Learning algorithm -> h: h(x) = y (hypothesis)
- x input, y output, m num of training examples
- h_theta(x) = theta0 + theta1 * x (called univariate linear regression)
- h_theta(xi) = (xi)^T * theta

LMS(Least Mean Square) Algorithm

Def Loss Function — how to choose thetas
- The idea: minimize the saperation of h(x)s from ys in the training sets
- In this example minimize(sum((h(xi) - yi) ^ 2) / (2m)) (what we minimize is also called Square Error Cost Function)
- Def cost function (J) as the function we try to minimize
Understanding of Hypothesis and Cost Function
- J is a function of parameters; For fixed parameters, h is a function of x
- For each value of parameters, J([parameters]) corresponds to a specific hypothesis
- The high-D understanding of h & J
Gradient Descent — How to automatically find the parameters that minimize J
- Ganerally we have a func J
  - Start with some parameters
  - keep changing them to reduce J, until hopfully we end up at a minimum
  - Gradient: the fast changing direction of a func
  - thetaj := thetaj - alpha * pd(J) / pd(thetaj) , in which alpha is called learning rate, too small — slow; too large — may overshoot the minimum, fail to converge, or even diverge
  - BTW: To simultaneously update all parameters correctly, we need to save all the new value in a register, and update them togather after we have done all the calculation. Otherwise, the parameter calculate later will not get the right value
- Specified to cost func J
  - That’s why there are ‘2’ in the hypothesis
  - theta_j := theta_j + alpha * (yi - h(xi)) * xi_j
  - Convex function — Boat shape
  - Batch Gradient Descent: Each step of gradient descent uses all the training examples(the sum coverage); And there’s other ways we’ll talked about later.
  - Stpchastic(Incremental) Gradient Descent: Each step uses samples of examples

The normal equations

Some more Matrix calc
- Derivative: Df(A)=[pd(f) / pd(Aij)]
- Trace: trA(commutative)
- Some equation:
  - D(_A)trAB = B^T
  - D(_A^T)f(A) = (D(_A)f(A))^T
  - D(_A)trABA^TC = CAB + C^TAB^T
  - D(_A) A = A (A^-1)^T
The Normal Equation: In order to let DJ(theta) = 0
X^T * X * theta = X^T * y
So by theta = (X^T * X)^-1 * X^T * y, we can minimize J without resorting to an iterative lgorithm

probabilistic interpretation

The prof of LMS: Minimizing J = Maximizing likelihood or log likelihood
- L(theta) = pi(p(yi | xi; theta))
- l(theta) = log(L(theta))
- In the example above, y|x; theta ~ N(miu, sigma^2) (N as Gaussian distribution)

Locally weighted linear regression(LWR)

Underfitting & Overfitting
Suppose we have enough data that the choice of features is less critical
In LWR instead of minimize sum(yi - theta^T * xi)^2 , we minimize sum(wi * (yi - theta^i * xi)^2) , where wi’s are non-negative valued weights
Fairly standard choice of w
wi = e^(- (xi - x)^2 / (2 * tao^2))
where x is the query point, tao is called the bandwidth parameter
Non-parametric

Classification and logistic regression

For now, we’ll focus on the binary classification problem, 0 — negative class(‘-‘); 1 — positive class(‘+’). Also given xi calc yi is also called the label for the reaining example

Logistic regression

Treated as continuous values prediction
Logistic function(also called Sigmoid function)
g(z) = 1 / (1 + e^-z)
Also g'(z) = g(z) * (1 - g(z))
Hypothesis: h_theta(x) = g(theta^T * x)
Probabilistic assumption:
- P(y = 1 x; theta) = h_theta(x)
- P(y = 0 x; theta) = 1 - h_theta(x)
- Genarally: p(y | x; theta) = (h_theta(x))^y * (1 - h_theta(x))^(1 - y) (called Bernoulli distribution Bernoulli(phi = h_theta(x)))
To maximize log likelihood l(theta) , we dirived the same fomula:
theta_j := theta_j + alpha * (yi - h_theta(xi)) * xi_j

Digression: The perceptron learning algorithm

All the same as above but g(z) = z<0?0:1

Another algorithm for maximizing `l(theta)`

Newton’s method to find a zero for a func:
theta := theta - f(theta) / f'(theta) (linear func understanding)
Apply to l(theta): theta := theta - l'(theta) / l''(theta)
Considering high-D: (Newton-Raphson method)
theta := theta - H^-1 * Delta_theta(l(theta))
where H = dp^2(l(theta)) / (dp(theta_i) * dp(theta_j))
Can understand as Gradient Descenting with a changing learning rate alpha

Generalized Linear Models(GLMs)

The exponential family

The exponential family distributions:
p(y; yita) = b(y) * exp(yita^T * T(y) - a(yita))
where:
- yita — natural parameter(also called canonical parameter);
- T(y) — sufficient satistic;
- a(yita) — log partition function( e^(-a(yita)) essentially plays the role of a normalization constant)
Example 1: For Bernoulli(phi):
- T(y) = y;
- a(yita) = - log(1 - phi) = log(1 + e^yita);
- b(y) = 1
Example 2: For Gaussian N(miu, sigma^2 = 1): (Because sigma has no effect on the final choice of theta and h)
- T(y) = y;
- a(yita) = miu^2 / 2 = yita^2 / 2;
- b(y) = exp(- y^2 / 2) / sqr_root(2 * pi)
Other Examples: multinomial, Poisson, gamma, exponential, bata, Dirichlet, etc.

Constructing GLMs

Assumption:

y	x; theta ~ ExponentialFamily(yita) — the distribution of y given x and theta follow some exponential family distribution with parameter yita

Given x, our goal is to predict the expected value of T(y) given x
In most of our examples, we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy h(x) = E[y|x]（条件期望）
The natural parameter yita and the inputs x are related linearly:
yita_i = theta_i^T * x

Some special name:
- y — Response variable
- Canonical response function & Canonical link function: h(x)(or saying g(yita)) & knowing outputs calc yita
Ordinary Least Squares
- We model the conditional distribution of y given x as a Gaussian function above
- So we have h_theta(x) = E[y | x; theta] = miu = yita = theta^T * x
Logistic regression
- Bernoulli
- h_theta(x) = E[y | x; theta] = phi = 1 / (1 + e^-yita) = 1 / (1 + e^-(theta^T * x))
Softmax regression (nultinomial distribution)
- For a multinomial over k possible outcomes, we’ll use k - 1 independent parameters to parameterize it phi_1, phi_2, ..., phi_(k - 1)
- phi_i = p(y = i; phi) and p(y = k; phi) = 1 - sum(phi_i) (let it = phi_k, where phi_k is not a parameter but just a notation)
- A notation to introduce: 1{·} as "The arguement is true" ? 1 : 0
- To express the nultinomial as an exponential family distribution:
  - we define that (T(y))_i = 1{y = i} (for i = 1, 2, …, k-1)
    That is: T(i) = [0, ..., 1(i_th), ..., 0((k-1)_th)]^T (for i = 1,2,…,k - 1); T(k) = [0,...,0]^T
    Moreover, E[(T(y))_i] = P(y = i) = phi_i
  - yita = [log(phi_1 / phi_k), ..., log(phi_(k-1) / phi_k)]^T
  - a(yita) = - log(phi_k)
  - b(y) = 1
- The softmax function:
  - The link function: yita_i = log(phi_i / phi_k); we also define yita_k = 0 for convenience
  - The response function: phi_i = e^yita_i / sum(e^yita_j)
- So that p(y = i | x; theta) = phi_i = e^(theta_i^T * x) / sum(e^(theta_j^T * x))

Generative Learning Algorithms

1. Discriminative learning algorithms: try to learn p(y|x) direcstly or mappings directly from the space of inputs to the labels;
2. Generative learning algorithm: instead, try to model p(y) (called class priors) and p(x|y), then calc p(y|x) by Bayes rule:
  p(y|x) = p(x|y) * p(y) / p(x)
  where p(x) = sum(p(x|y = i) * p(y = i))
  Moreover, arg max_y(p(y|x)) = arg max_y(p(x|y) * p(y) / p(x)) = arg max_y(p(x|y) * p(y))
  So likelihood function would be derived from p(x|y) and p(y)

Gaussian Discriminant Analysis (GDA)

Introduce: Multinariate normal distribution
- Mean vector miu (n); Covariance matrix Sigma (n x n) (>= 0, symmetric and positive semi-definite)
- p(x; miu, Sigma) = exp(-(x - miu)^T * Sigma^-1 * (x - miu) / 2) / ((2 * pi)^(n / 2) * |Sigma|^0.5)
- E[X] = \jf\_x(x * p(x; miu, Sigma))dx = miu
  Cov(Z) = E[(Z - E[Z])(Z - E[Z])^T] = E[ZZ^T] - E[Z]E[Z]^T = Simga
- Standard normal distribution
- In Sigma, the Aij represent the degree of compressness towards the direction xi = xj (As it goes bigger, it become more spread out in the direction xi = xj, and more compress in the direction xi = -xj (expect when i = j ))
The Gaussian Discriminant Analysis model: Take a binary-classification problem which vector x are continuous, real-valued vectors as an example
- y ~ Bernoulli(phi)
  x | y = 0 ~ N(miu0, Sigma)
  x | y = 1 ~ N(miu1, Sigma)
- Parameters: phi, miu0, miu1, Sigma (Note that only one Sigma for two miu)
- By maximize l(phi, miu0, miu1, Sigma), we get:
  phi = sum(1{yi = 1}) / m
  miu0 = sum(1{yi = 0} * xi) / sum(1{yi = 0})
  miu1 = sum(1{yi = 1} * xi) / sum(1{yi = 1})
  Sigma = sum((xi - miu_yi)(xi - miu_yi)^T) / m
Discussion: GDA & logistic regression
- Let theta = theta(phi, Sigma, miu0, miu1) then GDA become logistic regression
- GDA make a stronger assumption that not only p(y|x) follows logistic function, but also P(x|y) follows multivariate gaussian function
- When assumption correct, GDA is better, even asymptotically efficient
- GDA is more data efficient (requires less training data to learn “well”)
- Logistic regression more robust
Asymptotically efficient: Means that in the limit of very large training sets (large m), there is no algorithm that is strictly better than GDA

Naive Bayes

Input vector x are discrete-valued
Naive Bayes (NB) assumption: x_i’s are conditionally independent given y
Then p(x_1, ..., x_50000 | y) = p(x_1 | y)p(X_2 | y, x_1)···p(x_50000 | y, x_1, ..., x_49999) = pi(p(x_j | y))
Naive Bayes classifier:
- Parameterize and Maximize likelihood:
  phi_(j|y=1) = p(x_j=1|y=1) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=1})
  phi_(j|y=0) = p(x_j=1|y=0) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=0})
  phi_y = p(y = 1) = sum(1{yi=1}) / m
- Then p(y = 1 | x) = p(x | y = 1)p(y = 1) / p(x) where p(x) = p(x | y = 1) + p(x | y = 0)
- x can be multinomial;
  For continuous x, if GDA etc. fit not very good, we can also consider to discretize it and use Naive Bayes
Laplace smoothing:
- Problem of previous version: For words never come out, both phi would be 0, and the output will become 0/0
- Soution: Laplace smoothing
  Generally phi_j = (sum(1{zi = j}) + 1) / (m + k) where k is the number of possible j values to make sure the sum of all phi is 1
  Specifically, phi_(j|y=1) = (sum(1{xi_j=1 ^ yi=1}) + 1) / (sum(1{yi=1}) + 2), similarly for phi_(j|y=0)
Multinomial event model
x_i represent what the i_th word of the passage is;
We consider all xi_j’s distrubution is the same for the same i

Support Vector Machines

Margins

Intuition: Margins, Confidence of the prediction, Separating Hyperplane
Important assumption: Data is linearly separable
For the next discussing, we let the two value that y could have be -1 and 1, and donote h_(w,b)(x) = g(w^T * x + b), where g(z) = z<0 ? -1 : 1
Functional Margin: gammai = yi * (w^T * xi + b) and gamma = min(gammai)
- > 0 — correct; Large — confident
- There is problem that singly multiply 2 to w and b would result in the functional margin become bigger without doing anything meaningful
  So often we will impose some sort of normalization condition such as ||w||_2 = 1
Geometric Margin: gammai = yi * ((w / ||w||)^T * xi + b / ||w||) and gamma = min(gammai)

The Optimal Margin Classifier

The natural desideratum: Maximizes the (geometric) margin
So the problem is: max_(gamma,w,b)(gamma / ||w||) s.t. yi * (w^T * xi + b) >= gamma, i = 1, ..., m
Which is for calc easier: min_(w,b)(0.5 * ||w||^2) s.t. yi * (w^T * xi + b) >= 1, i = 1, ..., m
Can be considered as the loss function that the model (Optimal Margin Classifier) use to do the optimization
Digression: Lagrange Duality
- Consider the problem: min_w(f(w)) s.t. h_i(w) = 0, i = 1, ..., l
  条件极值，拉格朗日乘数法:
  - Def Lagrangian: hL(w, beta) = f(w) + sum(beta_i * h_i(w)) where beta_is are called Lagrange multipliers
  - Then solve pdhL / pdw_i = 0; pdhL / pdbeta_i = 0 to get w and beta
- Consider the problem (primal optimization problem): min_w(f(w)) s.t. g_i(w) <= 0, i = 1, ..., k; h_i(w) = 0, i = 1, ..., l
  - Def generalized Lagrangian: hL(w, alpha, beta) = f(w) + sum(alpha_i * g_i(w)) + sum(beta_i * h_i(w))
  - Consider the quantity: theta_P(w) = max_(alpha, beta; alpha_i>=0)(hL(w, alpha, beta))
    we can dirive that theta_P(w) = w satisfies primal constraints ? f(w) : infinity
    Then we have min(theta_P(w)) = min(max(hL(w, alpha, beta)))
    we call the solution p^* = min(theta_P(w)) the value of the primal problem
  - Consider another problem called the Dual optimization problem:
    First define theta_D(alpha, beta) = min_w(hL(w, alpha, beta))
    Then the problem is max_(alpha_i>=0)(theta_D(alpha, beta)) = max_(alpha_i>=0)(min(hf(w, alpha, beta)))
    we call the solution d^* = max_(alpha_i>=0)(theta_D(alpha, beta)) the value of the primal problem
  - Because maxmin <= minmax we get d^* <= p^*
  - Suppose f and g_is are convex, and the h_is are affine. Suppose further taht the constraints g_i are (strictly) feasible
    Then we have p^* = d^* = hL(w^*, alphs^*, beta^*)
    Moreover, they satisfy the Karush-Kuhn-Tucker (KKT) conditions. And The w^*, alphs^*, beta^* that satisfy KKT conditions are also a solution
    - pdhL(w^*, alphs^*, beta^*) / pdw_i = 0, i=1, ..., n
    - pdhL(w^*, alphs^*, beta^*) / pdbeta_i = 0, i=1, ..., l
    - alpha_i^* * g_i(w^*) = 0, i = 1, ..., k (called the KKT dual complementarity condition)
    - g_i(w^*) <= 0, i = 1, ..., k
    - alpha_i^* >= 0, i = 1, ..., k
So to match the Lagrange Duality expression, we write the problem as: min_(w,b)(0.5 * ||w||^T) s.t. g_i(w) = - yi * (w^T * xi + b) + 1 <= 0, i = 1, ..., m
From the KKT we konw that only when g_i(w) = 0 can alpha != 0
That is to say, only when yi * (w^T * xi + b) get its minimum (which is 1 because of the constrain)
To understand from a geomatric view, for a certain line (w, b choosen), we can do translation towards both directions, until it touch the point, these point it touches have a non-zero alpha
Specifically in Optimal Margin Classifier, the points with non-zero alpha are called Support Vector
To use Largrange Duailty and KKT:
- First we write down hL(w, b, alpha) = 0.5 * ||w||^2 - sum(alpha_i * (yi * (w^T * xi + b) - 1))
- Then we let the partial derivatives of hL to w and b be 0 (KKT 1 and also minimizing hL over w, b to get theta_D), and we get:
  w = sum(alpha_i * yi * xi)
  sum(alpha_i * yi) = 0
- So bring back to the origin function, we get: hL(w, b, alpha) = sum(alpha_i) - 0.5 * sum(yi * yj * alpha_i * alpha_j * xi^T * xj) = W(alpha)
- Then we get the dual optimization problem: max_alpha(W(alpha)) s.t. alpha_i >= 0, i = 1, ..., m; sum(alpha_i * yi) = 0
- why fit KKT 3 ???
- If we can solve it, we then can get w^* through w = sum(alpha_i * yi * xi) and b^* through b^* = -0.5 * (max_(i:yi=-1)(w^(*T) * xi) + min_(i:yi=1)(w^(*T) * xi))
When the test x input, we need to calc w^T * x + b = sum(alpha_i * yi * xi)^T * x + b. Consider what we talk about in No.8, what we really need to do is to calc the inner product between input x and these Support Vectors. — Support Vector Machine (SVM)

Kernels

For inputs x (called input attributes below), we may consider to perform training using the feature calculated from x (called input features below). And we denote phi the feature mapping (function mapping attributes and corresponding features)
Because in the dual optimization problem, all we need is inner product between the input attributes x, we may instead want to use it (that is SVMs) to learn using these features phi(x)
Def Kernel: K(x, z) = phi(x)^T * phi(z)
Sometime, calc K(x, z) directly cost much less time than calc phi(x) first, because phi(x) usually are high-D vectors
Example: K(x, z) = (x^T * z + c)^d
An understanding of Kernels: A measurement of how similar are phi(x) and phi(z), or of how similar are x and z
- So we may directly write down the K(x, z) as what we think might be a reasonable measure of how similar x and z are
- Moreover, to test whether the function we write can be a Kernel function — Theorem (Mercer)
  K is a valid (Mercer) kernel if and only if its corresponding matrix K (Kij = K(xi, xj)) is symmetric positive semi-definite for any m x’s
- Note that, the corresponding phi may be a infinite-D vector

Regularization and the non-separable case

The assumed that the data is linearly separable may not clearly true
The action of separating may not exactly what we want to do
In order to make the algorithm to work for non-linearly saparable datasets and less sensitive to outliers: Reformulate the optimization (using l_1 regularization)
min_(xxi,w,b)(0.5 * ||w||^2 + C * sum(xxi_i)) s.t. yi * (w^T * xi + b) >= 1 - xxi_i, i = 1, ..., m; xxi_i >= 0, i = 1, ..., m
Understand: allow some point to have margin less than 1 by xxi_i, but it should pay the cost of C * xxi_i
Then the dual optimization problem is max_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0 with the same W(alpha)

Moreover, the KKT condition become:
- alpha_i = 0 then yi * (w^T * xi + b) >= 1
- alpha_i = C then yi * (w^T * xi + b) <= 1
- 0 < alpha_i < C then yi * (w^T * xi + b) = 1

The SMO Algorithm: To solve dual problem

SMO: Sequential Minimal Optimization

Prviously introduced method to solve a max/min problem: Gradient ascent/descent, Newton’s method

Digression: Coordinate Ascent Algorithm to solve max problem (or min)

Loop until convergence:{
For i = 1, ..., m:{
    alpha_i := arg max W(alpha_1, ...,alpha_m).
}
}

SMO
- Rewrite the problem: max_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0
- Because the constrain 2, any single alpha is determined by others
- So we need to update at least two of them simultaneously
``` Repeat until convergence:{
1. Select some pair alpha_i and alpha_j to update next. (using heuristic to pick will allow us to make the biggest progress towards the global maximum.)
2. Reoptimize W(alpha) with respect to alpha_i and alpha_j while holding all the other alpha_k’s (k != i, j) fix. } ```
- To test for convergence of this algorithm, we can check whether the KKT conditions are satisfied to within some tolerance (typically sat to 0.01 to 0.0001)
- Can do well because the high efficient in each reoptimization
- To Exploir: Heurixtics, How to update b

Learning Theory

Bias/Variance Tradeoff

Generalization error:its expected error on examples not necessarily in the training set
Bias E[f_true(x) - f(x)] : the expected generalization error even if we were to fit it to a very (say, infinitely) large training set — Underfit if large
Variance Var[f(x)] : — Overfit if large
The total Mean Square Error (MSE): MSE = E[(y - f(x))^2] = sigma^2 + (Bias[f(x)])^2 + Var[f(x)] for y = f_true(x) + e where E[e] = 0; Var(e) = sigma^2

Learning Theory (Take binary classification as an example)

Two Lammas:
- The union bound lamma;
- The Hoeffding inequality (Chernoff bound) lamma
PAC assumption 1: Training examples are drawn iid (independent and identically distributed) from some prob distribution hD
Training error (also called Empirical risk or Empirical error): epsilon'(h) = (1 / m) * sum(1{h(xi) != yi})
Generalization error: epsilon(h) = P_((x,y)~hD)(h(x) != y) (PAC assumption 2)
Empirical risk minimization (ERM): a way to fit theta: theta' = argminepsilon'(h_theta)
Def Hypothesis class hH as the set of all classifiers with some constrain, such as linear regression hH. And then the argmin domain can be defined by hH
The case of finite hH
- epsilon'(h) is a reliable estimate of epsilon(h) for all h in hH
  This implies an upper-bound on the generalization error of h'
- Theorem. (If uniform convergence occurs) Let |hH| = k, and let any m, delta be fixed, then with prob at least 1 - delta, we have that
  epsilon(h') <= minepsilon(h) + 2 * sqr_root(log(2 * k / delta) / (2 * m))
- The two part of this equation corresponding to the tradeoff between bias and variance
- Sample complexity Corollary: Let |hH| = k, and let delta, gamma be fixed. Then for epsilong(h') <= minepsilon(h) + 2 * gamma to hold with prob ar least 1 - delta, it suffices that
  m >= log(2 * k / delta) / (2 * gamma^2) = O(...)
The case of infinite hH
- we say that hH shatters S if hH can realize any labeling on S
- Def Vapnik-Chervonenkis dimension of hH, written VC(hH), to be the size of the largest set that is shattered by H. (If hH can shatter arbitrarily large sets, then VC(hH) = infinity)
- Vapnik Theorem: (some believe its the most important theorm): Let hH be given, and let d = VC(hH). Then with prob at least 1 - delta, we have that for all h belongs to hH:
  |epsilon(h) - epsilon'(h)| <= O(sqr_root(d * log(m / d) / m + log(1 / delta) / m)) also
  epsilon(h') <= epsilon(h^*) + O(sqr_root(d * log(m / d) / m + log(1 / delta) / m))
- Corollary: For |epsilon(h) − epsilon'(h)| ≤ gamma to hold for all h ∈ H with probability at least 1 − delta, it suffices that m = O_(gamma, delta)(d)
- Also, d is also roughly linear in the number of parameters

Error Analysis

How to make sure that which part of the algorithm have the biggest problem
- Plug in the ground-truth (that is manually do the work or by some ways to make sure that the work is done perfectly) for each component and see the accuracy increase

Ablative Analysis

How much did each of the improvment really help
- start from the current level of performance, and slowly take away all of these features to see how it affects performance

Mistakes Analysis

Analyze what the mistakes composed of, and make improvement according to it

Regularization and Model Selection

Cross Validation — The method for Model Selection

Hold-out (or called Simple) Cross Calidation:
- Randomly split S into S_train and S_CV (called the hold-out cross calidation set)
- Train each model M_i on S_trian only, to get some hypothesis h_i
- Select and output the hypohesis h_i that had the smallest error epsilon'_(S_CV)(h_i) on the hold out cross validation set
- Optionally, step 3 in the algorithm may also be replaced with selecting the model M_i and then retraining M_i on the entire training set S
k-Fold Cross Validation
- Randomly split s into k disjoint subsets of m / k training examples each
- Use one as validation set and others as training set each loop, and calc the mean epsilon'_(S_j)(h_i) for M_i
- Pick the M_i corresponding to the smallest mean
- Usually k = 10;
  Extramly situation: k = m — Leave-One-Out Cross Validation
These also can be used more simply to evaluate a single model or algorithm

Feature Selection — A special case of Model Selection

For the case that number of features is too large that will easily cause overfitting
Total number of models is 2^n which is too large — need a kind of heuristic search procedure
Forward Search (a little bit like greedy algorithm)
- Initialize hF = ∅
- Repeat till some pre-set threshold{
  - For i = 1, …, n if i doesnt belong to hF let hF_i = hF U {i} , and use some version of cross validation to evaluate features hF_i
  - Set hF to be the best feature subset found on the above step
    }
- Select and output the best feature subset that was evaluated during the entire search procudure
Wrapper Model Feature Selection
Backeward Search
Filter Feature Selection: much cheaper
The idea here is to compute some simple score S(i) that measures how informative each feature x_i is about the class labels y. Then, we simply pick the k features with the largest scores S(i)
One possible choice of the score would be define S(i) to be the correlation between x_i and y, as measured on the training data
Mutual Information MI(x_i, y):
- MI(x_i, y) = sum_(x_i)(sum_y(p(x_i, y) * log(p(x_i, y) / (p(x_i) * p(y)))))
- p’s can all be estimated according to their empirical distributions on the training set
- Expressed as a Kullback-Leibler (KL) divergence:
  MI(x_i, y) = KL(p(x_i, y) || p(x_i) * p(y))

Bayesian Statistics and Regularization

Frequentist: Not random just happen to be unknown;
Bayesian: Random and unknown
Prior Distribution: p(theta)
p(theta | S) = p(S | theta) * p(theta) / p(S) = pi(p(yi | xi, theta)) * p(theta) / \jf\_theta(pi(p(yi | xi, theta)) * p(theta))dtheta
Posterior Distribution: p(y | x, S)
p(y | x, S) = \jf\_theta(p(y |x, theta) * p(theta | S))dtheta
MAP (maximum a posteriori) estimate for theta:
theta_MAP = argmax(pi(p(yi | xi, theta) * p(theta)))
Similar to the ML (maximum liklihood) estimate for theta in Frequentist:
theta_ML = argmax(pi(p(yi | xi, theta)))
In prectice, p(theta) is to assume that theta ~ N(0, tao^2 * I)
Use this, MAP can get a theta with smaller norm than ML and is less susceptible to overfitting than ML

Decision Tree & Ensembling Methods

Decision Tree: a Non-linearity, region-based algorithm

Def of Non-linearity
A simple example of it is that Lattitude & Mouth => Can Ski or not — Select Region which is intractable
Decision Tree: An approximate solution via greedy, top-down, recursive partitioning
- Top-down: We start with the original input space Xxi and split it into two child regions by thresholding on a single feature
- Recursive: We then take one of these child regions and can partition via a new threshold, and continue to train our model in a recursive manner
  (always selecting a leaf node, a feature, and a threshold to form a new split)
- We can continue in such a manner until we a meet a given stop criterion, and then predict the majority class at each leaf node
The Loss Function of Decision Tree
- When the parent region R_p ge splited into two child regions R_1, R_2, then the decrease in loss is:
  L(R_p) - (|R_1| * L(R_1) + |R_2| * L(R_2)) / (|R_1| + |R_2|) (cardinality-weighted sum)
A straight-forward L that we can think about: Misclassification Loss L_misclass
- L_misclass(R) = 1 - max(p'_c)
- p'_c is the prob of misclassification if we denote all the element in the region as c
- But it is not sensitive enough, sometimes: not only are the losses of the two splits identical, but neither of the splits decrease the loss over that of the parent
A more sensitive loss: Cross-Entropy L_cross
- L_cross(R) = - sum_c(p'_c * log(p'_c))
- From an information-theoretic perspective, cross-entropy measure the number of bits needed to specify the outcome (or class) given that the distribution is known
- Furthermore, the reduction in loss from parent to child is known as information gain
The regression setting for Decision Trees
- To predict, instead of using the major value of y’s in the region R, we use the mean: y' = sum_(i in R)(y_i) / |R|
- Then we can directly use the squared loss L_squared:
  L_squared(R) = sum_(i in R)(y_i - y')^2 / |R|
A unique advantage: Can deal with categorical variables
Regularization or Stopping Criteria
- Minimum Leaf Size
- Maximum Depth
- Maximum Number of Nodes
- Minimum decrease in loss after split is not good, because it means missing higher order interactions (like needing thresholding on multiple features to achieve a good split)
- Fully growing out the tree, and then pruning away nodes that minimally decrease misclassification or squared error, as measured on a validation set
Runtime
Disadvantages:
- High variance (use ensembling which will introduce nxet to solve it)
- Feature-based, so that can not rotate the picture, which means it can just use - & to approximately modeled a \

Ensembling Methods

By bias/variance analysis
- Ensembling will lead to a decrease in variance of the error
  Var(X_mean) = rou * sigma^2 + (1 - rou) * sigma^2 / n
  where rou is a measurement of correlation between each model
- Both increasing the number of models used to ensembling and decreasing the correlation between models will cause to an over all decrease in variance of the error of the ensenmble
Ways to generate de-correlated models
- Using different algorithms
- Using different training sets
- Bagging
- Boosting
Bagging: ‘Bootstrap Aggregation’
- Bootstrap:
  True population P, A training set S sampled from P ( S ~ P )
  we can generate many new bootstrap sets Z_i (i = 1, …, n) sampled with replacement from S ( Z ∼ S, |Z| = |S| ) ???
  We can then look at the variability of our estimate across these bootstrap sets to obtain a measure of error
- Aggregation:
  we can take each Z_i and train a machine learning model G_i on each, and define a new aggregate predictor: G(X) = sum_i(G_i(x) / n)
- Decreasing variance by decreasing rou compare to all just simply trained on S
- Note that rou is indepent upon n , by increasing n, the variance only decrease
- Out-of-bag estimation:
  - each bootstrapped sample only contains approximately 2/3 of S, and thus we can use the other 1/3 as an estimate of error, called out-of-bag error
  - In the limit, as n -> infinity, out-of-bag error gives an equivalent result to leave-one-out cross-validation
- Use upon Dicision tree:
  - Work well
  - Support for missing values
  - Lose interprtability
  - Varaible importance measure
  - Random forests: only allow a subset of features to be used at each split, to prevent a very strong predictor (feature) causing large rou
Boosting: use for bias-reduction
- Get high bias, low variance models (weak learners) by only allow Decision Trees split once (Decision Stumps):
  The key idea is that after one training, we then track which examples the classifier got wrong, and increase their relative weight compared to the correctly classified examples, and then do the next training
- Adaboost
  - Pseudocode (N sample data)
    w_i := 1 / N, for i = 1, 2, ..., N for m = 0 to M do Fit weak classifier G_m to training data weighted by w_i Compute weighted error err_m = sum_i(w_i * 1{y_i != G_m(x_i)}) / sum(w_i) Compute weight alpha_m = log((1-err_m) / err_m) w_i := w_i * exp(alpha_m * 1{y_i != G_m(x_i)}) end f(x) = sign(sum_m(alpha_m * G_m(x)))
  - The G’s are not independent, meaning that increasing M leads to an increase in the risk of overfitting
- Forward Stagewise Additive Modeling
  - Pseudocode
    Initialize f_0(x) = 0 for m = 0 to M do Compute (beta_m, gamma_m) = argmin(sum_i(L(y_i, f_(m-1)(x_i) + beta * G(x_i; gamma)))) Set f_m(x) = f_(m-1)(x) + beta_m * G(x; y_i) end f(x) = f_M(x)
  - Intuition: Every step tries to find the weak learner that fit the rest best ???
  - Adaboost is a special case of it
- Xgboost
- Gradient Boosting: Use gradient descent to solve the min problem. But we are restricted to taking steps in our model class
  - We instead compute the gradient at each training point with respect to the current predictor
    g_i = pdL(y, f(x_i)) / pdf(x_i)
  - Then train a new regression predictor to match this gradient and use it as the gradient step
    In Forward Stagewise Additive Modeling gamma_i = argmin(sum_i(g_i - G(x_i; gamma)))

Deep Learning

Neural Networks

Introduce to building of Neural Networks:
- Introduce to ReLU
- Introduce to feature generating
- Introduce to the stacking
- End-to-End Learning
Notations:
- Inputs: xi — i-th input; x_i = a0_i;
- Hidding layers outputs: a[l]_j — j-th unit in layer l
- foo[i] meanings anything associated with layer i
- w — weight vectors; W — weight matrix; W_i — the i-th row of W
A neural is consist of linear regression and non-linear activation
z[l]_i = W[l]_i^T * a[l-1] + b[l]_i (where a0 = x)
a[l]_i = g(z[l]_i)

Vectorization

Calc with vectorization
z[l] = W[l] * x + b[l] and a[l] = g(z[l])
The necessity of using non-linearity
Moreover Z[l] = W[l] * X + (broadcast)b[l]

Backpropagation

Parameter Initialization
- Property Symmetry
- Solution:
  - Random Initialization
  - Xavier/He Initialization
    Donote n[l] as the number of neurons in layer l , then w[l] ~ hN(0, sqr_root(2 / (n[l] + n[l + 1])))
Optimization
- Backpropagation
- Mini-batch Gradient Descent
  Momentum
Regularization
- L2 Regularization
  J_L2 = J + lamda * ||W||^2 / 2
- Parameter Sharing: Convolutional Neural Networks

Unsupervised Learning Problem

k-Means clustering algorithm

we are given a training set and want to group the data into a few cohesive “clusters”. Which means no labels yi are given
The k-mean clustering algorithm
```
1. Initialize cluster centroids miu_i, i=1, …, k randomly
2. Repeat until convergence: { For every i, set: ci := argmin_j(||xi - miu_j||^2) For every j, set: miu_j := sum_i(1{ci=j} * xi) / sum_i(1{ci=j}) } ```
Distortion Function
- J(c,miu) = sum_i(||xi - miu_ci||^2)
- What we are doing is repeatedly minimizing miu and c while holding another fixed
- But J is a non-convex function, so may get stuck in a locl minima
- We can run it many times with different initializing and use the one that gives the lowest distortion

Mixtures of Gaussians and the EM algorithm

Mixture of Gussians:
What we want is to specifying a joint distribution p(xi, zi) = p(xi | zi) * p(zi)
Where xi’s is the input training set, zi ~ Multinomial(phi) is a latent variable and xi | zi = j ~ hN(miu_j, Sigma_j), let k denote the number of values that the zi’s can take on
Likelihood:
l(phi, miu, Sigma) = sum_i(log(p(xi; phi, miu, Sigma))) = sum_i(log(sum_zi(p(xi | zi; miu, Sigma) * p(zi, phi))))
zi’s are unknown, so it’s difficult to solve. But if zi’s are known, then this become a classification problem, and easy to solve.
- phi_j = sum_i(1{zi = j}) / m (m, the number of inputs)
- miu_j = sum_i(1{zi = j} * xi) / sum_i(z{zi = j})
- Sigma_j = sum_i(1{zi = j} * ||xi - miu_j||^2) / sum_i(1{zi = j})
The EM Algorithm:
- E-step: try to guess the calues of the zi’s
  M- step: Updates the parameters of out model based on out guesses
```
  Repeat until convergence: {
      (E) For each `i, j`, set
          wi_j := p(zi = j | xi; phi, miu, Sigma)
      (M) Update the parameters by 
          phi_j := sum_i(wi_j) / m
          miu_j := sum_i(wi_j * xi) / sum_i(wi_j)
          Sigma := sum_i(wi_j * ||xi - miu_j||^2) / sum_i(wi_j)
  }
```
  Where p(zi = j | xi; phi, miu, Sigma) = p(xi | zi = j; miu, Sigma) * p(zi = j; phi) / sum_l(p(xi | zi = l; miu, Sigma) * p(zi = l; phi))
- Compared with k-mean,
  - k-means use the “hard” cluster assignments ci
  - EM use the “soft” assignments wi_j
  - Similar to K-means, it is also susceptible to local optima, so reinitializing at several different initial parameters may be a good idea

The EM Algorithm: A More General View

Jensen’s Inequality

If f''(x) > 0 for all x (in the vector-valued case, the corresponding statement is that H must be positive definite, written H > 0), then we say f is strictly convex
Jensen’s Inequality Theorem:
let f be a convex function, and let X be a random variable. Then:
E[f(X)] >= f(E[X])
Moreover, if f is strictly convex, then = holds true if and only if X = E[X] with probability 1 (i.e., if X is a constant)
Concave virsion of Jensne’s Inequality

The EM Algorithm

Inputs training set {x1, ..., xm}, we wish to fit the parameters of a model p(x, z) to the data, where the likelihood is given by:
l(theta) = sum_i(log(p(x; theta))) = sum_i(log(sum_z(p(x, z; theta))))
Where zi’s are the latent random variables
The EM algorithm:
- Idea: As explicitly maximizing l(theta) might be difficult, we instead repeatedly construct a lower-bound on l (E-step), and then optimize that lower-bound (M-step)
- For each i, let Q_i be some distribution over the z’s (sum_z(Q_i(z)) = 1, Q_i(z) >= 0). Then
  sum_i(log(p(xi; theta))) =
  sum_i(log(sum_zi(p(xi, zi; theta)))) =
  sum_i(log(sum_zi(Q_i(zi) * p(xi, zi; theta) / Q_i(zi)))) >=
  sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi))))
  Which give a lower bound for any distribution Q_i
- For any fix theta, we’ll try to make the bound tight, which is to say: let the equality hold:
  p(xi, zi; theta) / Q_i(zi) = c
  Also sum_z(Q_i(z)) = 1
  We have Q_i(zi) = p(xi, zi; theta) / sum_z(p(xi, z; theta)) = p(xi, zi; theta) / p(xi; theta) = p(zi | xi; theta)
- Above gives us the EM algorithm
```
  Repeat until convergence: {
      (E) For each i, set
          Q_i(zi) := p(zi | xi; theta)
      (M) Set
          theta := argmax(sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi)))))
  }
```
- The prove of the algorithm will convege, and one reasonable convergence test would be to check if the increase in l(θ) between successive iterations is smaller than some tolerance parameter, and to declare convergence if EM is improving l(θ) too slowly

Specified with the Example Before: Mixture of Gaussians

Factor Analysis

Unless m exceeds n by some reasonable amount, the maximum likelihood estimates of the mean and covariance may be quite poor

Restrictions of Sigma

If we do not have sufficient data to fit a full covariance matrix, we may place some restrictions on the space of matrices Σ that we will consider
Fit a covariance matrix Σ that is diagonal: (m is the number of x)
Sigma_jj = sum_i((xi_j - miu_j)^2) / m
A diagonal Sigma corresponds to a Gaussian where the major axes of these ellipses are axis-aligned
Sometimes, we may place a further restriction on the covariance matri that not only must it be diagonal, but its diagonal entries must all be equal (m is the number of x, n is the number of dimentions)
Sigma = sigma^2 * I; sigma^2 = sum_j(sum_i((xi_j - miu_j)^2)) / (m * n)
This model corresponds to using Gaussians whose densities have contours that are hyperspheres
Restricting Σ to be diagonal also means modeling the different coordinates x_i, x_j of the data as being uncorrelated and independent
Therefore, it’ll fail to captures any correlations in the data

Marginals and Conditionals of Gaussians

分块(Marginal)向量化高斯分布:
- x=(x_1, ..., x_n)^T , miu=(miu_1,..., miu_n)^T
  Sigma = (Sigma_ij)_n*n, where Sigma_ij = E[(x_i - miu_i) * (x_j - miu_j)^T]
- Marginal distributions of Gaussians are themselves Gaussian: x_i ~ hN(miu_i, Sigma_ii)
- The conditional distribution of x_i given x_j: x_i | x_j ~hN(miu_i|j, Sigma_i|j) , where miu_i|j = miu_i + Sigma_ij * (Sigma_jj)^-1 * (x_j - miu_j) ; Sigma_i|j = Sigma_ii - Sigma_ij * (Sigma_jj)^-1 * (Sigma_ji)

The Factor Analysis Model

We posit a jointt distribution on (x, z) as follows, where z (k-D) is a latent random variable: (x is n-D, and usually k < n)
z ~ hN(0, I) ; x|z ~ hN(miu + Lamda * z, Psi) (Psi is a diagonal matrix)
Or equalvalence:
z ~ hN(0, I) ; epsilon ~ hN(0, Psi) ; x = miu + Lamda * z + epsilon
Let’s work out exactly what distribution our model defines:
- First, there’s a joint distribution:
  (z, x)^T ~ hN(miu_zx, Sigma)
- Find miu_zx:
  We have: E[z] = E[epsilon] = 0
  Then E[x] = miu + Lamda * E[z] + E[epsilon] = miu
  So miu_zx = (0(k-D), miu)^T
- Find Sigma:
  Sigma_zz = I
  Sigma_zx = E[(z - E[z]) * (x - E[x])^T] = E[z * (miu + Lamda * z + epsilon - miu)^T] = E[z * z^T] * Lamda^T + E[z * epsilon^T] = Lamda^T
  Sigma_xz = Lamda
  Sigma_xx = Lamda * Lamda^T + Psi
- Maximize likelihood: x ~ hN(miu, Lamda * Lamda^T + Psi)

EM for Factor Analysis

Using former equition to calc miu_zi|xi; Sigma_zi|xi
Then generate the expression of Q_i(zi)
Then we can get how the optimization is done in M-step
Note that E[z * z^T] and E[z] * E[z^T] is different

Principal Components Analysis (PCA)

For a dataset xi, i = 1, ..., m, where xi is n-d (n<<m and unknown to us).
Which means the dimension now we know for x has many redundancy. I.e. from some of the featrues we can derive the other some of them
Which can be seen as the input approximately lie on some hyperline or direction or diagonal axis in the high-d space
First we do the centering (zero-meaning) and normalizing of the input xi’s
Centering can be omitted if the input is already zero mean; Normalizing can be omitted if the different attributes are all on the same scale
Then to find the ‘axis’, one way to do so is as finding the unit vector u ( |u| = 1 ) so that when the data is projected onto the direction corresponding to u, the variance of the projected data is maximized
u = argmax(sum_i((xi^T * u)^2) / m) = argmax(u^T * (sum_i(xi * xi^T) / m) * u)
- Where as we zero-meaned the input x , we have Sigma = sum_i(xi * xi^T) / m is just the covariance matrix of the data
- We easily recognize that what we want to do here is to find the principal eigenvector of Sigma
More generally, if we wish to project our data into a k-dimensional subspace ( k < n ), we should choose u_1, ..., u_k to be the top k eigenvectors of Sigma
And the u_i’s now form a new, orthogonal basis for the data
Then to represent xi in this basis, we need only compute the corresponding vector yi :
yi = (u_1^T * xi, ..., u_k^T * xi)^T
PAC is also called Dimensionality Reduction algorithm.
The vectors u_i’s are called the first k principal components of the data
Application:
- Compression
- Plot and view
- Reduce demension before training
- Noise reduction algorithm

Independent Components Analysis (ICA)

Some data s (n dimension) that is generated via n independent sources
What we observe is x = A * s where A is an unknown square matrix called the mixing matrix
The dataset { xi; i = 1,...,m }, and out goal is to recover the sources si that had generated out data ( xi = A * si )
Let W = A^-1 be the unmixing matrix, our goal is to find W (and let w_i^T denote the i-th row of W)

ICA Ambiguities

There are some inherent ambiguities in A that are impossible to recover, given only the xi’s
- Permutation
- Scaling
- For a Gaussian distribution s , because of the circle shape of Gaussian distribution, there will be an arbitrary retatoinal component in the mixingg matrix that cannot be determined from the date
As long as the data is not Gaussian, it is possible, given enough data, to recover the n independent sources

Densities and linear transpormations

Let p_s(s) be the density of s , and p_x(x) be the density of x , and other notation are inherented. Then we have:
p_x(x) = p_s(W * x) * |W|
Where |W| is the determinant of the matrix W

ICA Algorithm

The derivation of ICA:
- We suppose that the distribution of each source s_i is given by a density p_s, and that the joint distribution of the sources s is given by
  p(s) = pi_i(p_s(s_i))
  Note that by modeling the joint distribution as a product of the marginal, we capture the assumption that the sources are independent
- Then we get:
  p(x) = pi_i(p_s(w_i^T * si)) * |W|
  And all we need to do is to specify a density for the individual sources p_s
- To do so, all we need to do is to specify some cumulative distribution function (cdf) for it
  As we cannot choose the cdf of Gaussian distribution, we choose sigmoid function
  Then p_s(s) = g'(s)
- The likelihood:
  hl(W) = sum_i(sum_j(log(g'(w_j^T * xi))) + log(|W|))
- Knowing that Delta_W(|W|) = |W| * W^-1^T
  We can derive that W := W + alpha * ([1 - 2 * g(w_i^T * xi)]_n*1 * xi^T + (W^T)^-1)
  Where alpha is the learning rate

Reinforcement Learning and Control

Markov Decision Process (MDP)

A MDP is a tuple (S, A, {P_sa}, gamma, R) where
- S is a set of states
- A is a set of actions
- P_sa are the state transition probabilities, gives the distribution over what states we will transition to if we take action a in state s
- gamma belongs to [0,1) is called discount factor
- R: S (x A) >-> R is the reward function
The dynamics of an MDP proceeds as follows:
We start in some state s_0, and get to choose some action a_0 ∈ A to take in the MDP. As a result of our choice, the state of the MDP randomly transitions to some successor state s_1, drawn according to s_1 ∼ P_s0a0. Then, we get to pick another action a_1 ……
Upon visiting the sequence of states s_0, s_1, ... with actions a_0, a_1, ... , our total payoff is given by
R(s_0, a_0) + gamma * R(s_1, a_1) + gamma^2 * R(s_2, a_2) + ...
Or R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ... (take this as example below for simplity)
Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payof
E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ...]
Def: Policy
A policy is any function pi: S >-> A mapping from the states to the actions
Def: Value function
The value function for a policy pi is
Vpi(s) = E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ... | s_0 = s, pi]
Def: Optimal Value function
V^*(s) = max_pi(Vpi(s))
Bellman equations: (The value function satisfies)
- Vpi(s) = R(s) + gamma * sum_s'(P_(spi(s))(s') * Vpi(s')) = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
- Note: Bellman’s equations can be used to efficiently solve for Vpi. Specifically, in a finite-state MDP, we can write down one such equation for Vpi(s) for every state s. This gives us a set of |S| linear equations in |S| variables which can be efficiently solved
- Bellman equation for the optimal value function:
  V^*(s) = R(s) + max_a(gamma * sum_s'(P_sa(s') * V^*(s')))
Def: pi^* (Equation (3))
pi^*(s) = argmax_a(sum_s'(P_sa(s') * V^*(s')))
Then we get for every state s and every policy pi :
V^*(s) = Vpi^*(s) >= Vpi(s)
Note that pi^* has the interesting property that it is the optimal policy for all states s

Value iteration and policy iteration

For now, we will consider only MDPs with finite state and action spaces

The Value Iteration
- the process
```
1. For each state s, initialize V(s) := 0.
2. Repeat until convergence: { For every state, update V(s) := R(s) + max_a(gamma * sum_s’(P_sa(s’) * V(s’))) } ```
- Synchronous and Asynchronous updates
- Having found V^* , we can then use Equation (3) to find the optimal policy
The Policy Iteration
- The process
```
1. Initialize pi randomly.
2. Repeat until convergence: { (a) Let V := Vpi (b) For each state s, let pi(s) := argmax_a(sum_s’(P_sa(s’) * V(s’))) } ```
- Note that step (a) can be done via solving Bellman’s equations as described earlier

Learning a model for an MDP

In many realistic problems, we are not given state transition probabilities and rewards explicitly, but must instead estimate them from data
Given that the MDP consisting of a number of trials, we can then easily derive the maximum likelihood estimates for the state transition probabilities
P_sa(s') = No. of times took we action a in state s and got to s' / No. of times we took action a in state s
Or if the ratio above is 0 / 0 , we maight simply estimate P_sa(s') to be 1 / |S|
if R is unknown, we can also pick our estimate of the expected immediate reward R(s) in state s to be the average reward observed in state s

Continuous state MDPs

We now discuss algorithms for MDPs that may have an infinite number of states

Discretization
Downsides:
- it assumes that the value function is takes a constant value over each of the discretization intervals
- curse of dimensionality
Value function approximation: Using a model or simulator
- a simulator is a black-box that takes as input any (continuous-valued) state s_t and action a_t, and outputs a next-state s_(t+1) sampled according to the state transition probabilities P_stat
- Several ways to get such model:
  - Physics simulation
  - Get a model is to learn one from data collected in the MDP
    Deterministic / Stochastic model
Value function approximation: Fitted value iteration algorithm
- Assume that the problem has a continuous state space S of n-D, but that the action space A is small and discrete
- Recall that in value iteration
  Vpi(s) = R(s) + gamma * \jf\_s'(P_(spi(s))(s') * Vpi(s'))ds' = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
- The main idea:
  - Approximately carry out this step, over a finite sample of states si using machine learning
  - Specifically, we will use a supervised learning algorithm—linear regression in our description below—to approximate the value function as a linear or non-linear function of the states:
    V(s) = theta^T * phi(s) , where phi is some appropriate feature mapping of the states
- For each state s in our finite sample of m states, fitted value iteration will first compute a quantity yi , which will be our approximation to R(s) + gamma * max_a(E_(s'∼P_sa)[V(s')]) . Then, it will apply a supervised learning algorithm to try to get V(si) = theta^T * phi(si) close to yi
- Cannot be proved to always converge
- When we need to choose an action for some state s, The process for computing/approximating this is similar to the inner-loop of fitted value iteration
- The method to choose the k above

LQR, DDP and LQG

Linear Quadratic Regulation, Differential Dynamic Programming and Linear Quadratic Gaussian

Finite-horizon MDPs

To place ourselves in a more general setting:
- We rewrite the sum or integal into the expectation form: E_(s'~P_sa)[Vpi^*(s')]
- Assume that R(s, a) , then the reward function will go into the max or argmax
- Finite horizon: we set a finite ending time T
  Therefore, our definition of payoff is going to be slightly different: ( gamma become not necessary)
  R(s_0, a_0) + R(s_1, a_1) + ...+ R(s_T, a_T)
In this new setting, things behave quite differently:
- First, the optimal policy pi^* might be non-stationary, meaning that it changes over time:
  pit : S -> A (e.g. a_0 = pi0(s_0) )
- Time Dependent dynamics:
  s_(t+1) ~ Pt_stat
  Rt(s_t, a_t)
  Then (S, A, Pt_sa, T, Rt)
- Def Value function and optimal value function as before
Bellman’s equation is made for Dynamic Programming
Solve for the optimal value function:
```
1. compute V^*_T := max_a(RT(s,a))
2. for t = T-1, …, 0: compute V^_t := max_a(Rt(s, a) + E_(s’~Pt_sa)[V^_(t+1)(s’)]) ```

Linear Quadratic Regulation (LQR)

Assumptions:
- S n-D, A d-D
- Linear transitions with noise:
  s_(t+1) = A_t * s_t + B_t * a_t + w_t
  where Gaussian noise w_t ~ hN(0, Sigma_t)
- Quadratic Rewards:
  Rt(s_t, a_t) = -s_t^T * U_t * s_t - a_t^T * W_t * a_t
  where U_t, W_t are both positive definite matrices
  Note that the quadratic formulation of the reward is equivalent to saying that we want to take smooth action to make our state be close to the origin
Steps:
- step 1: If we need to estimate A, B, Sigma, first, collect transitions from an arbitrary policy. Then, use linear regression to find argmin_(A, B)(sum_i(sum_t(||si_(t+1) - (A * si_t + B * ai_t)||^2))) . Finally, use a technique seen in Gaussian Discriminant Analysis to learn Sigma
- Step 2: Derive the optimal policy using dynamic programming
  - Fact 1: It can be shown that if V^*_(t+1) is a quadratic function in s_t , then V^*_t is also a quadratic function:
    If V^*_(t+1)(s_(t+1)) = s_(t+1)^T * phi_(t+1) * s_(t+1) + psi_(t+1)
    Then V^*_t(s_t) = s_t^T * phi_t * s_t + psi_t
    Where for t = T , we had phi_t = -U_T, psi_t = 0
  - Fact 2: We can show that the optimal policy is just a linear function of the state
    Discrete Ricatti equation:
    phi_t = A_t^T * (phi_(t+1) - phi_(t+1) * B_t * (B_t^T * phi_(t+1) * B_t - W_t)^-1 * B_t * phi_(t+1)) * A_t - U_t
    psi_t = -tr(Sigma_t * phi_(t+1)) + psi_(t+1)
  - Fact 3: We notice that phi_t depends on neither psi nor the noise Sigma. It implies that the optimal policy also does not depend on the noise
    ((But psi_t does depend on Sigma_t , which implies that V^*_t depends on Sigma)
- As the optimal policy does not depend on psi_t , and the update of phi_t only depends on phi_t , it is sufficient to update only phi_t

From non-linear dynamics to LQR

It turns out that a lot of problems can be reduced to LQR, even if dynamics are non-linear
Taylor expansion: works well for cases where the goal is to stay around some state s^*
Differential Dynamic Programming (DDP)
We’ll cover a method that applies when our system has to follow some trajectory (think about a rocket). This method is going to discretize the trajectory into discrete time steps, and create intermediary goals around which we will be able to use the previous technique
Steps:
- Step 1: come up with a nominal trajectory using a naive controller, that approximate the trajectory we want to follow. In other words, our controller is able to approximate the gold trajectory with:
  s^*_0, a^*_0 -> s^*_1, a^*_1 -> ...
- Step 2: linearize the dynamics around each trajectory point s^*_t using Taylor expansion
  Note We can apply a similar derivation for the reward Rt , with a second-order Taylor expansion
- Step 3: Now, the problem is strictly re-written in the LQR framework. Let’s just use LQR to find the optimal policy pi_t . As a result, our new controller will (hopefully) be better
- Step 4: We use the new controller (our new policy pi_t) to produce a new trajectory
  Note that when we generate this new trajectory, we use the real F and not its linear approximation to compute transitions
- Then, go back to step 2 and repeat until some stopping criterion

Linear Quadratic Gaussian (LQR)

So far, we assumed that the state was available. As this might not hold true for most of the real-world problems, we need a new tool to model this situation:
Partially Observable MDPs (POMDP)
A POMDP is an MDP with an extra observation layer.
In other words, we introduce a new variable o_t , that follows some conditional distribution given the current state s_t
o_t | s_t ~ O(o | s)
Formally, a finite-horizon POMDP is given by a tuple:
(S, O, A, P_sa, T, R)
Within this framework, the general strategy is to maintain a belief state (distribution over states) based on the observation o_1, ..., o_t . Then, a policy in a POMDP maps belief states to actions
Suppose that, we observe y_t (m-D) with m < n such that:
y_t = C * s_t + v_t
s_(t+1) = A * s_t + B * a_t + w_t
Where v_t is the sensor noise (also gaussian, like w_t )
Note that the reward function Rt is left unchanged, as a function of the state (not the observation) and action
Steps:
- Step 1: first, compute the distribution on the possible states ( s_t|t, Sigma_t|t ), based on the observations we have:
  s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t)
  Using Kalman Filter algorithm
- Step 2: We’ll use the mean s_t|t as the best approximation for s_t
- Step 3: then set the action a_t := L_t * s_t|t where L_t comes from the regular LQR algorithm
Kalman filter algorithm
- Predict step: compute s_(t+1) | y_1, ..., y_t
  Suppose we know: s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t) . Then s_(t+1) | y_1, ..., y_t ~ hN(s_t+1|t, Sigma_t+1|t) , where s_t+1|t = A * s_t|t; Sigma_t+1|t = A * Sigma_t|t * A^T + Sigma_s ( Sigma_s is the variance of s)
- Update step: compute s_(t+1) | y_1, ..., y_(t+1)
  As we know above, we can prove that s_(t+1) | y_1, ..., y_(t+1) ~ hN(s_t+1|t+1, Sigma_t+1|t+1) , where s_t+1|t+1 = s_t+1|t + K_t * (y_(t+1) - C * S_t+1|t); Sigma_t+1|t+1 = Sigma_t+1|t - K_t * C * Sigma_t+1|t with K_t := Sigma_t+1|t * C^T * (C * Sigma_t+1|t * C^T + Sigma_y)^-1 ( Sigma_y is the variance of y , and K_t is called the Kalman gain)