CS229学习笔记


Intoduction

  1. What is machine learning?
    Experience E, task T, performance measure P. If its performance on T, as measured by P, improves with experience E.
  2. Types:
    • Supervised learning & Unsupervised learning
    • Reinforcement learning & recommender sysytems
  3. How to apply learning algorithm
  4. Basic knowledge of linear algebra
    • Matrix: rows x columns
    • Vector: An n x 1 matrix; 1-indexed and 0-indexed(The first element of vector y is called y1 or y0)
    • Scalar means real number
    • PredictionVector = DataMatrix x ParametersVector
    • PredictionMatrix = DataMatrix x Multi_Pairs_ParametersMatrix
    • Not Commutative, Associative, Inverse(singular or degenerate), Transpose

Supervised learning

  1. Def: We give the computer a dataset with the correct answer of each of the data
  2. Regression: Predict continuous value output
  3. Classification: Discrete valued output
  4. Deal with infinite features: we have algorithm to do that

Unsupervised learning

  1. Def: We just give the computer a dataset, and let the computer to find the structure of the data
    • Type 1: Split the data into different groups
    • Type 2: Audio seperating



Linear Regression

  1. Training set -> Learning algorithm -> h: h(x) = y (hypothesis)
    • x input, y output, m num of training examples
    • h_theta(x) = theta0 + theta1 * x (called univariate linear regression)
    • h_theta(xi) = (xi)^T * theta

LMS(Least Mean Square) Algorithm

  1. Def Loss Function — how to choose thetas
    • The idea: minimize the saperation of h(x)s from ys in the training sets
    • In this example minimize(sum((h(xi) - yi) ^ 2) / (2m)) (what we minimize is also called Square Error Cost Function)
    • Def cost function (J) as the function we try to minimize
  2. Understanding of Hypothesis and Cost Function
    • J is a function of parameters; For fixed parameters, h is a function of x
    • For each value of parameters, J([parameters]) corresponds to a specific hypothesis
    • The high-D understanding of h & J
  3. Gradient Descent — How to automatically find the parameters that minimize J
    • Ganerally we have a func J
      • Start with some parameters
      • keep changing them to reduce J, until hopfully we end up at a minimum
      • Gradient: the fast changing direction of a func
      • thetaj := thetaj - alpha * pd(J) / pd(thetaj) , in which alpha is called learning rate, too small — slow; too large — may overshoot the minimum, fail to converge, or even diverge
      • BTW: To simultaneously update all parameters correctly, we need to save all the new value in a register, and update them togather after we have done all the calculation. Otherwise, the parameter calculate later will not get the right value
    • Specified to cost func J
      • That’s why there are ‘2’ in the hypothesis
      • theta_j := theta_j + alpha * (yi - h(xi)) * xi_j
      • Convex function — Boat shape
      • Batch Gradient Descent: Each step of gradient descent uses all the training examples(the sum coverage); And there’s other ways we’ll talked about later.
      • Stpchastic(Incremental) Gradient Descent: Each step uses samples of examples

The normal equations

  1. Some more Matrix calc
    • Derivative: Df(A)=[pd(f) / pd(Aij)]
    • Trace: trA(commutative)
    • Some equation:
      • D(_A)trAB = B^T
      • D(_A^T)f(A) = (D(_A)f(A))^T
      • D(_A)trABA^TC = CAB + C^TAB^T
      • D(_A) A = A (A^-1)^T
  2. The Normal Equation: In order to let DJ(theta) = 0
    X^T * X * theta = X^T * y

  3. So by theta = (X^T * X)^-1 * X^T * y, we can minimize J without resorting to an iterative lgorithm

probabilistic interpretation

  1. The prof of LMS: Minimizing J = Maximizing likelihood or log likelihood
    • L(theta) = pi(p(yi | xi; theta))
    • l(theta) = log(L(theta))
    • In the example above, y|x; theta ~ N(miu, sigma^2) (N as Gaussian distribution)

Locally weighted linear regression(LWR)

  1. Underfitting & Overfitting

  2. Suppose we have enough data that the choice of features is less critical
  3. In LWR instead of minimize sum(yi - theta^T * xi)^2 , we minimize sum(wi * (yi - theta^i * xi)^2) , where wi’s are non-negative valued weights
  4. Fairly standard choice of w
    wi = e^(- (xi - x)^2 / (2 * tao^2))
    where x is the query point, tao is called the bandwidth parameter

  5. Non-parametric



Classification and logistic regression

For now, we’ll focus on the binary classification problem, 0 — negative class(‘-‘); 1 — positive class(‘+’). Also given xi calc yi is also called the label for the reaining example

Logistic regression

  1. Treated as continuous values prediction
  2. Logistic function(also called Sigmoid function)
    g(z) = 1 / (1 + e^-z)
    Also g'(z) = g(z) * (1 - g(z))
  3. Hypothesis: h_theta(x) = g(theta^T * x)
  4. Probabilistic assumption:
    • P(y = 1 x; theta) = h_theta(x)
    • P(y = 0 x; theta) = 1 - h_theta(x)
    • Genarally: p(y | x; theta) = (h_theta(x))^y * (1 - h_theta(x))^(1 - y) (called Bernoulli distribution Bernoulli(phi = h_theta(x)))

  5. To maximize log likelihood l(theta) , we dirived the same fomula:
    theta_j := theta_j + alpha * (yi - h_theta(xi)) * xi_j

Digression: The perceptron learning algorithm

  • All the same as above but g(z) = z<0?0:1

Another algorithm for maximizing l(theta)

  1. Newton’s method to find a zero for a func:
    theta := theta - f(theta) / f'(theta) (linear func understanding)
  2. Apply to l(theta): theta := theta - l'(theta) / l''(theta)
  3. Considering high-D: (Newton-Raphson method)
    theta := theta - H^-1 * Delta_theta(l(theta))
    where H = dp^2(l(theta)) / (dp(theta_i) * dp(theta_j))

  4. Can understand as Gradient Descenting with a changing learning rate alpha



Generalized Linear Models(GLMs)

The exponential family

  1. The exponential family distributions:
    p(y; yita) = b(y) * exp(yita^T * T(y) - a(yita))
    where:
    • yita — natural parameter(also called canonical parameter);
    • T(y) — sufficient satistic;
    • a(yita) — log partition function( e^(-a(yita)) essentially plays the role of a normalization constant)

  2. Example 1: For Bernoulli(phi):
    • T(y) = y;
    • a(yita) = - log(1 - phi) = log(1 + e^yita);
    • b(y) = 1
  3. Example 2: For Gaussian N(miu, sigma^2 = 1): (Because sigma has no effect on the final choice of theta and h)
    • T(y) = y;
    • a(yita) = miu^2 / 2 = yita^2 / 2;
    • b(y) = exp(- y^2 / 2) / sqr_root(2 * pi)
  4. Other Examples: multinomial, Poisson, gamma, exponential, bata, Dirichlet, etc.

Constructing GLMs

  1. Assumption:
    • y x; theta ~ ExponentialFamily(yita) — the distribution of y given x and theta follow some exponential family distribution with parameter yita
    • Given x, our goal is to predict the expected value of T(y) given x
      In most of our examples, we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy h(x) = E[y|x](条件期望)
    • The natural parameter yita and the inputs x are related linearly:
      yita_i = theta_i^T * x
  2. Some special name:
    • y — Response variable
    • Canonical response function & Canonical link function: h(x)(or saying g(yita)) & knowing outputs calc yita

  3. Ordinary Least Squares
    • We model the conditional distribution of y given x as a Gaussian function above
    • So we have h_theta(x) = E[y | x; theta] = miu = yita = theta^T * x
  4. Logistic regression
    • Bernoulli
    • h_theta(x) = E[y | x; theta] = phi = 1 / (1 + e^-yita) = 1 / (1 + e^-(theta^T * x))
  5. Softmax regression (nultinomial distribution)
    • For a multinomial over k possible outcomes, we’ll use k - 1 independent parameters to parameterize it phi_1, phi_2, ..., phi_(k - 1)
    • phi_i = p(y = i; phi) and p(y = k; phi) = 1 - sum(phi_i) (let it = phi_k, where phi_k is not a parameter but just a notation)
    • A notation to introduce: 1{·} as "The arguement is true" ? 1 : 0
    • To express the nultinomial as an exponential family distribution:
      • we define that (T(y))_i = 1{y = i} (for i = 1, 2, …, k-1)
        That is: T(i) = [0, ..., 1(i_th), ..., 0((k-1)_th)]^T (for i = 1,2,…,k - 1); T(k) = [0,...,0]^T
        Moreover, E[(T(y))_i] = P(y = i) = phi_i
      • yita = [log(phi_1 / phi_k), ..., log(phi_(k-1) / phi_k)]^T
      • a(yita) = - log(phi_k)
      • b(y) = 1
    • The softmax function:
      • The link function: yita_i = log(phi_i / phi_k); we also define yita_k = 0 for convenience
      • The response function: phi_i = e^yita_i / sum(e^yita_j)
    • So that p(y = i | x; theta) = phi_i = e^(theta_i^T * x) / sum(e^(theta_j^T * x))



Generative Learning Algorithms

    1. Discriminative learning algorithms: try to learn p(y|x) direcstly or mappings directly from the space of inputs to the labels;
    2. Generative learning algorithm: instead, try to model p(y) (called class priors) and p(x|y), then calc p(y|x) by Bayes rule:
      p(y|x) = p(x|y) * p(y) / p(x)
      where p(x) = sum(p(x|y = i) * p(y = i))
      Moreover, arg max_y(p(y|x)) = arg max_y(p(x|y) * p(y) / p(x)) = arg max_y(p(x|y) * p(y))
      So likelihood function would be derived from p(x|y) and p(y)

Gaussian Discriminant Analysis (GDA)

  1. Introduce: Multinariate normal distribution
    • Mean vector miu (n); Covariance matrix Sigma (n x n) (>= 0, symmetric and positive semi-definite)
    • p(x; miu, Sigma) = exp(-(x - miu)^T * Sigma^-1 * (x - miu) / 2) / ((2 * pi)^(n / 2) * |Sigma|^0.5)
    • E[X] = \jf\_x(x * p(x; miu, Sigma))dx = miu
      Cov(Z) = E[(Z - E[Z])(Z - E[Z])^T] = E[ZZ^T] - E[Z]E[Z]^T = Simga
    • Standard normal distribution
    • In Sigma, the Aij represent the degree of compressness towards the direction xi = xj (As it goes bigger, it become more spread out in the direction xi = xj, and more compress in the direction xi = -xj (expect when i = j ))

  2. The Gaussian Discriminant Analysis model: Take a binary-classification problem which vector x are continuous, real-valued vectors as an example
    • y ~ Bernoulli(phi)
      x | y = 0 ~ N(miu0, Sigma)
      x | y = 1 ~ N(miu1, Sigma)
    • Parameters: phi, miu0, miu1, Sigma (Note that only one Sigma for two miu)
    • By maximize l(phi, miu0, miu1, Sigma), we get:
      phi = sum(1{yi = 1}) / m
      miu0 = sum(1{yi = 0} * xi) / sum(1{yi = 0})
      miu1 = sum(1{yi = 1} * xi) / sum(1{yi = 1})
      Sigma = sum((xi - miu_yi)(xi - miu_yi)^T) / m
  3. Discussion: GDA & logistic regression
    • Let theta = theta(phi, Sigma, miu0, miu1) then GDA become logistic regression
    • GDA make a stronger assumption that not only p(y|x) follows logistic function, but also P(x|y) follows multivariate gaussian function
    • When assumption correct, GDA is better, even asymptotically efficient
    • GDA is more data efficient (requires less training data to learn “well”)
    • Logistic regression more robust
  4. Asymptotically efficient: Means that in the limit of very large training sets (large m), there is no algorithm that is strictly better than GDA

Naive Bayes

  1. Input vector x are discrete-valued
  2. Naive Bayes (NB) assumption: x_i’s are conditionally independent given y
    Then p(x_1, ..., x_50000 | y) = p(x_1 | y)p(X_2 | y, x_1)···p(x_50000 | y, x_1, ..., x_49999) = pi(p(x_j | y))
  3. Naive Bayes classifier:
    • Parameterize and Maximize likelihood:
      phi_(j|y=1) = p(x_j=1|y=1) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=1})
      phi_(j|y=0) = p(x_j=1|y=0) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=0})
      phi_y = p(y = 1) = sum(1{yi=1}) / m
    • Then p(y = 1 | x) = p(x | y = 1)p(y = 1) / p(x) where p(x) = p(x | y = 1) + p(x | y = 0)
    • x can be multinomial;
      For continuous x, if GDA etc. fit not very good, we can also consider to discretize it and use Naive Bayes
  4. Laplace smoothing:
    • Problem of previous version: For words never come out, both phi would be 0, and the output will become 0/0
    • Soution: Laplace smoothing
      Generally phi_j = (sum(1{zi = j}) + 1) / (m + k) where k is the number of possible j values to make sure the sum of all phi is 1
      Specifically, phi_(j|y=1) = (sum(1{xi_j=1 ^ yi=1}) + 1) / (sum(1{yi=1}) + 2), similarly for phi_(j|y=0)

  5. Multinomial event model
    x_i represent what the i_th word of the passage is;
    We consider all xi_j’s distrubution is the same for the same i



Support Vector Machines

Margins

  1. Intuition: Margins, Confidence of the prediction, Separating Hyperplane
    Important assumption: Data is linearly separable
  2. For the next discussing, we let the two value that y could have be -1 and 1, and donote h_(w,b)(x) = g(w^T * x + b), where g(z) = z<0 ? -1 : 1

  3. Functional Margin: gammai = yi * (w^T * xi + b) and gamma = min(gammai)
    • > 0 — correct; Large — confident
    • There is problem that singly multiply 2 to w and b would result in the functional margin become bigger without doing anything meaningful
      So often we will impose some sort of normalization condition such as ||w||_2 = 1
  4. Geometric Margin: gammai = yi * ((w / ||w||)^T * xi + b / ||w||) and gamma = min(gammai)

The Optimal Margin Classifier

  1. The natural desideratum: Maximizes the (geometric) margin
    So the problem is: max_(gamma,w,b)(gamma / ||w||) s.t. yi * (w^T * xi + b) >= gamma, i = 1, ..., m
    Which is for calc easier: min_(w,b)(0.5 * ||w||^2) s.t. yi * (w^T * xi + b) >= 1, i = 1, ..., m
    Can be considered as the loss function that the model (Optimal Margin Classifier) use to do the optimization

  2. Digression: Lagrange Duality
    • Consider the problem: min_w(f(w)) s.t. h_i(w) = 0, i = 1, ..., l
      条件极值, 拉格朗日乘数法:
      • Def Lagrangian: hL(w, beta) = f(w) + sum(beta_i * h_i(w)) where beta_is are called Lagrange multipliers
      • Then solve pdhL / pdw_i = 0; pdhL / pdbeta_i = 0 to get w and beta
    • Consider the problem (primal optimization problem): min_w(f(w)) s.t. g_i(w) <= 0, i = 1, ..., k; h_i(w) = 0, i = 1, ..., l
      • Def generalized Lagrangian: hL(w, alpha, beta) = f(w) + sum(alpha_i * g_i(w)) + sum(beta_i * h_i(w))
      • Consider the quantity: theta_P(w) = max_(alpha, beta; alpha_i>=0)(hL(w, alpha, beta))
        we can dirive that theta_P(w) = w satisfies primal constraints ? f(w) : infinity
        Then we have min(theta_P(w)) = min(max(hL(w, alpha, beta)))
        we call the solution p^* = min(theta_P(w)) the value of the primal problem
      • Consider another problem called the Dual optimization problem:
        First define theta_D(alpha, beta) = min_w(hL(w, alpha, beta))
        Then the problem is max_(alpha_i>=0)(theta_D(alpha, beta)) = max_(alpha_i>=0)(min(hf(w, alpha, beta)))
        we call the solution d^* = max_(alpha_i>=0)(theta_D(alpha, beta)) the value of the primal problem
      • Because maxmin <= minmax we get d^* <= p^*
      • Suppose f and g_is are convex, and the h_is are affine. Suppose further taht the constraints g_i are (strictly) feasible
        Then we have p^* = d^* = hL(w^*, alphs^*, beta^*)
        Moreover, they satisfy the Karush-Kuhn-Tucker (KKT) conditions. And The w^*, alphs^*, beta^* that satisfy KKT conditions are also a solution
        • pdhL(w^*, alphs^*, beta^*) / pdw_i = 0, i=1, ..., n
        • pdhL(w^*, alphs^*, beta^*) / pdbeta_i = 0, i=1, ..., l
        • alpha_i^* * g_i(w^*) = 0, i = 1, ..., k (called the KKT dual complementarity condition)
        • g_i(w^*) <= 0, i = 1, ..., k
        • alpha_i^* >= 0, i = 1, ..., k

  3. So to match the Lagrange Duality expression, we write the problem as: min_(w,b)(0.5 * ||w||^T) s.t. g_i(w) = - yi * (w^T * xi + b) + 1 <= 0, i = 1, ..., m
  4. From the KKT we konw that only when g_i(w) = 0 can alpha != 0
    That is to say, only when yi * (w^T * xi + b) get its minimum (which is 1 because of the constrain)
    To understand from a geomatric view, for a certain line (w, b choosen), we can do translation towards both directions, until it touch the point, these point it touches have a non-zero alpha
  5. Specifically in Optimal Margin Classifier, the points with non-zero alpha are called Support Vector

  6. To use Largrange Duailty and KKT:
    • First we write down hL(w, b, alpha) = 0.5 * ||w||^2 - sum(alpha_i * (yi * (w^T * xi + b) - 1))
    • Then we let the partial derivatives of hL to w and b be 0 (KKT 1 and also minimizing hL over w, b to get theta_D), and we get:
      w = sum(alpha_i * yi * xi)
      sum(alpha_i * yi) = 0
    • So bring back to the origin function, we get: hL(w, b, alpha) = sum(alpha_i) - 0.5 * sum(yi * yj * alpha_i * alpha_j * xi^T * xj) = W(alpha)
    • Then we get the dual optimization problem: max_alpha(W(alpha)) s.t. alpha_i >= 0, i = 1, ..., m; sum(alpha_i * yi) = 0
    • why fit KKT 3 ???
    • If we can solve it, we then can get w^* through w = sum(alpha_i * yi * xi) and b^* through b^* = -0.5 * (max_(i:yi=-1)(w^(*T) * xi) + min_(i:yi=1)(w^(*T) * xi))
  7. When the test x input, we need to calc w^T * x + b = sum(alpha_i * yi * xi)^T * x + b. Consider what we talk about in No.8, what we really need to do is to calc the inner product between input x and these Support Vectors. — Support Vector Machine (SVM)

Kernels

  1. For inputs x (called input attributes below), we may consider to perform training using the feature calculated from x (called input features below). And we denote phi the feature mapping (function mapping attributes and corresponding features)
  2. Because in the dual optimization problem, all we need is inner product between the input attributes x, we may instead want to use it (that is SVMs) to learn using these features phi(x)
  3. Def Kernel: K(x, z) = phi(x)^T * phi(z)
  4. Sometime, calc K(x, z) directly cost much less time than calc phi(x) first, because phi(x) usually are high-D vectors
    Example: K(x, z) = (x^T * z + c)^d
  5. An understanding of Kernels: A measurement of how similar are phi(x) and phi(z), or of how similar are x and z
    • So we may directly write down the K(x, z) as what we think might be a reasonable measure of how similar x and z are
    • Moreover, to test whether the function we write can be a Kernel function — Theorem (Mercer)
      K is a valid (Mercer) kernel if and only if its corresponding matrix K (Kij = K(xi, xj)) is symmetric positive semi-definite for any m x’s
    • Note that, the corresponding phi may be a infinite-D vector

Regularization and the non-separable case

  1. The assumed that the data is linearly separable may not clearly true
    The action of separating may not exactly what we want to do
  2. In order to make the algorithm to work for non-linearly saparable datasets and less sensitive to outliers: Reformulate the optimization (using l_1 regularization)
    min_(xxi,w,b)(0.5 * ||w||^2 + C * sum(xxi_i)) s.t. yi * (w^T * xi + b) >= 1 - xxi_i, i = 1, ..., m; xxi_i >= 0, i = 1, ..., m
    Understand: allow some point to have margin less than 1 by xxi_i, but it should pay the cost of C * xxi_i
    Then the dual optimization problem is max_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0 with the same W(alpha)

    Moreover, the KKT condition become:
    • alpha_i = 0 then yi * (w^T * xi + b) >= 1
    • alpha_i = C then yi * (w^T * xi + b) <= 1
    • 0 < alpha_i < C then yi * (w^T * xi + b) = 1

The SMO Algorithm: To solve dual problem

SMO: Sequential Minimal Optimization

  1. Prviously introduced method to solve a max/min problem: Gradient ascent/descent, Newton’s method
  2. Digression: Coordinate Ascent Algorithm to solve max problem (or min)
    Loop until convergence:{
    For i = 1, ..., m:{
        alpha_i := arg max W(alpha_1, ...,alpha_m).
    }
    }
    
  3. SMO
    • Rewrite the problem: max_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0
    • Because the constrain 2, any single alpha is determined by others
    • So we need to update at least two of them simultaneously
      ``` Repeat until convergence:{
      1. Select some pair alpha_i and alpha_j to update next. (using heuristic to pick will allow us to make the biggest progress towards the global maximum.)
      2. Reoptimize W(alpha) with respect to alpha_i and alpha_j while holding all the other alpha_k’s (k != i, j) fix. } ```
    • To test for convergence of this algorithm, we can check whether the KKT conditions are satisfied to within some tolerance (typically sat to 0.01 to 0.0001)
    • Can do well because the high efficient in each reoptimization
    • To Exploir: Heurixtics, How to update b



Learning Theory

Bias/Variance Tradeoff

  1. Generalization error:its expected error on examples not necessarily in the training set
  2. Bias E[f_true(x) - f(x)] : the expected generalization error even if we were to fit it to a very (say, infinitely) large training set — Underfit if large
    Variance Var[f(x)] : — Overfit if large
  3. The total Mean Square Error (MSE): MSE = E[(y - f(x))^2] = sigma^2 + (Bias[f(x)])^2 + Var[f(x)] for y = f_true(x) + e where E[e] = 0; Var(e) = sigma^2

Learning Theory (Take binary classification as an example)

  1. Two Lammas:
    • The union bound lamma;
    • The Hoeffding inequality (Chernoff bound) lamma
  2. PAC assumption 1: Training examples are drawn iid (independent and identically distributed) from some prob distribution hD
  3. Training error (also called Empirical risk or Empirical error): epsilon'(h) = (1 / m) * sum(1{h(xi) != yi})
    Generalization error: epsilon(h) = P_((x,y)~hD)(h(x) != y) (PAC assumption 2)
  4. Empirical risk minimization (ERM): a way to fit theta: theta' = argminepsilon'(h_theta)
    Def Hypothesis class hH as the set of all classifiers with some constrain, such as linear regression hH. And then the argmin domain can be defined by hH
  5. The case of finite hH
    • epsilon'(h) is a reliable estimate of epsilon(h) for all h in hH
      This implies an upper-bound on the generalization error of h'
    • Theorem. (If uniform convergence occurs) Let |hH| = k, and let any m, delta be fixed, then with prob at least 1 - delta, we have that
      epsilon(h') <= minepsilon(h) + 2 * sqr_root(log(2 * k / delta) / (2 * m))
    • The two part of this equation corresponding to the tradeoff between bias and variance
    • Sample complexity Corollary: Let |hH| = k, and let delta, gamma be fixed. Then for epsilong(h') <= minepsilon(h) + 2 * gamma to hold with prob ar least 1 - delta, it suffices that
      m >= log(2 * k / delta) / (2 * gamma^2) = O(...)
  6. The case of infinite hH
    • we say that hH shatters S if hH can realize any labeling on S
    • Def Vapnik-Chervonenkis dimension of hH, written VC(hH), to be the size of the largest set that is shattered by H. (If hH can shatter arbitrarily large sets, then VC(hH) = infinity)
    • Vapnik Theorem: (some believe its the most important theorm): Let hH be given, and let d = VC(hH). Then with prob at least 1 - delta, we have that for all h belongs to hH:
      |epsilon(h) - epsilon'(h)| <= O(sqr_root(d * log(m / d) / m + log(1 / delta) / m)) also
      epsilon(h') <= epsilon(h^*) + O(sqr_root(d * log(m / d) / m + log(1 / delta) / m))
    • Corollary: For |epsilon(h) − epsilon'(h)| ≤ gamma to hold for all h ∈ H with probability at least 1 − delta, it suffices that m = O_(gamma, delta)(d)
    • Also, d is also roughly linear in the number of parameters

Error Analysis

  1. How to make sure that which part of the algorithm have the biggest problem
    • Plug in the ground-truth (that is manually do the work or by some ways to make sure that the work is done perfectly) for each component and see the accuracy increase

Ablative Analysis

  1. How much did each of the improvment really help
    • start from the current level of performance, and slowly take away all of these features to see how it affects performance

Mistakes Analysis

  1. Analyze what the mistakes composed of, and make improvement according to it



Regularization and Model Selection

Cross Validation — The method for Model Selection

  1. Hold-out (or called Simple) Cross Calidation:
    • Randomly split S into S_train and S_CV (called the hold-out cross calidation set)
    • Train each model M_i on S_trian only, to get some hypothesis h_i
    • Select and output the hypohesis h_i that had the smallest error epsilon'_(S_CV)(h_i) on the hold out cross validation set
    • Optionally, step 3 in the algorithm may also be replaced with selecting the model M_i and then retraining M_i on the entire training set S
  2. k-Fold Cross Validation
    • Randomly split s into k disjoint subsets of m / k training examples each
    • Use one as validation set and others as training set each loop, and calc the mean epsilon'_(S_j)(h_i) for M_i
    • Pick the M_i corresponding to the smallest mean
    • Usually k = 10;
      Extramly situation: k = m — Leave-One-Out Cross Validation
  3. These also can be used more simply to evaluate a single model or algorithm

Feature Selection — A special case of Model Selection

  1. For the case that number of features is too large that will easily cause overfitting
  2. Total number of models is 2^n which is too large — need a kind of heuristic search procedure

  3. Forward Search (a little bit like greedy algorithm)
    • Initialize hF = ∅
    • Repeat till some pre-set threshold{
      • For i = 1, …, n if i doesnt belong to hF let hF_i = hF U {i} , and use some version of cross validation to evaluate features hF_i
      • Set hF to be the best feature subset found on the above step
        }
    • Select and output the best feature subset that was evaluated during the entire search procudure
  4. Wrapper Model Feature Selection
  5. Backeward Search

  6. Filter Feature Selection: much cheaper
    The idea here is to compute some simple score S(i) that measures how informative each feature x_i is about the class labels y. Then, we simply pick the k features with the largest scores S(i)
  7. One possible choice of the score would be define S(i) to be the correlation between x_i and y, as measured on the training data
  8. Mutual Information MI(x_i, y):
    • MI(x_i, y) = sum_(x_i)(sum_y(p(x_i, y) * log(p(x_i, y) / (p(x_i) * p(y)))))
    • p’s can all be estimated according to their empirical distributions on the training set
    • Expressed as a Kullback-Leibler (KL) divergence:
      MI(x_i, y) = KL(p(x_i, y) || p(x_i) * p(y))

Bayesian Statistics and Regularization

  1. Frequentist: Not random just happen to be unknown;
    Bayesian: Random and unknown
  2. Prior Distribution: p(theta)
    p(theta | S) = p(S | theta) * p(theta) / p(S) = pi(p(yi | xi, theta)) * p(theta) / \jf\_theta(pi(p(yi | xi, theta)) * p(theta))dtheta
    Posterior Distribution: p(y | x, S)
    p(y | x, S) = \jf\_theta(p(y |x, theta) * p(theta | S))dtheta
  3. MAP (maximum a posteriori) estimate for theta:
    theta_MAP = argmax(pi(p(yi | xi, theta) * p(theta)))
  4. Similar to the ML (maximum liklihood) estimate for theta in Frequentist:
    theta_ML = argmax(pi(p(yi | xi, theta)))
  5. In prectice, p(theta) is to assume that theta ~ N(0, tao^2 * I)
    Use this, MAP can get a theta with smaller norm than ML and is less susceptible to overfitting than ML



Decision Tree & Ensembling Methods

Decision Tree: a Non-linearity, region-based algorithm

  1. Def of Non-linearity
    A simple example of it is that Lattitude & Mouth => Can Ski or not — Select Region which is intractable
  2. Decision Tree: An approximate solution via greedy, top-down, recursive partitioning
    • Top-down: We start with the original input space Xxi and split it into two child regions by thresholding on a single feature
    • Recursive: We then take one of these child regions and can partition via a new threshold, and continue to train our model in a recursive manner
      (always selecting a leaf node, a feature, and a threshold to form a new split)
    • We can continue in such a manner until we a meet a given stop criterion, and then predict the majority class at each leaf node

  3. The Loss Function of Decision Tree
    • When the parent region R_p ge splited into two child regions R_1, R_2, then the decrease in loss is:
      L(R_p) - (|R_1| * L(R_1) + |R_2| * L(R_2)) / (|R_1| + |R_2|) (cardinality-weighted sum)
  4. A straight-forward L that we can think about: Misclassification Loss L_misclass
    • L_misclass(R) = 1 - max(p'_c)
    • p'_c is the prob of misclassification if we denote all the element in the region as c
    • But it is not sensitive enough, sometimes: not only are the losses of the two splits identical, but neither of the splits decrease the loss over that of the parent
  5. A more sensitive loss: Cross-Entropy L_cross
    • L_cross(R) = - sum_c(p'_c * log(p'_c))
    • From an information-theoretic perspective, cross-entropy measure the number of bits needed to specify the outcome (or class) given that the distribution is known
    • Furthermore, the reduction in loss from parent to child is known as information gain
  6. The regression setting for Decision Trees
    • To predict, instead of using the major value of y’s in the region R, we use the mean: y' = sum_(i in R)(y_i) / |R|
    • Then we can directly use the squared loss L_squared:
      L_squared(R) = sum_(i in R)(y_i - y')^2 / |R|

  7. A unique advantage: Can deal with categorical variables
  8. Regularization or Stopping Criteria
    • Minimum Leaf Size
    • Maximum Depth
    • Maximum Number of Nodes
    • Minimum decrease in loss after split is not good, because it means missing higher order interactions (like needing thresholding on multiple features to achieve a good split)
    • Fully growing out the tree, and then pruning away nodes that minimally decrease misclassification or squared error, as measured on a validation set
  9. Runtime
  10. Disadvantages:
    • High variance (use ensembling which will introduce nxet to solve it)
    • Feature-based, so that can not rotate the picture, which means it can just use - & to approximately modeled a \

Ensembling Methods

  1. By bias/variance analysis
    • Ensembling will lead to a decrease in variance of the error
      Var(X_mean) = rou * sigma^2 + (1 - rou) * sigma^2 / n
      where rou is a measurement of correlation between each model
    • Both increasing the number of models used to ensembling and decreasing the correlation between models will cause to an over all decrease in variance of the error of the ensenmble
  2. Ways to generate de-correlated models
    • Using different algorithms
    • Using different training sets
    • Bagging
    • Boosting

  3. Bagging: ‘Bootstrap Aggregation’
    • Bootstrap:
      True population P, A training set S sampled from P ( S ~ P )
      we can generate many new bootstrap sets Z_i (i = 1, …, n) sampled with replacement from S ( Z ∼ S, |Z| = |S| ) ???
      We can then look at the variability of our estimate across these bootstrap sets to obtain a measure of error
    • Aggregation:
      we can take each Z_i and train a machine learning model G_i on each, and define a new aggregate predictor: G(X) = sum_i(G_i(x) / n)

    • Decreasing variance by decreasing rou compare to all just simply trained on S
    • Note that rou is indepent upon n , by increasing n, the variance only decrease

    • Out-of-bag estimation:
      • each bootstrapped sample only contains approximately 2/3 of S, and thus we can use the other 1/3 as an estimate of error, called out-of-bag error
      • In the limit, as n -> infinity, out-of-bag error gives an equivalent result to leave-one-out cross-validation

    • Use upon Dicision tree:
      • Work well
      • Support for missing values
      • Lose interprtability
      • Varaible importance measure
      • Random forests: only allow a subset of features to be used at each split, to prevent a very strong predictor (feature) causing large rou

  4. Boosting: use for bias-reduction
    • Get high bias, low variance models (weak learners) by only allow Decision Trees split once (Decision Stumps):
      The key idea is that after one training, we then track which examples the classifier got wrong, and increase their relative weight compared to the correctly classified examples, and then do the next training

    • Adaboost
      • Pseudocode (N sample data)
          w_i := 1 / N, for i = 1, 2, ..., N
          for m = 0 to M do
          Fit weak classifier G_m to training data weighted by w_i
          Compute weighted error err_m = sum_i(w_i * 1{y_i != G_m(x_i)}) / sum(w_i)
          Compute weight alpha_m = log((1-err_m) / err_m)
          w_i := w_i * exp(alpha_m * 1{y_i != G_m(x_i)})
          end
          f(x) = sign(sum_m(alpha_m * G_m(x)))
        
      • The G’s are not independent, meaning that increasing M leads to an increase in the risk of overfitting
    • Forward Stagewise Additive Modeling
      • Pseudocode
          Initialize f_0(x) = 0
          for m = 0 to M do
          Compute (beta_m, gamma_m) = argmin(sum_i(L(y_i, f_(m-1)(x_i) + beta * G(x_i; gamma))))
          Set f_m(x) = f_(m-1)(x) + beta_m * G(x; y_i)
          end
          f(x) = f_M(x)
        
      • Intuition: Every step tries to find the weak learner that fit the rest best ???
      • Adaboost is a special case of it
    • Xgboost
    • Gradient Boosting: Use gradient descent to solve the min problem. But we are restricted to taking steps in our model class
      • We instead compute the gradient at each training point with respect to the current predictor
        g_i = pdL(y, f(x_i)) / pdf(x_i)
      • Then train a new regression predictor to match this gradient and use it as the gradient step
        In Forward Stagewise Additive Modeling gamma_i = argmin(sum_i(g_i - G(x_i; gamma)))



Deep Learning

Neural Networks

  1. Introduce to building of Neural Networks:
    • Introduce to ReLU
    • Introduce to feature generating
    • Introduce to the stacking
    • End-to-End Learning
  2. Notations:
    • Inputs: xi — i-th input; x_i = a0_i;
    • Hidding layers outputs: a[l]_j — j-th unit in layer l
    • foo[i] meanings anything associated with layer i
    • w — weight vectors; W — weight matrix; W_i — the i-th row of W
  3. A neural is consist of linear regression and non-linear activation
    z[l]_i = W[l]_i^T * a[l-1] + b[l]_i (where a0 = x)
    a[l]_i = g(z[l]_i)

Vectorization

  1. Calc with vectorization
    z[l] = W[l] * x + b[l] and a[l] = g(z[l])
  2. The necessity of using non-linearity
  3. Moreover Z[l] = W[l] * X + (broadcast)b[l]

Backpropagation

  1. Parameter Initialization
    • Property Symmetry
    • Solution:
      • Random Initialization
      • Xavier/He Initialization
        Donote n[l] as the number of neurons in layer l , then w[l] ~ hN(0, sqr_root(2 / (n[l] + n[l + 1])))
  2. Optimization
    • Backpropagation
    • Mini-batch Gradient Descent
      Momentum
  3. Regularization
    • L2 Regularization
      J_L2 = J + lamda * ||W||^2 / 2
    • Parameter Sharing: Convolutional Neural Networks



Unsupervised Learning Problem

k-Means clustering algorithm

  1. we are given a training set and want to group the data into a few cohesive “clusters”. Which means no labels yi are given
  2. The k-mean clustering algorithm
    ```
    1. Initialize cluster centroids miu_i, i=1, …, k randomly
    2. Repeat until convergence: { For every i, set: ci := argmin_j(||xi - miu_j||^2) For every j, set: miu_j := sum_i(1{ci=j} * xi) / sum_i(1{ci=j}) } ```
  3. Distortion Function
    • J(c,miu) = sum_i(||xi - miu_ci||^2)
    • What we are doing is repeatedly minimizing miu and c while holding another fixed
    • But J is a non-convex function, so may get stuck in a locl minima
    • We can run it many times with different initializing and use the one that gives the lowest distortion

Mixtures of Gaussians and the EM algorithm

  1. Mixture of Gussians:
    What we want is to specifying a joint distribution p(xi, zi) = p(xi | zi) * p(zi)
    Where xi’s is the input training set, zi ~ Multinomial(phi) is a latent variable and xi | zi = j ~ hN(miu_j, Sigma_j), let k denote the number of values that the zi’s can take on
  2. Likelihood:
    l(phi, miu, Sigma) = sum_i(log(p(xi; phi, miu, Sigma))) = sum_i(log(sum_zi(p(xi | zi; miu, Sigma) * p(zi, phi))))
  3. zi’s are unknown, so it’s difficult to solve. But if zi’s are known, then this become a classification problem, and easy to solve.
    • phi_j = sum_i(1{zi = j}) / m (m, the number of inputs)
    • miu_j = sum_i(1{zi = j} * xi) / sum_i(z{zi = j})
    • Sigma_j = sum_i(1{zi = j} * ||xi - miu_j||^2) / sum_i(1{zi = j})

  4. The EM Algorithm:
    • E-step: try to guess the calues of the zi’s
      M- step: Updates the parameters of out model based on out guesses
        Repeat until convergence: {
            (E) For each `i, j`, set
                wi_j := p(zi = j | xi; phi, miu, Sigma)
            (M) Update the parameters by 
                phi_j := sum_i(wi_j) / m
                miu_j := sum_i(wi_j * xi) / sum_i(wi_j)
                Sigma := sum_i(wi_j * ||xi - miu_j||^2) / sum_i(wi_j)
        }
      

      Where p(zi = j | xi; phi, miu, Sigma) = p(xi | zi = j; miu, Sigma) * p(zi = j; phi) / sum_l(p(xi | zi = l; miu, Sigma) * p(zi = l; phi))

    • Compared with k-mean,
      • k-means use the “hard” cluster assignments ci
      • EM use the “soft” assignments wi_j
      • Similar to K-means, it is also susceptible to local optima, so reinitializing at several different initial parameters may be a good idea



The EM Algorithm: A More General View

Jensen’s Inequality

  1. If f''(x) > 0 for all x (in the vector-valued case, the corresponding statement is that H must be positive definite, written H > 0), then we say f is strictly convex

  2. Jensen’s Inequality Theorem:
    let f be a convex function, and let X be a random variable. Then:
    E[f(X)] >= f(E[X])
    Moreover, if f is strictly convex, then = holds true if and only if X = E[X] with probability 1 (i.e., if X is a constant)

  3. Concave virsion of Jensne’s Inequality

The EM Algorithm

  1. Inputs training set {x1, ..., xm}, we wish to fit the parameters of a model p(x, z) to the data, where the likelihood is given by:
    l(theta) = sum_i(log(p(x; theta))) = sum_i(log(sum_z(p(x, z; theta))))
    Where zi’s are the latent random variables

  2. The EM algorithm:
    • Idea: As explicitly maximizing l(theta) might be difficult, we instead repeatedly construct a lower-bound on l (E-step), and then optimize that lower-bound (M-step)
    • For each i, let Q_i be some distribution over the z’s (sum_z(Q_i(z)) = 1, Q_i(z) >= 0). Then
      sum_i(log(p(xi; theta))) =
      sum_i(log(sum_zi(p(xi, zi; theta)))) =
      sum_i(log(sum_zi(Q_i(zi) * p(xi, zi; theta) / Q_i(zi)))) >=
      sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi))))
      Which give a lower bound for any distribution Q_i
    • For any fix theta, we’ll try to make the bound tight, which is to say: let the equality hold:
      p(xi, zi; theta) / Q_i(zi) = c
      Also sum_z(Q_i(z)) = 1
      We have Q_i(zi) = p(xi, zi; theta) / sum_z(p(xi, z; theta)) = p(xi, zi; theta) / p(xi; theta) = p(zi | xi; theta)
    • Above gives us the EM algorithm
        Repeat until convergence: {
            (E) For each i, set
                Q_i(zi) := p(zi | xi; theta)
            (M) Set
                theta := argmax(sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi)))))
        }
      
    • The prove of the algorithm will convege, and one reasonable convergence test would be to check if the increase in l(θ) between successive iterations is smaller than some tolerance parameter, and to declare convergence if EM is improving l(θ) too slowly

Specified with the Example Before: Mixture of Gaussians



Factor Analysis

  1. Unless m exceeds n by some reasonable amount, the maximum likelihood estimates of the mean and covariance may be quite poor

Restrictions of Sigma

  1. If we do not have sufficient data to fit a full covariance matrix, we may place some restrictions on the space of matrices Σ that we will consider

  2. Fit a covariance matrix Σ that is diagonal: (m is the number of x)
    Sigma_jj = sum_i((xi_j - miu_j)^2) / m
    A diagonal Sigma corresponds to a Gaussian where the major axes of these ellipses are axis-aligned
  3. Sometimes, we may place a further restriction on the covariance matri that not only must it be diagonal, but its diagonal entries must all be equal (m is the number of x, n is the number of dimentions)
    Sigma = sigma^2 * I; sigma^2 = sum_j(sum_i((xi_j - miu_j)^2)) / (m * n)
    This model corresponds to using Gaussians whose densities have contours that are hyperspheres

  4. Restricting Σ to be diagonal also means modeling the different coordinates x_i, x_j of the data as being uncorrelated and independent
    Therefore, it’ll fail to captures any correlations in the data

Marginals and Conditionals of Gaussians

  1. 分块(Marginal)向量化高斯分布:
    • x=(x_1, ..., x_n)^T , miu=(miu_1,..., miu_n)^T
      Sigma = (Sigma_ij)_n*n, where Sigma_ij = E[(x_i - miu_i) * (x_j - miu_j)^T]
    • Marginal distributions of Gaussians are themselves Gaussian: x_i ~ hN(miu_i, Sigma_ii)
    • The conditional distribution of x_i given x_j: x_i | x_j ~hN(miu_i|j, Sigma_i|j) , where miu_i|j = miu_i + Sigma_ij * (Sigma_jj)^-1 * (x_j - miu_j) ; Sigma_i|j = Sigma_ii - Sigma_ij * (Sigma_jj)^-1 * (Sigma_ji)

The Factor Analysis Model

  1. We posit a jointt distribution on (x, z) as follows, where z (k-D) is a latent random variable: (x is n-D, and usually k < n)
    z ~ hN(0, I) ; x|z ~ hN(miu + Lamda * z, Psi) (Psi is a diagonal matrix)
  2. Or equalvalence:
    z ~ hN(0, I) ; epsilon ~ hN(0, Psi) ; x = miu + Lamda * z + epsilon

  3. Let’s work out exactly what distribution our model defines:
    • First, there’s a joint distribution:
      (z, x)^T ~ hN(miu_zx, Sigma)
    • Find miu_zx:
      We have: E[z] = E[epsilon] = 0
      Then E[x] = miu + Lamda * E[z] + E[epsilon] = miu
      So miu_zx = (0(k-D), miu)^T
    • Find Sigma:
      Sigma_zz = I
      Sigma_zx = E[(z - E[z]) * (x - E[x])^T] = E[z * (miu + Lamda * z + epsilon - miu)^T] = E[z * z^T] * Lamda^T + E[z * epsilon^T] = Lamda^T
      Sigma_xz = Lamda
      Sigma_xx = Lamda * Lamda^T + Psi
    • Maximize likelihood: x ~ hN(miu, Lamda * Lamda^T + Psi)

EM for Factor Analysis

  1. Using former equition to calc miu_zi|xi; Sigma_zi|xi
    Then generate the expression of Q_i(zi)
  2. Then we can get how the optimization is done in M-step
  3. Note that E[z * z^T] and E[z] * E[z^T] is different



Principal Components Analysis (PCA)

  1. For a dataset xi, i = 1, ..., m, where xi is n-d (n<<m and unknown to us).
    Which means the dimension now we know for x has many redundancy. I.e. from some of the featrues we can derive the other some of them
    Which can be seen as the input approximately lie on some hyperline or direction or diagonal axis in the high-d space

  2. First we do the centering (zero-meaning) and normalizing of the input xi’s
    Centering can be omitted if the input is already zero mean; Normalizing can be omitted if the different attributes are all on the same scale
  3. Then to find the ‘axis’, one way to do so is as finding the unit vector u ( |u| = 1 ) so that when the data is projected onto the direction corresponding to u, the variance of the projected data is maximized
    u = argmax(sum_i((xi^T * u)^2) / m) = argmax(u^T * (sum_i(xi * xi^T) / m) * u)
    • Where as we zero-meaned the input x , we have Sigma = sum_i(xi * xi^T) / m is just the covariance matrix of the data
    • We easily recognize that what we want to do here is to find the principal eigenvector of Sigma
  4. More generally, if we wish to project our data into a k-dimensional subspace ( k < n ), we should choose u_1, ..., u_k to be the top k eigenvectors of Sigma
    And the u_i’s now form a new, orthogonal basis for the data
  5. Then to represent xi in this basis, we need only compute the corresponding vector yi :
    yi = (u_1^T * xi, ..., u_k^T * xi)^T

  6. PAC is also called Dimensionality Reduction algorithm.
    The vectors u_i’s are called the first k principal components of the data

  7. Application:
    • Compression
    • Plot and view
    • Reduce demension before training
    • Noise reduction algorithm

Independent Components Analysis (ICA)

  1. Some data s (n dimension) that is generated via n independent sources
    What we observe is x = A * s where A is an unknown square matrix called the mixing matrix
    The dataset { xi; i = 1,...,m }, and out goal is to recover the sources si that had generated out data ( xi = A * si )
    Let W = A^-1 be the unmixing matrix, our goal is to find W (and let w_i^T denote the i-th row of W)

ICA Ambiguities

  1. There are some inherent ambiguities in A that are impossible to recover, given only the xi’s
    • Permutation
    • Scaling
    • For a Gaussian distribution s , because of the circle shape of Gaussian distribution, there will be an arbitrary retatoinal component in the mixingg matrix that cannot be determined from the date
  2. As long as the data is not Gaussian, it is possible, given enough data, to recover the n independent sources

Densities and linear transpormations

  1. Let p_s(s) be the density of s , and p_x(x) be the density of x , and other notation are inherented. Then we have:
    p_x(x) = p_s(W * x) * |W|
    Where |W| is the determinant of the matrix W

ICA Algorithm

  1. The derivation of ICA:
    • We suppose that the distribution of each source s_i is given by a density p_s, and that the joint distribution of the sources s is given by
      p(s) = pi_i(p_s(s_i))
      Note that by modeling the joint distribution as a product of the marginal, we capture the assumption that the sources are independent
    • Then we get:
      p(x) = pi_i(p_s(w_i^T * si)) * |W|
      And all we need to do is to specify a density for the individual sources p_s
    • To do so, all we need to do is to specify some cumulative distribution function (cdf) for it
      As we cannot choose the cdf of Gaussian distribution, we choose sigmoid function
      Then p_s(s) = g'(s)
    • The likelihood:
      hl(W) = sum_i(sum_j(log(g'(w_j^T * xi))) + log(|W|))
    • Knowing that Delta_W(|W|) = |W| * W^-1^T
      We can derive that W := W + alpha * ([1 - 2 * g(w_i^T * xi)]_n*1 * xi^T + (W^T)^-1)
      Where alpha is the learning rate



Reinforcement Learning and Control

Markov Decision Process (MDP)

  1. A MDP is a tuple (S, A, {P_sa}, gamma, R) where
    • S is a set of states
    • A is a set of actions
    • P_sa are the state transition probabilities, gives the distribution over what states we will transition to if we take action a in state s
    • gamma belongs to [0,1) is called discount factor
    • R: S (x A) >-> R is the reward function

  2. The dynamics of an MDP proceeds as follows:
    We start in some state s_0, and get to choose some action a_0 ∈ A to take in the MDP. As a result of our choice, the state of the MDP randomly transitions to some successor state s_1, drawn according to s_1 ∼ P_s0a0. Then, we get to pick another action a_1 ……
  3. Upon visiting the sequence of states s_0, s_1, ... with actions a_0, a_1, ... , our total payoff is given by
    R(s_0, a_0) + gamma * R(s_1, a_1) + gamma^2 * R(s_2, a_2) + ...
    Or R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ... (take this as example below for simplity)
  4. Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payof
    E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ...]

  5. Def: Policy
    A policy is any function pi: S >-> A mapping from the states to the actions
  6. Def: Value function
    The value function for a policy pi is
    Vpi(s) = E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ... | s_0 = s, pi]
  7. Def: Optimal Value function
    V^*(s) = max_pi(Vpi(s))

  8. Bellman equations: (The value function satisfies)
    • Vpi(s) = R(s) + gamma * sum_s'(P_(spi(s))(s') * Vpi(s')) = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
    • Note: Bellman’s equations can be used to efficiently solve for Vpi. Specifically, in a finite-state MDP, we can write down one such equation for Vpi(s) for every state s. This gives us a set of |S| linear equations in |S| variables which can be efficiently solved
    • Bellman equation for the optimal value function:
      V^*(s) = R(s) + max_a(gamma * sum_s'(P_sa(s') * V^*(s')))
  9. Def: pi^* (Equation (3))
    pi^*(s) = argmax_a(sum_s'(P_sa(s') * V^*(s')))
  10. Then we get for every state s and every policy pi :
    V^*(s) = Vpi^*(s) >= Vpi(s)
    Note that pi^* has the interesting property that it is the optimal policy for all states s

Value iteration and policy iteration

For now, we will consider only MDPs with finite state and action spaces

  1. The Value Iteration
    • the process
      ```
      1. For each state s, initialize V(s) := 0.
      2. Repeat until convergence: { For every state, update V(s) := R(s) + max_a(gamma * sum_s’(P_sa(s’) * V(s’))) } ```
    • Synchronous and Asynchronous updates
    • Having found V^* , we can then use Equation (3) to find the optimal policy

  2. The Policy Iteration
    • The process
      ```
      1. Initialize pi randomly.
      2. Repeat until convergence: { (a) Let V := Vpi (b) For each state s, let pi(s) := argmax_a(sum_s’(P_sa(s’) * V(s’))) } ```
    • Note that step (a) can be done via solving Bellman’s equations as described earlier

Learning a model for an MDP

  1. In many realistic problems, we are not given state transition probabilities and rewards explicitly, but must instead estimate them from data

  2. Given that the MDP consisting of a number of trials, we can then easily derive the maximum likelihood estimates for the state transition probabilities
    P_sa(s') = No. of times took we action a in state s and got to s' / No. of times we took action a in state s
    Or if the ratio above is 0 / 0 , we maight simply estimate P_sa(s') to be 1 / |S|
  3. if R is unknown, we can also pick our estimate of the expected immediate reward R(s) in state s to be the average reward observed in state s

Continuous state MDPs

We now discuss algorithms for MDPs that may have an infinite number of states

  1. Discretization
    Downsides:
    • it assumes that the value function is takes a constant value over each of the discretization intervals
    • curse of dimensionality

  2. Value function approximation: Using a model or simulator
    • a simulator is a black-box that takes as input any (continuous-valued) state s_t and action a_t, and outputs a next-state s_(t+1) sampled according to the state transition probabilities P_stat
    • Several ways to get such model:
      • Physics simulation
      • Get a model is to learn one from data collected in the MDP
        Deterministic / Stochastic model
  3. Value function approximation: Fitted value iteration algorithm
    • Assume that the problem has a continuous state space S of n-D, but that the action space A is small and discrete
    • Recall that in value iteration
      Vpi(s) = R(s) + gamma * \jf\_s'(P_(spi(s))(s') * Vpi(s'))ds' = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
    • The main idea:
      • Approximately carry out this step, over a finite sample of states si using machine learning
      • Specifically, we will use a supervised learning algorithm—linear regression in our description below—to approximate the value function as a linear or non-linear function of the states:
        V(s) = theta^T * phi(s) , where phi is some appropriate feature mapping of the states
    • For each state s in our finite sample of m states, fitted value iteration will first compute a quantity yi , which will be our approximation to R(s) + gamma * max_a(E_(s'∼P_sa)[V(s')]) . Then, it will apply a supervised learning algorithm to try to get V(si) = theta^T * phi(si) close to yi
    • Cannot be proved to always converge
    • When we need to choose an action for some state s, The process for computing/approximating this is similar to the inner-loop of fitted value iteration
    • The method to choose the k above



LQR, DDP and LQG

Linear Quadratic Regulation, Differential Dynamic Programming and Linear Quadratic Gaussian

Finite-horizon MDPs

  1. To place ourselves in a more general setting:
    • We rewrite the sum or integal into the expectation form: E_(s'~P_sa)[Vpi^*(s')]
    • Assume that R(s, a) , then the reward function will go into the max or argmax
    • Finite horizon: we set a finite ending time T
      Therefore, our definition of payoff is going to be slightly different: ( gamma become not necessary)
      R(s_0, a_0) + R(s_1, a_1) + ...+ R(s_T, a_T)

  2. In this new setting, things behave quite differently:
    • First, the optimal policy pi^* might be non-stationary, meaning that it changes over time:
      pit : S -> A (e.g. a_0 = pi0(s_0) )
    • Time Dependent dynamics:
      s_(t+1) ~ Pt_stat
      Rt(s_t, a_t)
      Then (S, A, Pt_sa, T, Rt)
    • Def Value function and optimal value function as before

  3. Bellman’s equation is made for Dynamic Programming
  4. Solve for the optimal value function:
    ```
    1. compute V^*_T := max_a(RT(s,a))
    2. for t = T-1, …, 0: compute V^_t := max_a(Rt(s, a) + E_(s’~Pt_sa)[V^_(t+1)(s’)]) ```

Linear Quadratic Regulation (LQR)

  1. Assumptions:
    • S n-D, A d-D
    • Linear transitions with noise:
      s_(t+1) = A_t * s_t + B_t * a_t + w_t
      where Gaussian noise w_t ~ hN(0, Sigma_t)
    • Quadratic Rewards:
      Rt(s_t, a_t) = -s_t^T * U_t * s_t - a_t^T * W_t * a_t
      where U_t, W_t are both positive definite matrices
      Note that the quadratic formulation of the reward is equivalent to saying that we want to take smooth action to make our state be close to the origin
  2. Steps:
    • step 1: If we need to estimate A, B, Sigma, first, collect transitions from an arbitrary policy. Then, use linear regression to find argmin_(A, B)(sum_i(sum_t(||si_(t+1) - (A * si_t + B * ai_t)||^2))) . Finally, use a technique seen in Gaussian Discriminant Analysis to learn Sigma
    • Step 2: Derive the optimal policy using dynamic programming
      • Fact 1: It can be shown that if V^*_(t+1) is a quadratic function in s_t , then V^*_t is also a quadratic function:
        If V^*_(t+1)(s_(t+1)) = s_(t+1)^T * phi_(t+1) * s_(t+1) + psi_(t+1)
        Then V^*_t(s_t) = s_t^T * phi_t * s_t + psi_t
        Where for t = T , we had phi_t = -U_T, psi_t = 0
      • Fact 2: We can show that the optimal policy is just a linear function of the state
        Discrete Ricatti equation:
        phi_t = A_t^T * (phi_(t+1) - phi_(t+1) * B_t * (B_t^T * phi_(t+1) * B_t - W_t)^-1 * B_t * phi_(t+1)) * A_t - U_t
        psi_t = -tr(Sigma_t * phi_(t+1)) + psi_(t+1)
      • Fact 3: We notice that phi_t depends on neither psi nor the noise Sigma. It implies that the optimal policy also does not depend on the noise
        ((But psi_t does depend on Sigma_t , which implies that V^*_t depends on Sigma)
    • As the optimal policy does not depend on psi_t , and the update of phi_t only depends on phi_t , it is sufficient to update only phi_t

From non-linear dynamics to LQR

  1. It turns out that a lot of problems can be reduced to LQR, even if dynamics are non-linear
  2. Taylor expansion: works well for cases where the goal is to stay around some state s^*
  3. Differential Dynamic Programming (DDP)
    We’ll cover a method that applies when our system has to follow some trajectory (think about a rocket). This method is going to discretize the trajectory into discrete time steps, and create intermediary goals around which we will be able to use the previous technique
  4. Steps:
    • Step 1: come up with a nominal trajectory using a naive controller, that approximate the trajectory we want to follow. In other words, our controller is able to approximate the gold trajectory with:
      s^*_0, a^*_0 -> s^*_1, a^*_1 -> ...
    • Step 2: linearize the dynamics around each trajectory point s^*_t using Taylor expansion
      Note We can apply a similar derivation for the reward Rt , with a second-order Taylor expansion
    • Step 3: Now, the problem is strictly re-written in the LQR framework. Let’s just use LQR to find the optimal policy pi_t . As a result, our new controller will (hopefully) be better
    • Step 4: We use the new controller (our new policy pi_t) to produce a new trajectory
      Note that when we generate this new trajectory, we use the real F and not its linear approximation to compute transitions
    • Then, go back to step 2 and repeat until some stopping criterion

Linear Quadratic Gaussian (LQR)

  1. So far, we assumed that the state was available. As this might not hold true for most of the real-world problems, we need a new tool to model this situation:
    Partially Observable MDPs (POMDP)
  2. A POMDP is an MDP with an extra observation layer.
    In other words, we introduce a new variable o_t , that follows some conditional distribution given the current state s_t
    o_t | s_t ~ O(o | s)
  3. Formally, a finite-horizon POMDP is given by a tuple:
    (S, O, A, P_sa, T, R)
    Within this framework, the general strategy is to maintain a belief state (distribution over states) based on the observation o_1, ..., o_t . Then, a policy in a POMDP maps belief states to actions

  4. Suppose that, we observe y_t (m-D) with m < n such that:
    y_t = C * s_t + v_t
    s_(t+1) = A * s_t + B * a_t + w_t
    Where v_t is the sensor noise (also gaussian, like w_t )
  5. Note that the reward function Rt is left unchanged, as a function of the state (not the observation) and action

  6. Steps:
    • Step 1: first, compute the distribution on the possible states ( s_t|t, Sigma_t|t ), based on the observations we have:
      s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t)
      Using Kalman Filter algorithm
    • Step 2: We’ll use the mean s_t|t as the best approximation for s_t
    • Step 3: then set the action a_t := L_t * s_t|t where L_t comes from the regular LQR algorithm
  7. Kalman filter algorithm
    • Predict step: compute s_(t+1) | y_1, ..., y_t
      Suppose we know: s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t) . Then s_(t+1) | y_1, ..., y_t ~ hN(s_t+1|t, Sigma_t+1|t) , where s_t+1|t = A * s_t|t; Sigma_t+1|t = A * Sigma_t|t * A^T + Sigma_s ( Sigma_s is the variance of s)
    • Update step: compute s_(t+1) | y_1, ..., y_(t+1)
      As we know above, we can prove that s_(t+1) | y_1, ..., y_(t+1) ~ hN(s_t+1|t+1, Sigma_t+1|t+1) , where s_t+1|t+1 = s_t+1|t + K_t * (y_(t+1) - C * S_t+1|t); Sigma_t+1|t+1 = Sigma_t+1|t - K_t * C * Sigma_t+1|t with K_t := Sigma_t+1|t * C^T * (C * Sigma_t+1|t * C^T + Sigma_y)^-1 ( Sigma_y is the variance of y , and K_t is called the Kalman gain)