- Intoduction
- Linear Regression
- Classification and logistic regression
- Generalized Linear Models(GLMs)
- Generative Learning Algorithms
- Support Vector Machines
- Learning Theory
- Regularization and Model Selection
- Decision Tree & Ensembling Methods
- Deep Learning
- Unsupervised Learning Problem
- The EM Algorithm: A More General View
- Factor Analysis
- Principal Components Analysis (PCA)
- Independent Components Analysis (ICA)
- Reinforcement Learning and Control
- LQR, DDP and LQG
Intoduction
- What is machine learning?
Experience E, task T, performance measure P. If its performance on T, as measured by P, improves with experience E. - Types:
- Supervised learning & Unsupervised learning
- Reinforcement learning & recommender sysytems
- Supervised learning & Unsupervised learning
- How to apply learning algorithm
- Basic knowledge of linear algebra
- Matrix: rows x columns
- Vector: An n x 1 matrix; 1-indexed and 0-indexed(The first element of vector y is called y1 or y0)
- Scalar means real number
- PredictionVector = DataMatrix x ParametersVector
- PredictionMatrix = DataMatrix x Multi_Pairs_ParametersMatrix
- Not Commutative, Associative, Inverse(singular or degenerate), Transpose
- Matrix: rows x columns
Supervised learning
- Def: We give the computer a dataset with the correct answer of each of the data
- Regression: Predict continuous value output
- Classification: Discrete valued output
- Deal with infinite features: we have algorithm to do that
Unsupervised learning
- Def: We just give the computer a dataset, and let the computer to find the structure of the data
- Type 1: Split the data into different groups
- Type 2: Audio seperating
- Type 1: Split the data into different groups
Linear Regression
- Training set -> Learning algorithm -> h: h(x) = y (hypothesis)
- x input, y output, m num of training examples
- h_theta(x) = theta0 + theta1 * x (called univariate linear regression)
h_theta(xi) = (xi)^T * theta
- x input, y output, m num of training examples
LMS(Least Mean Square) Algorithm
- Def Loss Function — how to choose thetas
- The idea: minimize the saperation of h(x)s from ys in the training sets
- In this example
minimize(sum((h(xi) - yi) ^ 2) / (2m))
(what we minimize is also called Square Error Cost Function) - Def cost function (J) as the function we try to minimize
- The idea: minimize the saperation of h(x)s from ys in the training sets
- Understanding of Hypothesis and Cost Function
- J is a function of parameters; For fixed parameters, h is a function of x
- For each value of parameters, J([parameters]) corresponds to a specific hypothesis
- The high-D understanding of h & J
- J is a function of parameters; For fixed parameters, h is a function of x
- Gradient Descent — How to automatically find the parameters that minimize J
- Ganerally we have a func J
- Start with some parameters
- keep changing them to reduce J, until hopfully we end up at a minimum
- Gradient: the fast changing direction of a func
thetaj := thetaj - alpha * pd(J) / pd(thetaj)
, in which alpha is called learning rate, too small — slow; too large — may overshoot the minimum, fail to converge, or even diverge- BTW: To simultaneously update all parameters correctly, we need to save all the new value in a register, and update them togather after we have done all the calculation. Otherwise, the parameter calculate later will not get the right value
- Start with some parameters
- Specified to cost func J
- That’s why there are ‘2’ in the hypothesis
theta_j := theta_j + alpha * (yi - h(xi)) * xi_j
- Convex function — Boat shape
- Batch Gradient Descent: Each step of gradient descent uses all the training examples(the sum coverage); And there’s other ways we’ll talked about later.
- Stpchastic(Incremental) Gradient Descent: Each step uses samples of examples
- That’s why there are ‘2’ in the hypothesis
- Ganerally we have a func J
The normal equations
- Some more Matrix calc
- Derivative: Df(A)=[pd(f) / pd(Aij)]
- Trace: trA(commutative)
- Some equation:
- D(_A)trAB = B^T
- D(_A^T)f(A) = (D(_A)f(A))^T
- D(_A)trABA^TC = CAB + C^TAB^T
-
D(_A) A = A (A^-1)^T
- D(_A)trAB = B^T
- Derivative: Df(A)=[pd(f) / pd(Aij)]
- The Normal Equation: In order to let DJ(theta) = 0
X^T * X * theta = X^T * y
- So by
theta = (X^T * X)^-1 * X^T * y
, we can minimize J without resorting to an iterative lgorithm
probabilistic interpretation
- The prof of LMS: Minimizing J = Maximizing likelihood or log likelihood
L(theta) = pi(p(yi | xi; theta))
l(theta) = log(L(theta))
- In the example above,
y|x; theta ~ N(miu, sigma^2)
(N as Gaussian distribution)
Locally weighted linear regression(LWR)
- Underfitting & Overfitting
- Suppose we have enough data that the choice of features is less critical
- In LWR instead of minimize
sum(yi - theta^T * xi)^2
, we minimizesum(wi * (yi - theta^i * xi)^2)
, where wi’s are non-negative valued weights - Fairly standard choice of w
wi = e^(- (xi - x)^2 / (2 * tao^2))
where x is the query point, tao is called the bandwidth parameter - Non-parametric
Classification and logistic regression
For now, we’ll focus on the binary classification problem, 0 — negative class(‘-‘); 1 — positive class(‘+’). Also given xi calc yi is also called the label for the reaining example
Logistic regression
- Treated as continuous values prediction
- Logistic function(also called Sigmoid function)
g(z) = 1 / (1 + e^-z)
Alsog'(z) = g(z) * (1 - g(z))
- Hypothesis:
h_theta(x) = g(theta^T * x)
- Probabilistic assumption:
-
P(y = 1 x; theta) = h_theta(x) -
P(y = 0 x; theta) = 1 - h_theta(x) - Genarally:
p(y | x; theta) = (h_theta(x))^y * (1 - h_theta(x))^(1 - y)
(called Bernoulli distributionBernoulli(phi = h_theta(x))
)
-
- To maximize log likelihood
l(theta)
, we dirived the same fomula:
theta_j := theta_j + alpha * (yi - h_theta(xi)) * xi_j
Digression: The perceptron learning algorithm
- All the same as above but
g(z) = z<0?0:1
Another algorithm for maximizing l(theta)
- Newton’s method to find a zero for a func:
theta := theta - f(theta) / f'(theta)
(linear func understanding) - Apply to l(theta):
theta := theta - l'(theta) / l''(theta)
- Considering high-D: (Newton-Raphson method)
theta := theta - H^-1 * Delta_theta(l(theta))
whereH = dp^2(l(theta)) / (dp(theta_i) * dp(theta_j))
- Can understand as Gradient Descenting with a changing learning rate
alpha
Generalized Linear Models(GLMs)
The exponential family
- The exponential family distributions:
p(y; yita) = b(y) * exp(yita^T * T(y) - a(yita))
where:
- yita — natural parameter(also called canonical parameter);
- T(y) — sufficient satistic;
- a(yita) — log partition function(
e^(-a(yita))
essentially plays the role of a normalization constant)
- yita — natural parameter(also called canonical parameter);
- Example 1: For Bernoulli(phi):
- T(y) = y;
- a(yita) = - log(1 - phi) = log(1 + e^yita);
- b(y) = 1
- T(y) = y;
- Example 2: For Gaussian N(miu, sigma^2 = 1): (Because sigma has no effect on the final choice of theta and h)
- T(y) = y;
- a(yita) = miu^2 / 2 = yita^2 / 2;
- b(y) = exp(- y^2 / 2) / sqr_root(2 * pi)
- T(y) = y;
- Other Examples: multinomial, Poisson, gamma, exponential, bata, Dirichlet, etc.
Constructing GLMs
- Assumption:
-
y x; theta ~ ExponentialFamily(yita) — the distribution of y given x and theta follow some exponential family distribution with parameter yita - Given x, our goal is to predict the expected value of T(y) given x
In most of our examples, we will have T(y) = y, so this means we would like the prediction h(x) output by our learned hypothesis h to satisfy h(x) = E[y|x](条件期望) - The natural parameter yita and the inputs x are related linearly:
yita_i = theta_i^T * x
-
- Some special name:
- y — Response variable
- Canonical response function & Canonical link function: h(x)(or saying g(yita)) & knowing outputs calc yita
- y — Response variable
- Ordinary Least Squares
- We model the conditional distribution of y given x as a Gaussian function above
- So we have
h_theta(x) = E[y | x; theta] = miu = yita = theta^T * x
- We model the conditional distribution of y given x as a Gaussian function above
- Logistic regression
- Bernoulli
h_theta(x) = E[y | x; theta] = phi = 1 / (1 + e^-yita) = 1 / (1 + e^-(theta^T * x))
- Bernoulli
- Softmax regression (nultinomial distribution)
- For a multinomial over k possible outcomes, we’ll use k - 1 independent parameters to parameterize it
phi_1, phi_2, ..., phi_(k - 1)
phi_i = p(y = i; phi)
andp(y = k; phi) = 1 - sum(phi_i)
(let it = phi_k, where phi_k is not a parameter but just a notation)- A notation to introduce: 1{·} as
"The arguement is true" ? 1 : 0
- To express the nultinomial as an exponential family distribution:
- we define that
(T(y))_i = 1{y = i}
(for i = 1, 2, …, k-1)
That is:T(i) = [0, ..., 1(i_th), ..., 0((k-1)_th)]^T
(for i = 1,2,…,k - 1);T(k) = [0,...,0]^T
Moreover, E[(T(y))_i] = P(y = i) = phi_i yita = [log(phi_1 / phi_k), ..., log(phi_(k-1) / phi_k)]^T
a(yita) = - log(phi_k)
b(y) = 1
- we define that
- The softmax function:
- The link function:
yita_i = log(phi_i / phi_k)
; we also defineyita_k = 0
for convenience - The response function:
phi_i = e^yita_i / sum(e^yita_j)
- The link function:
- So that
p(y = i | x; theta) = phi_i = e^(theta_i^T * x) / sum(e^(theta_j^T * x))
- For a multinomial over k possible outcomes, we’ll use k - 1 independent parameters to parameterize it
Generative Learning Algorithms
-
- Discriminative learning algorithms: try to learn
p(y|x)
direcstly or mappings directly from the space of inputs to the labels; - Generative learning algorithm: instead, try to model
p(y)
(called class priors) andp(x|y)
, then calcp(y|x)
by Bayes rule:
p(y|x) = p(x|y) * p(y) / p(x)
wherep(x) = sum(p(x|y = i) * p(y = i))
Moreover,arg max_y(p(y|x)) = arg max_y(p(x|y) * p(y) / p(x)) = arg max_y(p(x|y) * p(y))
So likelihood function would be derived fromp(x|y)
andp(y)
- Discriminative learning algorithms: try to learn
Gaussian Discriminant Analysis (GDA)
- Introduce: Multinariate normal distribution
- Mean vector miu (n); Covariance matrix Sigma (n x n) (>= 0, symmetric and positive semi-definite)
p(x; miu, Sigma) = exp(-(x - miu)^T * Sigma^-1 * (x - miu) / 2) / ((2 * pi)^(n / 2) * |Sigma|^0.5)
E[X] = \jf\_x(x * p(x; miu, Sigma))dx = miu
Cov(Z) = E[(Z - E[Z])(Z - E[Z])^T] = E[ZZ^T] - E[Z]E[Z]^T = Simga
- Standard normal distribution
- In Sigma, the Aij represent the degree of compressness towards the direction
xi = xj
(As it goes bigger, it become more spread out in the directionxi = xj
, and more compress in the directionxi = -xj
(expect wheni = j
))
- Mean vector miu (n); Covariance matrix Sigma (n x n) (>= 0, symmetric and positive semi-definite)
- The Gaussian Discriminant Analysis model: Take a binary-classification problem which vector x are continuous, real-valued vectors as an example
- y ~ Bernoulli(phi)
x | y = 0 ~ N(miu0, Sigma)
x | y = 1 ~ N(miu1, Sigma) - Parameters: phi, miu0, miu1, Sigma (Note that only one Sigma for two miu)
- By maximize
l(phi, miu0, miu1, Sigma)
, we get:
phi = sum(1{yi = 1}) / m
miu0 = sum(1{yi = 0} * xi) / sum(1{yi = 0})
miu1 = sum(1{yi = 1} * xi) / sum(1{yi = 1})
Sigma = sum((xi - miu_yi)(xi - miu_yi)^T) / m
- y ~ Bernoulli(phi)
- Discussion: GDA & logistic regression
- Let
theta = theta(phi, Sigma, miu0, miu1)
then GDA become logistic regression - GDA make a stronger assumption that not only
p(y|x)
follows logistic function, but alsoP(x|y)
follows multivariate gaussian function - When assumption correct, GDA is better, even asymptotically efficient
- GDA is more data efficient (requires less training data to learn “well”)
- Logistic regression more robust
- Let
- Asymptotically efficient: Means that in the limit of very large training sets (large m), there is no algorithm that is strictly better than GDA
Naive Bayes
- Input vector x are discrete-valued
- Naive Bayes (NB) assumption: x_i’s are conditionally independent given y
Thenp(x_1, ..., x_50000 | y) = p(x_1 | y)p(X_2 | y, x_1)···p(x_50000 | y, x_1, ..., x_49999) = pi(p(x_j | y))
- Naive Bayes classifier:
- Parameterize and Maximize likelihood:
phi_(j|y=1) = p(x_j=1|y=1) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=1})
phi_(j|y=0) = p(x_j=1|y=0) = sum(1{xi_j=1 ^ yi=1}) / sum(1{yi=0})
phi_y = p(y = 1) = sum(1{yi=1}) / m
- Then
p(y = 1 | x) = p(x | y = 1)p(y = 1) / p(x)
wherep(x) = p(x | y = 1) + p(x | y = 0)
- x can be multinomial;
For continuous x, if GDA etc. fit not very good, we can also consider to discretize it and use Naive Bayes
- Parameterize and Maximize likelihood:
- Laplace smoothing:
- Problem of previous version: For words never come out, both
phi
would be0
, and the output will become0/0
- Soution: Laplace smoothing
Generallyphi_j = (sum(1{zi = j}) + 1) / (m + k)
wherek
is the number of possible j values to make sure the sum of all phi is1
Specifically,phi_(j|y=1) = (sum(1{xi_j=1 ^ yi=1}) + 1) / (sum(1{yi=1}) + 2)
, similarly forphi_(j|y=0)
- Problem of previous version: For words never come out, both
- Multinomial event model
x_i represent what the i_th word of the passage is;
We consider all xi_j’s distrubution is the same for the same i
Support Vector Machines
Margins
- Intuition: Margins, Confidence of the prediction, Separating Hyperplane
Important assumption: Data is linearly separable - For the next discussing, we let the two value that y could have be -1 and 1, and donote
h_(w,b)(x) = g(w^T * x + b)
, whereg(z) = z<0 ? -1 : 1
- Functional Margin:
gammai = yi * (w^T * xi + b)
andgamma = min(gammai)
- > 0 — correct; Large — confident
- There is problem that singly multiply 2 to w and b would result in the functional margin become bigger without doing anything meaningful
So often we will impose some sort of normalization condition such as||w||_2 = 1
- > 0 — correct; Large — confident
- Geometric Margin:
gammai = yi * ((w / ||w||)^T * xi + b / ||w||)
andgamma = min(gammai)
The Optimal Margin Classifier
- The natural desideratum: Maximizes the (geometric) margin
So the problem is:max_(gamma,w,b)(gamma / ||w||) s.t. yi * (w^T * xi + b) >= gamma, i = 1, ..., m
Which is for calc easier:min_(w,b)(0.5 * ||w||^2) s.t. yi * (w^T * xi + b) >= 1, i = 1, ..., m
Can be considered as the loss function that the model (Optimal Margin Classifier) use to do the optimization
- Digression: Lagrange Duality
- Consider the problem:
min_w(f(w)) s.t. h_i(w) = 0, i = 1, ..., l
条件极值, 拉格朗日乘数法:
- Def Lagrangian:
hL(w, beta) = f(w) + sum(beta_i * h_i(w))
wherebeta_i
s are called Lagrange multipliers - Then solve
pdhL / pdw_i = 0; pdhL / pdbeta_i = 0
to getw
andbeta
- Def Lagrangian:
- Consider the problem (primal optimization problem):
min_w(f(w)) s.t. g_i(w) <= 0, i = 1, ..., k; h_i(w) = 0, i = 1, ..., l
- Def generalized Lagrangian:
hL(w, alpha, beta) = f(w) + sum(alpha_i * g_i(w)) + sum(beta_i * h_i(w))
- Consider the quantity:
theta_P(w) = max_(alpha, beta; alpha_i>=0)(hL(w, alpha, beta))
we can dirive thattheta_P(w) = w satisfies primal constraints ? f(w) : infinity
Then we havemin(theta_P(w)) = min(max(hL(w, alpha, beta)))
we call the solutionp^* = min(theta_P(w))
the value of the primal problem - Consider another problem called the Dual optimization problem:
First definetheta_D(alpha, beta) = min_w(hL(w, alpha, beta))
Then the problem ismax_(alpha_i>=0)(theta_D(alpha, beta)) = max_(alpha_i>=0)(min(hf(w, alpha, beta)))
we call the solutiond^* = max_(alpha_i>=0)(theta_D(alpha, beta))
the value of the primal problem - Because
maxmin <= minmax
we getd^* <= p^*
- Suppose
f
andg_i
s are convex, and theh_i
s are affine. Suppose further taht the constraintsg_i
are (strictly) feasible
Then we havep^* = d^* = hL(w^*, alphs^*, beta^*)
Moreover, they satisfy the Karush-Kuhn-Tucker (KKT) conditions. And Thew^*, alphs^*, beta^*
that satisfy KKT conditions are also a solution
pdhL(w^*, alphs^*, beta^*) / pdw_i = 0, i=1, ..., n
pdhL(w^*, alphs^*, beta^*) / pdbeta_i = 0, i=1, ..., l
alpha_i^* * g_i(w^*) = 0, i = 1, ..., k
(called the KKT dual complementarity condition)g_i(w^*) <= 0, i = 1, ..., k
alpha_i^* >= 0, i = 1, ..., k
- Def generalized Lagrangian:
- Consider the problem:
- So to match the Lagrange Duality expression, we write the problem as:
min_(w,b)(0.5 * ||w||^T) s.t. g_i(w) = - yi * (w^T * xi + b) + 1 <= 0, i = 1, ..., m
- From the KKT we konw that only when
g_i(w) = 0
canalpha != 0
That is to say, only whenyi * (w^T * xi + b)
get its minimum (which is 1 because of the constrain)
To understand from a geomatric view, for a certain line (w, b choosen), we can do translation towards both directions, until it touch the point, these point it touches have a non-zeroalpha
- Specifically in Optimal Margin Classifier, the points with non-zero
alpha
are called Support Vector
- To use Largrange Duailty and KKT:
- First we write down
hL(w, b, alpha) = 0.5 * ||w||^2 - sum(alpha_i * (yi * (w^T * xi + b) - 1))
- Then we let the partial derivatives of hL to w and b be
0
(KKT 1 and also minimizinghL
overw, b
to gettheta_D
), and we get:
w = sum(alpha_i * yi * xi)
sum(alpha_i * yi) = 0
- So bring back to the origin function, we get:
hL(w, b, alpha) = sum(alpha_i) - 0.5 * sum(yi * yj * alpha_i * alpha_j * xi^T * xj) = W(alpha)
- Then we get the dual optimization problem:
max_alpha(W(alpha)) s.t. alpha_i >= 0, i = 1, ..., m; sum(alpha_i * yi) = 0
- why fit KKT 3 ???
- If we can solve it, we then can get
w^*
throughw = sum(alpha_i * yi * xi)
andb^*
throughb^* = -0.5 * (max_(i:yi=-1)(w^(*T) * xi) + min_(i:yi=1)(w^(*T) * xi))
- First we write down
- When the test
x
input, we need to calcw^T * x + b = sum(alpha_i * yi * xi)^T * x + b
. Consider what we talk about in No.8, what we really need to do is to calc the inner product between inputx
and these Support Vectors. — Support Vector Machine (SVM)
Kernels
- For inputs x (called input attributes below), we may consider to perform training using the feature calculated from x (called input features below). And we denote
phi
the feature mapping (function mapping attributes and corresponding features) - Because in the dual optimization problem, all we need is inner product between the input attributes x, we may instead want to use it (that is SVMs) to learn using these features
phi(x)
- Def Kernel:
K(x, z) = phi(x)^T * phi(z)
- Sometime, calc
K(x, z)
directly cost much less time than calcphi(x)
first, becausephi(x)
usually are high-D vectors
Example:K(x, z) = (x^T * z + c)^d
- An understanding of Kernels: A measurement of how similar are
phi(x)
andphi(z)
, or of how similar are x and z
- So we may directly write down the
K(x, z)
as what we think might be a reasonable measure of how similar x and z are - Moreover, to test whether the function we write can be a Kernel function — Theorem (Mercer)
K is a valid (Mercer) kernel if and only if its corresponding matrix K (Kij = K(xi, xj)) is symmetric positive semi-definite for any m x’s - Note that, the corresponding
phi
may be a infinite-D vector
- So we may directly write down the
Regularization and the non-separable case
- The assumed that the data is linearly separable may not clearly true
The action of separating may not exactly what we want to do - In order to make the algorithm to work for non-linearly saparable datasets and less sensitive to outliers: Reformulate the optimization (using l_1 regularization)
min_(xxi,w,b)(0.5 * ||w||^2 + C * sum(xxi_i)) s.t. yi * (w^T * xi + b) >= 1 - xxi_i, i = 1, ..., m; xxi_i >= 0, i = 1, ..., m
Understand: allow some point to have margin less than1
byxxi_i
, but it should pay the cost ofC * xxi_i
Then the dual optimization problem ismax_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0
with the sameW(alpha)
Moreover, the KKT condition become:
alpha_i = 0
thenyi * (w^T * xi + b) >= 1
alpha_i = C
thenyi * (w^T * xi + b) <= 1
0 < alpha_i < C
thenyi * (w^T * xi + b) = 1
The SMO Algorithm: To solve dual problem
SMO: Sequential Minimal Optimization
- Prviously introduced method to solve a max/min problem: Gradient ascent/descent, Newton’s method
- Digression: Coordinate Ascent Algorithm to solve max problem (or min)
Loop until convergence:{ For i = 1, ..., m:{ alpha_i := arg max W(alpha_1, ...,alpha_m). } }
- SMO
- Rewrite the problem:
max_alpha(W(alpha)) s.t. 0 <= alpha_i <= C, i = 1, ..., m; sum(alpha_i * yi) = 0
- Because the constrain 2, any single
alpha
is determined by others - So we need to update at least two of them simultaneously
``` Repeat until convergence:{- Select some pair alpha_i and alpha_j to update next. (using heuristic to pick will allow us to make the biggest progress towards the global maximum.)
- Reoptimize W(alpha) with respect to alpha_i and alpha_j while holding all the other alpha_k’s (k != i, j) fix. } ```
- To test for convergence of this algorithm, we can check whether the KKT conditions are satisfied to within some tolerance (typically sat to 0.01 to 0.0001)
- Can do well because the high efficient in each reoptimization
- To Exploir: Heurixtics, How to update b
- Rewrite the problem:
Learning Theory
Bias/Variance Tradeoff
- Generalization error:its expected error on examples not necessarily in the training set
- Bias
E[f_true(x) - f(x)]
: the expected generalization error even if we were to fit it to a very (say, infinitely) large training set — Underfit if large
VarianceVar[f(x)]
: — Overfit if large - The total Mean Square Error (MSE):
MSE = E[(y - f(x))^2] = sigma^2 + (Bias[f(x)])^2 + Var[f(x)]
fory = f_true(x) + e
whereE[e] = 0; Var(e) = sigma^2
Learning Theory (Take binary classification as an example)
- Two Lammas:
- The union bound lamma;
- The Hoeffding inequality (Chernoff bound) lamma
- The union bound lamma;
- PAC assumption 1: Training examples are drawn iid (independent and identically distributed) from some prob distribution
hD
- Training error (also called Empirical risk or Empirical error):
epsilon'(h) = (1 / m) * sum(1{h(xi) != yi})
Generalization error:epsilon(h) = P_((x,y)~hD)(h(x) != y)
(PAC assumption 2) - Empirical risk minimization (ERM): a way to fit
theta
:theta' = argminepsilon'(h_theta)
Def Hypothesis classhH
as the set of all classifiers with some constrain, such as linear regressionhH
. And then the argmin domain can be defined byhH
- The case of finite
hH
epsilon'(h)
is a reliable estimate ofepsilon(h)
for allh in hH
This implies an upper-bound on the generalization error ofh'
- Theorem. (If uniform convergence occurs) Let
|hH| = k
, and let anym, delta
be fixed, then with prob at least1 - delta
, we have that
epsilon(h') <= minepsilon(h) + 2 * sqr_root(log(2 * k / delta) / (2 * m))
- The two part of this equation corresponding to the tradeoff between bias and variance
- Sample complexity Corollary: Let
|hH| = k
, and letdelta, gamma
be fixed. Then forepsilong(h') <= minepsilon(h) + 2 * gamma
to hold with prob ar least1 - delta
, it suffices that
m >= log(2 * k / delta) / (2 * gamma^2) = O(...)
- The case of infinite
hH
- we say that
hH
shattersS
ifhH
can realize any labeling onS
- Def Vapnik-Chervonenkis dimension of
hH
, writtenVC(hH)
, to be the size of the largest set that is shattered by H. (IfhH
can shatter arbitrarily large sets, thenVC(hH) = infinity
) - Vapnik Theorem: (some believe its the most important theorm): Let
hH
be given, and letd = VC(hH)
. Then with prob at least1 - delta
, we have that for allh belongs to hH
:
|epsilon(h) - epsilon'(h)| <= O(sqr_root(d * log(m / d) / m + log(1 / delta) / m))
also
epsilon(h') <= epsilon(h^*) + O(sqr_root(d * log(m / d) / m + log(1 / delta) / m))
- Corollary: For
|epsilon(h) − epsilon'(h)| ≤ gamma
to hold for allh ∈ H
with probability at least1 − delta
, it suffices thatm = O_(gamma, delta)(d)
- Also,
d
is also roughly linear in the number of parameters
- we say that
Error Analysis
- How to make sure that which part of the algorithm have the biggest problem
- Plug in the ground-truth (that is manually do the work or by some ways to make sure that the work is done perfectly) for each component and see the accuracy increase
- Plug in the ground-truth (that is manually do the work or by some ways to make sure that the work is done perfectly) for each component and see the accuracy increase
Ablative Analysis
- How much did each of the improvment really help
- start from the current level of performance, and slowly take away all of these features to see how it affects performance
- start from the current level of performance, and slowly take away all of these features to see how it affects performance
Mistakes Analysis
- Analyze what the mistakes composed of, and make improvement according to it
Regularization and Model Selection
Cross Validation — The method for Model Selection
- Hold-out (or called Simple) Cross Calidation:
- Randomly split S into S_train and S_CV (called the hold-out cross calidation set)
- Train each model M_i on S_trian only, to get some hypothesis h_i
- Select and output the hypohesis h_i that had the smallest error
epsilon'_(S_CV)(h_i)
on the hold out cross validation set - Optionally, step 3 in the algorithm may also be replaced with selecting the model M_i and then retraining M_i on the entire training set S
- Randomly split S into S_train and S_CV (called the hold-out cross calidation set)
- k-Fold Cross Validation
- Randomly split s into k disjoint subsets of
m / k
training examples each - Use one as validation set and others as training set each loop, and calc the mean
epsilon'_(S_j)(h_i)
for M_i - Pick the M_i corresponding to the smallest mean
- Usually
k = 10
;
Extramly situation:k = m
— Leave-One-Out Cross Validation
- Randomly split s into k disjoint subsets of
- These also can be used more simply to evaluate a single model or algorithm
Feature Selection — A special case of Model Selection
- For the case that number of features is too large that will easily cause overfitting
- Total number of models is
2^n
which is too large — need a kind of heuristic search procedure
- Forward Search (a little bit like greedy algorithm)
- Initialize
hF = ∅
- Repeat till some pre-set threshold{
- For i = 1, …, n if
i doesnt belong to hF
lethF_i = hF U {i}
, and use some version of cross validation to evaluate featureshF_i
- Set
hF
to be the best feature subset found on the above step
}
- For i = 1, …, n if
- Select and output the best feature subset that was evaluated during the entire search procudure
- Initialize
- Wrapper Model Feature Selection
- Backeward Search
- Filter Feature Selection: much cheaper
The idea here is to compute some simple score S(i) that measures how informative each feature x_i is about the class labels y. Then, we simply pick the k features with the largest scores S(i) - One possible choice of the score would be define S(i) to be the correlation between x_i and y, as measured on the training data
- Mutual Information
MI(x_i, y)
:
MI(x_i, y) = sum_(x_i)(sum_y(p(x_i, y) * log(p(x_i, y) / (p(x_i) * p(y)))))
p
’s can all be estimated according to their empirical distributions on the training set- Expressed as a Kullback-Leibler (KL) divergence:
MI(x_i, y) = KL(p(x_i, y) || p(x_i) * p(y))
Bayesian Statistics and Regularization
- Frequentist: Not random just happen to be unknown;
Bayesian: Random and unknown - Prior Distribution:
p(theta)
p(theta | S) = p(S | theta) * p(theta) / p(S) = pi(p(yi | xi, theta)) * p(theta) / \jf\_theta(pi(p(yi | xi, theta)) * p(theta))dtheta
Posterior Distribution:p(y | x, S)
p(y | x, S) = \jf\_theta(p(y |x, theta) * p(theta | S))dtheta
- MAP (maximum a posteriori) estimate for theta:
theta_MAP = argmax(pi(p(yi | xi, theta) * p(theta)))
- Similar to the ML (maximum liklihood) estimate for theta in Frequentist:
theta_ML = argmax(pi(p(yi | xi, theta)))
- In prectice,
p(theta)
is to assume thattheta ~ N(0, tao^2 * I)
Use this, MAP can get atheta
with smaller norm than ML and is less susceptible to overfitting than ML
Decision Tree & Ensembling Methods
Decision Tree: a Non-linearity, region-based algorithm
- Def of Non-linearity
A simple example of it is thatLattitude & Mouth => Can Ski or not
— Select Region which is intractable - Decision Tree: An approximate solution via greedy, top-down, recursive partitioning
- Top-down: We start with the original input space
Xxi
and split it into two child regions by thresholding on a single feature - Recursive: We then take one of these child regions and can partition via a new threshold, and continue to train our model in a recursive manner
(always selecting a leaf node, a feature, and a threshold to form a new split) - We can continue in such a manner until we a meet a given stop criterion, and then predict the majority class at each leaf node
- Top-down: We start with the original input space
- The Loss Function of Decision Tree
- When the parent region
R_p
ge splited into two child regionsR_1, R_2
, then the decrease in loss is:
L(R_p) - (|R_1| * L(R_1) + |R_2| * L(R_2)) / (|R_1| + |R_2|)
(cardinality-weighted sum)
- When the parent region
- A straight-forward
L
that we can think about: Misclassification LossL_misclass
L_misclass(R) = 1 - max(p'_c)
p'_c
is the prob of misclassification if we denote all the element in the region asc
- But it is not sensitive enough, sometimes: not only are the losses of the two splits identical, but neither of the splits decrease the loss over that of the parent
- A more sensitive loss: Cross-Entropy
L_cross
L_cross(R) = - sum_c(p'_c * log(p'_c))
- From an information-theoretic perspective, cross-entropy measure the number of bits needed to specify the outcome (or class) given that the distribution is known
- Furthermore, the reduction in loss from parent to child is known as information gain
- The regression setting for Decision Trees
- To predict, instead of using the major value of
y
’s in the regionR
, we use the mean:y' = sum_(i in R)(y_i) / |R|
- Then we can directly use the squared loss
L_squared
:
L_squared(R) = sum_(i in R)(y_i - y')^2 / |R|
- To predict, instead of using the major value of
- A unique advantage: Can deal with categorical variables
- Regularization or Stopping Criteria
- Minimum Leaf Size
- Maximum Depth
- Maximum Number of Nodes
- Minimum decrease in loss after split is not good, because it means missing higher order interactions (like needing thresholding on multiple features to achieve a good split)
- Fully growing out the tree, and then pruning away nodes that minimally decrease misclassification or squared error, as measured on a validation set
- Minimum Leaf Size
- Runtime
- Disadvantages:
- High variance (use ensembling which will introduce nxet to solve it)
-
Feature-based, so that can not rotate the picture, which means it can just use - & to approximately modeled a \
- High variance (use ensembling which will introduce nxet to solve it)
Ensembling Methods
- By bias/variance analysis
- Ensembling will lead to a decrease in variance of the error
Var(X_mean) = rou * sigma^2 + (1 - rou) * sigma^2 / n
whererou
is a measurement of correlation between each model - Both increasing the number of models used to ensembling and decreasing the correlation between models will cause to an over all decrease in variance of the error of the ensenmble
- Ensembling will lead to a decrease in variance of the error
- Ways to generate de-correlated models
- Using different algorithms
- Using different training sets
- Bagging
- Boosting
- Using different algorithms
- Bagging: ‘Bootstrap Aggregation’
- Bootstrap:
True population P, A training set S sampled from P (S ~ P
)
we can generate many new bootstrap sets Z_i (i = 1, …, n) sampled with replacement from S (Z ∼ S, |Z| = |S|
) ???
We can then look at the variability of our estimate across these bootstrap sets to obtain a measure of error - Aggregation:
we can take each Z_i and train a machine learning model G_i on each, and define a new aggregate predictor:G(X) = sum_i(G_i(x) / n)
- Decreasing variance by decreasing
rou
compare to all just simply trained on S - Note that
rou
is indepent uponn
, by increasingn
, the variance only decrease
- Out-of-bag estimation:
- each bootstrapped sample only contains approximately 2/3 of S, and thus we can use the other 1/3 as an estimate of error, called out-of-bag error
- In the limit, as
n -> infinity
, out-of-bag error gives an equivalent result to leave-one-out cross-validation
- each bootstrapped sample only contains approximately 2/3 of S, and thus we can use the other 1/3 as an estimate of error, called out-of-bag error
- Use upon Dicision tree:
- Work well
- Support for missing values
- Lose interprtability
- Varaible importance measure
- Random forests: only allow a subset of features to be used at each split, to prevent a very strong predictor (feature) causing large
rou
- Work well
- Bootstrap:
- Boosting: use for bias-reduction
- Get high bias, low variance models (weak learners) by only allow Decision Trees split once (Decision Stumps):
The key idea is that after one training, we then track which examples the classifier got wrong, and increase their relative weight compared to the correctly classified examples, and then do the next training
- Adaboost
- Pseudocode (N sample data)
w_i := 1 / N, for i = 1, 2, ..., N for m = 0 to M do Fit weak classifier G_m to training data weighted by w_i Compute weighted error err_m = sum_i(w_i * 1{y_i != G_m(x_i)}) / sum(w_i) Compute weight alpha_m = log((1-err_m) / err_m) w_i := w_i * exp(alpha_m * 1{y_i != G_m(x_i)}) end f(x) = sign(sum_m(alpha_m * G_m(x)))
- The G’s are not independent, meaning that increasing M leads to an increase in the risk of overfitting
- Pseudocode (N sample data)
- Forward Stagewise Additive Modeling
- Pseudocode
Initialize f_0(x) = 0 for m = 0 to M do Compute (beta_m, gamma_m) = argmin(sum_i(L(y_i, f_(m-1)(x_i) + beta * G(x_i; gamma)))) Set f_m(x) = f_(m-1)(x) + beta_m * G(x; y_i) end f(x) = f_M(x)
- Intuition: Every step tries to find the weak learner that fit the rest best ???
- Adaboost is a special case of it
- Pseudocode
- Xgboost
- Gradient Boosting: Use gradient descent to solve the min problem. But we are restricted to taking steps in our model class
- We instead compute the gradient at each training point with respect to the current predictor
g_i = pdL(y, f(x_i)) / pdf(x_i)
- Then train a new regression predictor to match this gradient and use it as the gradient step
In Forward Stagewise Additive Modelinggamma_i = argmin(sum_i(g_i - G(x_i; gamma)))
- We instead compute the gradient at each training point with respect to the current predictor
- Get high bias, low variance models (weak learners) by only allow Decision Trees split once (Decision Stumps):
Deep Learning
Neural Networks
- Introduce to building of Neural Networks:
- Introduce to ReLU
- Introduce to feature generating
- Introduce to the stacking
- End-to-End Learning
- Introduce to ReLU
- Notations:
- Inputs: xi — i-th input; x_i = a0_i;
- Hidding layers outputs: a[l]_j — j-th unit in layer l
- foo[i] meanings anything associated with layer i
- w — weight vectors; W — weight matrix; W_i — the i-th row of W
- Inputs: xi — i-th input; x_i = a0_i;
- A neural is consist of linear regression and non-linear activation
z[l]_i = W[l]_i^T * a[l-1] + b[l]_i
(where a0 = x)
a[l]_i = g(z[l]_i)
Vectorization
- Calc with vectorization
z[l] = W[l] * x + b[l]
anda[l] = g(z[l])
- The necessity of using non-linearity
- Moreover
Z[l] = W[l] * X + (broadcast)b[l]
Backpropagation
- Parameter Initialization
- Property Symmetry
- Solution:
- Random Initialization
- Xavier/He Initialization
Donoten[l]
as the number of neurons in layerl
, thenw[l] ~ hN(0, sqr_root(2 / (n[l] + n[l + 1])))
- Random Initialization
- Property Symmetry
- Optimization
- Backpropagation
- Mini-batch Gradient Descent
Momentum
- Backpropagation
- Regularization
- L2 Regularization
J_L2 = J + lamda * ||W||^2 / 2
- Parameter Sharing: Convolutional Neural Networks
- L2 Regularization
Unsupervised Learning Problem
k-Means clustering algorithm
- we are given a training set and want to group the data into a few cohesive “clusters”. Which means no labels yi are given
- The k-mean clustering algorithm
```- Initialize cluster centroids miu_i, i=1, …, k randomly
- Repeat until convergence: { For every i, set: ci := argmin_j(||xi - miu_j||^2) For every j, set: miu_j := sum_i(1{ci=j} * xi) / sum_i(1{ci=j}) } ```
- Distortion Function
J(c,miu) = sum_i(||xi - miu_ci||^2)
- What we are doing is repeatedly minimizing
miu
andc
while holding another fixed - But
J
is a non-convex function, so may get stuck in a locl minima - We can run it many times with different initializing and use the one that gives the lowest distortion
Mixtures of Gaussians and the EM algorithm
- Mixture of Gussians:
What we want is to specifying a joint distributionp(xi, zi) = p(xi | zi) * p(zi)
Wherexi
’s is the input training set,zi ~ Multinomial(phi)
is a latent variable andxi | zi = j ~ hN(miu_j, Sigma_j)
, letk
denote the number of values that thezi
’s can take on - Likelihood:
l(phi, miu, Sigma) = sum_i(log(p(xi; phi, miu, Sigma))) = sum_i(log(sum_zi(p(xi | zi; miu, Sigma) * p(zi, phi))))
zi
’s are unknown, so it’s difficult to solve. But ifzi
’s are known, then this become a classification problem, and easy to solve.
phi_j = sum_i(1{zi = j}) / m
(m, the number of inputs)miu_j =
sum_i(1{zi = j} * xi) / sum_i(z{zi = j})Sigma_j = sum_i(1{zi = j} * ||xi - miu_j||^2) / sum_i(1{zi = j})
- The EM Algorithm:
- E-step: try to guess the calues of the
zi
’s
M- step: Updates the parameters of out model based on out guesses
Repeat until convergence: { (E) For each `i, j`, set wi_j := p(zi = j | xi; phi, miu, Sigma) (M) Update the parameters by phi_j := sum_i(wi_j) / m miu_j := sum_i(wi_j * xi) / sum_i(wi_j) Sigma := sum_i(wi_j * ||xi - miu_j||^2) / sum_i(wi_j) }
Where
p(zi = j | xi; phi, miu, Sigma) = p(xi | zi = j; miu, Sigma) * p(zi = j; phi) / sum_l(p(xi | zi = l; miu, Sigma) * p(zi = l; phi))
- Compared with k-mean,
- k-means use the “hard” cluster assignments
ci
- EM use the “soft” assignments
wi_j
- Similar to K-means, it is also susceptible to local optima, so reinitializing at several different initial parameters may be a good idea
- k-means use the “hard” cluster assignments
- E-step: try to guess the calues of the
The EM Algorithm: A More General View
Jensen’s Inequality
- If
f''(x) > 0
for all x (in the vector-valued case, the corresponding statement is thatH
must be positive definite, writtenH > 0
), then we say f is strictly convex
- Jensen’s Inequality Theorem:
letf
be a convex function, and let X be a random variable. Then:
E[f(X)] >= f(E[X])
Moreover, iff
is strictly convex, then=
holds true if and only ifX = E[X]
with probability 1 (i.e., if X is a constant)
- Concave virsion of Jensne’s Inequality
The EM Algorithm
- Inputs training set
{x1, ..., xm}
, we wish to fit the parameters of a modelp(x, z)
to the data, where the likelihood is given by:
l(theta) = sum_i(log(p(x; theta))) = sum_i(log(sum_z(p(x, z; theta))))
Wherezi
’s are the latent random variables
- The EM algorithm:
- Idea: As explicitly maximizing
l(theta)
might be difficult, we instead repeatedly construct a lower-bound onl
(E-step), and then optimize that lower-bound (M-step) - For each i, let
Q_i
be some distribution over thez
’s (sum_z(Q_i(z)) = 1, Q_i(z) >= 0
). Then
sum_i(log(p(xi; theta))) =
sum_i(log(sum_zi(p(xi, zi; theta)))) =
sum_i(log(sum_zi(Q_i(zi) * p(xi, zi; theta) / Q_i(zi)))) >=
sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi))))
Which give a lower bound for any distributionQ_i
- For any fix theta, we’ll try to make the bound tight, which is to say: let the equality hold:
p(xi, zi; theta) / Q_i(zi) = c
Alsosum_z(Q_i(z)) = 1
We haveQ_i(zi) = p(xi, zi; theta) / sum_z(p(xi, z; theta)) = p(xi, zi; theta) / p(xi; theta) = p(zi | xi; theta)
- Above gives us the EM algorithm
Repeat until convergence: { (E) For each i, set Q_i(zi) := p(zi | xi; theta) (M) Set theta := argmax(sum_i(sum_zi(Q_i(zi) * log(p(xi, zi; theta) / Q_i(zi))))) }
- The prove of the algorithm will convege, and one reasonable convergence test would be to check if the increase in
l(θ)
between successive iterations is smaller than some tolerance parameter, and to declare convergence if EM is improvingl(θ)
too slowly
- Idea: As explicitly maximizing
Specified with the Example Before: Mixture of Gaussians
Factor Analysis
- Unless m exceeds n by some reasonable amount, the maximum likelihood estimates of the mean and covariance may be quite poor
Restrictions of Sigma
- If we do not have sufficient data to fit a full covariance matrix, we may place some restrictions on the space of matrices Σ that we will consider
- Fit a covariance matrix Σ that is diagonal: (
m
is the number of x)
Sigma_jj = sum_i((xi_j - miu_j)^2) / m
A diagonal Sigma corresponds to a Gaussian where the major axes of these ellipses are axis-aligned - Sometimes, we may place a further restriction on the covariance matri that not only must it be diagonal, but its diagonal entries must all be equal (
m
is the number of x,n
is the number of dimentions)
Sigma = sigma^2 * I; sigma^2 = sum_j(sum_i((xi_j - miu_j)^2)) / (m * n)
This model corresponds to using Gaussians whose densities have contours that are hyperspheres
- Restricting Σ to be diagonal also means modeling the different coordinates x_i, x_j of the data as being uncorrelated and independent
Therefore, it’ll fail to captures any correlations in the data
Marginals and Conditionals of Gaussians
- 分块(Marginal)向量化高斯分布:
x=(x_1, ..., x_n)^T
,miu=(miu_1,..., miu_n)^T
Sigma = (Sigma_ij)_n*n
, whereSigma_ij = E[(x_i - miu_i) * (x_j - miu_j)^T]
- Marginal distributions of Gaussians are themselves Gaussian:
x_i ~ hN(miu_i, Sigma_ii)
- The conditional distribution of x_i given x_j:
x_i | x_j ~hN(miu_i|j, Sigma_i|j)
, wheremiu_i|j = miu_i + Sigma_ij * (Sigma_jj)^-1 * (x_j - miu_j)
;Sigma_i|j = Sigma_ii - Sigma_ij * (Sigma_jj)^-1 * (Sigma_ji)
The Factor Analysis Model
- We posit a jointt distribution on (x, z) as follows, where z (k-D) is a latent random variable: (x is n-D, and usually k < n)
z ~ hN(0, I)
;x|z ~ hN(miu + Lamda * z, Psi)
(Psi
is a diagonal matrix) - Or equalvalence:
z ~ hN(0, I)
;epsilon ~ hN(0, Psi)
;x = miu + Lamda * z + epsilon
- Let’s work out exactly what distribution our model defines:
- First, there’s a joint distribution:
(z, x)^T ~ hN(miu_zx, Sigma)
- Find
miu_zx
:
We have:E[z] = E[epsilon] = 0
ThenE[x] = miu + Lamda * E[z] + E[epsilon] = miu
Somiu_zx = (0(k-D), miu)^T
- Find
Sigma
:
Sigma_zz = I
Sigma_zx = E[(z - E[z]) * (x - E[x])^T] = E[z * (miu + Lamda * z + epsilon - miu)^T] = E[z * z^T] * Lamda^T + E[z * epsilon^T] = Lamda^T
Sigma_xz = Lamda
Sigma_xx = Lamda * Lamda^T + Psi
- Maximize likelihood:
x ~ hN(miu, Lamda * Lamda^T + Psi)
- First, there’s a joint distribution:
EM for Factor Analysis
- Using former equition to calc
miu_zi|xi; Sigma_zi|xi
Then generate the expression ofQ_i(zi)
- Then we can get how the optimization is done in M-step
- Note that
E[z * z^T]
andE[z] * E[z^T]
is different
Principal Components Analysis (PCA)
- For a dataset
xi, i = 1, ..., m
, wherexi
is n-d (n<<m
and unknown to us).
Which means the dimension now we know forx
has many redundancy. I.e. from some of the featrues we can derive the other some of them
Which can be seen as the input approximately lie on some hyperline or direction or diagonal axis in the high-d space
- First we do the centering (zero-meaning) and normalizing of the input
xi
’s
Centering can be omitted if the input is already zero mean; Normalizing can be omitted if the different attributes are all on the same scale - Then to find the ‘axis’, one way to do so is as finding the unit vector
u
(|u| = 1
) so that when the data is projected onto the direction corresponding tou
, the variance of the projected data is maximized
u = argmax(sum_i((xi^T * u)^2) / m) = argmax(u^T * (sum_i(xi * xi^T) / m) * u)
- Where as we zero-meaned the input
x
, we haveSigma = sum_i(xi * xi^T) / m
is just the covariance matrix of the data - We easily recognize that what we want to do here is to find the principal eigenvector of
Sigma
- Where as we zero-meaned the input
- More generally, if we wish to project our data into a k-dimensional subspace (
k < n
), we should chooseu_1, ..., u_k
to be the topk
eigenvectors ofSigma
And theu_i
’s now form a new, orthogonal basis for the data - Then to represent
xi
in this basis, we need only compute the corresponding vectoryi
:
yi = (u_1^T * xi, ..., u_k^T * xi)^T
- PAC is also called Dimensionality Reduction algorithm.
The vectorsu_i
’s are called the firstk
principal components of the data
- Application:
- Compression
- Plot and view
- Reduce demension before training
- Noise reduction algorithm
- Compression
Independent Components Analysis (ICA)
- Some data s (n dimension) that is generated via n independent sources
What we observe isx = A * s
where A is an unknown square matrix called the mixing matrix
The dataset {xi
;i = 1,...,m
}, and out goal is to recover the sourcessi
that had generated out data (xi = A * si
)
LetW = A^-1
be the unmixing matrix, our goal is to findW
(and letw_i^T
denote the i-th row ofW
)
ICA Ambiguities
- There are some inherent ambiguities in
A
that are impossible to recover, given only thexi
’s
- Permutation
- Scaling
- For a Gaussian distribution
s
, because of the circle shape of Gaussian distribution, there will be an arbitrary retatoinal component in the mixingg matrix that cannot be determined from the date
- Permutation
- As long as the data is not Gaussian, it is possible, given enough data, to recover the n independent sources
Densities and linear transpormations
- Let
p_s(s)
be the density ofs
, andp_x(x)
be the density ofx
, and other notation are inherented. Then we have:
p_x(x) = p_s(W * x) * |W|
Where|W|
is the determinant of the matrixW
ICA Algorithm
- The derivation of ICA:
- We suppose that the distribution of each source
s_i
is given by a densityp_s
, and that the joint distribution of the sourcess
is given by
p(s) = pi_i(p_s(s_i))
Note that by modeling the joint distribution as a product of the marginal, we capture the assumption that the sources are independent - Then we get:
p(x) = pi_i(p_s(w_i^T * si)) * |W|
And all we need to do is to specify a density for the individual sources p_s - To do so, all we need to do is to specify some cumulative distribution function (cdf) for it
As we cannot choose the cdf of Gaussian distribution, we choose sigmoid function
Thenp_s(s) = g'(s)
- The likelihood:
hl(W) = sum_i(sum_j(log(g'(w_j^T * xi))) + log(|W|))
- Knowing that
Delta_W(|W|) = |W| * W^-1^T
We can derive thatW := W + alpha * ([1 - 2 * g(w_i^T * xi)]_n*1 * xi^T + (W^T)^-1)
Wherealpha
is the learning rate
- We suppose that the distribution of each source
Reinforcement Learning and Control
Markov Decision Process (MDP)
- A MDP is a tuple
(S, A, {P_sa}, gamma, R)
where
S
is a set of statesA
is a set of actionsP_sa
are the state transition probabilities, gives the distribution over what states we will transition to if we take actiona
in states
gamma
belongs to[0,1)
is called discount factorR: S (x A) >-> R
is the reward function
- The dynamics of an MDP proceeds as follows:
We start in some states_0
, and get to choose some actiona_0 ∈ A
to take in the MDP. As a result of our choice, the state of the MDP randomly transitions to some successor states_1
, drawn according tos_1 ∼ P_s0a0
. Then, we get to pick another actiona_1
…… - Upon visiting the sequence of states
s_0, s_1, ...
with actionsa_0, a_1, ...
, our total payoff is given by
R(s_0, a_0) + gamma * R(s_1, a_1) + gamma^2 * R(s_2, a_2) + ...
OrR(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ...
(take this as example below for simplity) - Our goal in reinforcement learning is to choose actions over time so as to maximize the expected value of the total payof
E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ...]
- Def: Policy
A policy is any functionpi: S >-> A
mapping from the states to the actions - Def: Value function
The value function for a policypi
is
Vpi(s) = E[R(s_0) + gamma * R(s_1) + gamma^2 * R(s_2) + ... | s_0 = s, pi]
- Def: Optimal Value function
V^*(s) = max_pi(Vpi(s))
- Bellman equations: (The value function satisfies)
Vpi(s) = R(s) + gamma * sum_s'(P_(spi(s))(s') * Vpi(s')) = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
- Note: Bellman’s equations can be used to efficiently solve for
Vpi
. Specifically, in a finite-state MDP, we can write down one such equation forVpi(s)
for every states
. This gives us a set of|S|
linear equations in|S|
variables which can be efficiently solved - Bellman equation for the optimal value function:
V^*(s) = R(s) + max_a(gamma * sum_s'(P_sa(s') * V^*(s')))
- Def:
pi^*
(Equation (3))
pi^*(s) = argmax_a(sum_s'(P_sa(s') * V^*(s')))
- Then we get for every state
s
and every policypi
:
V^*(s) = Vpi^*(s) >= Vpi(s)
Note thatpi^*
has the interesting property that it is the optimal policy for all statess
Value iteration and policy iteration
For now, we will consider only MDPs with finite state and action spaces
- The Value Iteration
- the process
```- For each state s, initialize V(s) := 0.
- Repeat until convergence: { For every state, update V(s) := R(s) + max_a(gamma * sum_s’(P_sa(s’) * V(s’))) } ```
- Synchronous and Asynchronous updates
- Having found
V^*
, we can then use Equation (3) to find the optimal policy
- the process
- The Policy Iteration
- The process
```- Initialize pi randomly.
- Repeat until convergence: { (a) Let V := Vpi (b) For each state s, let pi(s) := argmax_a(sum_s’(P_sa(s’) * V(s’))) } ```
- Note that step (a) can be done via solving Bellman’s equations as described earlier
- The process
Learning a model for an MDP
- In many realistic problems, we are not given state transition probabilities and rewards explicitly, but must instead estimate them from data
- Given that the MDP consisting of a number of trials, we can then easily derive the maximum likelihood estimates for the state transition probabilities
P_sa(s') = No. of times took we action a in state s and got to s' / No. of times we took action a in state s
Or if the ratio above is0 / 0
, we maight simply estimateP_sa(s')
to be1 / |S|
- if
R
is unknown, we can also pick our estimate of the expected immediate rewardR(s)
in states
to be the average reward observed in states
Continuous state MDPs
We now discuss algorithms for MDPs that may have an infinite number of states
- Discretization
Downsides:
- it assumes that the value function is takes a constant value over each of the discretization intervals
- curse of dimensionality
- it assumes that the value function is takes a constant value over each of the discretization intervals
- Value function approximation: Using a model or simulator
- a simulator is a black-box that takes as input any (continuous-valued) state
s_t
and actiona_t
, and outputs a next-states_(t+1)
sampled according to the state transition probabilitiesP_stat
- Several ways to get such model:
- Physics simulation
- Get a model is to learn one from data collected in the MDP
Deterministic / Stochastic model
- Physics simulation
- a simulator is a black-box that takes as input any (continuous-valued) state
- Value function approximation: Fitted value iteration algorithm
- Assume that the problem has a continuous state space
S
ofn-D
, but that the action spaceA
is small and discrete - Recall that in value iteration
Vpi(s) = R(s) + gamma * \jf\_s'(P_(spi(s))(s') * Vpi(s'))ds' = R(s) + gamma * E_(s'~P_spi(s))[Vpi(s')]
- The main idea:
- Approximately carry out this step, over a finite sample of states
si
using machine learning - Specifically, we will use a supervised learning algorithm—linear regression in our description below—to approximate the value function as a linear or non-linear function of the states:
V(s) = theta^T * phi(s)
, wherephi
is some appropriate feature mapping of the states
- Approximately carry out this step, over a finite sample of states
- For each state
s
in our finite sample ofm
states, fitted value iteration will first compute a quantityyi
, which will be our approximation toR(s) + gamma * max_a(E_(s'∼P_sa)[V(s')])
. Then, it will apply a supervised learning algorithm to try to getV(si) = theta^T * phi(si)
close toyi
- Cannot be proved to always converge
- When we need to choose an action for some state
s
, The process for computing/approximating this is similar to the inner-loop of fitted value iteration - The method to choose the
k
above
- Assume that the problem has a continuous state space
LQR, DDP and LQG
Linear Quadratic Regulation, Differential Dynamic Programming and Linear Quadratic Gaussian
Finite-horizon MDPs
- To place ourselves in a more general setting:
- We rewrite the sum or integal into the expectation form:
E_(s'~P_sa)[Vpi^*(s')]
- Assume that
R(s, a)
, then the reward function will go into the max or argmax - Finite horizon: we set a finite ending time
T
Therefore, our definition of payoff is going to be slightly different: (gamma
become not necessary)
R(s_0, a_0) + R(s_1, a_1) + ...+ R(s_T, a_T)
- We rewrite the sum or integal into the expectation form:
- In this new setting, things behave quite differently:
- First, the optimal policy
pi^*
might be non-stationary, meaning that it changes over time:
pit : S -> A
(e.g.a_0 = pi0(s_0)
) - Time Dependent dynamics:
s_(t+1) ~ Pt_stat
Rt(s_t, a_t)
Then(S, A, Pt_sa, T, Rt)
- Def Value function and optimal value function as before
- First, the optimal policy
- Bellman’s equation is made for Dynamic Programming
- Solve for the optimal value function:
```- compute V^*_T := max_a(RT(s,a))
- for t = T-1, …, 0: compute V^_t := max_a(Rt(s, a) + E_(s’~Pt_sa)[V^_(t+1)(s’)]) ```
Linear Quadratic Regulation (LQR)
- Assumptions:
- S n-D, A d-D
- Linear transitions with noise:
s_(t+1) = A_t * s_t + B_t * a_t + w_t
where Gaussian noisew_t ~ hN(0, Sigma_t)
- Quadratic Rewards:
Rt(s_t, a_t) = -s_t^T * U_t * s_t - a_t^T * W_t * a_t
whereU_t, W_t
are both positive definite matrices
Note that the quadratic formulation of the reward is equivalent to saying that we want to take smooth action to make our state be close to the origin
- S n-D, A d-D
- Steps:
- step 1: If we need to estimate
A, B, Sigma
, first, collect transitions from an arbitrary policy. Then, use linear regression to findargmin_(A, B)(sum_i(sum_t(||si_(t+1) - (A * si_t + B * ai_t)||^2)))
. Finally, use a technique seen in Gaussian Discriminant Analysis to learn Sigma - Step 2: Derive the optimal policy using dynamic programming
- Fact 1: It can be shown that if
V^*_(t+1)
is a quadratic function ins_t
, thenV^*_t
is also a quadratic function:
IfV^*_(t+1)(s_(t+1)) = s_(t+1)^T * phi_(t+1) * s_(t+1) + psi_(t+1)
ThenV^*_t(s_t) = s_t^T * phi_t * s_t + psi_t
Where fort = T
, we hadphi_t = -U_T, psi_t = 0
- Fact 2: We can show that the optimal policy is just a linear function of the state
Discrete Ricatti equation:
phi_t = A_t^T * (phi_(t+1) - phi_(t+1) * B_t * (B_t^T * phi_(t+1) * B_t - W_t)^-1 * B_t * phi_(t+1)) * A_t - U_t
psi_t = -tr(Sigma_t * phi_(t+1)) + psi_(t+1)
- Fact 3: We notice that
phi_t
depends on neitherpsi
nor the noiseSigma
. It implies that the optimal policy also does not depend on the noise
((Butpsi_t
does depend onSigma_t
, which implies thatV^*_t
depends onSigma
)
- Fact 1: It can be shown that if
- As the optimal policy does not depend on
psi_t
, and the update ofphi_t
only depends onphi_t
, it is sufficient to update onlyphi_t
- step 1: If we need to estimate
From non-linear dynamics to LQR
- It turns out that a lot of problems can be reduced to LQR, even if dynamics are non-linear
- Taylor expansion: works well for cases where the goal is to stay around some state
s^*
- Differential Dynamic Programming (DDP)
We’ll cover a method that applies when our system has to follow some trajectory (think about a rocket). This method is going to discretize the trajectory into discrete time steps, and create intermediary goals around which we will be able to use the previous technique - Steps:
- Step 1: come up with a nominal trajectory using a naive controller, that approximate the trajectory we want to follow. In other words, our controller is able to approximate the gold trajectory with:
s^*_0, a^*_0 -> s^*_1, a^*_1 -> ...
- Step 2: linearize the dynamics around each trajectory point
s^*_t
using Taylor expansion
Note We can apply a similar derivation for the rewardRt
, with a second-order Taylor expansion - Step 3: Now, the problem is strictly re-written in the LQR framework. Let’s just use LQR to find the optimal policy
pi_t
. As a result, our new controller will (hopefully) be better - Step 4: We use the new controller (our new policy
pi_t
) to produce a new trajectory
Note that when we generate this new trajectory, we use the realF
and not its linear approximation to compute transitions - Then, go back to step 2 and repeat until some stopping criterion
- Step 1: come up with a nominal trajectory using a naive controller, that approximate the trajectory we want to follow. In other words, our controller is able to approximate the gold trajectory with:
Linear Quadratic Gaussian (LQR)
- So far, we assumed that the state was available. As this might not hold true for most of the real-world problems, we need a new tool to model this situation:
Partially Observable MDPs (POMDP) - A POMDP is an MDP with an extra observation layer.
In other words, we introduce a new variableo_t
, that follows some conditional distribution given the current states_t
o_t | s_t ~ O(o | s)
- Formally, a finite-horizon POMDP is given by a tuple:
(S, O, A, P_sa, T, R)
Within this framework, the general strategy is to maintain a belief state (distribution over states) based on the observationo_1, ..., o_t
. Then, a policy in a POMDP maps belief states to actions
- Suppose that, we observe
y_t (m-D)
withm < n
such that:
y_t = C * s_t + v_t
s_(t+1) = A * s_t + B * a_t + w_t
Wherev_t
is the sensor noise (also gaussian, likew_t
) - Note that the reward function
Rt
is left unchanged, as a function of the state (not the observation) and action
- Steps:
- Step 1: first, compute the distribution on the possible states (
s_t|t, Sigma_t|t
), based on the observations we have:
s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t)
Using Kalman Filter algorithm - Step 2: We’ll use the mean
s_t|t
as the best approximation fors_t
- Step 3: then set the action
a_t := L_t * s_t|t
whereL_t
comes from the regular LQR algorithm
- Step 1: first, compute the distribution on the possible states (
- Kalman filter algorithm
- Predict step: compute
s_(t+1) | y_1, ..., y_t
Suppose we know:s_t | y_1, ..., y_t ~ hN(s_t|t, Sigma_t|t)
. Thens_(t+1) | y_1, ..., y_t ~ hN(s_t+1|t, Sigma_t+1|t)
, wheres_t+1|t = A * s_t|t; Sigma_t+1|t = A * Sigma_t|t * A^T + Sigma_s
(Sigma_s
is the variance ofs
) - Update step: compute
s_(t+1) | y_1, ..., y_(t+1)
As we know above, we can prove thats_(t+1) | y_1, ..., y_(t+1) ~ hN(s_t+1|t+1, Sigma_t+1|t+1)
, wheres_t+1|t+1 = s_t+1|t + K_t * (y_(t+1) - C * S_t+1|t); Sigma_t+1|t+1 = Sigma_t+1|t - K_t * C * Sigma_t+1|t
withK_t := Sigma_t+1|t * C^T * (C * Sigma_t+1|t * C^T + Sigma_y)^-1
(Sigma_y
is the variance ofy
, andK_t
is called the Kalman gain)
- Predict step: compute