- History & Brief introduction
- Image classification pipeline
- Loss Function and Optimization
- Backpropagation and Neural Networks
- Convoltional Neural Networks (ConvNet)
- Train Neural Networks
- Deep Learning Hardware and Software
- CNN Architectures
- Recurrent Neural Networks
- Detection, Segmentation and Locolization
- Visualizing and Understanding
- Generative Models
- Deep Reinforcement Learning
- Efficient Methods and Hardware for Deep Learning
- Adversarial Examples and Adversarial Training
History & Brief introduction
- Animal vision development
- The invention of cameras
- Hubel & Wiesel, 1959: How mammals’ vision processing mechanism like
- CV start
- David Marr: proposing a mechanism
- Primal Sketch: edges, bars, ends, virtual lines, etc.
- 2.5-D Sketch: piece togather the surfaces, depth, etc.
- 3-D Model Representation
- Primal Sketch: edges, bars, ends, virtual lines, etc.
- Generalized Cylinder & Pictorial Structure:
- Every matter can be compsed of simple geomatric configuration
- Every matter can be compsed of simple geomatric configuration
- David Lowe:
- Using lines and edges to represent
- Using lines and edges to represent
- Vision too hard — Start from image segmentation
- Specified & using computer learning — Real-time face detect
- Feature based object recognation: ‘SIFT’
- Histogram of Gradients & Deformable Part Model
- PASCAL Visual Object Challenge — Test the prograss we have made in object recognation
- Overfit problem
- ImageNet; ImageNet Large Scale Visual Recognition Challenge: whether the first five label contain the correct one
- 2012 huge dorp of error rate — Using of CNN
- Other challenges: Object detaction, Action classification, Image caption, etc.
- The first appear of origin CNN — 1998
- Why it is only popular recent years
- Moore’s law & Increasing computation
- Data explosion
- Moore’s law & Increasing computation
- The other challenge beyond just object recognation:
- Movement recognation
- 3-D 重建
- Recognation of every single pixel of the image
- Understand the story behind the image, which means truly understanding picture
- Movement recognation
Image classification pipeline
- The problem: Semantic gap:
The huge gap between the semantic idea of a cat(or sth.) and the huge matrix of pixel value that the computer is actually seeing. - Challenges:
- View point
- Illumination
- Deformable
- Occlusion
- Background clutter
- Intraclass variation
- View point
- Attempts: find edges, corners, etc.
Very hard & having little scalable
- Some Point of View:
- origin pictures
- images as points in a high dementional space
K&N An example of simple classification method
- Data-driven approach: Split the API into two functions, one train, one predict
- Collect a dataset of images and labels
- Use Machine Learning to train a classifier
- Evaluata the classifier on new images
- Collect a dataset of images and labels
- Exercise: Nearest Neighbors(CIFAR10 dataset)
-
Train step: memrize all picture
-
Predict step: compare and find the most similar one to the test picture
-
What is ‘similar’? L1(Manhattan) distance:
import numpy as py class NearestNeighbor: def __init__(self): pass def train(self, X, y): # X is N x D where each row is an example. y is a 1-D of size N contain labels self.X_train = X self.y_train = y def predict(self, X): num_test = X.shape[0] Y_predict = np.zeros(num_test, dtype= self.y_train.dtype) for i in range(num_test): distances = np.csum(np.abs(self.X_train - X[i, :]), axis = 1) min_index = np.argmin(distances) Y_predict = self.y_train[min_index] return Y_predict
-
Problem: Train O(1), Predict O(N). But we want is slow training and fast predicting
-
- K-Nearest Neighbors: (some small improvements of 5.)
- taking a mojority vote from K closest points
-
Using L2(Euclidean) distance instead:
- BTW: L1 is dependent upon the choice of coordination system. So somehow it may be more fit if the individual entry of your vector has specific meaning for your task
- BTW: By just changing the distance matrix, it can be applied to multiple tasks
- BTW: L1 is dependent upon the choice of coordination system. So somehow it may be more fit if the individual entry of your vector has specific meaning for your task
- taking a mojority vote from K closest points
- Hyperparameters: (like the K value and distance matix in the example above) Very problem dependent and almost not able to learn from huge data. Therefore, most people just try each of them out.
- Idea 1: choose ones work best on training data.
Terrible(example above, K = 1 always do perfect on training data), never do it. - Idea 2: Split the training data to train and test part. Choose ones work best on test date
Terrible, never do it. Because no idea how algorithm will peform on new data. - Idea 3: Split the training data into 3 parts, train, validation, test. Train on the train set, choose on validation data, then try algorithm on test set only once at last, to test whether it is good.
Better - Idea 4: Cross_validation. Split data into folds and test try each fold as validation and average the results(accuracy), then do the same as Idea 3
Even better, but not use often in prectice
- Idea 1: choose ones work best on training data.
Linear classification: Another example
- Parametric Approach: input image x, weight parameter W, so we compose a function f(x, W), to get 10 numbers giving class scors(as above, using CIFAR10, so 10 categories in total)
- Linear Classifier: $f(x, W) = W \times x + b$
- CIFAR10 x 32 x 32 x 3 reshape to 3072 x 1
- W 10 x 3072
- b 10 x 1. To understand b, if the training data has more cats than dogs, then it reasonable to predict that b_cat is larger than b_dog
- Can be considered as a template matching method. Each row of W is a template of the category
- Restrain: only one template for a class, once a huge differece come out, it try to average all the picture to get the only template
- CIFAR10 x 32 x 32 x 3 reshape to 3072 x 1
- Hard cases for a linear classifier: Consider f as a line in high-D space
- pairity problem: 一三象限、二四象限; such as odd or even people of animals
- multimodel: island in other classes
- pairity problem: 一三象限、二四象限; such as odd or even people of animals
- Score function: $s = f(x; W) = Wx$
Loss Function and Optimization
- Todo from last time:
- Def a loss function that quantifies our unhappiness with the scores across the training data
- Come up with a way of efficiently finding the parameters that minimize the loss function — optimization
- Def a loss function that quantifies our unhappiness with the scores across the training data
Loss Function
-
Def: tell how good our current classifier is.
- Ways of understanding:
- To set a loss func, what you actually do is to tell the computer what kinds of errors you care about more and what kinds of errors you can kind of accept — the linear and square func is an example
- To set a loss func, what you actually do is to tell the computer what kinds of errors you care about more and what kinds of errors you can kind of accept — the linear and square func is an example
- Example 1 :Multi-class SVM(Support Vector Machine) loss: (Hinge loss)
-
Fomula:
(where Delta is a sort of safety margin we set) -
Min 0; Max infinity
-
Trick: At the beginning, we randomly choose W, so all s should ~= to 0, and all approximatly equal, so there should be $L_i \approx ClassNum - 1$
-
If we square the value in sum, we’ll get a different loss func, and hopefully get a different W — A trick we can use when figuring out our own loss func
-
If we get a zero loss W, W is definatly not unique — 2W is also zero loss
-
-
That’s because we tell the classifier to find the W that fits the training data. But in practice we dont really care about it, we really care about the performance of the classifier on test data !!!
- Regularization: to encourage the model to choose a “sinple” W — To solve the problem:
-
-
L = Data loss + lambda * Regularization loss
-
lamda is a very important Hyperparameter
-
Examples: L2, L1, Elastic(beta*L2+L1)
- Important difference: consider what as complexity
- L2 prefer to spread the influence to all pixels; While L1 prefer to concentrate
- Important difference: consider what as complexity
-
- Example 2: Softmax loss(Multinomial Logistic Regression)
-
Softmax function: normalized probabilities of the classes
-
Softmax loss func: The true class’s probabilities should close to 1
-
Min 0; Max infinity
But to get 0, we need infinit number at the correct class and minus infinit number at others. So we’ll never get zero loss -
Trick: when all s ~= 0, and all approximatly equal, so there should be $L_i \approx - \log(1/ClassNum) = \log(ClassNum)$
-
- Contrast the two loss func:
- How we choose to interpret the scores to measure the badness
- What they want to do:
- SVM just want the correct class’s scors much bigger than others;
- Softmax want it to go towards infinity and others go towards minus infinity
- SVM just want the correct class’s scors much bigger than others;
- How we choose to interpret the scores to measure the badness
Optimization
- Strategy: Follow the slope — Gradient, pointing to the greatest increasing direction
- Numerical & Analytic: to debug & to train
- Gradient Descent:
- Step size (also called Learning rate): the single most important hyperparameter!
- Update rules
- Stochastic Gradient Descent(SGD): Because N is super large, instead using all N inputs, we use a minibatch of samples of inputs
- Step size (also called Learning rate): the single most important hyperparameter!
- Image Features and The Motivation: In prectice, we prefer not to fit the raw pixels, but the “features” we “get” from them
- Example 1: Color Histogram
- Example 2: Histogram of Oriented Gradients(HoG, Edges direction and distribution)
- Example 3: Bag of Words(Define visual words)
- Example 1: Color Histogram
Backpropagation and Neural Networks
Backpropagation
- Computational graph: 类似流程图,每个框代表origin variable or 运算
- A simple example:
f(x,y,z) = (x + y) * z
- Draw the graph:
x ---v (q) y -> + ----v (f) z -------> * -->
-
Assign every intermediate variable a name:
- To get Gradients of f: we start at the end and go back, and each step calc
(v is a intermediate or origin variable) (using Chain rule)
- Draw the graph:
- Backpropagation:
- Local node:
Some inputsx, y
, a simple calcf
, some outputsz
It just aware of the immediate surrounding, which is the things above - Local gradient:
For a local node we can get the gradient between the inputs and the output such aspdz / pdx; pdz / pdy
Use them to multiply the gradient get from behind the node(which ispdfinal / pdz
called Upstream gradient), we can get the gradient between the final output and the inputs of this node
If there are more than one outputs, which is more than one upstream gradients, add them up as the total upstream input of this node - From the end of the computational graph go back the the beginning, we can get the full gradient
- Local node:
- We can group some simple calc
f
togather to get a little coplex, as long as we can directly write down the local gradient
-
Such as sigmoid function:
-
Can be viewed as a trade-off between numerical calc and computational calc
-
- Patterns in backward flow:
- Add gate: gradient distributor;
- Max gate: gradient router;
- Mul(multiply) gate: gradient switcher
- Add gate: gradient distributor;
- Gradients for vectorized code (which means that x, y, z are vectors instead of numbers):
- All the local gradients and the upstream gradients are Jacobian Matrixes. Basically just using the chain rule to calc
- Can calc elementwise or write it out in vectorized (and matrix) multiply form
- If the final output is a number, not vector, the full gradient will have the same shape as its corresponding input
- output是vector情况? 对Matrix 的梯度为三维矩阵,但Gradient Descent 是Loss Function 的梯度,不是Hypothesis, 输出是number, 不会是Vector
- All the local gradients and the upstream gradients are Jacobian Matrixes. Basically just using the chain rule to calc
Neural Networks
- First see it as a function, without the brain staff:
- Before: Linear score function:
f = W * x
- Now: 2-layer Neural Network:
f = W_2 * max(0, W_1 * x)
- The non-linearity that the max function brings is very important and necessery
- Understand from picture view:
W_1 is a collection of many templates, and W_2 assign the tamplate to classes - We can make it even deeper by repeating “add non-linearity and the another layer of linear regression”
- Before: Linear score function:
- Inspiration from neural system
- Non-linearity: called activation function
Such asSigmoid, Leaky ReLU, tanh, Maxout, ReLU, ELU
- The Architectures: “3-layer Neural Network” or “2-hidden_layer Neural Nerwork”
Convoltional Neural Networks (ConvNet)
- A bit of history
From a function perspective (with out brain stuff)
-
Recall: Fully Connected Layer
- Convolution Layer
- Instead of stretching to a 1-D vector, we keep the structure of the image — Preserve spatial structure
- Convolve: the filter with the image i.e. “slide over the image spatially computing dot products”
Note that the filters always extend the full deoth of the input volume - To get the dot product, we stretching the filter into a 1-D vector
w
, and stretching the overlapping chunk of the image into 1-D vectorx
, and then get a number output byw^T * x + b
Or consider as overlap and then element-wise multiplication and add up - To do the “slide”, basically, center the filter on top of every pixel in this input volume, do the dot product and get a number to fill in the corresponding position in the output matrix (2-D for a 3-D input) (called activation maps)
- Apply several filters (can be same shape) to the same input, then we can get the same number of activation maps out
- ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
- Instead of stretching to a 1-D vector, we keep the structure of the image — Preserve spatial structure
-
To understand the ConvNet, along the depth of ConvNet (usually the depth of the matrix in the layer is also deeper), the Layers can be considered as looking for featrues from simple to complex
Such as from local feature to lines then to corners and blobs to the resemble of blobs -
To understand the filter, we can also see it as a template. By visualizing the filter, we can find out what parttern the filter is looking for, that is: what kind of input will get the larger output
-
Stride: the step size that the filter slide on the input matrix
-
-
Zero pad: To make it possible to set the filter’s center at the corners of the input, we do zero pad
With the Stride of 1, we can then maintain the size of the output, preventing it from shrinking
- 1 x 1 convolution layers make perfect sense, they mix the information along depth
The brain view of convolution layer
- Filter also is called receptive field
Other layers in the ConvNet
- Pooling layers (Downsampling)
- makes the representations smaller and more manageable
- operates over each activation map independently
- Only spacially, do nothing to the depth
- Example: Max Pooling
- Also instead of doing pooling, we can also use stride to do the same thing — downsampling
- makes the representations smaller and more manageable
- ReLU or Other non-linearities: the same as Neural Networks
- Fully Connected Layer (FC layer)
Train Neural Networks
Activation Functions
- Sigmoid:
- Squashes numbers to range [0, 1]
- Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
- Three problems:
- Saturated neurons “kill” the gradients:
When inputs are very far away from 0, then the gradient become 0, and that is bad in the procedure in the backprop - Output are not zero centered:
That means the gradients on w will always all positive or negative, that would make the descenting less efficient ??? - exp() is a bit compute expensive (less important)
- Saturated neurons “kill” the gradients:
- Squashes numbers to range [0, 1]
-
- Squashes numbers to range [-1, 1]
- Zero centered
- Still, saturated neurons “kill” the gradients
- Squashes numbers to range [-1, 1]
- Rectified Linear Unit (ReLU):
- Does not saturate (in + region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- Actually more biologically plausible than sigmoid
- Problems:
- Not zero-centered output
- Saturate in the - region:
When you have a bad initializing weight matrix, or you have a learning rate too large
To reduce the possibility of dead ReLU, people like to initialize it with slightly positive biases (e.g. 0.01)
- Not zero-centered output
- Does not saturate (in + region)
- Leaky ReLU:
- Does not saturate (in + region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- Will not “die”
- Does not saturate (in + region)
-
PReLU:
with \alpha as a parameter that can learn by compute -
Exponential Linear Units (ELU):
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU adds some robustness to noise
- But computation of exp()
- All benefits of ReLU
- Maxout “Neuron”:
- Will not saturate, will not die
- Double the parametar number
- Will not saturate, will not die
Data Preprocessing
- Zero-mean & Normalize
- Subtract the mean image (e.g. AlexNet) or the per-channel mean (e.g. VGGNet)
Important because if it’s not zero-centered then loss function will be too sensitive to small perturbations - Not really do normalize in picture processing
- Subtract the mean image (e.g. AlexNet) or the per-channel mean (e.g. VGGNet)
- PCA & Whitening (more complex, usually dont do in picture processing)
- When we do the data preprocessing to the training set, we need to do the exact same calc with the exact same value to the test set
Weight initialization
- First idea: Small random numbers
w = 0.01 * np.random.randn(D, H)
- Do good in small network, but have problem when it get deeper
picture ???
- Xavier initialization:
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
- Do good when use tanh
- To fit ReLU (‘cause it kill half of the inputs), one possible modification is
W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)
- Do good when use tanh
- Still a research area
MSRA another way you can use
Batch Normalization
- Idea: want to keep activation in a gaussian range
- To make each dimension of a batch of actications at some layer unit gaussian, apply:
xk = (xk - E[xk]) / sqrt_root(Var[xk])
- Usually inserted after FC layers (each demention) or Convolutional layers (jointly) and before non-linearity (apply for each neuron)
- Usually then allow the network to squash the range if it wants to
yk = gamak * xk + betak
. Notice that the net work can learngamak = sqrt_root(Var[xk]); betak = E[xk]
to recover the identity mapping - Benefit:
- Improves gradient flow through the network
- Allows higher learning rates
- Kind of more robust
- Reduces the strong dependence on initialization
- Act as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
- Improves gradient flow through the network
- Note that: At test time the BatchNorm layer functions differently. The mean/std not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)
Babysitting the Learning Process
- The starting steps when apply ConvNet:
- Step 1: Preprocess the data
- Step 2: Choose the architecture
Notable: Double check the loss is reasonable (The trick learned in loss function, and also loss should go up when enable the regularization) - Step 3: Try to train
- Tip: First Make sure that you can overfit very small portion of the training data very well (with very small loss and accuracy 1.00)
- Start with small regularizatioin and find learning rate that makes the loss go down (not go down: too low; exploding: too high)
TODO
- Tip: First Make sure that you can overfit very small portion of the training data very well (with very small loss and accuracy 1.00)
- Step 1: Preprocess the data
Hyperparameter Optimization
- Cross-validation Strategy (in stages)
- First stage: only a few epochs to get rough idea of waht parameters work
- Second stage: longer runinig time, finer serch
- Repeat as necessary
- Tips for what doesnt work: If the cost is ever > 3 * original cost. Then we can break out early
- Note it’s best to optimize in log space (
10**uniform(s,e)
) - And compare to Grid Search, Random Search is actually better. Because we can get a better sense of what the function looks like
- First stage: only a few epochs to get rough idea of waht parameters work
- Some special situation:
- loss flat for some time then go down: maybe bad initializing
- big gap between train and validation accuracy: maybe overfitting
- no gap between them: consider increase capacity
- loss flat for some time then go down: maybe bad initializing
- Track the ratio of weight updates / weight magnitudes: around 0.001 or so is good
Fancier Optimization
- Problem with SGD:
- If the loss function changes quickly in one direction and slowly in another (has high condition number, which is the ratio of largest to smallest singular value of the Hessian matrix)
Then the SGD would get a very slow progress along shallow dimension, and we would get a zig-zag jitter along steep direction
In high-D this would happen more often - May stack in a local minima or a saddle point
In high-D, the problem is more about saddle points and less about local minima, and also not exactly at the saddle point but near the sadlle point - Out gradients come from minibatches so they can be noisy
- If the loss function changes quickly in one direction and slowly in another (has high condition number, which is the ratio of largest to smallest singular value of the Hessian matrix)
- Minor improvement: SGD + Momentum
-
Build up “velocity” as a running mean of gradients and go the direction of velocity
-
rho
gives the “friction” of the velocity contribute to the next velocity; Typically rho = 0.9 or 0.99 -
The fomula:
-
Solve all three problems!
-
-
Nesterov Momentum:
Instead of adding the gradient from the point we now standing on, we start from the point now go as the velocity before updating and calc the gradient there, then go back to the point, and calc the new velocity
Another form
- AdaGrad:
-
Keep adding the sqr of these SGDs and
-
As the training goes, grad_squared become smaller, so the actually learning rate become smaller — good for convex, otherwise, may still get stuck at saddle point
-
- Improve: RMSProp
- Instead of
grad_squared += SGD * SGD
, we dograd_squared = decay_rate * grad_squared + (1 - decay_rate) * SGD * SGD
wheredecay_rate
is commonly 0.9 or 0.99
- Instead of
- Adam algorithm
-
-
But if we initialize the
second_moment = 0
and because thebeta2
is 0.9 or 0.99 typically, whcih is near to 1. That is to say, for the early steps, we’ll always have small second_moment, then there will be a hugh step as we divide it. And that’s a problem -
To solve it: add a bias term (t means the current time step)
-
-
Problem for all these: axis dependent. That is to say, if we rotate the picture, the algorithm cannot really unrotate it
- Learning rate decay: strt with a relatively large learning rate and then decay it along the training
-
Step decay: decay learning rate by a factor every few epochs
exponential decay:
1/t decay:
-
Common with SGDMomentum, less common with Adam
-
Second order hyperparemeter
-
- All the algorithm is First-Order Optimization
Second-Order Optimization: Newton’s Method But H^-1 too big, so there’s improvement to approximate Hessian
(But they dont actully handle the stochastic case very nicely, and tend not to work very well with non-convex problems)
- Quasi-Newton methods
- L-BFGS
Regularization
- All the above is about how to reduce our training error, but we dont really care about that, what we really care about is the reducing the gap between the training set and unsee samples (test set)
- Model Ensembles: Tend to improve by 2%
- Trian multiple independent models
- At test time average their results
- Tricks:
- Instead of training independent models, use multiple snapshots of a single model during training
- Polyak averaging: Instead of using parameter vectors, use the average of it along the training ???
- Instead of training independent models, use multiple snapshots of a single model during training
- Trian multiple independent models
- Regularizations: To improve single-model performance
- Dropout: In each forward pass, randomly set some neurons to zero, and every time we do a forward pass, we’ll set a different set of the neurons (randomly) to zero
- Probablity of dropout is a hyperparameter; 0.5 is common
- Often do it on FC layers;
Sometimes use it also on convolutional layers;
And on convolutional layers sometimes instead of droping neurons, we might drop entire feature maps randomly - Understand why this works
- Help prevent co-adaptation of features, which is to say prevent it from distribute the decision to too many “overlaping” features
- Can also view as training a large ensemble of models
- Help prevent co-adaptation of features, which is to say prevent it from distribute the decision to too many “overlaping” features
- At test time, to wipe out randomness
-
Consider a single neuron which inputs are
x, y
and output isa
-
During training time we have:
-
So at test time, we just run the model without dropout and then multiply the output by dopout probablity
-
Inverted dropout: Insread multiply Prob at test time, we divide Prob during training time (to do test more efficiently)
-
- Probablity of dropout is a hyperparameter; 0.5 is common
- Common pattern of Regularization:
- Training: add some kind of randomness
- Testing: average out randomness (or approximatly)
- Training: add some kind of randomness
- Another examples:
- Batch Normalization
- Data Augmentation: Train on the randomly transformations of the origin pictures (flips, crops, scales, color jitter, rotating, shearing, etc.) Consider how can we transform the data without changing the label of it
- DropConnect
- Fractional Max Pooling
- Stochastic Depth
- Batch Normalization
Transfer Learing
- Another point of view to overfit: you dont have enough data
- Step of Transfer Learning:
- Train on a Large enough and similar Dataset
- Freeze the hidden layers, and reinitialize and train the last FC layers on the small dataset you want to aplly on
- When you have a larger dataset, then maybe consider updating larger parts of the network
- Note that you probably want to drop the learning rate when you do the transfer
- Train on a Large enough and similar Dataset
- Pretrain and Finetuning is very commonly use in many areas
Deep Learning Hardware and Software
CPU vs. GPU
- GPU: (DL prefer NVIDIA)
- larger number of cores (more than 100 times), but each core is much slower (small clock speed) and “dumber”
- have their own pretty large RAMs and own caching system
- Great for parallel tasks with pretty high similarity
- larger number of cores (more than 100 times), but each core is much slower (small clock speed) and “dumber”
- Example: Matrix Multiplication
- Programming GPUs
- CUDA (NVIDIA only)
- OpenCL (runs on anything, but usually slower)
- CUDA (NVIDIA only)
- Data Transfer problem (from disk to GPUs). Solutions:
- Read all data into RAM (if your data is relatively small)
- Usd SSD instead of HDD
- Use multiple CPU threads to prefetch data
- Read all data into RAM (if your data is relatively small)
Deep Learning Frameworks
ThsorFlow
- Rough Process:
- First define computational graph;
- Then run the graph many times
- First define computational graph;
- Detail Process:
- Placeholders:
x = tf.placeholder(tf.float32, shape=(rx,lx)); y, w1, w2
similar - Def forward pass: compute prediction for y and loss (No computation happens, just building model)
h, y_pred, loss
- Tell TF to compute loss of gradient (No computation happens, just building model)
grad_w1, grad_w2
- Enter a session so we can actually run the graph
with tf.Session() as sess:
- Create numpy arrays that will fill in the placeholders above
Value = {x:..., w1:..., ...}
- Run the graph: feed in the numpy arrays; get the output we want
out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
loss_val, grad_w1_val, grad_w2_val = out
- Use to update and repeat the step ‘R’
- Placeholders:
- Problems: Copying weights between CPU/GPU each step, Expensive! Solution:
- Instead of placeholder, use Variable for
w1 w2
at step ‘P’ and tell TF how we want them to be initialized (Still, dont actually initialize now)w1 = tf.Variable(tf.random_normal((rw, lw)))
- Before the step ‘E’ and after the step ‘T’, add assign operations to update
w1, w2
as part of the computational graphnew_w1 = w1.assign(w1 - learning_rate * grad_w1)
- After the step ‘E’
sess.run(tf.global_variables_initializer()) values = {x: np.random.randn(rx, lx), y: np.random.randn(ry, ly)} for t in range(TotStep): loss_val, = sess.run([loss], feed_dict=values)
- Instead of placeholder, use Variable for
- But this will not update, because we dont tell the TF to update, we just tell it to calc loss. Solution:
- A funny way of doing so:
- We add a dummy node which will return ‘None’ at the step ‘B’:
updates = tf.group(new_w1, new_w2)
- Last line change to
loss_val, _ = sess.run([loss, updates], feed_dict=values)
- We add a dummy node which will return ‘None’ at the step ‘B’:
- A way that TF provides:
- Instead step ‘T’ and ‘B’, we can just use an optimizer to cumpute gradients and update weight
optimizer = tf.train.GradientDescentOptimizer(1e-5)
(1e-5 is the training_rate)
updates = optimizer.minimize(loss)
- Last line change to
loss_val, _ = sess.run([loss, updates], feed_dict=values)
- Instead step ‘T’ and ‘B’, we can just use an optimizer to cumpute gradients and update weight
- A funny way of doing so:
- TF also packing layer for us: After setting
x, y
, we can just call the following code to set layers without announcingw
’s andb
’s outslves:
init = ty.contrib.layers.xavier_initializer() h = tf.layers.dense(inputs=x, units=H, activtion=tf.nn.relu, kernel_initializer=init) y_pred = tf.layers.dese(inputs=h, units=lx, kernel_initializer=init)
- Keras: High-Level Wrapper
- Other High-Level Wrappers: TFLearn, TensorLayer, Sonnet, etc.
- Pretrained Models, Tensorboard, Distributed Version
- Like Theano
PyTorch
- Three Levels of Abstraction
- Tensor: Imperative ndarray, but runs on GPU (like the Numpy array for TF)
- Variable: Node in a computational graph; stores data and gradient (like Tensor, Variable, Placeholder for TF)
- Module: A neural network layer; may store state or learnable weight (like tf.layers, or TFSlim, or TFLearn, or Sonnet, or… for TF)
- Tensor: Imperative ndarray, but runs on GPU (like the Numpy array for TF)
- To let Tensor run on GPU, we need to use special datatype:
dtype = torch.cuda.FloatTensor
x = torch.randc(rx, lx).type(dtype)
for use Tensor
x = torch.autograd.Variable(torch.randn(rx, lx), requires_grad=False)
for use Variable, and then it containx.data
, and a boolx.grad
But their API is all the same
But as using Variable, we can simply callloss.backward()
to calc the gradients (dont forget to initialize 0 first ???)- nn package, optim package and DataLoaders function
- We can define our own new autograd functions and new modules in nn
- Pretrained Models, Visdom
- Update from Torch
- Static vs Dynamic Graphs
- Def:
- Static: TF, we first build then run
- Dynamic: PT, we just define Tensor or Variable first, and the build is done with in the loop and is running itself
- Static: TF, we first build then run
- Optimization: Static ones can allow the computer to optimize the graph for you before it runs
- Serialization: Static ones once built, can serialize it and run it without the code that built the graph (which allow us to change coding environment)
- Conditional Operation: Dynamic ones will end up more clean. For Static you need special TF control flow to do so (like
tf.cond
) - Loops: Like
y_t = (y_(t-1) + x_t) * w
, maybe even have a different size. Again, Dynamic ones are cleaner, and For Static you need special TF control flow to do so (in thsi casetf.foldl
)
- Def:
- TensorFlow Fold: Dynamic graphs in TF
Caffe & Caffe2
CNN Architectures
- Review: LeNet 5
- AlexNet:
- Architectures:
CONV1 (96 x 11 x 11 (x 3); stride 4 pad 0 => 35K parameters);
MAX POOL1 (3 x 3 (x 1); stride 2 => No parameter);
NORM1;
CONV2 (256 x 5 x 5 (x 48); stride 1 pad 2);
MAX POOL2 (3 x 3 (x 1); stride 2 => No parameter);
NORM2;
CONV3 (384 x 3 x 3 (x 128); stride 1 pad 1);
CONV4 (384 x 3 x 3 (x 192); stride 1 pad 1);
CONV5 (256 x 3 x 3 (x 192); stride 1 pad 1);
MAX POOL3 (3 x 3 (x 1); stride 2 => No parameter);
FC6 (4096 neurons);
FC7 (4096 neurons);
FC8 (1000 neurons);
Softmax - First use of ReLU;
Heavy data augmentation;
Dropout; Momentum; Batch; Decay (Weight, Learning rate); Ensembles; etc.
- Architectures:
- ZFNet: Change of hyperparameters
- VGGNet: Small filters, Deeper networks (VGG16, VGG19)
- only 3 x 3 CONV stride 1 pad 1 & 2 x 2 MAXPOOL stride 2
That will lead to deeper, more non-linearities and less parameters - Dont use NORM layers
- Most memory is in early CONV
Most parameters are in late FC (but not the last one)
- only 3 x 3 CONV stride 1 pad 1 & 2 x 2 MAXPOOL stride 2
- GoogLeNet: Deeper networkd, with computational efficiency
- No FC layers => Only 5M parameters
- Efficient “Inception” module:
Apply parallel filter operations (different size of CONV and Pooling) on the same input from previous layer; Concatenate all filter outputs together depth-wise
Usually CONV 1x1, 3x3, 5x5 and POOL 3x3 - Problem: expensive for compute & always deeper because pooling layer, Solution:
use 1x1 conv (Bottleneck layers) to prjects depth to lower dimention (Before 3x3, 5x5 CONV and after 1x1 POOL)
- Architecture:
- Stem Network (Conv-Pool-2x Conv-Pool) + Stacked Inception module + Classifier output (Without the expensive FC, which means there’s only one FC to generate the score of each class)
- Auxiliary classification putputs to inject additional gradient at lower layers (AvgPool-1x1Conv-Fc-FC-Softmax)
- Stem Network (Conv-Pool-2x Conv-Pool) + Stacked Inception module + Classifier output (Without the expensive FC, which means there’s only one FC to generate the score of each class)
- No FC layers => Only 5M parameters
- ResNet: Very deep networks using residual connections
- The deeper network sometimes perform worse than shallow ones in both training set and test set (so it’s not caused by overfitting)
Because it is much more difficult to train and optimize
- But deeper ones should at least perform as well as shallow ones by copying and add identity maping — The basic idea of ResNet
- Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping
X (input) -> CONV -> ReLU -> CONV -> F(X)
, and instead of using F(X) as input of next layer, we use F(X) + X. By this mean, F(X) is the residual of delta from X
- Architecture:
- Stack residual blocks
- Periodically, downsampling by using stride 2
- Addtional CONV layer at the beginning and only one FC layer at the end
- Also use bottleneck layer to improve efficiency
- Stack residual blocks
- Others: Batch Norm, Xavier/2, Momentum, Decayings (Weight, Learning rate), No dropout
- The deeper network sometimes perform worse than shallow ones in both training set and test set (so it’s not caused by overfitting)
- Other Architecture:
- Network in Network (NiN)
- Identity Mappings in Deep Residual Networks
- Wide Residual Networks
- Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)
- Deep Networks with Stochastic Depth
- FractalNet: Ultra-Deep Neural Networks without Residuals
- Densely Connected Convolutional Networks
- SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size
- Network in Network (NiN)
Recurrent Neural Networks
-
“Vanilla” Neural Networks: one to one
-
Recurrent Neural Networks:
ONe to one; One to many; Many to one; Many to many
-
Recurrence formula:
x -> RNN -> y
.....| ^.....
.....|_|.....
Note W is the same for all the h_t - Backpropagation through time
- Problem: if input is very large, it’ll get super slow
- Improvement: Truncated Backpropagation through Time:
Run forward and backward through chunks of the sequence instead of whole sequence;
Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
- Problem: if input is very large, it’ll get super slow
- Example 1: (Vanilla) Recurrent Neural Network
-
-
The spread form of conputational graph
x_i's, y_i's, h_i's, W, L_i's, L
-
The case of o-m, m-o, m-m;
Also m-m can be seen as m-o + o-m (kind of like coding and decoding) -
Example: Character-level Language Model
- Train by a sequence of characters you want
- Sample a character, get an output, generate the next character by argmax of prob, and feed in the model
- Train by a sequence of characters you want
-
- Example 2: Image Captioning
- Convolutional NN + Recurrent NN
- For CNN, we use the 4096 output before the final FC layers as the image input
v
to the RNN - For RNN:
-
Special <START> token and <END> token
-
-
Word-level
-
- To train it, we can train both CNN and RNN togather by pass down the gradient to CNN
- Improvement: With Attention (Soft & Hard)
RNN focuses its attention at a different spatial location when generating each word (and the input from t-1 and ConvNet to the RNN will be different)
- Convolutional NN + Recurrent NN
-
Example 3: Visual Question Answering
-
Multilayer RNN:
- LSTM
- Exploding & Vanishing Gradients Problem: Because when we do backprop, we’ll multiply some
W
over and over again
- For EGP, nasty solution is to set a threshold, scale gradient if its norm is greater than it
- For VGP, Change RNN architecture ==> LSTM (which can solve both the problem)
- For EGP, nasty solution is to set a threshold, scale gradient if its norm is greater than it
-
Formula:
(where sigma is sigmoid functon, andW
is4het * 2het
if the heights ofh, x
are bothhet
)
where.*
is element-wise multiply -
Input gate (i): whether to write to cell
Forget gate (f): whether to erase cell
Gate gate (g): how much to write to cell
Output gate (o): how much te reveal cell
Where cell — c - Backprop:
Consider the path through cells (c’s), then it only get element-wise muitiplied by different forget gates (f), and that’s a lot nicer
???
And f usually initialize to near 1 to avoid vanishing at the beginning of the train
- Exploding & Vanishing Gradients Problem: Because when we do backprop, we’ll multiply some
-
Highway Networks
-
Other RNN Variants
- GRU
- GRU
Detection, Segmentation and Locolization
Semantic Segmentation
Want to produce a category label for each pixel of the input image
- Only pixels dont have the view of objects, so dont distinguish two cows for example, just give the label of cow
- Idea 1: Sliding Window
Seperated to crops, and run a CNN on each crops to dicide the central pixel’s label
Very inefficient! - Idea 2: Fully Convolutional
- Without FC layers, the output of the final Conv layer get a depth of class number and just view it as the scores of each class, and run the training
Super computational expensive - Imporvement:
Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network
- Without FC layers, the output of the final Conv layer get a depth of class number and just view it as the scores of each class, and run the training
- In-Network upsampling:
- Unpooling
- Nearest Neighbor
For 2x2 stride 2, what we do is to copy a element and fill the 2x2 with its number
11 <- 1
11
- Nails upsampling
For 2x2 stride 2, what we do is to fill the 2x2 with 0, except for one element
10 <- 1
00
- Max Unpooling
Just do the Nails upsampling and do it kind of symmetric with its corresponding downsampling
12 -> 5 -> ... -> 1 -> 00
35.....................01
- Nearest Neighbor
- Transpose Convolution
We take one pixel in the input, multipliy it with the filter and add it to the output matrix
So the stride is the step length of the filter on the output matrix, and the pad is also using at the output matrix
- Unpooling
Classification + Localization
- Treat the localization as a regression prblem, using the same second to last FC layer as classification.
While it go through another FC to the generate the class scores, let it fo through another FC layer to generate Box Coordinate(x, y, w, h)
Also we’ll have two loss functions: Softmax loss (or Cross-entropy loss) and Regression loss (L2 or L1 or others). So we take the weighted sum (hyperparameter) of them and use it to do the SGD
Fancier: generate boxes for every class and only use the one of the true class to calc the second loss function
- Aside: using the idea of generate fix location
- Human pose Estimation
- Human pose Estimation
Object Detection
- Might be varying number of output in every image
- Idea 1: as classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop aas objects or background
Problem: Need to apply CNN to huge number of locations and scales, Very computational expensive - Idea 2: Region Proposals
Give regions where the objects are likely to be found
It is not deep learning, but a troditional machine learning
- R-CNN: combine the two ideas
- Steps:
- Use Region Proposals to generate around 2K Regions of Interest (RoI)
- Warped them into a fix size the CNN wants
- Run each of them to the same CNN
- Do the Classification and Localization using the CNN
Note that we could train the CNN to Localize the object a little of the RoI, when the RoI miss a head of a person for example
- Use Region Proposals to generate around 2K Regions of Interest (RoI)
- Problems:
- Super slow still, for both training and testing
- Super slow still, for both training and testing
- Steps:
- Fast R-CNN
- Step:
- Forward whole image through some conv. layers to get a high resolution convolutional feature map
- Use Region Proposals on the convolutional feature map to generate RoIs
- RoI Pooling layer to warped them into a fix size the CNN wants
- Directly run them to some FC Layers to do the Classification and Localization
- Forward whole image through some conv. layers to get a high resolution convolutional feature map
- Super fast that Region Proposal becomes the bottleneck of speed
- Step:
- Faster R-CNN
- Idea: Make CNN do proposals. Insert Region Proposal Network (RPN) in the CNN to predict proposals from features
- The RPN also do things like classification (over two class: objects or not) and Bounding-box regression, so it has two loss function, too.
Therefore, in total, we have 4 loss function that we need to use at the same time (weighted sum) to train the network
- Idea: Make CNN do proposals. Insert Region Proposal Network (RPN) in the CNN to predict proposals from features
- Detection eithour Proposals: YOLO/SSD (You Only Look Once & Single Shot Detection)
- Idea:
- Divided into grid cell (R * L)
- For each grid cell, draw several (denote as
B
) boxes you set centered it - Within each grid cell
- Regress from each of the B base boxes to a final box with 5 numbers:
(dx, dy, dh, dw, confidence)
- Predict scores for each of C classes (including background as a class)
- Regress from each of the B base boxes to a final box with 5 numbers:
- Output: Tensor R * L * (5 * B + C)
- Divided into grid cell (R * L)
- Idea:
- Aside: Object Detection + Captioning = Dense Captioning
Instance Segmentation
Kind of like Semantic Segmentation + Object Detection
- Mask R-CNN:
- Image -> CNN -> RoI Align -> Two branches: One Conv of Mini Semantic Segmentation; One do the Classification and Localization (and joint locolization for next no.)
- Image -> CNN -> RoI Align -> Two branches: One Conv of Mini Semantic Segmentation; One do the Classification and Localization (and joint locolization for next no.)
- Also can do pose estimate at the same time
Visualizing and Understanding
- First Layer (Generally Conv. layer) : Visualize Filters: Oriented edges and Opposing colors (Scale to 0~255)
- For intermediate layers: Directly using gray scale to interprete the weight matrix can not give much of understanding, we need more fancier method
- For last hidden layer:
- Nearest Neighbors
- Dimensionality Reduction: Use to high-D vectors to 2-D coordinates
- PCA
- t-SNE (t-distributed stochastic neighbor embeddings)
- PCA
- Nearest Neighbors
- For intermediate layers: While visualizing weight matrix is less useful, visualizing the activation maps of those intermediate layers is kind of interpretable
- Maximally Activating Patches: Visualizing what types of patches from input images cause maximal activation in different neurons
- Occlusion Experiment: Mask part of the image before feeding to CNN, draw heatmap of probability at each mask location
- Saliency Maps ===> Segmentation without supervision
- (Guided) Backpropagation
- Gradient Ascent
- Using gradient to change pixels to maximize the output score of some neuron
- Regularization:
- L2 norm
- Gaussian blur image (Periodical)
- Clip pixels with small values to 0 (Periodical)
- Clip pixels with small gradients to 0 (Periodical)
- L2 norm
- Adding “multi-faceted” visualization (with more careful regularization, center-bias)
- Using gradient to change pixels to maximize the output score of some neuron
- Fooling Images / Adversarial Examples (A whole lecture later)
- Start from an arbitrary image
- Pick an arbitrary class
- Modify the image to maximize the class
- Repeat until network is fooled
- Start from an arbitrary image
- DeepDream: Amplify existing features
- Forward: compute activations at chosen layer
- Set gradient of chosen layer equal to its activation
- Backward: compute gradient on image
- Update image
- Forward: compute activations at chosen layer
- Feature Inversion: Given a CNN featrue vector for an image, find a new image that
- Matches the given feature vector
- (Start with noise) “looks natural” (by using regularization)
- Matches the given feature vector
- Total Variation Regularizer
- Texture Synthesis
- Nearest Neighbor (without neural network)
- Neural Texture Synthesis: Gram Matrix
- Nearest Neighbor (without neural network)
- Neural Artwork Synthesis
- Neural Style Transfer: Feature Inversion + Gram Matrix
- Content Image + Style Image
- Play with the hyperparameters
- Super slow
- Solution: Train another neural network to perform style transfer for us
- Content Image + Style Image
- The apply of multi-scale processing on some of the above
Generative Models
- Unsupervised learning
Learn some underlying hidden structure of the data (Without label)
Example: Clustering, Dimensionality reduction, Frature learning, Density - Generative Models: Given training data, generate new samples from the same distribution
- Two main classes: Explicit density; Implicit density
- Two main classes: Explicit density; Implicit density
PixelRNN and PixelCNN
-
Explicit density model
Using chain rule to decompose likelihood of an image x into product of 1-d distributions:
Then maximize likelihood of training data -
Complex distribution over pixel values on the right of the equation ==> Express using a neural network
We’ll need to define ordering of “previous pixels”
- PixelRNN:
- Generate image pixels startion from corner, and moving to right and down each step
- Dependency on previous pixels modeled using an RNN (LSTM)
- Drawback: Sequential, generation is slow
- Generate image pixels startion from corner, and moving to right and down each step
- PixelCNN: ???
- Still generate image pixels startion from corner, and moving to right and down each step
- Dependency on previous pixels now modeled using a CNN over context region
- Training: maximize the likelihood of training images
- Training is faster. But generation must still proceed sequentially, so still slow
- Still generate image pixels startion from corner, and moving to right and down each step
Variational Autoencoders (VAE) ??? from github
-
Define intractable density function with latent z:
Cannot optimize directly, derive and optimize lower bound on likelihood instead - Some background: Autoencoders
- Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data (x -> z, with dimensionality reduction)
- Encoder and Decoder: Linear + nonlinearity, Deep, Fully-connected, ReLU CNN
- Train such that features can be used to reconstruct original data
“Autoencoding” - encoding itself
Loss function: e.g. L2 - After training, we can throw away the decoder layers, and use z to train a supervised model (Classifier etc.)
- Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data (x -> z, with dimensionality reduction)
- Variational Autoencoders
- Assume training data is generated from underlying unobserved (latent) representation
z
- Sample from true prior
p_\theta^*(z)
, and then sample from true conditionalp_\theta^*(x|z_i)
We want to estimate the true parameterstheta^*
of this generative model - How should we represent this model
Choose priorp(z)
to be simple, e.g. Gaussian
Conditionalp(a|z)
is complex => represent with neural network (Called Decoder network) - How to train the madel
-
Learn model parameters to maximize likelihood ==>
\int
is intractable -
Solution: In addition to decoder network modeling
p_theta(x|z)
, define additional encoder networkq_phi(z|x)
that approximatesp_theta(z|x)
p => miu_x|z, Sigma_x|z => x|z ~ hN(miu_x|z, Sigma_x|z)
q => miu_z|x, Sigma_z|x => z|x ~ hN(miu_z|x, Sigma_z|x)
-
Then for likelihood:
-
-
Then we get a loweer bound of
log(p_theta(xi))
which ishL(xi, theta, phi) = E_z[log(p_theta(xi|z))] - D_KL(q_phi(z|xi) || p_theta(z))
And when training, we maximize the lower bound:theta^*, phi^* = argmax(sum_i(hL))
-
For a forward pass:
- input minibatch of x
q_phi(z|x)
Encoder network- Sample
z
fromz|x ~ hN(miu_z|x, Sigma_z|x)
p_theta(x|z)
Decoder network- Sample
x|z
fromx|z ~ hN(miu_x|z, Sigma_x|z)
- Calc
hL
and maximize it (backprop)
- input minibatch of x
-
- For generating data:
- Use decoder network
- Now sample
z
from prior (z ~ hN(0, I)
)
- Use decoder network
- Cons:
- Maximize the lower bound, okay but not as good
- Samples blurrier and lower quality picture
- Maximize the lower bound, okay but not as good
- Assume training data is generated from underlying unobserved (latent) representation
GANs (Generative Adversarial Networks)
-
Dont work with explicit distribution
Want to sample from complex, high-d training distribution
There’s no direct way to do this ==> Want to have a transformation which tansform them into a simple distribution ==> Neural network
What we do is to sent a random noise (which is a simple distribution’s sample) to a neural network, and let it to output a sample from training distribution
- Two-player game:
- Generator network: try to fool the discriminator by generating real-looking images (from random noise)
- Discriminator network: try to distinguish between real and fake images
- Generator network: try to fool the discriminator by generating real-looking images (from random noise)
-
Train jointly in minimax game:
whereD_{\theta_d}
is discriminator output, andG_{\theta_g}
is generated fake data
- Training GANs: Alternate between
-
Gradient ascent on discriminator
`
` -
Gradient descent on generator
But this function’s slope is bad, cause we’ll train slow when it’s bad and fast when it’s good
Using Gradient ascent on generator, instead
-
-
Jointly training two networks is challenging and can be unstable. Active research area
-
Improving: Convolutional Architectures
-
Interpolating and Interpretablility
- Cons:
- Trickier / more unstabel to train
- Can’t solve inference queries such as
p(x)
,p(z|x)
- Trickier / more unstabel to train
Deep Reinforcement Learning
- Problems involving an agent interaction with an environment which provides numeric reward signals. Goal is to learn how to take actions in order to maximize reward
Environment gives a States_t
to Agent, Agent generates back an Actiona_t
to Environment, then Environment gives a Rewardr_t
and the next States_(t+1)
. Repeat until the Environment gives the terminal State
- Examples:
- Cart-Pole Problem
- Robot Locomotion
- Atari Games
- Go (围棋)
- Cart-Pole Problem
Markov Decision Process (MDP)
-
It is mathematical formulation of the PL problem
-
Markov Property: Current state completely characterises the state of the world
- An MDP is defined by:
(hS, hA, hR, P, gamma)
hS
: set of possible stateshA
: set of possible actionshR
: distribution of reward given (state, action) pairP
: transition probability i.e. distribution over next state given (state, action) pairgamma
: discount factor (how much we value rewards coming up soon versus later on)
- The MDP:
- At time step
t = 0
, environment samples initial states_0 ~ p(s_0)
- Then, for
t = 0
until done:
- Agent selects action
a_t
- Environment samples reward
r_t ~ hR(. | s_t, a_t)
- Environment samples next state
s_(t+1) ~ P(. | s_t, a_t)
- Agent receives reward
r_t
and next states_(t+1)
- Agent selects action
- A policy
pi
is a function fromhS
tohA
that specifies what action to take in each state - Objective: find policy
pi^*
that maximizes cumulative discounted reward:sum_t(gamma^t * r_t)
- At time step
-
Example: Grid World
- How do we handle the randomness (initial state, transition probability…)
Solution: Maximize the expected sum of rewards
withs_0 ~ p(s_0), a_t ~ pi(. | s_t) ,s_(t+1) ~ p(. | s_t, a_t)
Q-Learning
- Def: Value function and Q-value funciton
-
Value function at state
s
, is the expected cumulative reward from following the policy from state s:
-
Q-value function at state
s
and actiona
, is the expected cumulative reward from taking actiona
in states
and then following the policy:
-
`
-
-
Bellman equation:
-
Solving for the optimal policy: Value iteration algorithm (use Bellman equation as an iterative update)
Q_i
will converge toQ^*
asi -> infinity
-
Problem: Not scalable. Must compute
Q(s, a)
for every state-action pair. Computationally infeasible
Solution: Use a function approximator to estimateQ(s, a)
. E.g. a neural network
- Deep Q-Learning
-
Q-Learning: Use a function approximator to extimate the axtion-value function
Q(s, a; theta) ~= Q^*(s, a)
Deep Q-Learning: When the function approximator is a deep neural network -
Forward Pass
Loss function:
Where
-
Backward Pass
Gradient update (with respect to Q-function parameterstheta
)
-
-
Example: Playing Atari Games
- Experience Replay:
- Learning from batches of consecutive samples is problematic:
- Samples are correlated => inefficient learning
- Current Q-network parameters determines next training samples => can lead to bad feedback loops
- Samples are correlated => inefficient learning
- Address these problems using experience replay:
- Continually update a replay memory table of transitions
(s_t, a_t, r_t, s_(t+1))
as game (experience) episodes are played - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
- As each of the transitions can also contribute to multiple weight updates => greater data efficiency
- Continually update a replay memory table of transitions
- Learning from batches of consecutive samples is problematic:
Policy Gradients
- If dimension goes to really huge, the Q-function can be very complicated
- Policy Gradients:
-
Formally, let’s def a class of parametrized policies:
-
For each policy, def its value:
-
We want to find the optimal policy
==> Gradient Ascent on policy parameters
-
- Reinforce algorithm
-
Mathematically
Wherer(\tau)
is the reward of a trajectorytau = (s_0, a_0, r_0, s_1,...)
-
Directly
Delta
is Intractable. However, we can use a nice trick
Then
-
Can we compute those quantities without knowing the transition probabilities
We have:
Thus:
And when differentiating:
-
- However, this also suffers from high variance because credit assignment is really hard
- Variance reduction:
- First idea: push up probabilities of an action only by the cumulative future reward from that state
- Second idea: Use discount factor
gamma
to ignore delayed effects - Baseline: Introduce a baseline function dependent on the state, and rescale the rewards
Choose the base line:
- constant moving average of rewards experienced so far from all trajectories
- if this action was better than the expected value of what we should get from that state
Q-function and value function!
Intuitively, we are happy with an actiona_t
in a states_t
ifQpi(s_t, a_t) - Vpi(s_t)
is large
Def Advatage functionA(s, a) = Qpi(s_t, a_t) - Vpi(s_t)
- constant moving average of rewards experienced so far from all trajectories
- First idea: push up probabilities of an action only by the cumulative future reward from that state
- ===> Combine Q-Learning and Policy Gradients by training both an actor(the policy) and a critic (the Q-function)
- Examples:
- Recurrent Attention Model (RAM)
- fine-grained image recognition
- image captioning
- visual question-answering
- AlphaGo
- Recurrent Attention Model (RAM)
Efficient Methods and Hardware for Deep Learning
Algorithms for Efficient Inference
- Pruning
- Weight Sharing
- Quantization
- Low Rank Approximation
- Binary / Ternary Net
- Winograd Transformation
Hardware for Efficient Inference
- TPU
- EIE
Algorithms for Efficient Training
- Parallelization
- Mixed Precision with FP16 and FP32
- Model Distillation
- DSD: Dense-Sparse-Dense Training
Hardware for Efficient Training
Adversarial Examples and Adversarial Training
- Adversarial Examples:
- Fooling image as we mentioned before
- Not just the neural network
- Fooling image as we mentioned before
- Understand why it happens:
- First idea: Causing from overfitting — Wrong
Because it’s not random, it is systematic over pictures and different trainings - Second idea: Causing from underfitting
Excessive Linearity, and it force the point far away the dicition boundary to be confident classified even if we actually dont have any input there
- First idea: Causing from overfitting — Wrong
- The Fast Gradient Sign Method
- The Clever Hans Theory
- RBFs: resist the problem, but it is very shallow and extramly difficult to train if it get deeper in order to increase its accuracy
- Transferability Attack
- Enhancing With Ensembles
- Enhancing With Ensembles
- The Adversarial Examples for Human (视觉错觉) and current neural networks is very different and have almost no 重叠
Which means there’s huge different in the mechanism of the two
- Adversarial Training
- Can improve a bit, but still long way to go
- Semi-supervised learning
- Can improve a bit, but still long way to go