History & Brief introduction
Image classification pipeline
- K&N An example of simple classification method
- Linear classification: Another example
Loss Function and Optimization
- Loss Function
- Optimization
Backpropagation and Neural Networks
- Backpropagation
- Neural Networks
Convoltional Neural Networks (ConvNet)
Train Neural Networks
Deep Learning Hardware and Software
- CPU vs. GPU
- Deep Learning Frameworks
CNN Architectures
Recurrent Neural Networks
Detection, Segmentation and Locolization
Visualizing and Understanding
Generative Models
Deep Reinforcement Learning
Efficient Methods and Hardware for Deep Learning
Adversarial Examples and Adversarial Training

History & Brief introduction

Animal vision development
The invention of cameras
Hubel & Wiesel, 1959: How mammals’ vision processing mechanism like
CV start
David Marr: proposing a mechanism
- Primal Sketch: edges, bars, ends, virtual lines, etc.
- 2.5-D Sketch: piece togather the surfaces, depth, etc.
- 3-D Model Representation
Generalized Cylinder & Pictorial Structure:
- Every matter can be compsed of simple geomatric configuration
David Lowe:
- Using lines and edges to represent
Vision too hard — Start from image segmentation
Specified & using computer learning — Real-time face detect
Feature based object recognation: ‘SIFT’
Histogram of Gradients & Deformable Part Model
PASCAL Visual Object Challenge — Test the prograss we have made in object recognation
Overfit problem
ImageNet; ImageNet Large Scale Visual Recognition Challenge: whether the first five label contain the correct one
2012 huge dorp of error rate — Using of CNN
Other challenges: Object detaction, Action classification, Image caption, etc.
The first appear of origin CNN — 1998
Why it is only popular recent years
- Moore’s law & Increasing computation
- Data explosion
The other challenge beyond just object recognation:
- Movement recognation
- 3-D 重建
- Recognation of every single pixel of the image
- Understand the story behind the image, which means truly understanding picture

Image classification pipeline

The problem: Semantic gap:
The huge gap between the semantic idea of a cat(or sth.) and the huge matrix of pixel value that the computer is actually seeing.
Challenges:
- View point
- Illumination
- Deformable
- Occlusion
- Background clutter
- Intraclass variation
Attempts: find edges, corners, etc.
Very hard & having little scalable
Some Point of View:
- origin pictures
- images as points in a high dementional space

K&N An example of simple classification method

Data-driven approach: Split the API into two functions, one train, one predict
- Collect a dataset of images and labels
- Use Machine Learning to train a classifier
- Evaluata the classifier on new images

Exercise: Nearest Neighbors(CIFAR10 dataset)

Train step: memrize all picture
Predict step: compare and find the most similar one to the test picture

What is ‘similar’? L1(Manhattan) distance:

$d=\sum{abs(a_{ij}-b_{ij})}$

  import numpy as py
        
  class NearestNeighbor:
      def __init__(self):
          pass
        
      def train(self, X, y):
          # X is N x D where each row is an example. y is a 1-D of size N contain labels
          self.X_train = X
          self.y_train = y
            
      def predict(self, X):
          num_test = X.shape[0]
          Y_predict = np.zeros(num_test, dtype= self.y_train.dtype) 
        
          for i in range(num_test):
              distances = np.csum(np.abs(self.X_train - X[i, :]), axis = 1)
              min_index = np.argmin(distances)
              Y_predict = self.y_train[min_index]
        
          return Y_predict

Problem: Train O(1), Predict O(N). But we want is slow training and fast predicting

K-Nearest Neighbors: (some small improvements of 5.)
- taking a mojority vote from K closest points
- Using L2(Euclidean) distance instead:
  
  $d=\sqrt{\sum{(a_{ij}-b_{ij})^2}}$
  - BTW: L1 is dependent upon the choice of coordination system. So somehow it may be more fit if the individual entry of your vector has specific meaning for your task
  - BTW: By just changing the distance matrix, it can be applied to multiple tasks
Hyperparameters: (like the K value and distance matix in the example above) Very problem dependent and almost not able to learn from huge data. Therefore, most people just try each of them out.
- Idea 1: choose ones work best on training data.
  Terrible(example above, K = 1 always do perfect on training data), never do it.
- Idea 2: Split the training data to train and test part. Choose ones work best on test date
  Terrible, never do it. Because no idea how algorithm will peform on new data.
- Idea 3: Split the training data into 3 parts, train, validation, test. Train on the train set, choose on validation data, then try algorithm on test set only once at last, to test whether it is good.
  Better
- Idea 4: Cross_validation. Split data into folds and test try each fold as validation and average the results(accuracy), then do the same as Idea 3
  Even better, but not use often in prectice

Linear classification: Another example

Parametric Approach: input image x, weight parameter W, so we compose a function f(x, W), to get 10 numbers giving class scors(as above, using CIFAR10, so 10 categories in total)
Linear Classifier: $f(x, W) = W \times x + b$
- CIFAR10 x 32 x 32 x 3 reshape to 3072 x 1
- W 10 x 3072
- b 10 x 1. To understand b, if the training data has more cats than dogs, then it reasonable to predict that b_cat is larger than b_dog
- Can be considered as a template matching method. Each row of W is a template of the category
- Restrain: only one template for a class, once a huge differece come out, it try to average all the picture to get the only template
Hard cases for a linear classifier: Consider f as a line in high-D space
- pairity problem: 一三象限、二四象限; such as odd or even people of animals
- multimodel: island in other classes
Score function: $s = f(x; W) = Wx$

Loss Function and Optimization

Todo from last time:
- Def a loss function that quantifies our unhappiness with the scores across the training data
- Come up with a way of efficiently finding the parameters that minimize the loss function — optimization

Loss Function

Def: tell how good our current classifier is.

$L = \frac{1}{N} \times \sum{L_i(f(x_i, W), y_i)}$
Ways of understanding:
- To set a loss func, what you actually do is to tell the computer what kinds of errors you care about more and what kinds of errors you can kind of accept — the linear and square func is an example
Example 1 :Multi-class SVM(Support Vector Machine) loss: (Hinge loss)
- Fomula:
  $L_i = \sum_{j\neq y_i} {max(0, s_j - s_{y_i} + \Delta)}$ (where Delta is a sort of safety margin we set)
- Min 0; Max infinity
- Trick: At the beginning, we randomly choose W, so all s should ~= to 0, and all approximatly equal, so there should be $L_i \approx ClassNum - 1$
- If we square the value in sum, we’ll get a different loss func, and hopefully get a different W — A trick we can use when figuring out our own loss func
- If we get a zero loss W, W is definatly not unique — 2W is also zero loss
That’s because we tell the classifier to find the W that fits the training data. But in practice we dont really care about it, we really care about the performance of the classifier on test data !!!
Regularization: to encourage the model to choose a “sinple” W — To solve the problem:
- $] L = \frac{1}{N}\times(\sum_i{L_i(f(x_i, W), y_i)}) + \lambda\times R(W)$
- L = Data loss + lambda * Regularization loss
- lamda is a very important Hyperparameter
- Examples: L2, L1, Elastic(beta*L2+L1)
  - Important difference: consider what as complexity
  - L2 prefer to spread the influence to all pixels; While L1 prefer to concentrate
Example 2: Softmax loss(Multinomial Logistic Regression)
- Softmax function: normalized probabilities of the classes
  
  $P(Y = k | X = x_i) = e ^ {s_k} / \sum_j e ^ {s_j}$
- Softmax loss func: The true class’s probabilities should close to 1
  
  $L_i = - \log(P(Y = y_i | X = x_i))$
- Min 0; Max infinity
  But to get 0, we need infinit number at the correct class and minus infinit number at others. So we’ll never get zero loss
- Trick: when all s ~= 0, and all approximatly equal, so there should be $L_i \approx - \log(1/ClassNum) = \log(ClassNum)$
Contrast the two loss func:
- How we choose to interpret the scores to measure the badness
- What they want to do:
  - SVM just want the correct class’s scors much bigger than others;
  - Softmax want it to go towards infinity and others go towards minus infinity

Optimization

Strategy: Follow the slope — Gradient, pointing to the greatest increasing direction
Numerical & Analytic: to debug & to train
Gradient Descent:
- Step size (also called Learning rate): the single most important hyperparameter!
- Update rules
- Stochastic Gradient Descent(SGD): Because N is super large, instead using all N inputs, we use a minibatch of samples of inputs
Image Features and The Motivation: In prectice, we prefer not to fit the raw pixels, but the “features” we “get” from them
- Example 1: Color Histogram
- Example 2: Histogram of Oriented Gradients(HoG, Edges direction and distribution)
- Example 3: Bag of Words(Define visual words)

Backpropagation and Neural Networks

Backpropagation

Computational graph: 类似流程图，每个框代表origin variable or 运算
A simple example: f(x,y,z) = (x + y) * z
- Draw the graph:
```
  x ---v (q) 
  y -> + ----v (f)
  z -------> * -->
```
- Assign every intermediate variable a name:
  
  $q = x + y (\frac{\partial q}{\partial x} = 1;\frac{\partial q}{\partial y} = 1)$
  
  $f = q \times z (\frac{\partial f}{\partial q}= z; \frac{\partial f}{\partial z} = q)$
- To get Gradients of f: we start at the end and go back, and each step calc
  $\frac{\partial f}{\partial v}$ (v is a intermediate or origin variable) (using Chain rule)
Backpropagation:
- Local node:
  Some inputs x, y , a simple calc f , some outputs z
  It just aware of the immediate surrounding, which is the things above
- Local gradient:
  For a local node we can get the gradient between the inputs and the output such as pdz / pdx; pdz / pdy
  Use them to multiply the gradient get from behind the node(which is pdfinal / pdz called Upstream gradient), we can get the gradient between the final output and the inputs of this node
  If there are more than one outputs, which is more than one upstream gradients, add them up as the total upstream input of this node
- From the end of the computational graph go back the the beginning, we can get the full gradient
We can group some simple calc f togather to get a little coplex, as long as we can directly write down the local gradient
- Such as sigmoid function:
  $\sigma(x) = 1 / (1 + e^{-x})$
  
  $\frac {d\sigma(x)}{dx} = (1 - \sigma(x)) \times \sigma(x)$
- Can be viewed as a trade-off between numerical calc and computational calc
Patterns in backward flow:
- Add gate: gradient distributor;
- Max gate: gradient router;
- Mul(multiply) gate: gradient switcher
Gradients for vectorized code (which means that x, y, z are vectors instead of numbers):
- All the local gradients and the upstream gradients are Jacobian Matrixes. Basically just using the chain rule to calc
- Can calc elementwise or write it out in vectorized (and matrix) multiply form
- If the final output is a number, not vector, the full gradient will have the same shape as its corresponding input
- output是vector情况？对Matrix 的梯度为三维矩阵，但Gradient Descent 是Loss Function 的梯度，不是Hypothesis，输出是number，不会是Vector

Neural Networks

First see it as a function, without the brain staff:
- Before: Linear score function: f = W * x
- Now: 2-layer Neural Network: f = W_2 * max(0, W_1 * x)
- The non-linearity that the max function brings is very important and necessery
- Understand from picture view:
  W_1 is a collection of many templates, and W_2 assign the tamplate to classes
- We can make it even deeper by repeating “add non-linearity and the another layer of linear regression”
Inspiration from neural system
Non-linearity: called activation function
Such as Sigmoid, Leaky ReLU, tanh, Maxout, ReLU, ELU
The Architectures: “3-layer Neural Network” or “2-hidden_layer Neural Nerwork”

Convoltional Neural Networks (ConvNet)

A bit of history

From a function perspective (with out brain stuff)

Recall: Fully Connected Layer
Convolution Layer
- Instead of stretching to a 1-D vector, we keep the structure of the image — Preserve spatial structure
- Convolve: the filter with the image i.e. “slide over the image spatially computing dot products”
  Note that the filters always extend the full deoth of the input volume
- To get the dot product, we stretching the filter into a 1-D vector w, and stretching the overlapping chunk of the image into 1-D vector x, and then get a number output by w^T * x + b
  Or consider as overlap and then element-wise multiplication and add up
- To do the “slide”, basically, center the filter on top of every pixel in this input volume, do the dot product and get a number to fill in the corresponding position in the output matrix (2-D for a 3-D input) (called activation maps)
- Apply several filters (can be same shape) to the same input, then we can get the same number of activation maps out
- ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
To understand the ConvNet, along the depth of ConvNet (usually the depth of the matrix in the layer is also deeper), the Layers can be considered as looking for featrues from simple to complex
Such as from local feature to lines then to corners and blobs to the resemble of blobs
To understand the filter, we can also see it as a template. By visualizing the filter, we can find out what parttern the filter is looking for, that is: what kind of input will get the larger output
Stride: the step size that the filter slide on the input matrix
$Output\_size=(N_{input} - N_{filter}) / Stride + 1`$
Zero pad: To make it possible to set the filter’s center at the corners of the input, we do zero pad
With the Stride of 1, we can then maintain the size of the output, preventing it from shrinking
1 x 1 convolution layers make perfect sense, they mix the information along depth

The brain view of convolution layer

Filter also is called receptive field

Other layers in the ConvNet

Pooling layers (Downsampling)
- makes the representations smaller and more manageable
- operates over each activation map independently
- Only spacially, do nothing to the depth
- Example: Max Pooling
- Also instead of doing pooling, we can also use stride to do the same thing — downsampling
ReLU or Other non-linearities: the same as Neural Networks
Fully Connected Layer (FC layer)

Train Neural Networks

Activation Functions

Sigmoid:
- Squashes numbers to range [0, 1]
- Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
- Three problems:
  - Saturated neurons “kill” the gradients:
    When inputs are very far away from 0, then the gradient become 0, and that is bad in the procedure in the backprop
  - Output are not zero centered:
    That means the gradients on w will always all positive or negative, that would make the descenting less efficient ???
  - exp() is a bit compute expensive (less important)
$\tanh(x) = (e^x - e^{-x}) / (e^x + e^{-x})$
- Squashes numbers to range [-1, 1]
- Zero centered
- Still, saturated neurons “kill” the gradients
Rectified Linear Unit (ReLU):
- Does not saturate (in + region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- Actually more biologically plausible than sigmoid
- Problems:
  - Not zero-centered output
  - Saturate in the - region:
    When you have a bad initializing weight matrix, or you have a learning rate too large
    To reduce the possibility of dead ReLU, people like to initialize it with slightly positive biases (e.g. 0.01)
Leaky ReLU:
- Does not saturate (in + region)
- Very computationally efficient
- Converges much faster than sigmoid/tanh in practice
- Will not “die”
PReLU:
$f(x) = max(\alpha * x, x)$ with \alpha as a parameter that can learn by compute
Exponential Linear Units (ELU):
$f(x) = x>0 ? x : \alpha * (e^x - 1)$
- All benefits of ReLU
- Closer to zero mean outputs
- Negative saturation regime compared with Leaky ReLU adds some robustness to noise
- But computation of exp()
Maxout “Neuron”:
- Will not saturate, will not die
- Double the parametar number

Data Preprocessing

Zero-mean & Normalize
- Subtract the mean image (e.g. AlexNet) or the per-channel mean (e.g. VGGNet)
  Important because if it’s not zero-centered then loss function will be too sensitive to small perturbations
- Not really do normalize in picture processing
PCA & Whitening (more complex, usually dont do in picture processing)
When we do the data preprocessing to the training set, we need to do the exact same calc with the exact same value to the test set

Weight initialization

First idea: Small random numbers
- w = 0.01 * np.random.randn(D, H)
- Do good in small network, but have problem when it get deeper
  picture ???
Xavier initialization: W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
- Do good when use tanh
- To fit ReLU (‘cause it kill half of the inputs), one possible modification is W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)
Still a research area
MSRA another way you can use

Batch Normalization

Idea: want to keep activation in a gaussian range
To make each dimension of a batch of actications at some layer unit gaussian, apply: xk = (xk - E[xk]) / sqrt_root(Var[xk])
Usually inserted after FC layers (each demention) or Convolutional layers (jointly) and before non-linearity (apply for each neuron)
Usually then allow the network to squash the range if it wants to yk = gamak * xk + betak. Notice that the net work can learn gamak = sqrt_root(Var[xk]); betak = E[xk] to recover the identity mapping
Benefit:
- Improves gradient flow through the network
- Allows higher learning rates
- Kind of more robust
- Reduces the strong dependence on initialization
- Act as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
Note that: At test time the BatchNorm layer functions differently. The mean/std not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)

Babysitting the Learning Process

The starting steps when apply ConvNet:
- Step 1: Preprocess the data
- Step 2: Choose the architecture
  Notable: Double check the loss is reasonable (The trick learned in loss function, and also loss should go up when enable the regularization)
- Step 3: Try to train
  - Tip: First Make sure that you can overfit very small portion of the training data very well (with very small loss and accuracy 1.00)
  - Start with small regularizatioin and find learning rate that makes the loss go down (not go down: too low; exploding: too high)
    
    TODO

Hyperparameter Optimization

Cross-validation Strategy (in stages)
- First stage: only a few epochs to get rough idea of waht parameters work
- Second stage: longer runinig time, finer serch
- Repeat as necessary
- Tips for what doesnt work: If the cost is ever > 3 * original cost. Then we can break out early
- Note it’s best to optimize in log space ( 10**uniform(s,e) )
- And compare to Grid Search, Random Search is actually better. Because we can get a better sense of what the function looks like
Some special situation:
- loss flat for some time then go down: maybe bad initializing
- big gap between train and validation accuracy: maybe overfitting
- no gap between them: consider increase capacity
Track the ratio of weight updates / weight magnitudes: around 0.001 or so is good

Fancier Optimization

Problem with SGD:
- If the loss function changes quickly in one direction and slowly in another (has high condition number, which is the ratio of largest to smallest singular value of the Hessian matrix)
  Then the SGD would get a very slow progress along shallow dimension, and we would get a zig-zag jitter along steep direction
  In high-D this would happen more often
- May stack in a local minima or a saddle point
  In high-D, the problem is more about saddle points and less about local minima, and also not exactly at the saddle point but near the sadlle point
- Out gradients come from minibatches so they can be noisy
Minor improvement: SGD + Momentum
- Build up “velocity” as a running mean of gradients and go the direction of velocity
- rho gives the “friction” of the velocity contribute to the next velocity; Typically rho = 0.9 or 0.99
- The fomula:
  $v_{t+1} = \rho \times v_t + SGD \\ x_{t+1} = x_t - \alpha \times v_{t+1}$
- Solve all three problems!
Nesterov Momentum:
Instead of adding the gradient from the point we now standing on, we start from the point now go as the velocity before updating and calc the gradient there, then go back to the point, and calc the new velocity

$v_{t+1} = \rho \times v_t - \alpha \times \Delta f(x_t + \rho \times v_t) \\ x_{t+1} = x_t + v_{t+1}$
Another form
$v_{t+1} = \rho \times v_t - \alpha \times \Delta f(x_t) \\ x_{t+1} = x_t - \rho \times v_t + (1 + \rho) \times v_{t+1} = x_t + v_{t+1} + \rho \times (v_{t+1} - v_t)$
AdaGrad:
- Keep adding the sqr of these SGDs and
  $x -= \alpha \times \frac {SGD} {\sqrt{grad\_squared} + (1e-7)}$
- As the training goes, grad_squared become smaller, so the actually learning rate become smaller — good for convex, otherwise, may still get stuck at saddle point
Improve: RMSProp
- Instead of grad_squared += SGD * SGD, we do grad_squared = decay_rate * grad_squared + (1 - decay_rate) * SGD * SGD where decay_rate is commonly 0.9 or 0.99
Adam algorithm
- $first\_moment = \beta_1 \times first\_moment + (1 - \beta_1) \times SGD\\ second\_moment = \beta_2 \times second\_moment + (1 - beta2) * SGD * SGD\\ x -= learning\_rate \times first\_moment / (\sqrt{second\_moment} + (1e-7))$
- But if we initialize the second_moment = 0 and because the beta2 is 0.9 or 0.99 typically, whcih is near to 1. That is to say, for the early steps, we’ll always have small second_moment, then there will be a hugh step as we divide it. And that’s a problem
- To solve it: add a bias term (t means the current time step)
  
  $first\_unbias = first\_moment / (1 - \beta_1^t)\\ second\_unbias = second\_moment / (1 - \beta_2^t)\\ x -= learning\_rate \times first\_unbias / (\sqrt{second\_unbias} + (1e-7))$
Problem for all these: axis dependent. That is to say, if we rotate the picture, the algorithm cannot really unrotate it
Learning rate decay: strt with a relatively large learning rate and then decay it along the training
- Step decay: decay learning rate by a factor every few epochs
  exponential decay:
  $\alpha = \alpha_0 \times e^{-k \times t}$
  1/t decay:
  $\alpha = \alpha_0 / (1 + k \times t)$
- Common with SGDMomentum, less common with Adam
- Second order hyperparemeter
All the algorithm is First-Order Optimization
Second-Order Optimization: Newton’s Method But H^-1 too big, so there’s improvement to approximate Hessian
(But they dont actully handle the stochastic case very nicely, and tend not to work very well with non-convex problems)
- Quasi-Newton methods
- L-BFGS

Regularization

All the above is about how to reduce our training error, but we dont really care about that, what we really care about is the reducing the gap between the training set and unsee samples (test set)
Model Ensembles: Tend to improve by 2%
- Trian multiple independent models
- At test time average their results
- Tricks:
  - Instead of training independent models, use multiple snapshots of a single model during training
  - Polyak averaging: Instead of using parameter vectors, use the average of it along the training ???
Regularizations: To improve single-model performance
Dropout: In each forward pass, randomly set some neurons to zero, and every time we do a forward pass, we’ll set a different set of the neurons (randomly) to zero
- Probablity of dropout is a hyperparameter; 0.5 is common
- Often do it on FC layers;
  Sometimes use it also on convolutional layers;
  And on convolutional layers sometimes instead of droping neurons, we might drop entire feature maps randomly
- Understand why this works
  - Help prevent co-adaptation of features, which is to say prevent it from distribute the decision to too many “overlaping” features
  - Can also view as training a large ensemble of models
- At test time, to wipe out randomness
  - Consider a single neuron which inputs are x, y and output is a
  - During training time we have:
    $E[a] = (1-Prob)^2 \times (w_1 \times x + w_2 \times y) + Prob\times(1-Prob) \times (w_1 \times x + 0) + Prob\times(1-Prob) \times (0 + w_2 \times y) + Prob^2 \times (0 + 0) = Prob \times (w_1 \times x + w_2 \times y)$
  - So at test time, we just run the model without dropout and then multiply the output by dopout probablity
  - Inverted dropout: Insread multiply Prob at test time, we divide Prob during training time (to do test more efficiently)
Common pattern of Regularization:
- Training: add some kind of randomness
- Testing: average out randomness (or approximatly)
Another examples:
- Batch Normalization
- Data Augmentation: Train on the randomly transformations of the origin pictures (flips, crops, scales, color jitter, rotating, shearing, etc.) Consider how can we transform the data without changing the label of it
- DropConnect
- Fractional Max Pooling
- Stochastic Depth

Transfer Learing

Another point of view to overfit: you dont have enough data
Step of Transfer Learning:
- Train on a Large enough and similar Dataset
- Freeze the hidden layers, and reinitialize and train the last FC layers on the small dataset you want to aplly on
- When you have a larger dataset, then maybe consider updating larger parts of the network
- Note that you probably want to drop the learning rate when you do the transfer
Pretrain and Finetuning is very commonly use in many areas

Deep Learning Hardware and Software

CPU vs. GPU

GPU: (DL prefer NVIDIA)
- larger number of cores (more than 100 times), but each core is much slower (small clock speed) and “dumber”
- have their own pretty large RAMs and own caching system
- Great for parallel tasks with pretty high similarity
Example: Matrix Multiplication
Programming GPUs
- CUDA (NVIDIA only)
- OpenCL (runs on anything, but usually slower)
Data Transfer problem (from disk to GPUs). Solutions:
- Read all data into RAM (if your data is relatively small)
- Usd SSD instead of HDD
- Use multiple CPU threads to prefetch data

Deep Learning Frameworks

ThsorFlow

Rough Process:
- First define computational graph;
- Then run the graph many times
Detail Process:
- Placeholders: x = tf.placeholder(tf.float32, shape=(rx,lx)); y, w1, w2 similar
- Def forward pass: compute prediction for y and loss (No computation happens, just building model) h, y_pred, loss
- Tell TF to compute loss of gradient (No computation happens, just building model) grad_w1, grad_w2
- Enter a session so we can actually run the graph with tf.Session() as sess:
- Create numpy arrays that will fill in the placeholders above Value = {x:..., w1:..., ...}
- Run the graph: feed in the numpy arrays; get the output we want
  out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
  loss_val, grad_w1_val, grad_w2_val = out
- Use to update and repeat the step ‘R’
Problems: Copying weights between CPU/GPU each step, Expensive! Solution:
- Instead of placeholder, use Variable for w1 w2 at step ‘P’ and tell TF how we want them to be initialized (Still, dont actually initialize now) w1 = tf.Variable(tf.random_normal((rw, lw)))
- Before the step ‘E’ and after the step ‘T’, add assign operations to update w1, w2 as part of the computational graph new_w1 = w1.assign(w1 - learning_rate * grad_w1)
- After the step ‘E’
```
      sess.run(tf.global_variables_initializer())
      values = {x: np.random.randn(rx, lx), y: np.random.randn(ry, ly)}
      for t in range(TotStep):
          loss_val, = sess.run([loss], feed_dict=values)
```
But this will not update, because we dont tell the TF to update, we just tell it to calc loss. Solution:
- A funny way of doing so:
  - We add a dummy node which will return ‘None’ at the step ‘B’: updates = tf.group(new_w1, new_w2)
  - Last line change to loss_val, _ = sess.run([loss, updates], feed_dict=values)
- A way that TF provides:
  - Instead step ‘T’ and ‘B’, we can just use an optimizer to cumpute gradients and update weight
    optimizer = tf.train.GradientDescentOptimizer(1e-5) (1e-5 is the training_rate)
    updates = optimizer.minimize(loss)
  - Last line change to loss_val, _ = sess.run([loss, updates], feed_dict=values)

TF also packing layer for us: After setting x, y , we can just call the following code to set layers without announcing w’s and b’s outslves:

 init = ty.contrib.layers.xavier_initializer()
 h = tf.layers.dense(inputs=x, units=H, activtion=tf.nn.relu, kernel_initializer=init)
 y_pred = tf.layers.dese(inputs=h, units=lx, kernel_initializer=init)

Keras: High-Level Wrapper
Other High-Level Wrappers: TFLearn, TensorLayer, Sonnet, etc.
Pretrained Models, Tensorboard, Distributed Version
Like Theano

PyTorch

Three Levels of Abstraction
- Tensor: Imperative ndarray, but runs on GPU (like the Numpy array for TF)
- Variable: Node in a computational graph; stores data and gradient (like Tensor, Variable, Placeholder for TF)
- Module: A neural network layer; may store state or learnable weight (like tf.layers, or TFSlim, or TFLearn, or Sonnet, or… for TF)
To let Tensor run on GPU, we need to use special datatype: dtype = torch.cuda.FloatTensor
x = torch.randc(rx, lx).type(dtype) for use Tensor
x = torch.autograd.Variable(torch.randn(rx, lx), requires_grad=False) for use Variable, and then it contain x.data , and a bool x.grad
But their API is all the same
But as using Variable, we can simply call loss.backward() to calc the gradients (dont forget to initialize 0 first ???)
nn package, optim package and DataLoaders function
We can define our own new autograd functions and new modules in nn
Pretrained Models, Visdom
Update from Torch
Static vs Dynamic Graphs
- Def:
  - Static: TF, we first build then run
  - Dynamic: PT, we just define Tensor or Variable first, and the build is done with in the loop and is running itself
- Optimization: Static ones can allow the computer to optimize the graph for you before it runs
- Serialization: Static ones once built, can serialize it and run it without the code that built the graph (which allow us to change coding environment)
- Conditional Operation: Dynamic ones will end up more clean. For Static you need special TF control flow to do so (like tf.cond)
- Loops: Like y_t = (y_(t-1) + x_t) * w , maybe even have a different size. Again, Dynamic ones are cleaner, and For Static you need special TF control flow to do so (in thsi case tf.foldl )
TensorFlow Fold: Dynamic graphs in TF

Caffe & Caffe2

CNN Architectures

Review: LeNet 5
AlexNet:
- Architectures:
  CONV1 (96 x 11 x 11 (x 3); stride 4 pad 0 => 35K parameters);
  MAX POOL1 (3 x 3 (x 1); stride 2 => No parameter);
  NORM1;
  CONV2 (256 x 5 x 5 (x 48); stride 1 pad 2);
  MAX POOL2 (3 x 3 (x 1); stride 2 => No parameter);
  NORM2;
  CONV3 (384 x 3 x 3 (x 128); stride 1 pad 1);
  CONV4 (384 x 3 x 3 (x 192); stride 1 pad 1);
  CONV5 (256 x 3 x 3 (x 192); stride 1 pad 1);
  MAX POOL3 (3 x 3 (x 1); stride 2 => No parameter);
  FC6 (4096 neurons);
  FC7 (4096 neurons);
  FC8 (1000 neurons);
  Softmax
- First use of ReLU;
  Heavy data augmentation;
  Dropout; Momentum; Batch; Decay (Weight, Learning rate); Ensembles; etc.
ZFNet: Change of hyperparameters
VGGNet: Small filters, Deeper networks (VGG16, VGG19)
- only 3 x 3 CONV stride 1 pad 1 & 2 x 2 MAXPOOL stride 2
  That will lead to deeper, more non-linearities and less parameters
- Dont use NORM layers
- Most memory is in early CONV
  Most parameters are in late FC (but not the last one)
GoogLeNet: Deeper networkd, with computational efficiency
- No FC layers => Only 5M parameters
- Efficient “Inception” module:
  Apply parallel filter operations (different size of CONV and Pooling) on the same input from previous layer; Concatenate all filter outputs together depth-wise
  Usually CONV 1x1, 3x3, 5x5 and POOL 3x3
- Problem: expensive for compute & always deeper because pooling layer, Solution:
  use 1x1 conv (Bottleneck layers) to prjects depth to lower dimention (Before 3x3, 5x5 CONV and after 1x1 POOL)
- Architecture:
  - Stem Network (Conv-Pool-2x Conv-Pool) + Stacked Inception module + Classifier output (Without the expensive FC, which means there’s only one FC to generate the score of each class)
  - Auxiliary classification putputs to inject additional gradient at lower layers (AvgPool-1x1Conv-Fc-FC-Softmax)
ResNet: Very deep networks using residual connections
- The deeper network sometimes perform worse than shallow ones in both training set and test set (so it’s not caused by overfitting)
  Because it is much more difficult to train and optimize
- But deeper ones should at least perform as well as shallow ones by copying and add identity maping — The basic idea of ResNet
- Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping
  X (input) -> CONV -> ReLU -> CONV -> F(X) , and instead of using F(X) as input of next layer, we use F(X) + X. By this mean, F(X) is the residual of delta from X
- Architecture:
  - Stack residual blocks
  - Periodically, downsampling by using stride 2
  - Addtional CONV layer at the beginning and only one FC layer at the end
  - Also use bottleneck layer to improve efficiency
- Others: Batch Norm, Xavier/2, Momentum, Decayings (Weight, Learning rate), No dropout
Other Architecture:
- Network in Network (NiN)
- Identity Mappings in Deep Residual Networks
- Wide Residual Networks
- Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)
- Deep Networks with Stochastic Depth
- FractalNet: Ultra-Deep Neural Networks without Residuals
- Densely Connected Convolutional Networks
- SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size

Recurrent Neural Networks

“Vanilla” Neural Networks: one to one
Recurrent Neural Networks:
ONe to one; One to many; Many to one; Many to many
Recurrence formula:
$h_t = f_W(h_{t-1}, x_t)$
x -> RNN -> y
.....| ^.....
.....|_|.....
Note W is the same for all the h_t
Backpropagation through time
- Problem: if input is very large, it’ll get super slow
- Improvement: Truncated Backpropagation through Time:
  Run forward and backward through chunks of the sequence instead of whole sequence;
  Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps
Example 1: (Vanilla) Recurrent Neural Network
- $h_t = \tanh(W_{hh} \times h_{t-1} + W_{xh} \times x_t)\\ y_t = W_{hy} \times h_t$
- The spread form of conputational graph
  x_i's, y_i's, h_i's, W, L_i's, L
- The case of o-m, m-o, m-m;
  Also m-m can be seen as m-o + o-m (kind of like coding and decoding)
- Example: Character-level Language Model
  - Train by a sequence of characters you want
  - Sample a character, get an output, generate the next character by argmax of prob, and feed in the model
Example 2: Image Captioning
- Convolutional NN + Recurrent NN
- For CNN, we use the 4096 output before the final FC layers as the image input v to the RNN
- For RNN:
  - Special <START> token and <END> token
  - $h = \tanh(W_{xh} \times x + W_{hh} \times h + W_{ih} \times v)$
  - Word-level
- To train it, we can train both CNN and RNN togather by pass down the gradient to CNN
- Improvement: With Attention (Soft & Hard)
  RNN focuses its attention at a different spatial location when generating each word (and the input from t-1 and ConvNet to the RNN will be different)
Example 3: Visual Question Answering
Multilayer RNN:

$H^l_t = (h^{l-1}_t, h^l_{t-1})^T\\ h^l_t = \tanh(W_l \times H^l_t)$
LSTM
- Exploding & Vanishing Gradients Problem: Because when we do backprop, we’ll multiply some W over and over again
  - For EGP, nasty solution is to set a threshold, scale gradient if its norm is greater than it
  - For VGP, Change RNN architecture ==> LSTM (which can solve both the problem)
- Formula:
  
  $(i, f, o, g)^T = (\sigma, \sigma, \sigma, \tanh)^T \times W \times (h_{t-1}, x_t)$ (where sigma is sigmoid functon, and W is 4het * 2het if the heights of h, x are both het)
  
  $c_t = f .* c_{t-1} + i .* g$
  
  $h_t = o .* \tanh(c_t)$
  where .* is element-wise multiply
- Input gate (i): whether to write to cell
  Forget gate (f): whether to erase cell
  Gate gate (g): how much to write to cell
  Output gate (o): how much te reveal cell
  Where cell — c
- Backprop:
  Consider the path through cells (c’s), then it only get element-wise muitiplied by different forget gates (f), and that’s a lot nicer
  ???
  And f usually initialize to near 1 to avoid vanishing at the beginning of the train
Highway Networks
Other RNN Variants
- GRU

Detection, Segmentation and Locolization

Semantic Segmentation

Want to produce a category label for each pixel of the input image

Only pixels dont have the view of objects, so dont distinguish two cows for example, just give the label of cow
Idea 1: Sliding Window
Seperated to crops, and run a CNN on each crops to dicide the central pixel’s label
Very inefficient!
Idea 2: Fully Convolutional
- Without FC layers, the output of the final Conv layer get a depth of class number and just view it as the scores of each class, and run the training
  Super computational expensive
- Imporvement:
  Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network
In-Network upsampling:
- Unpooling
  - Nearest Neighbor
    For 2x2 stride 2, what we do is to copy a element and fill the 2x2 with its number
    11 <- 1
    11
  - Nails upsampling
    For 2x2 stride 2, what we do is to fill the 2x2 with 0, except for one element
    10 <- 1
    00
  - Max Unpooling
    Just do the Nails upsampling and do it kind of symmetric with its corresponding downsampling
    12 -> 5 -> ... -> 1 -> 00
    35.....................01
- Transpose Convolution
  We take one pixel in the input, multipliy it with the filter and add it to the output matrix
  So the stride is the step length of the filter on the output matrix, and the pad is also using at the output matrix

Classification + Localization

Treat the localization as a regression prblem, using the same second to last FC layer as classification.
While it go through another FC to the generate the class scores, let it fo through another FC layer to generate Box Coordinate (x, y, w, h)
Also we’ll have two loss functions: Softmax loss (or Cross-entropy loss) and Regression loss (L2 or L1 or others). So we take the weighted sum (hyperparameter) of them and use it to do the SGD
Fancier: generate boxes for every class and only use the one of the true class to calc the second loss function
Aside: using the idea of generate fix location
- Human pose Estimation

Object Detection

Might be varying number of output in every image
Idea 1: as classification: Sliding Window
Apply a CNN to many different crops of the image, CNN classifies each crop aas objects or background
Problem: Need to apply CNN to huge number of locations and scales, Very computational expensive
Idea 2: Region Proposals
Give regions where the objects are likely to be found
It is not deep learning, but a troditional machine learning
R-CNN: combine the two ideas
- Steps:
  - Use Region Proposals to generate around 2K Regions of Interest (RoI)
  - Warped them into a fix size the CNN wants
  - Run each of them to the same CNN
  - Do the Classification and Localization using the CNN
    Note that we could train the CNN to Localize the object a little of the RoI, when the RoI miss a head of a person for example
- Problems:
  - Super slow still, for both training and testing
Fast R-CNN
- Step:
  - Forward whole image through some conv. layers to get a high resolution convolutional feature map
  - Use Region Proposals on the convolutional feature map to generate RoIs
  - RoI Pooling layer to warped them into a fix size the CNN wants
  - Directly run them to some FC Layers to do the Classification and Localization
- Super fast that Region Proposal becomes the bottleneck of speed
Faster R-CNN
- Idea: Make CNN do proposals. Insert Region Proposal Network (RPN) in the CNN to predict proposals from features
- The RPN also do things like classification (over two class: objects or not) and Bounding-box regression, so it has two loss function, too.
  Therefore, in total, we have 4 loss function that we need to use at the same time (weighted sum) to train the network
Detection eithour Proposals: YOLO/SSD (You Only Look Once & Single Shot Detection)
- Idea:
  - Divided into grid cell (R * L)
  - For each grid cell, draw several (denote as B ) boxes you set centered it
  - Within each grid cell
    - Regress from each of the B base boxes to a final box with 5 numbers: (dx, dy, dh, dw, confidence)
    - Predict scores for each of C classes (including background as a class)
  - Output: Tensor R * L * (5 * B + C)
Aside: Object Detection + Captioning = Dense Captioning

Instance Segmentation

Kind of like Semantic Segmentation + Object Detection

Mask R-CNN:
- Image -> CNN -> RoI Align -> Two branches: One Conv of Mini Semantic Segmentation; One do the Classification and Localization (and joint locolization for next no.)
Also can do pose estimate at the same time

Visualizing and Understanding

First Layer (Generally Conv. layer) : Visualize Filters: Oriented edges and Opposing colors (Scale to 0~255)
For intermediate layers: Directly using gray scale to interprete the weight matrix can not give much of understanding, we need more fancier method
For last hidden layer:
- Nearest Neighbors
- Dimensionality Reduction: Use to high-D vectors to 2-D coordinates
  - PCA
  - t-SNE (t-distributed stochastic neighbor embeddings)
For intermediate layers: While visualizing weight matrix is less useful, visualizing the activation maps of those intermediate layers is kind of interpretable
Maximally Activating Patches: Visualizing what types of patches from input images cause maximal activation in different neurons
Occlusion Experiment: Mask part of the image before feeding to CNN, draw heatmap of probability at each mask location
Saliency Maps ===> Segmentation without supervision
(Guided) Backpropagation
Gradient Ascent
- Using gradient to change pixels to maximize the output score of some neuron
- Regularization:
  - L2 norm
  - Gaussian blur image (Periodical)
  - Clip pixels with small values to 0 (Periodical)
  - Clip pixels with small gradients to 0 (Periodical)
- Adding “multi-faceted” visualization (with more careful regularization, center-bias)
Fooling Images / Adversarial Examples (A whole lecture later)
- Start from an arbitrary image
- Pick an arbitrary class
- Modify the image to maximize the class
- Repeat until network is fooled
DeepDream: Amplify existing features
- Forward: compute activations at chosen layer
- Set gradient of chosen layer equal to its activation
- Backward: compute gradient on image
- Update image
Feature Inversion: Given a CNN featrue vector for an image, find a new image that
- Matches the given feature vector
- (Start with noise) “looks natural” (by using regularization)
Total Variation Regularizer
Texture Synthesis
- Nearest Neighbor (without neural network)
- Neural Texture Synthesis: Gram Matrix
Neural Artwork Synthesis
Neural Style Transfer: Feature Inversion + Gram Matrix
- Content Image + Style Image
- Play with the hyperparameters
- Super slow
- Solution: Train another neural network to perform style transfer for us
The apply of multi-scale processing on some of the above

Generative Models

Unsupervised learning
Learn some underlying hidden structure of the data (Without label)
Example: Clustering, Dimensionality reduction, Frature learning, Density
Generative Models: Given training data, generate new samples from the same distribution
- Two main classes: Explicit density; Implicit density

PixelRNN and PixelCNN

Explicit density model
Using chain rule to decompose likelihood of an image x into product of 1-d distributions:

$p(x) = p^i_i(p(x_i | x_1,..., x_{i-1}))$
Then maximize likelihood of training data
Complex distribution over pixel values on the right of the equation ==> Express using a neural network
We’ll need to define ordering of “previous pixels”
PixelRNN:
- Generate image pixels startion from corner, and moving to right and down each step
- Dependency on previous pixels modeled using an RNN (LSTM)
- Drawback: Sequential, generation is slow
PixelCNN: ???
- Still generate image pixels startion from corner, and moving to right and down each step
- Dependency on previous pixels now modeled using a CNN over context region
- Training: maximize the likelihood of training images
- Training is faster. But generation must still proceed sequentially, so still slow

Variational Autoencoders (VAE) ??? from github

Define intractable density function with latent z:

$p_\theta(x) = \int(p_\theta(z) \times p_\theta(x|z))dz$
Cannot optimize directly, derive and optimize lower bound on likelihood instead
Some background: Autoencoders
- Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data (x -> z, with dimensionality reduction)
- Encoder and Decoder: Linear + nonlinearity, Deep, Fully-connected, ReLU CNN
- Train such that features can be used to reconstruct original data
  “Autoencoding” - encoding itself
  Loss function: e.g. L2
- After training, we can throw away the decoder layers, and use z to train a supervised model (Classifier etc.)
Variational Autoencoders
- Assume training data is generated from underlying unobserved (latent) representation z
- Sample from true prior p_\theta^*(z) , and then sample from true conditional p_\theta^*(x|z_i)
  We want to estimate the true parameters theta^* of this generative model
- How should we represent this model
  Choose prior p(z) to be simple, e.g. Gaussian
  Conditional p(a|z) is complex => represent with neural network (Called Decoder network)
- How to train the madel
  - Learn model parameters to maximize likelihood ==> \int is intractable
  - Solution: In addition to decoder network modeling p_theta(x|z), define additional encoder network q_phi(z|x) that approximates p_theta(z|x)
    p => miu_x|z, Sigma_x|z => x|z ~ hN(miu_x|z, Sigma_x|z)
    q => miu_z|x, Sigma_z|x => z|x ~ hN(miu_z|x, Sigma_z|x)
  - Then for likelihood:
  - $\log(p_\theta(x_i)) = E_{z~q_phi}(z|x_i)[\log(p_\theta(x_i))]\\ = E_z[\log(p_\theta(x_i|z) \times p_\theta(z) / p_\theta(z|x_i))]\\ = E_z[\log(p_\theta(x_i|z))] - E_Z[\log(q_\varphi(z|x_i) / p_\theta(z))] + E_z[\log(q_\varphi(z|x_i) / p_\theta(z|x_i))]\\ = E_z[\log(p_\theta(x_i|z))] - D_{KL}(q_\varphi(z|x_i) || p_\theta(z)) + D_{KL}(q_\varphi(z|x_i) || p_\theta(z|x_i))\\ \geq E_z[\log(p_\theta(x_i|z))] - D_{KL}(q_\varphi(z|x_i) || p_\theta(z))$
  - Then we get a loweer bound of log(p_theta(xi)) which is hL(xi, theta, phi) = E_z[log(p_theta(xi|z))] - D_KL(q_phi(z|xi) || p_theta(z))
    And when training, we maximize the lower bound: theta^*, phi^* = argmax(sum_i(hL))
  - For a forward pass:
    - input minibatch of x
    - q_phi(z|x) Encoder network
    - Sample z from z|x ~ hN(miu_z|x, Sigma_z|x)
    - p_theta(x|z) Decoder network
    - Sample x|z from x|z ~ hN(miu_x|z, Sigma_x|z)
    - Calc hL and maximize it (backprop)
- For generating data:
  - Use decoder network
  - Now sample z from prior ( z ~ hN(0, I) )
- Cons:
  - Maximize the lower bound, okay but not as good
  - Samples blurrier and lower quality picture

GANs (Generative Adversarial Networks)

Dont work with explicit distribution
Want to sample from complex, high-d training distribution
There’s no direct way to do this ==> Want to have a transformation which tansform them into a simple distribution ==> Neural network
What we do is to sent a random noise (which is a simple distribution’s sample) to a neural network, and let it to output a sample from training distribution
Two-player game:
- Generator network: try to fool the discriminator by generating real-looking images (from random noise)
- Discriminator network: try to distinguish between real and fake images
Train jointly in minimax game:

$\min_{\theta_g}(\max_{\theta_d}(E_{x~p_{data}}[\log(D_{\theta_d}(x))] + E_{z~p(z)}[\log(1 - D_{\theta_d}(G_{\theta_g}(z)))] ))$
where D_{\theta_d} is discriminator output, and G_{\theta_g} is generated fake data
Training GANs: Alternate between
- Gradient ascent on discriminator
  `
  $\max_{\theta_d}(E_{x~p_{data}}[\log(D_[\theta_d](x))] + E_{z~p(z)}[\log(1 - D_{\theta_d}(G_{\theta_g}(z)))])$ `
- Gradient descent on generator
  
  $\min_{\theta_g}(E_{z~p(z)}[1 - D_{\theta_d}(G_{\theta_g}(z))])$
  But this function’s slope is bad, cause we’ll train slow when it’s bad and fast when it’s good
  Using Gradient ascent on generator, instead
  
  $\max_{\theta_g}(E_{z~p(z)}[D_{\theta_d}(G_{\theta_g}(z))])$
Jointly training two networks is challenging and can be unstable. Active research area
Improving: Convolutional Architectures
Interpolating and Interpretablility
Cons:
- Trickier / more unstabel to train
- Can’t solve inference queries such as p(x) , p(z|x)

Deep Reinforcement Learning

Problems involving an agent interaction with an environment which provides numeric reward signals. Goal is to learn how to take actions in order to maximize reward
Environment gives a State s_t to Agent, Agent generates back an Action a_t to Environment, then Environment gives a Reward r_t and the next State s_(t+1). Repeat until the Environment gives the terminal State
Examples:
- Cart-Pole Problem
- Robot Locomotion
- Atari Games
- Go (围棋)

Markov Decision Process (MDP)

It is mathematical formulation of the PL problem
Markov Property: Current state completely characterises the state of the world
An MDP is defined by: (hS, hA, hR, P, gamma)
- hS : set of possible states
- hA : set of possible actions
- hR : distribution of reward given (state, action) pair
- P : transition probability i.e. distribution over next state given (state, action) pair
- gamma : discount factor (how much we value rewards coming up soon versus later on)
The MDP:
- At time step t = 0, environment samples initial state s_0 ~ p(s_0)
- Then, for t = 0 until done:
  - Agent selects action a_t
  - Environment samples reward r_t ~ hR(. | s_t, a_t)
  - Environment samples next state s_(t+1) ~ P(. | s_t, a_t)
  - Agent receives reward r_t and next state s_(t+1)
- A policy pi is a function from hS to hA that specifies what action to take in each state
- Objective: find policy pi^* that maximizes cumulative discounted reward: sum_t(gamma^t * r_t)
Example: Grid World
How do we handle the randomness (initial state, transition probability…)
Solution: Maximize the expected sum of rewards

$p_i^* = \arg \max(E[\sum_t(\gamma^t \times r_t) | p_i])$ with s_0 ~ p(s_0), a_t ~ pi(. | s_t) ,s_(t+1) ~ p(. | s_t, a_t)

Q-Learning

Def: Value function and Q-value funciton
- Value function at state s , is the expected cumulative reward from following the policy from state s:
  
  $V_{p_i}(s) = E[\sum_t(\gamma^t \times r_t) | s_0 = s, p_i]$
- Q-value function at state s and action a , is the expected cumulative reward from taking action a in state s and then following the policy:
  
  $Q_{p_i}(s, a) = E[\sum_t(\gamma^t \times r_t) | s_0 = s, a_0 = a, p_i]$
- `
  $Q^*(s, a) = \max_{p_i}(E[\sum_t(\gamma^t \times r_t) | s_0 = s, a_0 = a, p_i])$
Bellman equation:

$Q^*(s, a) = E_{s'~\varepsilon}[t + \gamma \times \max_{a'}(Q^*(s', a') | s, a)]$
Solving for the optimal policy: Value iteration algorithm (use Bellman equation as an iterative update)

$Q_{i+1}(s, a) = E[r + \gamma \times \max_{a'}(Q_i(s', a')) | s, a]$
Q_i will converge to Q^* as i -> infinity
Problem: Not scalable. Must compute Q(s, a) for every state-action pair. Computationally infeasible
Solution: Use a function approximator to estimate Q(s, a). E.g. a neural network
Deep Q-Learning
- Q-Learning: Use a function approximator to extimate the axtion-value function Q(s, a; theta) ~= Q^*(s, a)
  Deep Q-Learning: When the function approximator is a deep neural network
- Forward Pass
  Loss function:
  $L_i(\theta_i) = E_{s,a~\rho(.)}[(y_i - Q(s, a; \theta_i))^2]$
  Where
  $y_i = E_{s'~\varepsilon}[r + \gamma \times \max_{a'}(Q(s', a'; \theta_{i-1})) | s, a]$
- Backward Pass
  Gradient update (with respect to Q-function parameters theta)
Example: Playing Atari Games
Experience Replay:
- Learning from batches of consecutive samples is problematic:
  - Samples are correlated => inefficient learning
  - Current Q-network parameters determines next training samples => can lead to bad feedback loops
- Address these problems using experience replay:
  - Continually update a replay memory table of transitions (s_t, a_t, r_t, s_(t+1)) as game (experience) episodes are played
  - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
  - As each of the transitions can also contribute to multiple weight updates => greater data efficiency

Policy Gradients

If dimension goes to really huge, the Q-function can be very complicated
Policy Gradients:
- Formally, let’s def a class of parametrized policies:
  $\Pi = {\pi_\theta, \theta\in \mathbb{R}^d}$
- For each policy, def its value:
  $J(\theta) = E[\sum_t(\gamma^t \times r_t) | \pi_\theta]$
- We want to find the optimal policy
  $\theta^* = \arg\max(J(\theta))$ ==> Gradient Ascent on policy parameters
Reinforce algorithm
- Mathematically
  
  $J(\theta) = E_{\tau~p(\tau;\theta)}[r(\tau)] = \int_\tau(r(\tau) \times p(\tau; \theta))d\tau$
  Where r(\tau) is the reward of a trajectory tau = (s_0, a_0, r_0, s_1,...)
- Directly Delta is Intractable. However, we can use a nice trick
  
  $\nabla_\theta p(\tau; \theta) = p(\tau; \theta) \frac {\nabla_\theta p(\tau; \theta)}{p(\tau; \theta)} = p(\tau; \theta) \times \nabla_\theta \log p(\tau; \theta)$
  Then
  $\nabla_\theta J(\theta) = E_{\tau~p(\tau;\theta)}[r(\tau) \times \nabla_\theta \log p(\tau; \theta)]$
- Can we compute those quantities without knowing the transition probabilities
  We have:
  $p(\tau; \theta) = \prod_{t\geq0}(p(s_{t+1} | s_t, a_t) \times \pi_\theta(a_t | s_t))$
  Thus:
  $\log p(\tau; \theta)) = \sum_{t\geq0}(\log p(s_{t+1} | s_t, a_t) + \log \pi_\theta(a_t | s_t))$
  And when differentiating:
  $\nabla_\theta\log p(\tau; \theta) = \sum_{t\geq0}(\nabla_\theta\log\pi_\theta(a_t | s_t))$
However, this also suffers from high variance because credit assignment is really hard
Variance reduction:
- First idea: push up probabilities of an action only by the cumulative future reward from that state
- Second idea: Use discount factor gamma to ignore delayed effects
- Baseline: Introduce a baseline function dependent on the state, and rescale the rewards
  Choose the base line:
  - constant moving average of rewards experienced so far from all trajectories
  - if this action was better than the expected value of what we should get from that state
    Q-function and value function!
    Intuitively, we are happy with an action a_t in a state s_t if Qpi(s_t, a_t) - Vpi(s_t) is large
    Def Advatage function A(s, a) = Qpi(s_t, a_t) - Vpi(s_t)
===> Combine Q-Learning and Policy Gradients by training both an actor(the policy) and a critic (the Q-function)
Examples:
- Recurrent Attention Model (RAM)
- fine-grained image recognition
- image captioning
- visual question-answering
- AlphaGo

Efficient Methods and Hardware for Deep Learning

Algorithms for Efficient Inference

Pruning
Weight Sharing
Quantization
Low Rank Approximation
Binary / Ternary Net
Winograd Transformation

Hardware for Efficient Inference

Algorithms for Efficient Training

Parallelization
Mixed Precision with FP16 and FP32
Model Distillation
DSD: Dense-Sparse-Dense Training

Hardware for Efficient Training

Adversarial Examples and Adversarial Training

Adversarial Examples:
- Fooling image as we mentioned before
- Not just the neural network
Understand why it happens:
- First idea: Causing from overfitting — Wrong
  Because it’s not random, it is systematic over pictures and different trainings
- Second idea: Causing from underfitting
  Excessive Linearity, and it force the point far away the dicition boundary to be confident classified even if we actually dont have any input there
The Fast Gradient Sign Method
The Clever Hans Theory
RBFs: resist the problem, but it is very shallow and extramly difficult to train if it get deeper in order to increase its accuracy
Transferability Attack
- Enhancing With Ensembles
The Adversarial Examples for Human (视觉错觉) and current neural networks is very different and have almost no 重叠
Which means there’s huge different in the mechanism of the two
Adversarial Training
- Can improve a bit, but still long way to go
- Semi-supervised learning