CS231n学习笔记


History & Brief introduction

  1. Animal vision development
  2. The invention of cameras
  3. Hubel & Wiesel, 1959: How mammals’ vision processing mechanism like
  4. CV start

  5. David Marr: proposing a mechanism
    • Primal Sketch: edges, bars, ends, virtual lines, etc.
    • 2.5-D Sketch: piece togather the surfaces, depth, etc.
    • 3-D Model Representation
  6. Generalized Cylinder & Pictorial Structure:
    • Every matter can be compsed of simple geomatric configuration
  7. David Lowe:
    • Using lines and edges to represent

  8. Vision too hard — Start from image segmentation
  9. Specified & using computer learning — Real-time face detect
  10. Feature based object recognation: ‘SIFT’
  11. Histogram of Gradients & Deformable Part Model

  12. PASCAL Visual Object Challenge — Test the prograss we have made in object recognation
  13. Overfit problem
  14. ImageNet; ImageNet Large Scale Visual Recognition Challenge: whether the first five label contain the correct one
  15. 2012 huge dorp of error rate — Using of CNN
  16. Other challenges: Object detaction, Action classification, Image caption, etc.

  17. The first appear of origin CNN — 1998
  18. Why it is only popular recent years
    • Moore’s law & Increasing computation
    • Data explosion

  19. The other challenge beyond just object recognation:
    • Movement recognation
    • 3-D 重建
    • Recognation of every single pixel of the image
    • Understand the story behind the image, which means truly understanding picture


Image classification pipeline

  1. The problem: Semantic gap:
    The huge gap between the semantic idea of a cat(or sth.) and the huge matrix of pixel value that the computer is actually seeing.
  2. Challenges:
    • View point
    • Illumination
    • Deformable
    • Occlusion
    • Background clutter
    • Intraclass variation
  3. Attempts: find edges, corners, etc.
    Very hard & having little scalable

  4. Some Point of View:
    • origin pictures
    • images as points in a high dementional space

K&N An example of simple classification method

  1. Data-driven approach: Split the API into two functions, one train, one predict
    • Collect a dataset of images and labels
    • Use Machine Learning to train a classifier
    • Evaluata the classifier on new images
  2. Exercise: Nearest Neighbors(CIFAR10 dataset)
    • Train step: memrize all picture

    • Predict step: compare and find the most similar one to the test picture

    • What is ‘similar’? L1(Manhattan) distance:

        import numpy as py
              
        class NearestNeighbor:
            def __init__(self):
                pass
              
            def train(self, X, y):
                # X is N x D where each row is an example. y is a 1-D of size N contain labels
                self.X_train = X
                self.y_train = y
                  
            def predict(self, X):
                num_test = X.shape[0]
                Y_predict = np.zeros(num_test, dtype= self.y_train.dtype) 
              
                for i in range(num_test):
                    distances = np.csum(np.abs(self.X_train - X[i, :]), axis = 1)
                    min_index = np.argmin(distances)
                    Y_predict = self.y_train[min_index]
              
                return Y_predict
      
    • Problem: Train O(1), Predict O(N). But we want is slow training and fast predicting

  3. K-Nearest Neighbors: (some small improvements of 5.)
    • taking a mojority vote from K closest points
    • Using L2(Euclidean) distance instead:

      • BTW: L1 is dependent upon the choice of coordination system. So somehow it may be more fit if the individual entry of your vector has specific meaning for your task
      • BTW: By just changing the distance matrix, it can be applied to multiple tasks

  4. Hyperparameters: (like the K value and distance matix in the example above) Very problem dependent and almost not able to learn from huge data. Therefore, most people just try each of them out.
    • Idea 1: choose ones work best on training data.
      Terrible(example above, K = 1 always do perfect on training data), never do it.
    • Idea 2: Split the training data to train and test part. Choose ones work best on test date
      Terrible, never do it. Because no idea how algorithm will peform on new data.
    • Idea 3: Split the training data into 3 parts, train, validation, test. Train on the train set, choose on validation data, then try algorithm on test set only once at last, to test whether it is good.
      Better
    • Idea 4: Cross_validation. Split data into folds and test try each fold as validation and average the results(accuracy), then do the same as Idea 3
      Even better, but not use often in prectice

Linear classification: Another example

  1. Parametric Approach: input image x, weight parameter W, so we compose a function f(x, W), to get 10 numbers giving class scors(as above, using CIFAR10, so 10 categories in total)
  2. Linear Classifier: $f(x, W) = W \times x + b$
    • CIFAR10 x 32 x 32 x 3 reshape to 3072 x 1
    • W 10 x 3072
    • b 10 x 1. To understand b, if the training data has more cats than dogs, then it reasonable to predict that b_cat is larger than b_dog
    • Can be considered as a template matching method. Each row of W is a template of the category
    • Restrain: only one template for a class, once a huge differece come out, it try to average all the picture to get the only template
  3. Hard cases for a linear classifier: Consider f as a line in high-D space
    • pairity problem: 一三象限、二四象限; such as odd or even people of animals
    • multimodel: island in other classes
  4. Score function: $s = f(x; W) = Wx$



Loss Function and Optimization

  1. Todo from last time:
    • Def a loss function that quantifies our unhappiness with the scores across the training data
    • Come up with a way of efficiently finding the parameters that minimize the loss function — optimization

Loss Function

  1. Def: tell how good our current classifier is.


  2. Ways of understanding:
    • To set a loss func, what you actually do is to tell the computer what kinds of errors you care about more and what kinds of errors you can kind of accept — the linear and square func is an example

  3. Example 1 :Multi-class SVM(Support Vector Machine) loss: (Hinge loss)
    • Fomula:
      (where Delta is a sort of safety margin we set)

    • Min 0; Max infinity

    • Trick: At the beginning, we randomly choose W, so all s should ~= to 0, and all approximatly equal, so there should be $L_i \approx ClassNum - 1$

    • If we square the value in sum, we’ll get a different loss func, and hopefully get a different W — A trick we can use when figuring out our own loss func

    • If we get a zero loss W, W is definatly not unique — 2W is also zero loss

  4. That’s because we tell the classifier to find the W that fits the training data. But in practice we dont really care about it, we really care about the performance of the classifier on test data !!!

  5. Regularization: to encourage the model to choose a “sinple” W — To solve the problem:


    • L = Data loss + lambda * Regularization loss

    • lamda is a very important Hyperparameter

    • Examples: L2, L1, Elastic(beta*L2+L1)

      • Important difference: consider what as complexity
      • L2 prefer to spread the influence to all pixels; While L1 prefer to concentrate

  6. Example 2: Softmax loss(Multinomial Logistic Regression)
    • Softmax function: normalized probabilities of the classes


    • Softmax loss func: The true class’s probabilities should close to 1


    • Min 0; Max infinity
      But to get 0, we need infinit number at the correct class and minus infinit number at others. So we’ll never get zero loss

    • Trick: when all s ~= 0, and all approximatly equal, so there should be $L_i \approx - \log(1/ClassNum) = \log(ClassNum)$

  7. Contrast the two loss func:
    • How we choose to interpret the scores to measure the badness
    • What they want to do:
      • SVM just want the correct class’s scors much bigger than others;
      • Softmax want it to go towards infinity and others go towards minus infinity

Optimization

  1. Strategy: Follow the slope — Gradient, pointing to the greatest increasing direction
  2. Numerical & Analytic: to debug & to train
  3. Gradient Descent:
    • Step size (also called Learning rate): the single most important hyperparameter!
    • Update rules
    • Stochastic Gradient Descent(SGD): Because N is super large, instead using all N inputs, we use a minibatch of samples of inputs

  4. Image Features and The Motivation: In prectice, we prefer not to fit the raw pixels, but the “features” we “get” from them
    • Example 1: Color Histogram
    • Example 2: Histogram of Oriented Gradients(HoG, Edges direction and distribution)
    • Example 3: Bag of Words(Define visual words)



Backpropagation and Neural Networks

Backpropagation

  1. Computational graph: 类似流程图,每个框代表origin variable or 运算
  2. A simple example: f(x,y,z) = (x + y) * z
    • Draw the graph:
        x ---v (q) 
        y -> + ----v (f)
        z -------> * -->
      
    • Assign every intermediate variable a name:




    • To get Gradients of f: we start at the end and go back, and each step calc
      (v is a intermediate or origin variable) (using Chain rule)

  3. Backpropagation:
    • Local node:
      Some inputs x, y , a simple calc f , some outputs z
      It just aware of the immediate surrounding, which is the things above
    • Local gradient:
      For a local node we can get the gradient between the inputs and the output such as pdz / pdx; pdz / pdy
      Use them to multiply the gradient get from behind the node(which is pdfinal / pdz called Upstream gradient), we can get the gradient between the final output and the inputs of this node
      If there are more than one outputs, which is more than one upstream gradients, add them up as the total upstream input of this node
    • From the end of the computational graph go back the the beginning, we can get the full gradient
  4. We can group some simple calc f togather to get a little coplex, as long as we can directly write down the local gradient
    • Such as sigmoid function:



    • Can be viewed as a trade-off between numerical calc and computational calc

  5. Patterns in backward flow:
    • Add gate: gradient distributor;
    • Max gate: gradient router;
    • Mul(multiply) gate: gradient switcher

  6. Gradients for vectorized code (which means that x, y, z are vectors instead of numbers):
    • All the local gradients and the upstream gradients are Jacobian Matrixes. Basically just using the chain rule to calc
    • Can calc elementwise or write it out in vectorized (and matrix) multiply form
    • If the final output is a number, not vector, the full gradient will have the same shape as its corresponding input
    • output是vector情况? 对Matrix 的梯度为三维矩阵,但Gradient Descent 是Loss Function 的梯度,不是Hypothesis, 输出是number, 不会是Vector

Neural Networks

  1. First see it as a function, without the brain staff:
    • Before: Linear score function: f = W * x
    • Now: 2-layer Neural Network: f = W_2 * max(0, W_1 * x)
    • The non-linearity that the max function brings is very important and necessery
    • Understand from picture view:
      W_1 is a collection of many templates, and W_2 assign the tamplate to classes
    • We can make it even deeper by repeating “add non-linearity and the another layer of linear regression”
  2. Inspiration from neural system
  3. Non-linearity: called activation function
    Such as Sigmoid, Leaky ReLU, tanh, Maxout, ReLU, ELU
  4. The Architectures: “3-layer Neural Network” or “2-hidden_layer Neural Nerwork”



Convoltional Neural Networks (ConvNet)

  1. A bit of history

From a function perspective (with out brain stuff)

  1. Recall: Fully Connected Layer

  2. Convolution Layer
    • Instead of stretching to a 1-D vector, we keep the structure of the image — Preserve spatial structure
    • Convolve: the filter with the image i.e. “slide over the image spatially computing dot products”
      Note that the filters always extend the full deoth of the input volume
    • To get the dot product, we stretching the filter into a 1-D vector w, and stretching the overlapping chunk of the image into 1-D vector x, and then get a number output by w^T * x + b
      Or consider as overlap and then element-wise multiplication and add up
    • To do the “slide”, basically, center the filter on top of every pixel in this input volume, do the dot product and get a number to fill in the corresponding position in the output matrix (2-D for a 3-D input) (called activation maps)
    • Apply several filters (can be same shape) to the same input, then we can get the same number of activation maps out
    • ConvNet is a sequence of Convolutional Layers, interspersed with activation functions
  3. To understand the ConvNet, along the depth of ConvNet (usually the depth of the matrix in the layer is also deeper), the Layers can be considered as looking for featrues from simple to complex
    Such as from local feature to lines then to corners and blobs to the resemble of blobs

  4. To understand the filter, we can also see it as a template. By visualizing the filter, we can find out what parttern the filter is looking for, that is: what kind of input will get the larger output

  5. Stride: the step size that the filter slide on the input matrix




  6. Zero pad: To make it possible to set the filter’s center at the corners of the input, we do zero pad
    With the Stride of 1, we can then maintain the size of the output, preventing it from shrinking

  7. 1 x 1 convolution layers make perfect sense, they mix the information along depth

The brain view of convolution layer

  1. Filter also is called receptive field

Other layers in the ConvNet

  1. Pooling layers (Downsampling)
    • makes the representations smaller and more manageable
    • operates over each activation map independently
    • Only spacially, do nothing to the depth
    • Example: Max Pooling
    • Also instead of doing pooling, we can also use stride to do the same thing — downsampling
  2. ReLU or Other non-linearities: the same as Neural Networks
  3. Fully Connected Layer (FC layer)



Train Neural Networks

Activation Functions

  1. Sigmoid:

    • Squashes numbers to range [0, 1]
    • Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
    • Three problems:
      • Saturated neurons “kill” the gradients:
        When inputs are very far away from 0, then the gradient become 0, and that is bad in the procedure in the backprop
      • Output are not zero centered:
        That means the gradients on w will always all positive or negative, that would make the descenting less efficient ???
      • exp() is a bit compute expensive (less important)


    • Squashes numbers to range [-1, 1]
    • Zero centered
    • Still, saturated neurons “kill” the gradients
  2. Rectified Linear Unit (ReLU):

    • Does not saturate (in + region)
    • Very computationally efficient
    • Converges much faster than sigmoid/tanh in practice
    • Actually more biologically plausible than sigmoid
    • Problems:
      • Not zero-centered output
      • Saturate in the - region:
        When you have a bad initializing weight matrix, or you have a learning rate too large
        To reduce the possibility of dead ReLU, people like to initialize it with slightly positive biases (e.g. 0.01)
  3. Leaky ReLU:

    • Does not saturate (in + region)
    • Very computationally efficient
    • Converges much faster than sigmoid/tanh in practice
    • Will not “die”
  4. PReLU:
    with \alpha as a parameter that can learn by compute

  5. Exponential Linear Units (ELU):

    • All benefits of ReLU
    • Closer to zero mean outputs
    • Negative saturation regime compared with Leaky ReLU adds some robustness to noise
    • But computation of exp()
  6. Maxout “Neuron”:

    • Will not saturate, will not die
    • Double the parametar number

Data Preprocessing

  1. Zero-mean & Normalize
    • Subtract the mean image (e.g. AlexNet) or the per-channel mean (e.g. VGGNet)
      Important because if it’s not zero-centered then loss function will be too sensitive to small perturbations
    • Not really do normalize in picture processing
  2. PCA & Whitening (more complex, usually dont do in picture processing)

  3. When we do the data preprocessing to the training set, we need to do the exact same calc with the exact same value to the test set

Weight initialization

  1. First idea: Small random numbers
    • w = 0.01 * np.random.randn(D, H)
    • Do good in small network, but have problem when it get deeper
      picture ???
  2. Xavier initialization: W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in)
    • Do good when use tanh
    • To fit ReLU (‘cause it kill half of the inputs), one possible modification is W = np.random.randn(fan_in, fan_out) / np.sqrt(fan_in / 2)
  3. Still a research area
    MSRA another way you can use

Batch Normalization

  1. Idea: want to keep activation in a gaussian range
  2. To make each dimension of a batch of actications at some layer unit gaussian, apply: xk = (xk - E[xk]) / sqrt_root(Var[xk])
  3. Usually inserted after FC layers (each demention) or Convolutional layers (jointly) and before non-linearity (apply for each neuron)
  4. Usually then allow the network to squash the range if it wants to yk = gamak * xk + betak. Notice that the net work can learn gamak = sqrt_root(Var[xk]); betak = E[xk] to recover the identity mapping
  5. Benefit:
    • Improves gradient flow through the network
    • Allows higher learning rates
    • Kind of more robust
    • Reduces the strong dependence on initialization
    • Act as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
  6. Note that: At test time the BatchNorm layer functions differently. The mean/std not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used. (e.g. can be estimated during training with running averages)

Babysitting the Learning Process

  1. The starting steps when apply ConvNet:
    • Step 1: Preprocess the data
    • Step 2: Choose the architecture
      Notable: Double check the loss is reasonable (The trick learned in loss function, and also loss should go up when enable the regularization)
    • Step 3: Try to train
      • Tip: First Make sure that you can overfit very small portion of the training data very well (with very small loss and accuracy 1.00)
      • Start with small regularizatioin and find learning rate that makes the loss go down (not go down: too low; exploding: too high)

        TODO

Hyperparameter Optimization

  1. Cross-validation Strategy (in stages)
    • First stage: only a few epochs to get rough idea of waht parameters work
    • Second stage: longer runinig time, finer serch
    • Repeat as necessary

    • Tips for what doesnt work: If the cost is ever > 3 * original cost. Then we can break out early
    • Note it’s best to optimize in log space ( 10**uniform(s,e) )
    • And compare to Grid Search, Random Search is actually better. Because we can get a better sense of what the function looks like
  2. Some special situation:
    • loss flat for some time then go down: maybe bad initializing
    • big gap between train and validation accuracy: maybe overfitting
    • no gap between them: consider increase capacity
  3. Track the ratio of weight updates / weight magnitudes: around 0.001 or so is good

Fancier Optimization

  1. Problem with SGD:
    • If the loss function changes quickly in one direction and slowly in another (has high condition number, which is the ratio of largest to smallest singular value of the Hessian matrix)
      Then the SGD would get a very slow progress along shallow dimension, and we would get a zig-zag jitter along steep direction
      In high-D this would happen more often
    • May stack in a local minima or a saddle point
      In high-D, the problem is more about saddle points and less about local minima, and also not exactly at the saddle point but near the sadlle point
    • Out gradients come from minibatches so they can be noisy

  2. Minor improvement: SGD + Momentum
    • Build up “velocity” as a running mean of gradients and go the direction of velocity

    • rho gives the “friction” of the velocity contribute to the next velocity; Typically rho = 0.9 or 0.99

    • The fomula:

    • Solve all three problems!

  3. Nesterov Momentum:
    Instead of adding the gradient from the point we now standing on, we start from the point now go as the velocity before updating and calc the gradient there, then go back to the point, and calc the new velocity


    Another form


  4. AdaGrad:
    • Keep adding the sqr of these SGDs and

    • As the training goes, grad_squared become smaller, so the actually learning rate become smaller — good for convex, otherwise, may still get stuck at saddle point

  5. Improve: RMSProp
    • Instead of grad_squared += SGD * SGD, we do grad_squared = decay_rate * grad_squared + (1 - decay_rate) * SGD * SGD where decay_rate is commonly 0.9 or 0.99

  6. Adam algorithm


    • But if we initialize the second_moment = 0 and because the beta2 is 0.9 or 0.99 typically, whcih is near to 1. That is to say, for the early steps, we’ll always have small second_moment, then there will be a hugh step as we divide it. And that’s a problem

    • To solve it: add a bias term (t means the current time step)



  7. Problem for all these: axis dependent. That is to say, if we rotate the picture, the algorithm cannot really unrotate it

  8. Learning rate decay: strt with a relatively large learning rate and then decay it along the training
    • Step decay: decay learning rate by a factor every few epochs
      exponential decay:

      1/t decay:

    • Common with SGDMomentum, less common with Adam

    • Second order hyperparemeter

  9. All the algorithm is First-Order Optimization
    Second-Order Optimization: Newton’s Method But H^-1 too big, so there’s improvement to approximate Hessian
    (But they dont actully handle the stochastic case very nicely, and tend not to work very well with non-convex problems)
    • Quasi-Newton methods
    • L-BFGS

Regularization

  1. All the above is about how to reduce our training error, but we dont really care about that, what we really care about is the reducing the gap between the training set and unsee samples (test set)

  2. Model Ensembles: Tend to improve by 2%
    • Trian multiple independent models
    • At test time average their results
    • Tricks:
      • Instead of training independent models, use multiple snapshots of a single model during training
      • Polyak averaging: Instead of using parameter vectors, use the average of it along the training ???

  3. Regularizations: To improve single-model performance
  4. Dropout: In each forward pass, randomly set some neurons to zero, and every time we do a forward pass, we’ll set a different set of the neurons (randomly) to zero
    • Probablity of dropout is a hyperparameter; 0.5 is common
    • Often do it on FC layers;
      Sometimes use it also on convolutional layers;
      And on convolutional layers sometimes instead of droping neurons, we might drop entire feature maps randomly
    • Understand why this works
      • Help prevent co-adaptation of features, which is to say prevent it from distribute the decision to too many “overlaping” features
      • Can also view as training a large ensemble of models
    • At test time, to wipe out randomness
      • Consider a single neuron which inputs are x, y and output is a

      • During training time we have:

      • So at test time, we just run the model without dropout and then multiply the output by dopout probablity

      • Inverted dropout: Insread multiply Prob at test time, we divide Prob during training time (to do test more efficiently)

  5. Common pattern of Regularization:
    • Training: add some kind of randomness
    • Testing: average out randomness (or approximatly)
  6. Another examples:
    • Batch Normalization
    • Data Augmentation: Train on the randomly transformations of the origin pictures (flips, crops, scales, color jitter, rotating, shearing, etc.) Consider how can we transform the data without changing the label of it
    • DropConnect
    • Fractional Max Pooling
    • Stochastic Depth

Transfer Learing

  1. Another point of view to overfit: you dont have enough data
  2. Step of Transfer Learning:
    • Train on a Large enough and similar Dataset
    • Freeze the hidden layers, and reinitialize and train the last FC layers on the small dataset you want to aplly on
    • When you have a larger dataset, then maybe consider updating larger parts of the network
    • Note that you probably want to drop the learning rate when you do the transfer
  3. Pretrain and Finetuning is very commonly use in many areas



Deep Learning Hardware and Software

CPU vs. GPU

  1. GPU: (DL prefer NVIDIA)
    • larger number of cores (more than 100 times), but each core is much slower (small clock speed) and “dumber”
    • have their own pretty large RAMs and own caching system
    • Great for parallel tasks with pretty high similarity
  2. Example: Matrix Multiplication
  3. Programming GPUs
    • CUDA (NVIDIA only)
    • OpenCL (runs on anything, but usually slower)
  4. Data Transfer problem (from disk to GPUs). Solutions:
    • Read all data into RAM (if your data is relatively small)
    • Usd SSD instead of HDD
    • Use multiple CPU threads to prefetch data

Deep Learning Frameworks

ThsorFlow

  1. Rough Process:
    • First define computational graph;
    • Then run the graph many times
  2. Detail Process:
    • Placeholders: x = tf.placeholder(tf.float32, shape=(rx,lx)); y, w1, w2 similar
    • Def forward pass: compute prediction for y and loss (No computation happens, just building model) h, y_pred, loss
    • Tell TF to compute loss of gradient (No computation happens, just building model) grad_w1, grad_w2
    • Enter a session so we can actually run the graph with tf.Session() as sess:
    • Create numpy arrays that will fill in the placeholders above Value = {x:..., w1:..., ...}
    • Run the graph: feed in the numpy arrays; get the output we want
      out = sess.run([loss, grad_w1, grad_w2], feed_dict=values)
      loss_val, grad_w1_val, grad_w2_val = out
    • Use to update and repeat the step ‘R’
  3. Problems: Copying weights between CPU/GPU each step, Expensive! Solution:
    • Instead of placeholder, use Variable for w1 w2 at step ‘P’ and tell TF how we want them to be initialized (Still, dont actually initialize now) w1 = tf.Variable(tf.random_normal((rw, lw)))
    • Before the step ‘E’ and after the step ‘T’, add assign operations to update w1, w2 as part of the computational graph new_w1 = w1.assign(w1 - learning_rate * grad_w1)
    • After the step ‘E’
            sess.run(tf.global_variables_initializer())
            values = {x: np.random.randn(rx, lx), y: np.random.randn(ry, ly)}
            for t in range(TotStep):
                loss_val, = sess.run([loss], feed_dict=values)
      
  4. But this will not update, because we dont tell the TF to update, we just tell it to calc loss. Solution:
    • A funny way of doing so:
      • We add a dummy node which will return ‘None’ at the step ‘B’: updates = tf.group(new_w1, new_w2)
      • Last line change to loss_val, _ = sess.run([loss, updates], feed_dict=values)
    • A way that TF provides:
      • Instead step ‘T’ and ‘B’, we can just use an optimizer to cumpute gradients and update weight
        optimizer = tf.train.GradientDescentOptimizer(1e-5) (1e-5 is the training_rate)
        updates = optimizer.minimize(loss)
      • Last line change to loss_val, _ = sess.run([loss, updates], feed_dict=values)
  5. TF also packing layer for us: After setting x, y , we can just call the following code to set layers without announcing w’s and b’s outslves:
     init = ty.contrib.layers.xavier_initializer()
     h = tf.layers.dense(inputs=x, units=H, activtion=tf.nn.relu, kernel_initializer=init)
     y_pred = tf.layers.dese(inputs=h, units=lx, kernel_initializer=init)
    


  6. Keras: High-Level Wrapper
  7. Other High-Level Wrappers: TFLearn, TensorLayer, Sonnet, etc.
  8. Pretrained Models, Tensorboard, Distributed Version

  9. Like Theano

PyTorch

  1. Three Levels of Abstraction
    • Tensor: Imperative ndarray, but runs on GPU (like the Numpy array for TF)
    • Variable: Node in a computational graph; stores data and gradient (like Tensor, Variable, Placeholder for TF)
    • Module: A neural network layer; may store state or learnable weight (like tf.layers, or TFSlim, or TFLearn, or Sonnet, or… for TF)
  2. To let Tensor run on GPU, we need to use special datatype: dtype = torch.cuda.FloatTensor
  3. x = torch.randc(rx, lx).type(dtype) for use Tensor
    x = torch.autograd.Variable(torch.randn(rx, lx), requires_grad=False) for use Variable, and then it contain x.data , and a bool x.grad
    But their API is all the same
    But as using Variable, we can simply call loss.backward() to calc the gradients (dont forget to initialize 0 first ???)
  4. nn package, optim package and DataLoaders function
  5. We can define our own new autograd functions and new modules in nn
  6. Pretrained Models, Visdom

  7. Update from Torch

  8. Static vs Dynamic Graphs
    • Def:
      • Static: TF, we first build then run
      • Dynamic: PT, we just define Tensor or Variable first, and the build is done with in the loop and is running itself
    • Optimization: Static ones can allow the computer to optimize the graph for you before it runs
    • Serialization: Static ones once built, can serialize it and run it without the code that built the graph (which allow us to change coding environment)
    • Conditional Operation: Dynamic ones will end up more clean. For Static you need special TF control flow to do so (like tf.cond)
    • Loops: Like y_t = (y_(t-1) + x_t) * w , maybe even have a different size. Again, Dynamic ones are cleaner, and For Static you need special TF control flow to do so (in thsi case tf.foldl )
  9. TensorFlow Fold: Dynamic graphs in TF

Caffe & Caffe2



CNN Architectures

  1. Review: LeNet 5

  2. AlexNet:
    • Architectures:
      CONV1 (96 x 11 x 11 (x 3); stride 4 pad 0 => 35K parameters);
      MAX POOL1 (3 x 3 (x 1); stride 2 => No parameter);
      NORM1;
      CONV2 (256 x 5 x 5 (x 48); stride 1 pad 2);
      MAX POOL2 (3 x 3 (x 1); stride 2 => No parameter);
      NORM2;
      CONV3 (384 x 3 x 3 (x 128); stride 1 pad 1);
      CONV4 (384 x 3 x 3 (x 192); stride 1 pad 1);
      CONV5 (256 x 3 x 3 (x 192); stride 1 pad 1);
      MAX POOL3 (3 x 3 (x 1); stride 2 => No parameter);
      FC6 (4096 neurons);
      FC7 (4096 neurons);
      FC8 (1000 neurons);
      Softmax
    • First use of ReLU;
      Heavy data augmentation;
      Dropout; Momentum; Batch; Decay (Weight, Learning rate); Ensembles; etc.
  3. ZFNet: Change of hyperparameters

  4. VGGNet: Small filters, Deeper networks (VGG16, VGG19)
    • only 3 x 3 CONV stride 1 pad 1 & 2 x 2 MAXPOOL stride 2
      That will lead to deeper, more non-linearities and less parameters
    • Dont use NORM layers
    • Most memory is in early CONV
      Most parameters are in late FC (but not the last one)

  5. GoogLeNet: Deeper networkd, with computational efficiency
    • No FC layers => Only 5M parameters
    • Efficient “Inception” module:
      Apply parallel filter operations (different size of CONV and Pooling) on the same input from previous layer; Concatenate all filter outputs together depth-wise
      Usually CONV 1x1, 3x3, 5x5 and POOL 3x3
    • Problem: expensive for compute & always deeper because pooling layer, Solution:
      use 1x1 conv (Bottleneck layers) to prjects depth to lower dimention (Before 3x3, 5x5 CONV and after 1x1 POOL)

    • Architecture:
      • Stem Network (Conv-Pool-2x Conv-Pool) + Stacked Inception module + Classifier output (Without the expensive FC, which means there’s only one FC to generate the score of each class)
      • Auxiliary classification putputs to inject additional gradient at lower layers (AvgPool-1x1Conv-Fc-FC-Softmax)

  6. ResNet: Very deep networks using residual connections
    • The deeper network sometimes perform worse than shallow ones in both training set and test set (so it’s not caused by overfitting)
      Because it is much more difficult to train and optimize

    • But deeper ones should at least perform as well as shallow ones by copying and add identity maping — The basic idea of ResNet
    • Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping
      X (input) -> CONV -> ReLU -> CONV -> F(X) , and instead of using F(X) as input of next layer, we use F(X) + X. By this mean, F(X) is the residual of delta from X

    • Architecture:
      • Stack residual blocks
      • Periodically, downsampling by using stride 2
      • Addtional CONV layer at the beginning and only one FC layer at the end
      • Also use bottleneck layer to improve efficiency
    • Others: Batch Norm, Xavier/2, Momentum, Decayings (Weight, Learning rate), No dropout

  7. Other Architecture:
    • Network in Network (NiN)
    • Identity Mappings in Deep Residual Networks
    • Wide Residual Networks
    • Aggregated Residual Transformations for Deep Neural Networks (ResNeXt)
    • Deep Networks with Stochastic Depth
    • FractalNet: Ultra-Deep Neural Networks without Residuals
    • Densely Connected Convolutional Networks
    • SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and <0.5Mb Model Size



Recurrent Neural Networks

  1. “Vanilla” Neural Networks: one to one

  2. Recurrent Neural Networks:
    ONe to one; One to many; Many to one; Many to many

  3. Recurrence formula:

    x -> RNN -> y
    .....| ^.....
    .....|_|.....
    Note W is the same for all the h_t

  4. Backpropagation through time
    • Problem: if input is very large, it’ll get super slow
    • Improvement: Truncated Backpropagation through Time:
      Run forward and backward through chunks of the sequence instead of whole sequence;
      Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps

  5. Example 1: (Vanilla) Recurrent Neural Network


    • The spread form of conputational graph
      x_i's, y_i's, h_i's, W, L_i's, L

    • The case of o-m, m-o, m-m;
      Also m-m can be seen as m-o + o-m (kind of like coding and decoding)

    • Example: Character-level Language Model

      • Train by a sequence of characters you want
      • Sample a character, get an output, generate the next character by argmax of prob, and feed in the model
  6. Example 2: Image Captioning
    • Convolutional NN + Recurrent NN
    • For CNN, we use the 4096 output before the final FC layers as the image input v to the RNN
    • For RNN:
      • Special <START> token and <END> token



      • Word-level

    • To train it, we can train both CNN and RNN togather by pass down the gradient to CNN
    • Improvement: With Attention (Soft & Hard)
      RNN focuses its attention at a different spatial location when generating each word (and the input from t-1 and ConvNet to the RNN will be different)
  7. Example 3: Visual Question Answering

  8. Multilayer RNN:


  9. LSTM
    • Exploding & Vanishing Gradients Problem: Because when we do backprop, we’ll multiply some W over and over again
      • For EGP, nasty solution is to set a threshold, scale gradient if its norm is greater than it
      • For VGP, Change RNN architecture ==> LSTM (which can solve both the problem)
    • Formula:

      (where sigma is sigmoid functon, and W is 4het * 2het if the heights of h, x are both het)




      where .* is element-wise multiply

    • Input gate (i): whether to write to cell
      Forget gate (f): whether to erase cell
      Gate gate (g): how much to write to cell
      Output gate (o): how much te reveal cell
      Where cell — c

    • Backprop:
      Consider the path through cells (c’s), then it only get element-wise muitiplied by different forget gates (f), and that’s a lot nicer
      ???
      And f usually initialize to near 1 to avoid vanishing at the beginning of the train
  10. Highway Networks

  11. Other RNN Variants

    • GRU



Detection, Segmentation and Locolization

Semantic Segmentation

Want to produce a category label for each pixel of the input image

  1. Only pixels dont have the view of objects, so dont distinguish two cows for example, just give the label of cow

  2. Idea 1: Sliding Window
    Seperated to crops, and run a CNN on each crops to dicide the central pixel’s label
    Very inefficient!
  3. Idea 2: Fully Convolutional
    • Without FC layers, the output of the final Conv layer get a depth of class number and just view it as the scores of each class, and run the training
      Super computational expensive
    • Imporvement:
      Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network

  4. In-Network upsampling:
    • Unpooling
      • Nearest Neighbor
        For 2x2 stride 2, what we do is to copy a element and fill the 2x2 with its number
        11 <- 1
        11
      • Nails upsampling
        For 2x2 stride 2, what we do is to fill the 2x2 with 0, except for one element
        10 <- 1
        00
      • Max Unpooling
        Just do the Nails upsampling and do it kind of symmetric with its corresponding downsampling
        12 -> 5 -> ... -> 1 -> 00
        35.....................01
    • Transpose Convolution
      We take one pixel in the input, multipliy it with the filter and add it to the output matrix
      So the stride is the step length of the filter on the output matrix, and the pad is also using at the output matrix

Classification + Localization

  1. Treat the localization as a regression prblem, using the same second to last FC layer as classification.
    While it go through another FC to the generate the class scores, let it fo through another FC layer to generate Box Coordinate (x, y, w, h)
    Also we’ll have two loss functions: Softmax loss (or Cross-entropy loss) and Regression loss (L2 or L1 or others). So we take the weighted sum (hyperparameter) of them and use it to do the SGD
    Fancier: generate boxes for every class and only use the one of the true class to calc the second loss function

  2. Aside: using the idea of generate fix location
    • Human pose Estimation

Object Detection

  1. Might be varying number of output in every image

  2. Idea 1: as classification: Sliding Window
    Apply a CNN to many different crops of the image, CNN classifies each crop aas objects or background
    Problem: Need to apply CNN to huge number of locations and scales, Very computational expensive
  3. Idea 2: Region Proposals
    Give regions where the objects are likely to be found
    It is not deep learning, but a troditional machine learning

  4. R-CNN: combine the two ideas
    • Steps:
      • Use Region Proposals to generate around 2K Regions of Interest (RoI)
      • Warped them into a fix size the CNN wants
      • Run each of them to the same CNN
      • Do the Classification and Localization using the CNN
        Note that we could train the CNN to Localize the object a little of the RoI, when the RoI miss a head of a person for example
    • Problems:
      • Super slow still, for both training and testing
  5. Fast R-CNN
    • Step:
      • Forward whole image through some conv. layers to get a high resolution convolutional feature map
      • Use Region Proposals on the convolutional feature map to generate RoIs
      • RoI Pooling layer to warped them into a fix size the CNN wants
      • Directly run them to some FC Layers to do the Classification and Localization
    • Super fast that Region Proposal becomes the bottleneck of speed
  6. Faster R-CNN
    • Idea: Make CNN do proposals. Insert Region Proposal Network (RPN) in the CNN to predict proposals from features
    • The RPN also do things like classification (over two class: objects or not) and Bounding-box regression, so it has two loss function, too.
      Therefore, in total, we have 4 loss function that we need to use at the same time (weighted sum) to train the network

  7. Detection eithour Proposals: YOLO/SSD (You Only Look Once & Single Shot Detection)
    • Idea:
      • Divided into grid cell (R * L)
      • For each grid cell, draw several (denote as B ) boxes you set centered it
      • Within each grid cell
        • Regress from each of the B base boxes to a final box with 5 numbers: (dx, dy, dh, dw, confidence)
        • Predict scores for each of C classes (including background as a class)
      • Output: Tensor R * L * (5 * B + C)
  8. Aside: Object Detection + Captioning = Dense Captioning

Instance Segmentation

Kind of like Semantic Segmentation + Object Detection

  1. Mask R-CNN:
    • Image -> CNN -> RoI Align -> Two branches: One Conv of Mini Semantic Segmentation; One do the Classification and Localization (and joint locolization for next no.)
  2. Also can do pose estimate at the same time



Visualizing and Understanding

  1. First Layer (Generally Conv. layer) : Visualize Filters: Oriented edges and Opposing colors (Scale to 0~255)
  2. For intermediate layers: Directly using gray scale to interprete the weight matrix can not give much of understanding, we need more fancier method
  3. For last hidden layer:
    • Nearest Neighbors
    • Dimensionality Reduction: Use to high-D vectors to 2-D coordinates
      • PCA
      • t-SNE (t-distributed stochastic neighbor embeddings)
  4. For intermediate layers: While visualizing weight matrix is less useful, visualizing the activation maps of those intermediate layers is kind of interpretable
  5. Maximally Activating Patches: Visualizing what types of patches from input images cause maximal activation in different neurons
  6. Occlusion Experiment: Mask part of the image before feeding to CNN, draw heatmap of probability at each mask location
  7. Saliency Maps ===> Segmentation without supervision
  8. (Guided) Backpropagation
  9. Gradient Ascent
    • Using gradient to change pixels to maximize the output score of some neuron
    • Regularization:
      • L2 norm
      • Gaussian blur image (Periodical)
      • Clip pixels with small values to 0 (Periodical)
      • Clip pixels with small gradients to 0 (Periodical)
    • Adding “multi-faceted” visualization (with more careful regularization, center-bias)

  10. Fooling Images / Adversarial Examples (A whole lecture later)
    • Start from an arbitrary image
    • Pick an arbitrary class
    • Modify the image to maximize the class
    • Repeat until network is fooled

  11. DeepDream: Amplify existing features
    • Forward: compute activations at chosen layer
    • Set gradient of chosen layer equal to its activation
    • Backward: compute gradient on image
    • Update image
  12. Feature Inversion: Given a CNN featrue vector for an image, find a new image that
    • Matches the given feature vector
    • (Start with noise) “looks natural” (by using regularization)

  13. Total Variation Regularizer

  14. Texture Synthesis
    • Nearest Neighbor (without neural network)
    • Neural Texture Synthesis: Gram Matrix
  15. Neural Artwork Synthesis
  16. Neural Style Transfer: Feature Inversion + Gram Matrix
    • Content Image + Style Image
    • Play with the hyperparameters
    • Super slow
    • Solution: Train another neural network to perform style transfer for us

  17. The apply of multi-scale processing on some of the above



Generative Models

  1. Unsupervised learning
    Learn some underlying hidden structure of the data (Without label)
    Example: Clustering, Dimensionality reduction, Frature learning, Density
  2. Generative Models: Given training data, generate new samples from the same distribution
    • Two main classes: Explicit density; Implicit density

PixelRNN and PixelCNN

  1. Explicit density model
    Using chain rule to decompose likelihood of an image x into product of 1-d distributions:


    Then maximize likelihood of training data

  2. Complex distribution over pixel values on the right of the equation ==> Express using a neural network
    We’ll need to define ordering of “previous pixels”

  3. PixelRNN:
    • Generate image pixels startion from corner, and moving to right and down each step
    • Dependency on previous pixels modeled using an RNN (LSTM)
    • Drawback: Sequential, generation is slow
  4. PixelCNN: ???
    • Still generate image pixels startion from corner, and moving to right and down each step
    • Dependency on previous pixels now modeled using a CNN over context region
    • Training: maximize the likelihood of training images
    • Training is faster. But generation must still proceed sequentially, so still slow

Variational Autoencoders (VAE) ??? from github

  1. Define intractable density function with latent z:


    Cannot optimize directly, derive and optimize lower bound on likelihood instead

  2. Some background: Autoencoders
    • Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data (x -> z, with dimensionality reduction)
    • Encoder and Decoder: Linear + nonlinearity, Deep, Fully-connected, ReLU CNN
    • Train such that features can be used to reconstruct original data
      “Autoencoding” - encoding itself
      Loss function: e.g. L2
    • After training, we can throw away the decoder layers, and use z to train a supervised model (Classifier etc.)

  3. Variational Autoencoders
    • Assume training data is generated from underlying unobserved (latent) representation z
    • Sample from true prior p_\theta^*(z) , and then sample from true conditional p_\theta^*(x|z_i)
      We want to estimate the true parameters theta^* of this generative model
    • How should we represent this model
      Choose prior p(z) to be simple, e.g. Gaussian
      Conditional p(a|z) is complex => represent with neural network (Called Decoder network)
    • How to train the madel
      • Learn model parameters to maximize likelihood ==> \int is intractable

      • Solution: In addition to decoder network modeling p_theta(x|z), define additional encoder network q_phi(z|x) that approximates p_theta(z|x)
        p => miu_x|z, Sigma_x|z => x|z ~ hN(miu_x|z, Sigma_x|z)
        q => miu_z|x, Sigma_z|x => z|x ~ hN(miu_z|x, Sigma_z|x)

      • Then for likelihood:



      • Then we get a loweer bound of log(p_theta(xi)) which is hL(xi, theta, phi) = E_z[log(p_theta(xi|z))] - D_KL(q_phi(z|xi) || p_theta(z))
        And when training, we maximize the lower bound: theta^*, phi^* = argmax(sum_i(hL))

      • For a forward pass:

        • input minibatch of x
        • q_phi(z|x) Encoder network
        • Sample z from z|x ~ hN(miu_z|x, Sigma_z|x)
        • p_theta(x|z) Decoder network
        • Sample x|z from x|z ~ hN(miu_x|z, Sigma_x|z)
        • Calc hL and maximize it (backprop)
    • For generating data:
      • Use decoder network
      • Now sample z from prior ( z ~ hN(0, I) )
    • Cons:
      • Maximize the lower bound, okay but not as good
      • Samples blurrier and lower quality picture

GANs (Generative Adversarial Networks)

  1. Dont work with explicit distribution
    Want to sample from complex, high-d training distribution
    There’s no direct way to do this ==> Want to have a transformation which tansform them into a simple distribution ==> Neural network
    What we do is to sent a random noise (which is a simple distribution’s sample) to a neural network, and let it to output a sample from training distribution

  2. Two-player game:
    • Generator network: try to fool the discriminator by generating real-looking images (from random noise)
    • Discriminator network: try to distinguish between real and fake images
  3. Train jointly in minimax game:


    where D_{\theta_d} is discriminator output, and G_{\theta_g} is generated fake data

  4. Training GANs: Alternate between
    • Gradient ascent on discriminator
      `
      `

    • Gradient descent on generator


      But this function’s slope is bad, cause we’ll train slow when it’s bad and fast when it’s good
      Using Gradient ascent on generator, instead


  5. Jointly training two networks is challenging and can be unstable. Active research area

  6. Improving: Convolutional Architectures

  7. Interpolating and Interpretablility

  8. Cons:
    • Trickier / more unstabel to train
    • Can’t solve inference queries such as p(x) , p(z|x)



Deep Reinforcement Learning

  1. Problems involving an agent interaction with an environment which provides numeric reward signals. Goal is to learn how to take actions in order to maximize reward
    Environment gives a State s_t to Agent, Agent generates back an Action a_t to Environment, then Environment gives a Reward r_t and the next State s_(t+1). Repeat until the Environment gives the terminal State

  2. Examples:
    • Cart-Pole Problem
    • Robot Locomotion
    • Atari Games
    • Go (围棋)

Markov Decision Process (MDP)

  1. It is mathematical formulation of the PL problem

  2. Markov Property: Current state completely characterises the state of the world

  3. An MDP is defined by: (hS, hA, hR, P, gamma)
    • hS : set of possible states
    • hA : set of possible actions
    • hR : distribution of reward given (state, action) pair
    • P : transition probability i.e. distribution over next state given (state, action) pair
    • gamma : discount factor (how much we value rewards coming up soon versus later on)
  4. The MDP:
    • At time step t = 0, environment samples initial state s_0 ~ p(s_0)
    • Then, for t = 0 until done:
      • Agent selects action a_t
      • Environment samples reward r_t ~ hR(. | s_t, a_t)
      • Environment samples next state s_(t+1) ~ P(. | s_t, a_t)
      • Agent receives reward r_t and next state s_(t+1)

    • A policy pi is a function from hS to hA that specifies what action to take in each state
    • Objective: find policy pi^* that maximizes cumulative discounted reward: sum_t(gamma^t * r_t)
  5. Example: Grid World

  6. How do we handle the randomness (initial state, transition probability…)
    Solution: Maximize the expected sum of rewards

    with s_0 ~ p(s_0), a_t ~ pi(. | s_t) ,s_(t+1) ~ p(. | s_t, a_t)

Q-Learning

  1. Def: Value function and Q-value funciton
    • Value function at state s , is the expected cumulative reward from following the policy from state s:


    • Q-value function at state s and action a , is the expected cumulative reward from taking action a in state s and then following the policy:


    • `

  2. Bellman equation:



  3. Solving for the optimal policy: Value iteration algorithm (use Bellman equation as an iterative update)


    Q_i will converge to Q^* as i -> infinity

  4. Problem: Not scalable. Must compute Q(s, a) for every state-action pair. Computationally infeasible
    Solution: Use a function approximator to estimate Q(s, a). E.g. a neural network

  5. Deep Q-Learning
    • Q-Learning: Use a function approximator to extimate the axtion-value function Q(s, a; theta) ~= Q^*(s, a)
      Deep Q-Learning: When the function approximator is a deep neural network

    • Forward Pass
      Loss function:

      Where

    • Backward Pass
      Gradient update (with respect to Q-function parameters theta)

  6. Example: Playing Atari Games

  7. Experience Replay:
    • Learning from batches of consecutive samples is problematic:
      • Samples are correlated => inefficient learning
      • Current Q-network parameters determines next training samples => can lead to bad feedback loops
    • Address these problems using experience replay:
      • Continually update a replay memory table of transitions (s_t, a_t, r_t, s_(t+1)) as game (experience) episodes are played
      • Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples
      • As each of the transitions can also contribute to multiple weight updates => greater data efficiency

Policy Gradients

  1. If dimension goes to really huge, the Q-function can be very complicated

  2. Policy Gradients:
    • Formally, let’s def a class of parametrized policies:

    • For each policy, def its value:

    • We want to find the optimal policy
      ==> Gradient Ascent on policy parameters

  3. Reinforce algorithm
    • Mathematically


      Where r(\tau) is the reward of a trajectory tau = (s_0, a_0, r_0, s_1,...)

    • Directly Delta is Intractable. However, we can use a nice trick


      Then

    • Can we compute those quantities without knowing the transition probabilities
      We have:

      Thus:

      And when differentiating:

  4. However, this also suffers from high variance because credit assignment is really hard
  5. Variance reduction:
    • First idea: push up probabilities of an action only by the cumulative future reward from that state
    • Second idea: Use discount factor gamma to ignore delayed effects
    • Baseline: Introduce a baseline function dependent on the state, and rescale the rewards
      Choose the base line:
      • constant moving average of rewards experienced so far from all trajectories
      • if this action was better than the expected value of what we should get from that state
        Q-function and value function!
        Intuitively, we are happy with an action a_t in a state s_t if Qpi(s_t, a_t) - Vpi(s_t) is large
        Def Advatage function A(s, a) = Qpi(s_t, a_t) - Vpi(s_t)
  6. ===> Combine Q-Learning and Policy Gradients by training both an actor(the policy) and a critic (the Q-function)
  7. Examples:
    • Recurrent Attention Model (RAM)
    • fine-grained image recognition
    • image captioning
    • visual question-answering
    • AlphaGo



Efficient Methods and Hardware for Deep Learning

Algorithms for Efficient Inference

  1. Pruning
  2. Weight Sharing
  3. Quantization
  4. Low Rank Approximation
  5. Binary / Ternary Net
  6. Winograd Transformation

Hardware for Efficient Inference

  1. TPU
  2. EIE

Algorithms for Efficient Training

  1. Parallelization
  2. Mixed Precision with FP16 and FP32
  3. Model Distillation
  4. DSD: Dense-Sparse-Dense Training

Hardware for Efficient Training



Adversarial Examples and Adversarial Training

  1. Adversarial Examples:
    • Fooling image as we mentioned before
    • Not just the neural network
  2. Understand why it happens:
    • First idea: Causing from overfitting — Wrong
      Because it’s not random, it is systematic over pictures and different trainings
    • Second idea: Causing from underfitting
      Excessive Linearity, and it force the point far away the dicition boundary to be confident classified even if we actually dont have any input there

  3. The Fast Gradient Sign Method

  4. The Clever Hans Theory

  5. RBFs: resist the problem, but it is very shallow and extramly difficult to train if it get deeper in order to increase its accuracy

  6. Transferability Attack
    • Enhancing With Ensembles

  7. The Adversarial Examples for Human (视觉错觉) and current neural networks is very different and have almost no 重叠
    Which means there’s huge different in the mechanism of the two

  8. Adversarial Training
    • Can improve a bit, but still long way to go
    • Semi-supervised learning