Week 3

ClassC1W3
Created
Materialshttps://www.coursera.org/learn/neural-networks-deep-learning/home/week/4
Property
Reviewed
TypeSection

Shallow Neural Network

Neural Networks Representation

Below is an image for a typical neural network(Refered to as a 2-layer network we do not count the input layer as an official). It consists of 3 layers namely:

Computing a Neural Network's Output

In the previous lessons we were introduced to the computational graph, the circle in the image above represents two steps of computation rows. You initially compute zz  then compute the activation as a sigmoid function of zz.

Neural Networks are like logistic regression just theat they repeat a number of times before you get an output as illustrated in the image below. Whereby we compute the logistic regression computation at each node in the neural network in order to get the value of the activation function.

Simplified equations, note that ww weights are transposed (column vector transpose → row vector := matrix)

Activation functions

When you build a NN, one of the choices you'll be faced with is what activation function do you use for your hidden layers and output layer.

On the previous lessons we used a sigmoid function as an activation functions (Remember: it only outputs the probability)

It turns out that if you use the hyperbolic tangent (tanh()tanh()) activation function instead of the sigmoid function, which is a shifted version of the sigmoid function (sigmoid function goes from 1 → 0 where as the latter goes from -1 → +1 which means that the mean is closer to 0).

This makes the learning for the next layer easier as your data is normalised to 0 instead of 0.5 as compared to the sigmoid function.

The hyperbolic tangent activation function is slighly superior than the sigmoid function with an exception for the output layer as your output should be a probability (0→1) for binary classifications as shown in the image below.

Cons:

Enter the RELU (Rectify Linear Unit), which is shown in the image below.

The formular for the RELU function is max(0,z)max(0,z), such that the derivative is 1 as long as zz is positive and derivative is 0 when zz is negative.

Best practices when choosing an activation functions:

Why do you need non-linear activation functions?

If you were to use linear activation functions (No RELU, sigmoid, tanh activation functions) then the Neural Network will just output/compute a linear activation function which mean there isn't a need for hidden layers.

Linear hidden layer is useless except when you are doing regression problems, an example is estimating house prices as your output wll be a realnumber however the hidden layers need to either be RELU or other activation functions.

Derivatives of activation functions

When implementing back propagation for a NN you need to compute the slope/derivative of the activation function. Below is a list of activation functions you can choose from.

Sigmoid activation function

g(z)=dg(z)/dz=a(1a)g'(z) = dg(z)/dz = a(1-a)

Hyperbolic Tangent activation function

g(z)=dg(z)/dz=1(tanh(z)2=1a2g'(z) = dg(z)/dz = 1 - (tanh(z)^2 = 1 -a^2

RELU activation function

Gradient descent for Neural Networks

Equestions for forward/back propagation

Forward propagation for a neural network computational graph

Back propagation for a neural network computational graph

Cleaner version

Note: dZ[1]dZ[1] equation uses element-wise multiplication.

How to get element-wise matrix multiplication (Hadamard product) in numpy?
For elementwise multiplication of matrix objects, you can use numpy.multiply : import numpy as np a = np.array([[1,2],[3,4]]) b = np.array([[5,6],[7,8]]) np.multiply(a,b) Result array([[ 5, 12], [21, 32]]) However, you should really use array instead of matrix. matrix objects have all sorts of horrible incompatibilities with regular ndarrays.
https://stackoverflow.com/a/40035266

Random Initialization

When training a NN, it is important to initialise the weights randomly, suppose you initalise your weights to zero!

When you compute forward prop: a1=a2a1 = a2 and back prop will be dz1=dz2dz1=dz2

Thus no matter how many times you train your network the hidden layers units will compute the same exact function i.e there won't be a point for computing the hidden units. This is known as symmetry braking problem

However the case is different when you initialize with randoms weights. It is recommended that you initialize with the smallest weight values as it turns out that if the weight values are very large then your activation function will end up flattening/learning will be slow will be saturated.

Q&A

  1. Which of the following are true? (Check all that apply.)
    • XX is a matrix in which each column is one training example.
    • a[2](12)a^{[2](12)} denotes the activation vector of the 2nd layer for 12th training example.
    • a[2]a^{[2]}denotes the activation vector of the 2nd layer.
    • a4[2]a^{[2]}_4is the activation output by the 4th neuron of the 2nd layer

    Correct

  1. The tanh()tanh() activation usually works better than sigmoid activation function for hidden units because the mean of its output is closer to zero, and so it centers the data better for the next layer. True
    • Yes. As seen in lecture the output of the tanh is between -1 and 1, it thus centers the data which makes the learning simpler for the next layer.
  1. Which of these is a correct vectorized implementation of forward propagation for layer lll, where1lL1 1≤l≤L1 ?
    • Z[l]=W[l]A[l1]+b[l]Z^{[l]}=W^{[l]}A^{[l−1]}+b^{[l]}
    • A[l]=g[l](Z[l])A^{[l]} = g^{[l]}(Z^{[l]})
  1. You are building a binary classifier for recognizing cucumbers (y=1) vs. watermelons (y=0). Which one of these activation functions would you recommend using for the output layer? sigmoid
    • Justification: Yes. Sigmoid outputs a value between 0 and 1 which makes it a very good choice for binary classification. You can classify as 0 if the output is less than 0.5 and classify as 1 if the output is more than 0.5. It can be done with tanh as well but it is less convenient as the output is between -1 and 1.

    5. Consider the following code:

    A = np.random.randn(4,3)
    B = np.sum(A, axis = 1, keepdims = True)

    What will be B.shape? (4, 1)

    • Yes, we use (keepdims = True) to make sure that A.shape is (4,1) and not (4, ). It makes our code more rigorous.

    6. Suppose you have built a neural network. You decide to initialize the weights and biases to be zero. Which of the following statements is true?

    • Each neuron in the first hidden layer will perform the same computation. So even after multiple iterations of gradient descent each neuron in the layer will be computing the same thing as other neurons.

    7. Logistic regression’s weights w should be initialized randomly rather than to all zeros, because if you initialize to all zeros, then logistic regression will fail to learn a useful decision boundary because it will fail to “break symmetry”, False

    • Justification: Logistic Regression doesn't have a hidden layer. If you initialize the weights to zeros, the first example x fed in the logistic regression will output zero but the derivatives of the Logistic Regression depend on the input x (because there's no hidden layer) which is not zero. So at the second iteration, the weights values follow x's distribution and are different from each other if x is not a constant vector.

8. You have built a network using the tanh activation for all the hidden units. You initialize the weights to relative large values, using np.random.randn(..,..)*1000. What will happen?

9. Consider the following 1 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

10. In the same network as the previous question, what are the dimensions ofZ[1] Z^{[1]} and A[1]A^{[1]}?