Week 4

Class	C1W4
Created	@Aug 4, 2020 5:20 PM
Materials	https://www.coursera.org/learn/neural-networks-deep-learning/home/week/4
Property
Reviewed
Type	Section

Deep Neural Network

Deep L-layer neural network

What is a deep neural network?

The image below shows the types of neural network, and we call a logistic regression a shallow neural network and a 5+ hidden layer is a deep neural network.

Deep neaural network notation

Below is an image illustrating the notation we use to describe the number of hidden layers denotated by $n^l$ , where $n$ is the number of units in layer $l$ and $l$ is the number of layers.

The neural network below consists of 4 layers and 3 hidden layers, with 3 inputs and 1 output.

Forward Propagation in a Deep Network

The general forward prop calculation is denoted as:

$z^{[l]} = W^{[l]}a^{[l-1]} +b^{[l]}$

$a^{[l]} = g{[l]}(z^{[l]})$ ,

Where $a$ is the activations of the output $z$ and,

$a^{[l]} = \hat{y}$

Vectorization

for $l=1...4$

$X = A^{[0]}$

$Z^{[1]} = W^{[1]}A^{[0]} + b^{[1]}$

$A^{[1]} = g^{[1]}(z^{[1]})$

$...$

$\hat{Y} = g^{[4]}(Z^{[4]}) = A^{[4]}$

Our notation allows us to replace lowercase $a$ and $z$ with $A$ and $Z$ and that will output the vectorized version. When implementing vectorization you will need an explicit for loop, currently there isn't any way around it.

When working with Deep Neural Networks one should always take note of the shape of matrix they are working with.

Getting your matrix dimensions right

https://youtu.be/vuYcFz86ryo

The general formula to check is that when youre implementing the matrix for layer $L$ , that the dimension of that matrix be:

$w^{[l]}:$ $(n^{[l]}, n^{[l-1]})$

$b^{[l]}:$ $(n^{[l]}, 1)$

Therefore:

$a^{[l]}=g^{[l]}(z^{[l]})$ = $\hat{y}$

Note that "a" and "z have dimensions $(n^{[l]},1)$

In general, the number of neurons in the previous layer gives us the number of columns of the weight matrix, and the number of neurons in the current layer gives us the number of rows in the weight matrix.

Vectorized implementation

Through Python broadcasting the dimensions for $b$ will be broadcasted thus instead of $(n^{[l]}, 1)$ it will be $(n^{[l]}, m)$ , and $W^{[l]}$ will be horizontal matrix and $X$ a vertical matrix which will result in $Z^{[l]}$ being a horizontal matrix.

Why deep representations?

Why deep neural network work well as compared to something else.

https://www.youtube.com/watch?v=5dWp1mw_XNk

Building blocks of deep neural networks

A key takeaway is that, when building a basic building blocks for implementing a deep neural network, in each layer there's a forward propagation step and there's a corresponding backpropagation step as well as a cache $(z^{[l-1]})$ to pass the information from one layer to another.

https://www.youtube.com/watch?v=B7-iPbddhsw

Forward and Backward Propagation

One thing to note is that when computing the output for a forward propagation you'd need an input and cache $(z^{[l]}, w^{[l]}, b^{[l]})$ .

When computing the output for a backward propagation you'd need an input $da^{[l]}$ .

Note: $dw^{[l]} = dz^{[l]} a^{[l-1]T}$

Summary

Suppose you have a 3-layer neural network, which gets initialised with random values of $X$ this would compute the value of $\hat{Y}$ which you then compute the loss function $L(\hat{Y}, Y)$ , while at the same time caching all the values of $Z^{[l-1]}$ .

Note that the backward propagation is also initialised to certain values and this is the values of $da^{[l]}$ . There is no need to calculate the value of $da^{[0]}$ as theres no point in computing the value of the initialisers, hence why it is scratched off in the image.

Parameters vs Hyperparameters

Hyperparameters tell how your learning algorithm resolves such as:

The learning rate $\alpha$ (by setting the learning rate this determines the number of iterations when calculating the gradient descent),

The number of hidden layer $L$

The number of hidden units $n^{[l]}...$

The choice of activation function to use.

These hyperparameters determine the final value of $w$ and $b$ .

ALWAYS REMEMBER:

The difference between np.random.rand and np.random.randn

See explaination here: https://stackoverflow.com/a/47241066 Graphical explaination: https://stackoverflow.com/a/56829859

What does this have to do with the brain?

https://www.youtube.com/watch?v=2zgon7XfN4I

Q & A

What is the "cache" used for in our implementation of forward propagation and backward propagation?
- We use it to pass variables computed during forward propagation to the corresponding backward propagation step. It contains useful values for backward propagation to compute derivatives.
  Justification: Correct, the "cache" records values from the forward propagation units and sends it to the backward propagation units because it is needed to compute the chain rule derivatives.

Among the following, which ones are "hyperparameters"?
- number of layers $L$ in the neural network
- number of iterations
- learning rate $\alpha$
- size of the hidden layers $n^{[l]}$

Which of the following statements is true?
- The deeper layers of a neural network are typically computing more complex features of the input than the earlier layers.

Vectorization allows you to compute forward propagation in an L-layer neural network without an explicit for-loop (or any other explicit iterative loop) over the layers l=1, 2, …,L. False

Assume we store the values for $n^{[l]}$ in an array called layer_dims, as follows: $layer\_dims = [n_x, 4,3,2,1]$ . So layer 1 has four hidden units, layer 2 has 3 hidden units and so on. Which of the following for-loops will allow you to initialize the parameters for the model?
```
for i in range(1, len(layer_dims)):
    parameter['W' + str(i)] = np.random.randn(layer_dims[i], layer_dims[i-1]) * 0.01
    parameter['b' + str(i)] = np.random.randn(layer_dims[i], 1) * 0.01
```

6. Consider the following neural network.

How many layers does this network have?

The number of layers LLL is 4. The number of hidden layers is 3.
Justification: Yes. As seen in lecture, the number of layers is counted as the number of hidden layers + 1. The input and output layers are not counted as hidden layers.

7. During forward propagation, in the forward function for a layer lll you need to know what is the activation function in a layer (Sigmoid, tanh, ReLU, etc.). During backpropagation, the corresponding backward function also needs to know what is the activation function for layer lll, since the gradient depends on it. True

Justification: Yes, as you've seen in the week 3 each activation has a different derivative. Thus, during backpropagation you need to know which activation was used in the forward propagation to be able to compute the correct derivative.

8. There are certain functions with the following properties: True

To compute the function using a shallow network circuit, you will need a large network (where we measure size by the number of logic gates in the network), but

To compute it using a deep network circuit, you need only an exponentially smaller network.

9. Consider the following 2 hidden layer neural network:

Which of the following statements are True? (Check all that apply).

$W^{[1]}$ will have shape (4, 4)
Justification: Yes. More generally, the shape of $W^{[l]}\ is (n^{[l]}, n^{[l-1]})$ .

$b^{[1]}$ will have shape (4, 1)
Yes. More generally, the shape of $b^{[l]}\ is (n^{[l]}, 1).$

$W^{[2]}$ will have shape (3, 4)
Yes. More generally, the shape of $W^{[l]} \ is (n^{[l]}, n^{[l-1]})$

$b^{[2]}$ will have shape (3, 1)
Yes. More generally, the shape of b[l]b^{[l]}b[l] is (n[l],1)(n^{[l]}, 1)(n[l],1).

$W^{[3]}$ will have shape (1, 3)
Yes. More generally, the shape of $W^{[l]}$ is $(n^{[l]}, n^{[l-1]})$ .

$b^{[3]}$ will have shape (1, 1)
Yes. More generally, the shape of $b^{[l]}$ is $(n^{[l]}, 1)$ .

10. Whereas the previous question used a specific network, in the general case what is the dimension of $W^{[l]},$ the weight matrix associated with layer- $L$ ?

$W^{[l]}\ has\ shape\ (n^{[l]}, n^{[l-1]})$