Week 1

Class	C3W1
Created	@Oct 11, 2020 7:37 AM
Materials
Property	https://www.coursera.org/learn/neural-networks-deep-learning/
Reviewed
Type	Section

Introduction to ML Strategy

Why ML Strategy

When trying to improve a deep learning system you often have a lot of things you can try and tweak, the problem is when you poorly choose a solution and end up taking longer to complete your DL system.

Orthogonalization

According to Andrew Ng, Orthogonalization is an important aspect in ML. Orthogonalization is a process of converting a set of functions into separate indipendent functions, the process of splitting a problem/system into its distinct components. Thus making it easier to verify an algorithm independently from one another, which reduces development and testing time.

For a supervised learning system to do well, one usually needs to tune the hyperparameters of the system to ensure that the systems is efficiently optimized. Below are the chain of assumptions when building an ML system:

Firstly, fit training set well on the cost function, If our model doesn't fit training set well we should:
- Use a bigger neural network
- Use an alternative optimization algorithm

Secondly, fit the development set well on the cost function, if our model doesn't fit well we should:
- Implement regularization
- Getting a bigger training set.

Thirdly, fit test set well on the cost function, if our model doesn't fit well we should:
- Bigger dev set, as you probably overtuned to your dev set.
- Check that distribution of the dev set is the same as the test set or the cost function.

When Andrew Ng trains a neural network, he tends not to use early stopping, as he find it difficult to tune it in order to affect a single aspect of the system. It simultenously affects how well you fit your training and test set. So early stopping affects orthogonalization and should not be considered.

Setting up your goal

Single number evaluation metric

Whether you're tuning hyperparameters or trying out different ideas for learning algorithms, you'll find that your progress will be much faster if you have a single real number evaluation metric that lets you know quickly if your technique works or not.

One reasonable way to evaluate the perfomance of your classifier is to look at it's precision and recall.

An example of precision would be, of the examples that your classifier recognizes correctly, what is the actuall percentage it actually correctly recognized?

Recall is, of all the images that were correctly classified, what percentage were correctly recognized by your classifier.

With the 2 evaluation matrics mentioned above, you could instead combine precision and recall into a matric called F1 Score, which is defined by the fomula: $2/(1/p + 1/r)$

Evaluation metric allows one to tell quickly if classifier A or classifier B is better and therefore having a development training set and signle number metric distance to speed up iterating. This speeds up the iterative process of improving your learning algorithm.

Satisficing and Optimizing metric

Suppose your classifier's takes x amount of running time with a specific accuracy evaluations metrics. The accuracy can be an optimizing metric as one would consider it as a priority as compared to the satisficing metric running time.

So, if you have $N$ metrics, 1 should be optimizing and $N-1$ satisficing.

To summarize, if there are multiple things you care about by say there's one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you'll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the best one. Now these evaluation matrix must be evaluated or calculated on a training set or a development set or maybe on the test set. So one of the things you also need to do is set up training, dev or development, as well as test sets.

Train/Dev/Test Distributions

It is recommended that your dev & test set comes from the same distribution.

A great solution is to randomly select your dev & test set from the distribution.

Choose a dev & test set to reflect data that you expect to get in the future which you consider important to do well on.

Size of the dev and test sets

The rule of thumb is really to try to set the dev set to big enough for its purpose, which helps you evaluate different ideas and pick this up from AOP better. And the purpose of test set is to help you evaluate your final cost buys. You just have to set your test set big enough for that purpose, and that could be much less than 30% of the data.

Comparing to human-level perfomance

Why human-level performance?

In the last few years, a lot more machine learning teams have been talking about comparing the machine learning systems to human level performance. There are two main reasons:

First is that because of advances in deep learning, machine learning algorithms are suddenly working much better and so it has become much more feasible in a lot of application areas for machine learning algorithms to actually become competitive with human-level perfomance.

Second, it turns out that the workflow of designing and building a machine learning system is much more efficient when youre trying to do something that a human can also do.

A lot of machine learning tasks progresses terns to be relatively rapid as you approach human level perfomance. But then after a while the algorithm surpusses human-level perfomance and then the progress and accuracy slows down.

Over time, as you keep training the algorithm the perfomance approaches but never surpasses some theoretical limit, which is called the Bayes Optimal Error (think of this as the best possible error).

According to stackoverflow,

Bayes error is the lowest possible prediction error that can be achieved and is the same as irreducible error. If one would know exactly what process generates the data, then errors will still be made if the process is random. This is also what is meant by "y is inherently stochastic".

For example, when flipping a fair coin, we know exactly what process generates the outcome (a binomial distribution). However, if we were to predict the outcome of a series of coin flips, we would still make errors, because the process is inherently random (i.e. stochastic).

Avoidable Bias

The avoidable bias occurs when, there's a difference between Bayes error or approximation of Bayes error and the training error. So you would want is to keep improving your training perfomance until you get down to Bayes error but not better, you will only get better than Bayes error when the model is overfitting.

Understanding human-level performance

What is "human-level" error?

The human optimal error(team of doctors) will thus be 0.5% or lower and we know that Bayes error is $<= 0.5\%$

Improving your model performance

Getting a supervised learning algorithm to work well means fundamentally two things:

You can fit the training set well (achieve low avoidable bias)

The training set perfomance generalizes well to the dev and test set(varience is not bad)

Reducing avoidable bias and varience

Reduce Avoidable Bias (human-level → Training error)
- Train bigger model
- Train longer/better optimization algorithm
- NN architecture/hyperparametes search (CNN/RNN)

Reduce Varience (Training error → Dev error)
- More data
- Regularization(L2, dropout, data augmantation, early stopping)
- NN architecture/hyperparametes search.

Q & A

Bird recognition in the city of Peacetopia (case study)

Problem Statement

This example is adapted from a real production application, but with details disguised to protect confidentiality.

You are a famous researcher in the City of Peacetopia. The people of Peacetopia have a common characteristic: they are afraid of birds. To save them, you have to build an algorithm that will detect any bird flying over Peacetopia and alert the population.

The City Council gives you a dataset of 10,000,000 images of the sky above Peacetopia, taken from the city’s security cameras. They are labelled:

y = 0: There is no bird on the image

y = 1: There is a bird on the image

Your goal is to build an algorithm able to classify new images taken by security cameras from Peacetopia.

There are a lot of decisions to make:

What is the evaluation metric?

How do you structure your data into train/dev/test sets?

Metric of success

The City Council tells you that they want an algorithm that

Has high accuracy

Runs quickly and takes only a short time to classify a new image.

Can fit in a small amount of memory, so that it can run in a small processor that the city will attach to many different security cameras.

Note: Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate. True/False?

True

2. After further discussions, the city narrows down its criteria to:

"We need an algorithm that can let us know a bird is flying over Peacetopia as accurately as possible."

"We want the trained model to take no more than 10sec to classify a new image.”

“We want the model to fit in 10MB of memory.”

If you had the three following models, which one would you choose?

Test Accuracy: 98%

Runtime: 9 sec

Memory size: 9MB

Justification: Correct! As soon as the runtime is less than 10 seconds you're good. So, you may simply maximize the test accuracy after you made sure the runtime is <10sec.

3. Based on the city’s requests, which of the following would you say is true?

Accuracy is an optimizing metric; running time and memory size are a satisficing metrics.

4. Structuring your data

Before implementing your algorithm, you need to split your data into train/dev/test sets. Which of these do you think is the best choice?

Train: 9,500,000

Dev: 250,000

Test: 250,000

5. After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the “citizens’ data”. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

Notice that adding this additional data to the training set will make the distribution of the training set different from the distributions of the dev and test sets.

Is the following statement true or false?

"You should not add the citizens' data to the training set, because if the training distribution is different from the dev and test sets, then this will not allow the model to perform well on the test set."

False
Justification: Sometimes we'll need to train the model on the data that is available, and its distribution may not be the same as the data that will occur in production. Also, adding training data that differs from the dev set may still help the model improve performance on the dev set. What matters is that the dev and test set have the same distribution.

6. One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens’ data images to the test set. You object because:

The test set no longer reflects the distribution of data (security cameras) you most care about.

This would cause the dev and test set distributions to become different. This is a bad idea because you’re not aiming where you want to hit.

7. You train a system, and its errors are as follows (error = 100%-Accuracy):

Dev set error: 4.5%

Training set error: 4%

This suggests that one good avenue for improving performance is to train a bigger network so as to drive down the 4.0% training error. Do you agree?

No, because there is insufficient information to tell.

8. You ask a few people to label the dataset so as to find out what is human-level performance. You find the following levels of accuracy:

Bird watching expert #1: 0.3% error

Bird watching expert #2: 0.5% error

Normal person #1 (not a bird watching expert): 1.0% error

Normal person #2 (not a bird watching expert): 1.2% error

If your goal is to have “human-level performance” be a proxy (or estimate) for Bayes error, how would you define “human-level performance”?

0.3% (accuracy of expert #1)

9. Which of the following statements do you agree with?

A learning algorithm’s performance can be better than human-level performance but it can never be better than Bayes error.

10. You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following:

Human-level performance: 0.1%

Training set error: 2%

Dev set error: 2.1%

Based on the evidence you have, which two of the following four options seem the most promising to try? (Check two options.)

Try decreasing regularization.

Train a bigger model to try to do better on the training set.

11. You also evaluate your model on the test set, and find the following:

Human-level performance: 0.1%

Training set error: 2%

Dev set error: 2.1%

Test set error: 7%

What does this mean? (Check the two best options.)

You should try to get a bigger dev set.

You have overfit to the dev set.

12. After working on this project for a year, you finally achieve:

Human-level performance: 0.10%

Training set error: 0.05% Dev set error: 0.05%

What can you conclude? (Check all that apply.)

If the test set is big enough for the 0.05% error estimate to be accurate, this implies Bayes error is $\leq 0.05\%$

It is now harder to measure avoidable bias, thus progress will be slower going forward.

13. It turns out Peacetopia has hired one of your competitors to build a system as well. Your system and your competitor both deliver systems with about the same running time and memory size. However, your system has higher accuracy! However, when Peacetopia tries out your and your competitor’s systems, they conclude they actually like your competitor’s system better, because even though you have higher overall accuracy, you have more false negatives (failing to raise an alarm when a bird is in the air). What should you do?

Rethink the appropriate metric for this task, and ask your team to tune to the new metric.

14. You’ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

Use the data you have to define a new evaluation metric (using a new dev/test set) taking into account the new species, and use that to drive further progress for your team.

15. The City Council thinks that having more Cats in the city would help scare off birds. They are so happy with your work on the Bird detector that they also hire you to build a Cat detector. (Wow Cat detectors are just incredibly useful aren’t they.) Because of years of working on Cat detectors, you have such a huge dataset of 100,000,000 cat images that training on this data takes about two weeks. Which of the statements do you agree with? (Check all that agree.)

Needing two weeks to train will limit the speed at which you can iterate.

Buying faster computers could speed up your teams’ iteration speed and thus your team’s productivity.

If 100,000,000 examples is enough to build a good enough Cat detector, you might be better of training with just 10,000,000 examples to gain a \approx≈10x improvement in how quickly you can run experiments, even if each model performs a bit worse because it’s trained on less data.