Week 1

ClassC3W1
Created
Materials
Propertyhttps://www.coursera.org/learn/neural-networks-deep-learning/
Reviewed
TypeSection

Introduction to ML Strategy

Why ML Strategy

When trying to improve a deep learning system you often have a lot of things you can try and tweak, the problem is when you poorly choose a solution and end up taking longer to complete your DL system.

Orthogonalization

According to Andrew Ng, Orthogonalization is an important aspect in ML. Orthogonalization is a process of converting a set of functions into separate indipendent functions, the process of splitting a problem/system into its distinct components. Thus making it easier to verify an algorithm independently from one another, which reduces development and testing time.

For a supervised learning system to do well, one usually needs to tune the hyperparameters of the system to ensure that the systems is efficiently optimized. Below are the chain of assumptions when building an ML system:

When Andrew Ng trains a neural network, he tends not to use early stopping, as he find it difficult to tune it in order to affect a single aspect of the system. It simultenously affects how well you fit your training and test set. So early stopping affects orthogonalization and should not be considered.

Setting up your goal

Single number evaluation metric

Whether you're tuning hyperparameters or trying out different ideas for learning algorithms, you'll find that your progress will be much faster if you have a single real number evaluation metric that lets you know quickly if your technique works or not.

One reasonable way to evaluate the perfomance of your classifier is to look at it's precision and recall.

With the 2 evaluation matrics mentioned above, you could instead combine precision and recall into a matric called F1 Score, which is defined by the fomula: 2/(1/p+1/r)2/(1/p + 1/r)

Evaluation metric allows one to tell quickly if classifier A or classifier B is better and therefore having a development training set and signle number metric distance to speed up iterating. This speeds up the iterative process of improving your learning algorithm.

Satisficing and Optimizing metric

Suppose your classifier's takes x amount of running time with a specific accuracy evaluations metrics. The accuracy can be an optimizing metric as one would consider it as a priority as compared to the satisficing metric running time.

So, if you have NN metrics, 1 should be optimizing and N1N-1 satisficing.

To summarize, if there are multiple things you care about by say there's one as the optimizing metric that you want to do as well as possible on and one or more as satisficing metrics were you'll be satisfice. Almost it does better than some threshold you can now have an almost automatic way of quickly looking at multiple core size and picking the best one. Now these evaluation matrix must be evaluated or calculated on a training set or a development set or maybe on the test set. So one of the things you also need to do is set up training, dev or development, as well as test sets.

Train/Dev/Test Distributions

Size of the dev and test sets

The rule of thumb is really to try to set the dev set to big enough for its purpose, which helps you evaluate different ideas and pick this up from AOP better. And the purpose of test set is to help you evaluate your final cost buys. You just have to set your test set big enough for that purpose, and that could be much less than 30% of the data.

Comparing to human-level perfomance

Why human-level performance?

In the last few years, a lot more machine learning teams have been talking about comparing the machine learning systems to human level performance. There are two main reasons:

A lot of machine learning tasks progresses terns to be relatively rapid as you approach human level perfomance. But then after a while the algorithm surpusses human-level perfomance and then the progress and accuracy slows down.

Over time, as you keep training the algorithm the perfomance approaches but never surpasses some theoretical limit, which is called the Bayes Optimal Error (think of this as the best possible error).

According to stackoverflow,

Bayes error is the lowest possible prediction error that can be achieved and is the same as irreducible error. If one would know exactly what process generates the data, then errors will still be made if the process is random. This is also what is meant by "y is inherently stochastic".
For example, when flipping a fair coin, we know exactly what process generates the outcome (a binomial distribution). However, if we were to predict the outcome of a series of coin flips, we would still make errors, because the process is inherently random (i.e. stochastic).

Also read:

Understanding Bayes Error: How a low cost machine learning strategy could have a big impact
Summary Machine learning typically outperforms humans where the problem is a structured data, and not a perception problem. Knowing this, failing to implement a strategy could prove hugely expensive to any business.
https://www.linkedin.com/pulse/understanding-bayes-error-how-low-cost-machine-learning-malcolm-mason?articleId=6468081107404492800

Why compare ML to human-level perfomance

Avoidable Bias

The avoidable bias occurs when, there's a difference between Bayes error or approximation of Bayes error and the training error. So you would want is to keep improving your training perfomance until you get down to Bayes error but not better, you will only get better than Bayes error when the model is overfitting.

Understanding human-level performance

What is "human-level" error?

The human optimal error(team of doctors) will thus be 0.5% or lower and we know that Bayes error is <=0.5%<= 0.5\%

Improving your model performance

Getting a supervised learning algorithm to work well means fundamentally two things:

Reducing avoidable bias and varience

Q & A

Bird recognition in the city of Peacetopia (case study)

Problem Statement

  1. This example is adapted from a real production application, but with details disguised to protect confidentiality.

You are a famous researcher in the City of Peacetopia. The people of Peacetopia have a common characteristic: they are afraid of birds. To save them, you have to build an algorithm that will detect any bird flying over Peacetopia and alert the population.

The City Council gives you a dataset of 10,000,000 images of the sky above Peacetopia, taken from the city’s security cameras. They are labelled:

Your goal is to build an algorithm able to classify new images taken by security cameras from Peacetopia.

There are a lot of decisions to make:

Metric of success

The City Council tells you that they want an algorithm that

  1. Has high accuracy
  1. Runs quickly and takes only a short time to classify a new image.
  1. Can fit in a small amount of memory, so that it can run in a small processor that the city will attach to many different security cameras.

Note: Having three evaluation metrics makes it harder for you to quickly choose between two different algorithms, and will slow down the speed with which your team can iterate. True/False?

2. After further discussions, the city narrows down its criteria to:

If you had the three following models, which one would you choose?

Test Accuracy: 98%

Runtime: 9 sec

Memory size: 9MB

Justification: Correct! As soon as the runtime is less than 10 seconds you're good. So, you may simply maximize the test accuracy after you made sure the runtime is <10sec.

3. Based on the city’s requests, which of the following would you say is true?

4. Structuring your data

Before implementing your algorithm, you need to split your data into train/dev/test sets. Which of these do you think is the best choice?

Train: 9,500,000

Dev: 250,000

Test: 250,000

5. After setting up your train/dev/test sets, the City Council comes across another 1,000,000 images, called the “citizens’ data”. Apparently the citizens of Peacetopia are so scared of birds that they volunteered to take pictures of the sky and label them, thus contributing these additional 1,000,000 images. These images are different from the distribution of images the City Council had originally given you, but you think it could help your algorithm.

Notice that adding this additional data to the training set will make the distribution of the training set different from the distributions of the dev and test sets.

Is the following statement true or false?

"You should not add the citizens' data to the training set, because if the training distribution is different from the dev and test sets, then this will not allow the model to perform well on the test set."

6. One member of the City Council knows a little about machine learning, and thinks you should add the 1,000,000 citizens’ data images to the test set. You object because:

7. You train a system, and its errors are as follows (error = 100%-Accuracy):

Dev set error: 4.5%

Training set error: 4%

This suggests that one good avenue for improving performance is to train a bigger network so as to drive down the 4.0% training error. Do you agree?

8. You ask a few people to label the dataset so as to find out what is human-level performance. You find the following levels of accuracy:

Bird watching expert #1: 0.3% error

Bird watching expert #2: 0.5% error

Normal person #1 (not a bird watching expert): 1.0% error

Normal person #2 (not a bird watching expert): 1.2% error

If your goal is to have “human-level performance” be a proxy (or estimate) for Bayes error, how would you define “human-level performance”?

9. Which of the following statements do you agree with?

10. You find that a team of ornithologists debating and discussing an image gets an even better 0.1% performance, so you define that as “human-level performance.” After working further on your algorithm, you end up with the following:

Human-level performance: 0.1%

Training set error: 2%

Dev set error: 2.1%

Based on the evidence you have, which two of the following four options seem the most promising to try? (Check two options.)

11. You also evaluate your model on the test set, and find the following:

Human-level performance: 0.1%

Training set error: 2%

Dev set error: 2.1%

Test set error: 7%

What does this mean? (Check the two best options.)

12. After working on this project for a year, you finally achieve:

Human-level performance: 0.10%

Training set error: 0.05% Dev set error: 0.05%

What can you conclude? (Check all that apply.)

13. It turns out Peacetopia has hired one of your competitors to build a system as well. Your system and your competitor both deliver systems with about the same running time and memory size. However, your system has higher accuracy! However, when Peacetopia tries out your and your competitor’s systems, they conclude they actually like your competitor’s system better, because even though you have higher overall accuracy, you have more false negatives (failing to raise an alarm when a bird is in the air). What should you do?

14. You’ve handily beaten your competitor, and your system is now deployed in Peacetopia and is protecting the citizens from birds! But over the last few months, a new species of bird has been slowly migrating into the area, so the performance of your system slowly degrades because your data is being tested on a new type of data.

You have only 1,000 images of the new species of bird. The city expects a better system from you within the next 3 months. Which of these should you do first?

15. The City Council thinks that having more Cats in the city would help scare off birds. They are so happy with your work on the Bird detector that they also hire you to build a Cat detector. (Wow Cat detectors are just incredibly useful aren’t they.) Because of years of working on Cat detectors, you have such a huge dataset of 100,000,000 cat images that training on this data takes about two weeks. Which of the statements do you agree with? (Check all that agree.)