Finding Waldo with Hyperparameterization

Photo by Markus Winkler on Unsplash

Back to the middle and around again… I’m going to be there to the end… 100%…. accuracy is the goal. (With apologies to Crystal Waters…) In essence though, we are all trying to find Waldo with our neural nets — training them for some outcome. Waldo has to be a defined function here though — what exactly is the outcome we are looking for? How do we want to measure success?

Generally we want to maximize some measure of correct inferences for a particular classification neural net. (Other types of neural nets have different measures.) Accuracy is the number of inferences that a neural network gets correct. A neural network that gets 100% of the training data correct, but only shows 10% correct inferences, is pretty much worthless. It is overfitted to the training data. (Or perhaps it learned on bad data, but we’ll go into that in another therapy session.)

So, breaking it down, every neural network has the same basic characteristics:

  • Inputs — these are numbers. Not an image, not data, just numbers. The neural net is nothing but a giant number crunching machine. You ever stop and think what Charles Babbage would be thinking if he saw this?
  • Weighted Sum — some sort of weighted sum of these inputs
  • Activation Function — a function that is applied to the weighted sum
  • Output — an output from the Activation Function. For some types of neural nets, this is .. a Yes\No-ish answer. For others, it is reassembled into a picture or other data. And for other neural nets, it’s.. just a number, fed to something else, or used directly.

Inside the neural net though is a set of parameters — these are chosen by the network itself. During the backpropagation phase, the neural network optimizes these parameters to bring the output of the activation function more into line with the Ground Truth (also known as the training data, labeled by a human to provide an actual Answer).

On the other hand, we have the hyperparameters. These are like parameters, but never sit still, hence why we call them hyperparameters. Don’t worry though — these hyperparameters have amazing properties that can control the neural net as a whole.

Types of hyperparameters:

  • Learning Rate — This controls how quickly the neural net adjusts weights. (Or, put another way, controls how much adjustment to each weight at each backprop phase.) Too big, and you miss the loss minimum, too large, and you never… ever… get there.
  • Number of hidden layers — The number of magical hidden layers in a neural net. These layers, when misused, can lead to overspecialization of the neural net to the training data.
  • Momentum — Like learning rate, this controls how the neural net adjusts weights. Unlike learning rate, it is not related to the value of adjustment, but the direction of the adjustment. Think of it as a type of memory that says — if we got closer on the last adjustment, let’s keep trying that direction. Don’t randomly reverse the direction of adjustment.
  • Batch Size — This is how you split up the training data. For very large training data sets, this keeps your neural net from getting overloaded and basically freaking out.
  • Epochs — Once you’ve done a single pass through your data, you’ve completed 1 epoch. A single pass can be defined as a certain number of iterations made up of X batches, etc. Using multiple epochs can provide you with a way to get a certain parameter, such as accuracy, closer to 100%. Or it can provide you with a way to overspecialize your neural net.
  • There are other hyperparameters —

In order for us to have a good way to determine whether or not we are getting better or worse, we need to have a way to measure the output of our our neural net. Like parameters, hyperparameters, and even types of neural nets, there are dozens of these.

  • Accuracy — this is the number of cases the neural net gets the right answer on vs the total number of cases. To arrive at this information, you must have some of your training data labeled to check the results.
  • Loss — this is essentially a measure of how far away your activation function is from the expected result. If, for instance, you were building a binary classification system, of hamsters and snakes, and your neural net gave you an 85% chance the cute image of the dwarf hamster was a hamster, then your loss is .15 or 15%.
  • Precision — Precision is how many True Positives we get, expressed as a percentage of how many total positives (True + False) the neural net said there would be. This might be a better indicator of neural net performance, especially if the cost of a false positive was high (for instance, putting someone in prison because the model said to do so).
  • There are many, many others, given by type of neural network. Don’t believe me, believe Microsoft! Or any of the other very smart people on the web…

Now that we know how to measure basic neural network performance, let’s get started on finding Waldo.

The code samples from here on out can be found at:

For our first trick, we need to define metrics to run our neural net against. Otherwise, we’ll just be bouncing around, trying to guess if we’re doing better or not.

We will be using the Azure environment here, so we’ll also need to be working with those APIs.

The first thing we need to do is to define a metric. Since we are working with MNIST digit classification, let’s just start there.

After that, we’ll show how to log those metrics in Azure ML environment.

And lastly, we’ll setup hyperparameters to try and find better classification.

We’ll do this in both local and Azure — so you choose where to run this. Remember, running locally is a great way to speed up development, but depending on how big your data set is, or how large the search space is, you may want to do your final training on Azure. With MNIST, we should be okay — but hey, let’s give it a shot and find out! (In a future article, I REALLY want to combine Nvidia Jetson with Azure Local Targets!)

Before we can get anywhere with hyperparameters, we need to define some metrics.

To do so, we’ll use the syntax ‘run.log()’. This will allow us to log metrics with labels to the Azure world.

run = Run.get_context()run.log("loss", loss)

The first gets the context of the current run. It can be used from the training script to log into the current run context. This represents 1 run of an experiment.

Note that with the allow offline to True, we are able to run this file locally, and we will see additional logging output.

Attempted to log scalar metric loss:

You can also run this on the Azure ML cluster we created earlier. To do so, we need to add azureml libraries to our ‘pytorch-env.yaml’ file:

- pip:- azureml-sdk- azureml-defaults

Now, once we run the file, when your run is complete, click on the Metrics tab in the Experiment URL provided by your driver script.

You will see something like this:

When we log metrics using the ‘Run’ module, we now get automatic charts! Cool, right! Even more cool is what will happen later, when we see hyperparameter runs happen.

At the same time, we want to make sure we are all set to run our driver script locally, as our driver script is what is going to do our hyperparameter search. Note — to do inference on a local computer, you need to have Docker installed. More on that in a different article. For now, we’ll be doing training on a local computer.

So, for the purposes of this tutorial, I’ve created a new file,, which will show this. In actuality, a command line argument and common code, or a variable and common code, is the right way to go.

First, we set compute_target=’local’ in our ScriptRunConfig setup. This tells Azure Env to run things locally.

Then, we run the file locally. That should be it. But… while this file works from Powershell, it does not work from WSL. The job is started and registered, but never completed. I’m working on that — but for now, let’s move on. :(

(If you’d like to follow my saga:


To run from powershell:

  • Install Python 3.8 from the Windows store
  • Open Powershell
  • Run pip install azureml-sdk, torch (just like in WSL!)
  • Use Explorer to Map a network drive to \\wsl$\<your distro>
  • Navigate to your files using Powershell
  • python

Once a local job is complete, all logs are available in the Azure portal — just like as if you had run them in the cloud. This makes using local jobs amazing — you can save money and still track your runs!

Now that we can run it locally and in the cloud — let’s get back to finding an optimal solution using hyperparameters!

First, we need to analyze our model.

Looking at our file, we have a simple convolution neural net. For the purposes of this experiment, let’s just modify 2 simple hyperparameters:

  • learning rate — this is how quickly the neural network adjusts the weights.
  • momentum — this keeps a neural net adjusting weights in the same direction it was previously.

There are entire reams of paper that can be written about hyperparameters — so let’s just keep it simple.

First, we’ll need to parameterize the training file — this is so that the the driver script can pass various different parameters to search different spaces.

Let’s add some code to accept input for the momentum and the learning rate:

parser = argparse.ArgumentParser()parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate for SGD')parser.add_argument('--momentum', type=float, default=0.9, help='Momentum for SGD')args = parser.parse_args()

We can set download to False in our call to torchvision.datasets.CIFAR10 — we’ve already got the storage allocated and available. We’ll go into other ways to add data to our storage in the future, but for now, we’ll just use the default datastore.

Lastly, we need to change our optimizer to accept arguments:

optimizer = optim.SGD(

This will allow us to pass through the arguments from our driver script to the training script.

One important call here is the ‘run.log’- this is used to log metrics, as we discussed earlier. In this run, it will log the loss, which will allow our optimization script to handle finding the best run for the loss metric.

If we wanted to search for other parameters, we could easily do so — but we would have to log them here, or the hyperparameter search function won’t be able to utilize them.

Second, we’ll need to create a new driver file to drive the training script. This file will call the training script with different parameters, searching over the address space for us to find the “best” parameters.

In order to define “best”, we need to do some thinking — are we interested in say, loss? Or would F1 or accuracy be a better measure? What about regression analysis?

Since our current MNIST is setup to use loss, for now, we’ll just stick to that. (Although a valid argument could be made for doing something like… accuracy for number identification.)

We interrupt this article for an important word about Estimators. Many examples that you will find online use Estimators. However, in August 25th 2020, a Microsoft employee noted that Estimators are being deprecated:, and ScriptRunConfig has taken over.

I don’t really care either way, but I did pull my hair out for a while, trying to map the 2 onto each other, as various examples had differing usages. Hopefully this helps you out a bit.

At a high level, we need to do the following:

  • define the parameters that our search will loop over and the search space. There are multiple ways to define a search space — we’ll see examples later.
  • Define a policy for the engine to terminate when things go really bad — if we aren’t even close to where we want to be, just eject and get out now.
  • Define a config which will be used as the base configuration — this is the run that each search run will kick off with new parameters.
  • Define a primary metric (and any additional metrics that are important) for our search to use to measure the runs against.
  • Define how many runs, how many in parallel, and other run specific metrics.

With that, let’s get started!

First, let’s define those parameters!

In order to define the parameters, we have to know what types of parameters they are:

  • Discrete — these are parameters that are a choice. Does our neural net like chocolate or vanilla? We’re not looking to mix things, just have one or another. In this category, we can have various discrete functions, or iterative ranges, or even an arbitrary list of objects.
  • Continuous — these hyperparameters are continuous- ie, they could be anything on a range from say, 1–2. Or, it could be chocolate, vanilla, or any mix of them, with varying parts of chocolate and vanilla. In this category, we have various types of functions that return a value in the continuous search space.

Additionally, we need to tell the driving script how to search parameters. In a continuous function, it could get very expensive to try all parameters from say, 0–1. There’s a lot of them. Go look it up if you don’t believe me.

Azure supports the following types of search functions:

  • Random — this is exactly what it sounds like. A toddler bangs on the keyboard, randomly picking data and handing it to the training script. This can be used with both discrete and continuous hyperparameters.
  • Grid — This only supports discrete parameters. It makes a table of all possible combinations of the search space, and tries… all of them. It could get rather expensive.
  • Bayesian — This uses a Bayesian optimization algorithm to search your parameters. The more runs that have gone before, the better the latest run will perform. (Theoretically.) It does take a lot of runs (20 or more) — essentially you’re using math AI to find the best Neural AI. Fun, right? (Another reason bayesian can be more expensive is b\c you cannot terminate runs early — they HAVE to run to completion.)

So, let’s start by adding this code snippet to a new driver script, ‘’:

param_sampling = RandomParameterSampling(
"learning_rate": uniform(.001, .1),
"momentum": uniform(.5, 1)

This sets up a random sampler, which will search 2 uniform distributions to try and find a value that works very well.

Next, let’s define an early termination policy. An early termination policy is pretty much like getting fired from your job b\c you just didn’t work out … on your second day.

Azure ML supports the following types of termination policies:

  • Bandit — the Bandit policy compares runs to the best performing run. Every “N time intervals”, the system checks the primary metric. If the primary metric isn’t within the allowable range, as compared to the best run, the run is terminated.
  • Median — the Median policy uses a running average of other runs primary metrics. Any time the currently running primary metric drops below the running average of all training runs, the current run gets terminated.
  • Truncation — the Truncation samples all current runs every “N time intervals”. The lowest X% are terminated.

Microsoft has some data from their studies, and they claim that that using a Median Termination policy with an evaluation interval of 1 and a delay evaluation of 5 provides 25–35% cost savings with no loss of training accuracy.

Defining an early termination policy is pretty easy:

early_termination_policy = MedianStoppingPolicy(evaluation_interval=1, delay_evaluation=5)

And done — we’ll just stick with what Microsoft recommends. For once. :)

Next, we need to define the primary metric. The system needs to know WHAT it is measuring our search space for.

If you remember, previously, we added this line to our script:

run.log('loss', loss)

This logs metrics for us to use — in this case, it logs the loss.

So our training script is all set!

But we do need to tell the drive script exactly what metric and what we want to do with it:


These will go in our HyperDriveConfig later — essentially we can tell the system to maximize or minimize something. For instance, we might want to maximize accuracy, or minimize hidden nodes.

The last thing we have to decide on is how many runs in parallel, and how many concurrent runs.

Many of the Early termination policies measure against other runs, either previous runs, or current runs. We also don’t want to spin up hundreds of resources in the cloud.

Since we have our compute target defined as a 4 node cluster, we’ll set:

max_concurrent_runs = 4

You have to have enough resources in your compute target to run everything at the same time, or it won’t work, but you can run less at a time than you have resources for.

And, since we don’t want to run forever, we’ll set:


With that done, we need to tell the driver script how to run the training script, by setting up a ScriptRunConfig:

config = ScriptRunConfig(source_directory=’src’,
hd_config = HyperDriveConfig(

Essentially, the HyperDriveConfig sets up the search space, and then calls the individual run using the ScriptRunConfig.

Okay! We’re almost there!

One small change from before, instead of running ‘config’- the individual run, we need to submit the entire HyperDriveConfig to run:

run = experiment.submit(hd_config)

Lastly, we probably want to print out some metrics. Azure includes a lot of widgets and things like that to do stuff like this, we’ll go over some of that in other sessions — but for now, here we are:

print ("Done with RUN, Best Run:")
best_run = run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
print('Best Run Id: ',
print('\n Loss:', best_run_metrics['loss'])

The above will get the best run, as the hyper parameter thought it was, and print out the metrics.

With that done, let’s run it!

Go ahead and run it any way you like — I’ll use the terminal, but you can use VS Code, right click the file, and hit “Run File in Terminal”.

Once you get the URL, you’ll notice… it’s not exiting. That’s b\c we asked it to wait till the entire run was complete. The run won’t be complete till all the child runs are done. A child run is a single run that the HyperDriveConfig has kicked off.

Anyway, head to the URL, and you’ll notice a tab now has information!

What you’re seeing here is a bunch of data. Data is easy, information is hard — and right now, there’s not a lot of information. Let’s go through what you’re seeing.

  1. At the bottom of the run list, is all your child runs for this batch. We set it up to do 4 at a time. When all of these runs are complete, this will be the end of the optimization and HyperDrive will pick the best run for you.
  2. The graph of the primary metric — the one on the left is the LOSS, or our primary metric.
  3. The graph of the search space on the right — including the loss. Over time, this will pick up more runs.

Now you can see some cool things!

  1. Canceled runs! Looking at the loss, it was not within the running average of the other runs — remember our Median Early Termination Policy?
  2. A graph on the right of the learning rate and momentum vs. the loss. Data visualization is fun — and Azure has a lot of it!
  3. Lastly, the run is getting a “better and better” loss. You can see the individual loss go down and down and down.

If you want to play with things, you can do a lot here:

  • Try changing the chart type in the upper right — scatter plots galore! This is useful if you are dealing with a complex search space.
  • On the top left, you can add a filter by status — this is useful to only see runs of a certain type, for instance, completed.
  • Click on a child run ID, and view the logs — just like we used to do for a regular run.

Eventually, we’ll reach a point, where things are done:

[2020–11–12T21:44:40.720608][GENERATOR][INFO]Max number of jobs ‘16’ reached for experiment.<END>\n””<START>

You’ll see this printed out on the terminal screen:

Done with RUN, Best Run:
Best Run Id: HD_f58ef62b-7ae6–47cc-b3eb-b06cb5c74091_5
Loss: [2.264762686252594, 2.255858953118324, 2.0913370432257654, 2.031358771383762, 2.013495835483074, 2.0463720375299452, 1.9966653327941895, 2.038774788945913, 2.0014820760190486, 2.0269744057655332, 2.040849109262228, 2.033858101397753]

Why so many Loss values? That’s the loss measured at every interval. This is made more clear, when we drill down into the child runs, and click metrics:

Notice how the loss kind of bounces around…?

So how does this compare to our single run a long time ago?

Notice how this one, with values lr=.001, and momentum=.9 is still better?

I know, i know… you’re asking why….

Take a look at our learning rate and momentum:

You can see that they are starting to get closer and closer to our static values. MNIST is pretty well understood — and so it’s hard to beat. Also, I didn’t want to spend a lot of time\cash on this. (One of the values of local runs is that you can use HyperDrive to run on your own computer — if your computer can handle the data that you’re dealing with… )

Over time, and searching the search space, we would have got there.

So that’s that!

We found Waldo using HyperParameters!





Years of technology experience have given me a unique perspective on many things, including parenting, climate change, etc. Or maybe I’m just opinionated.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Arm Always-On Mobile Face Unlock Achieves Over 98% Accuracy

Machine Learning Model Over Docker Container

Toxic Comment Classification Models Comparison and Selection

Large Language Model Morality

A Better mAP for Object Detection

NLP Insights from Guy Kawasaki’s Tweets Last Week, 1/5/2022


How Neural Networks Can Be Used to Detect Fraud | Riskified Blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Allan Graves

Allan Graves

Years of technology experience have given me a unique perspective on many things, including parenting, climate change, etc. Or maybe I’m just opinionated.

More from Medium

Vehicle City-Highway Application

The Global Data Barometer first edition Released — Part 1

Python Design Pattern

Assign colors to faces in a STEP file