GANs on the Azure (ML) Sea — Part 2

Photo by Mary Ray on Unsplash

Okay — our initial result in part 1 of this article wasn’t great. Honestly, more training, etc could probably be used. But, our point here is not just to train the heck of out of this thing — but to move it to Azure.

This is a continuation of https://allangraves.medium.com/gans-on-the-azure-ml-sea-part-1-e3af65061900.

So, let’s get started.

For this, we’ll need a driver script, and a new training script. We could easily use our old training script, but since these files are used for learning, I’d like to separate out the new pieces we’ll add.

The last thing we want to do is upload our data set, and prevent Azure from cloning it from our local dir.

The new training script is called ‘train-dcgan_azure.py’, and the new driver script is called 06-dcgan_azure.py.

The new driver script is pretty simple, so we won’t go into it.

Right now, we’re running on the CPU cluster. Later, we’ll run on the GPU cluster, just to see if there’s any difference in time taken.

Flip back to our train-dcgan_azure.py file. Here’s our goals to “azurify” this.

  • Log images from training runs

These will teach us basic components that we learn to build more and more fantastic beasts.

Let’s start with the loss and metrics. If you recall from: https://allan-graves.medium.com/finding-waldo-with-hyperparameterization-29b6dbfb1888, we can get a run context, and add metrics easily using that.

We will need to add a line to get the run context:

run = Run.get_context()

Now that we have that, at the end of every run, we can easily log and see where things are going.

Our original code had this:

[0/5][0/487]    Loss_D: 1.7856  Loss_G: 3.7816  D(x): 0.4119    D(G(z)): 0.4626 / 0.0339

At the end of every 50 pictures, we printed out some stats.

Now, we’re also going to log those stats.

Note that this could easily be used in hyperparameter optimization!

run.log("loss_d", errD.item())
run.log("loss_g", errG.item())
run.log("out_d_real", D_x)
run.log("out_d_fake", D_G_z1)

That’s it — all it takes to make Azure track this!

Next, let’s discuss logging images — wouldn’t it be cool to see our images morph over time?

What we’re looking for here is the log_image part of the Azure Run SDK. Just like ‘log’, log_image allows us to store images for saving later.

log_image takes in a path to an image. We’ll add the following log_image call to our script:

vutils.save_image(img_list[len(img_list) - 1], "sample" + str(i) + ".png", normalize=True)
run.log_image("sample" + str(i), "sample" + str(i) + ".png")

This allows us to see the images in the run log and compare them from run to run.

Lastly, we need to get our training data uploaded. For that, we’ll use Azure Storage Explorer: https://azure.microsoft.com/en-us/features/storage-explorer/

Go ahead, download and install it.

Select “Add an Azure Account”, and, unless you’re on something special, just select Azure.

Sign in. Select your plans that you want to show resources from. Hit ‘Apply’.

Once you’ve finalized your resources, expand the blob container, till you see something like this:

That’s the default azureml-blobstore or datastore. In this case, I’ve not created any additional directories or other pathnames.

In our driver script, we need to add in a call to access this:

datastore = ws.get_default_datastore()
dataset = Dataset.File.from_files(path=(datastore, ‘data’))

The first call there gets our default datastore. A data store is just a place to store data. By default, your setup includes a file store and a blob store. The blob store is faster, according to the documentation. The file store is for unstructured data. You can allocate other file and blob stores in the future — group them by experiment, resource group, etc.

A dataset is a a factory that provides common interfaces for accessing data — data sets can be stored or registered, so you can access them by name (for instance if a whole class was using the same data set, or an entire department), or by files, etc.

And then a call to pass it through to the training script:

arguments=[
'--data_path', dataset.as_named_input('input').as_mount(),
]

This code merely sets up a mount point so that the container itself can see the data in this blob.

Over in the training script, we’ll add in the following code to list the data path and accept it as an argument to the script:

parser = argparse.ArgumentParser()
parser.add_argument(‘ — data_path’, type=str, help=’Path to the training data’)
args = parser.parse_args()
print(“===== DATA =====”)
print(“DATA PATH: “ + args.data_path)
print(“================”)

And one last change:

dataroot = args.data_path + "/celeba"

This sets our dataroot to the passed in path.

Now, we just need to upload our data!

In Azure Storage Explorer, navigate to the data directory. It should be in the root of your blobstore.

If it’s not there, then click “New Folder” and enter ‘data’ as the name. You’ll find yourself in a blank screen after you hit ‘OK” — this is the new folder. Right now, it’s virtual, and won’t be pushed to the cloud till you actually put something in it.

Click the “Upload” button, hitting the drop down arrow on it, and select “Upload Folder”. Hit the ‘…’ and select the celeba data directory we were using for Part 1 of this tutorial.

Hit Upload, and wait!

When it’s all done, this is what you should see:

If you see an error like this when you try and run your code:

Traceback (most recent call last):
File "/home/agraves/.local/lib/python3.8/site-packages/dotnetcore2/runtime.py", line 263, in attempt_get_deps
blob_deps_to_file()
File "/home/agraves/.local/lib/python3.8/site-packages/dotnetcore2/runtime.py", line 255, in blob_deps_to_file
blob = request.urlopen(deps_url, context=ssl_context)
File "/usr/lib/python3.8/urllib/request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.8/urllib/request.py", line 531, in open
response = meth(req, response)
File "/usr/lib/python3.8/urllib/request.py", line 640, in http_response
response = self.parent.error(
File "/usr/lib/python3.8/urllib/request.py", line 569, in error
return self._call_chain(*args)
File "/usr/lib/python3.8/urllib/request.py", line 502, in _call_chain
result = func(*args)
File "/usr/lib/python3.8/urllib/request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 404: Not Found

It means you’re looking to open the wrong datastore, or you haven’t actually uploaded the data. Double check your calls to your setup of ‘data’ in the training script — do you have matching pathnames from where you uploaded these files?

Note — storage costs some money. So when you’re done your experiments, clean it up! Put your toys away.

The last thing we want to do is to tell Azure ML to NOT transfer our local data directory — we already have our stuff up in the cloud.

To do so, we’ll use the amlignore file, which we will put in the top level directory. This will tell Azure to ignore aux scripts, and to ignore the data directory.

data/
aux/

For more, see https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#snapshots.

Now, we’re ready to run!

At this point, I feel the need to point out an error I began receiving on scripts that were working. I’m using WSL2 with Ubuntu 20.04.

NotImplementedError: Unsupported Linux distribution ubuntu 20.04

This was strange — I couldn’t figure out if it was my environment, or something. However, going back and running older scripts that were once working — showed exactly that, they also were broken now.

Welcome to the cloud.

So, I moved to Ubuntu 18.04, and things are working again. It sometimes runs on the cloud!

Photo by Jeremy Thomas on Unsplash

Anyway, back to our setup. Taking a look at the resulting URL from running on the command line, we notice that the job failed!!!

Head over to ‘Outputs + logs’, and check 70_driver_log.txt.

You’ll notice this:

Finishing unmounting /tmp/tmpl_0ixjzi.
Exit __exit__ of DatasetContextManager
Traceback (most recent call last):
File "train-dcgan_azure.py", line 16, in <module>
import matplotlib.pyplot as plt
ModuleNotFoundError: No module named 'matplotlib'

Whoops! We need to add that to our environment Yaml file!

Pull up ‘pytorch-env.yaml’, and add the following under dependencies:

- matplotlib

Save the file, and try again.

As you look at the ‘70_driver_log.txt’ output, you’ll see this line:

After variable expansion, calling script [ train-dcgan_azure.py ] with arguments: ['--data_path', '/tmp/tmp0ehpqhoe']

This line is what is calling our training script, and passing the data path we just mounted. It gets mounted into the docker container, and now our script can access any files in in it!

Another thing you’ll notice is this!

We have image output!

That’s the first sample for our run

We got this through run.log_image calls we put in our training script!

And, in our ‘70_driver_log.txt’:

Starting Training Loop...
[0/5][0/545] Loss_D: 2.0239 Loss_G: 3.4270 D(x): 0.3267 D(G(z)): 0.4405 / 0.0469

The first output of our training loop.

Please see: https://allan-graves.medium.com/gans-on-the-azure-ml-sea-part-1-e3af65061900 for more on what these images mean.

In this shortened run, we never quite reached our final destination, and the output images aren’t that great.

But under ‘logs’, you’ll see all the images and other text logs. Under ‘Metrics’, you’ll see the charts, automatically generated from the values we logged during the run.

Further exploration could utilize hyperparameters to search for better layouts of the neural net or other hyperparameters.

However, for now, we’ll end with a number of links to the sites used in the creation of this article.

Links:

PyTorch:

Machine Learning:

WSL:

Azure:

Years of technology experience have given me a unique perspective on many things, including parenting, climate change, etc. Or maybe I’m just opinionated.