Ax Service API with RayTune on PyTorch CNN¶

Ax integrates easily with different scheduling frameworks and distributed training frameworks. In this example, Ax-driven optimization is executed in a distributed fashion using RayTune.

RayTune is a scalable framework for hyperparameter tuning that provides many state-of-the-art hyperparameter tuning algorithms and seamlessly scales from laptop to distributed cluster with fault tolerance. RayTune leverages Ray's Actor API to provide asynchronous parallel and distributed execution.

Ray 'Actors' are a simple and clean abstraction for replicating your Python classes across multiple workers and nodes. Each hyperparameter evaluation is asynchronously executed on a separate Ray actor and reports intermediate training progress back to RayTune. Upon reporting, RayTune then uses this information to performs actions such as early termination, re-prioritization, or checkpointing.

In [1]:

import logging

from ray import tune
from ray.tune import report
from ray.tune.suggest.ax import AxSearch

logger = logging.getLogger(tune.__name__)
logger.setLevel(
    level=logging.CRITICAL
)  # Reduce the number of Ray warnings that are not relevant here.

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.
  "update your install command.", FutureWarning)

In [2]:

import numpy as np
import torch
from ax.plot.contour import plot_contour
from ax.plot.trace import optimization_trace_single_method
from ax.service.ax_client import AxClient
from ax.utils.notebook.plotting import init_notebook_plotting, render
from ax.utils.tutorials.cnn_utils import CNN, evaluate, load_mnist, train

init_notebook_plotting()

[INFO 06-30 21:34:59] ax.utils.notebook.plotting: Injecting Plotly library into cell. Do not overwrite or delete cell.

1. Initialize client¶

We specify enforce_sequential_optimization as False, because Ray runs many trials in parallel. With the sequential optimization enforcement, AxClient would expect the first few trials to be completed with data before generating more trials.

When high parallelism is not required, it is best to enforce sequential optimization, as it allows for achieving optimal results in fewer (but sequential) trials. In cases where parallelism is important, such as with distributed training using Ray, we choose to forego minimizing resource utilization and run more trials in parallel.

In [3]:

ax = AxClient(enforce_sequential_optimization=False)

[INFO 06-30 21:34:59] ax.service.ax_client: Starting optimization with verbose logging. To disable logging, set the `verbose_logging` argument to `False`. Note that float values in the logs are rounded to 2 decimal points.

2. Set up experiment¶

Here we set up the search space and specify the objective; refer to the Ax API tutorials for more detail.

In [4]:

MINIMIZE = False  # Whether we should be minimizing or maximizing the objective

In [5]:

ax.create_experiment(
    name="mnist_experiment",
    parameters=[
        {"name": "lr", "type": "range", "bounds": [1e-6, 0.4], "log_scale": True},
        {"name": "momentum", "type": "range", "bounds": [0.0, 1.0]},
    ],
    objective_name="mean_accuracy",
    minimize=MINIMIZE,
)

[INFO 06-30 21:34:59] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter lr. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 06-30 21:34:59] ax.service.utils.instantiation: Inferred value type of ParameterType.FLOAT for parameter momentum. If that is not the expected value type, you can explicity specify 'value_type' ('int', 'float', 'bool' or 'str') in parameter dict.
[INFO 06-30 21:34:59] ax.modelbridge.dispatch_utils: Using GPEI (Bayesian optimization) since there are more continuous parameters than there are categories for the unordered categorical parameters.
[INFO 06-30 21:34:59] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 5 trials, GPEI for subsequent trials]). Iterations after 5 will take longer to generate due to  model-fitting.

In [6]:

ax.experiment.optimization_config.objective.minimize

Out[6]:

False

In [7]:

load_mnist(data_path="~/.data")  # Pre-load the dataset before the initial evaluations are executed.

0.8%

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /home/runner/.data/MNIST/raw/train-images-idx3-ubyte.gz

100.0%

Extracting /home/runner/.data/MNIST/raw/train-images-idx3-ubyte.gz to /home/runner/.data/MNIST/raw

102.8%

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /home/runner/.data/MNIST/raw/train-labels-idx1-ubyte.gz
Extracting /home/runner/.data/MNIST/raw/train-labels-idx1-ubyte.gz to /home/runner/.data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /home/runner/.data/MNIST/raw/t10k-images-idx3-ubyte.gz

100.0%
112.7%

Extracting /home/runner/.data/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/runner/.data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 503: Service Unavailable

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /home/runner/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting /home/runner/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/runner/.data/MNIST/raw

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/torchvision/datasets/mnist.py:498: UserWarning:

The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:180.)

Out[7]:

(<torch.utils.data.dataloader.DataLoader at 0x7f0afa577bd0>,
 <torch.utils.data.dataloader.DataLoader at 0x7f0af84bec90>,
 <torch.utils.data.dataloader.DataLoader at 0x7f0afa564ad0>)

3. Define how to evaluate trials¶

Since we use the Ax Service API here, we evaluate the parameterizations that Ax suggests, using RayTune. The evaluation function follows its usual pattern, taking in a parameterization and outputting an objective value. For detail on evaluation functions, see Trial Evaluation.

In [8]:

def train_evaluate(parameterization):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader, valid_loader, test_loader = load_mnist(data_path="~/.data")
    net = train(
        net=CNN(),
        train_loader=train_loader,
        parameters=parameterization,
        dtype=torch.float,
        device=device,
    )
    report(
        mean_accuracy=evaluate(
            net=net,
            data_loader=valid_loader,
            dtype=torch.float,
            device=device,
        )
    )

4. Run optimization¶

Execute the Ax optimization and trial evaluation in RayTune using AxSearch algorithm:

In [9]:

tune.run(
    train_evaluate,
    num_samples=30,
    search_alg=AxSearch(
        ax_client=ax, 
        max_concurrent=3, 
        mode="min" if MINIMIZE else "max",  # Ensure that `mode` aligns with the `minimize` setting in `AxClient`
    ),
    verbose=0,  # Set this level to 1 to see status updates and to 2 to also see trial results.
    # To use GPU, specify: resources_per_trial={"gpu": 1}.
)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3053/105012604.py in <module>
      5         ax_client=ax,
      6         max_concurrent=3,
----> 7         mode="min" if MINIMIZE else "max",  # Ensure that `mode` aligns with the `minimize` setting in `AxClient`
      8     ),
      9     verbose=0,  # Set this level to 1 to see status updates and to 2 to also see trial results.

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/ray/tune/suggest/ax.py in __init__(self, space, metric, mode, points_to_evaluate, parameter_constraints, outcome_constraints, ax_client, use_early_stopped_trials, max_concurrent)
    166 
    167         if self._ax or self._space:
--> 168             self._setup_experiment()
    169 
    170     def _setup_experiment(self):

/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/ray/tune/suggest/ax.py in _setup_experiment(self)
    211                         "outcome_constraints",
    212                         "mode",
--> 213                         "metric",
    214                     ]))
    215 

ValueError: If you create the Ax experiment yourself, do not pass values for these parameters to `AxSearch`: ['space', 'parameter_constraints', 'outcome_constraints', 'mode', 'metric'].

5. Retrieve the optimization results¶

In [10]:

best_parameters, values = ax.get_best_parameters()
best_parameters

Out[10]:

{'lr': 0.0022512732432975525, 'momentum': 3.990447334350426e-15}

In [11]:

means, covariances = values
means

Out[11]:

{'mean_accuracy': 0.9661663659714574}

6. Plot the response surface and optimization trace¶

In [12]:

render(
    plot_contour(
        model=ax.generation_strategy.model,
        param_x="lr",
        param_y="momentum",
        metric_name="mean_accuracy",
    )
)

In [13]:

# `plot_single_method` expects a 2-d array of means, because it expects to average means from multiple
# optimization runs, so we wrap out best objectives array in another array.
best_objectives = np.array(
    [[trial.objective_mean * 100 for trial in ax.experiment.trials.values()]]
)
best_objective_plot = optimization_trace_single_method(
    y=np.maximum.accumulate(best_objectives, axis=1),
    title="Model performance vs. # of iterations",
    ylabel="Accuracy",
)
render(best_objective_plot)

Download Tutorial Jupyter Notebook

Download Tutorial Source Code