Ax Service API with RayTune on PyTorch CNN¶

Ax integrates easily with different scheduling frameworks and distributed training frameworks. In this example, Ax-driven optimization is executed in a distributed fashion using RayTune.

RayTune is a scalable framework for hyperparameter tuning that provides many state-of-the-art hyperparameter tuning algorithms and seamlessly scales from laptop to distributed cluster with fault tolerance. RayTune leverages Ray's Actor API to provide asynchronous parallel and distributed execution.

Ray 'Actors' are a simple and clean abstraction for replicating your Python classes across multiple workers and nodes. Each hyperparameter evaluation is asynchronously executed on a separate Ray actor and reports intermediate training progress back to RayTune. Upon reporting, RayTune then uses this information to performs actions such as early termination, re-prioritization, or checkpointing.

In [1]:

import logging
from ray import tune
from ray.tune import track
from ray.tune.suggest.ax import AxSearch
logger = logging.getLogger(tune.__name__)  
logger.setLevel(level=logging.CRITICAL)  # Reduce the number of Ray warnings that are not relevant here.

In [2]:

import torch
import numpy as np

from ax.plot.contour import plot_contour
from ax.plot.trace import optimization_trace_single_method
from ax.service.ax_client import AxClient
from ax.utils.notebook.plotting import render, init_notebook_plotting
from ax.utils.tutorials.cnn_utils import CNN, load_mnist, train, evaluate


init_notebook_plotting()

[INFO 05-14 21:26:36] ipy_plotting: Injecting Plotly library into cell. Do not overwrite or delete cell.

1. Initialize client¶

We specify enforce_sequential_optimization as False, because Ray runs many trials in parallel. With the sequential optimization enforcement, AxClient would expect the first few trials to be completed with data before generating more trials.

When high parallelism is not required, it is best to enforce sequential optimization, as it allows for achieving optimal results in fewer (but sequential) trials. In cases where parallelism is important, such as with distributed training using Ray, we choose to forego minimizing resource utilization and run more trials in parallel.

In [3]:

ax = AxClient(enforce_sequential_optimization=False)

[INFO 05-14 21:26:36] ax.service.ax_client: Starting optimization with verbose logging. To disable logging, set the `verbose_logging` argument to `False`. Note that float values in the logs are rounded to 2 decimal points.

2. Set up experiment¶

Here we set up the search space and specify the objective; refer to the Ax API tutorials for more detail.

In [4]:

ax.create_experiment(
    name="mnist_experiment",
    parameters=[
        {"name": "lr", "type": "range", "bounds": [1e-6, 0.4], "log_scale": True},
        {"name": "momentum", "type": "range", "bounds": [0.0, 1.0]},
    ],
    objective_name="mean_accuracy",
)

[INFO 05-14 21:26:36] ax.modelbridge.dispatch_utils: Using Bayesian Optimization generation strategy: GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 5 trials, GPEI for subsequent trials]). Iterations after 5 will take longer to generate due to  model-fitting.

3. Define how to evaluate trials¶

Since we use the Ax Service API here, we evaluate the parameterizations that Ax suggests, using RayTune. The evaluation function follows its usual pattern, taking in a parameterization and outputting an objective value. For detail on evaluation functions, see Trial Evaluation.

In [5]:

def train_evaluate(parameterization):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    train_loader, valid_loader, test_loader = load_mnist(data_path='~/.data')
    net = train(net=CNN(), train_loader=train_loader, parameters=parameterization, dtype=torch.float, device=device)
    track.log(
        mean_accuracy=evaluate(
            net=net,
            data_loader=valid_loader,
            dtype=torch.float,
            device=device,
        )
    )

4. Run optimization¶

Execute the Ax optimization and trial evaluation in RayTune using AxSearch algorithm:

In [6]:

tune.run(
    train_evaluate, 
    num_samples=30, 
    search_alg=AxSearch(ax),  # Note that the argument here is the `AxClient`.
    verbose=0,  # Set this level to 1 to see status updates and to 2 to also see trial results.
    # To use GPU, specify: resources_per_trial={"gpu": 1}.
)

2020-05-14 21:26:36,342	INFO resource_spec.py:212 -- Starting Ray with 4.35 GiB memory available for workers and up to 2.19 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-05-14 21:26:36,691	INFO services.py:1170 -- View the Ray dashboard at localhost:8265
[INFO 05-14 21:26:36] ax.service.ax_client: Generated new trial 0 with parameters {'lr': 0.0, 'momentum': 0.7}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 1 with parameters {'lr': 0.0, 'momentum': 0.96}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 2 with parameters {'lr': 0.39, 'momentum': 0.88}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 3 with parameters {'lr': 0.0, 'momentum': 0.28}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 4 with parameters {'lr': 0.0, 'momentum': 0.64}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 5 with parameters {'lr': 0.26, 'momentum': 0.38}.
[INFO 05-14 21:26:37] ax.service.ax_client: Generated new trial 6 with parameters {'lr': 0.0, 'momentum': 0.25}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 7 with parameters {'lr': 0.0, 'momentum': 0.97}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 8 with parameters {'lr': 0.0, 'momentum': 0.69}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 9 with parameters {'lr': 0.0, 'momentum': 0.43}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 10 with parameters {'lr': 0.07, 'momentum': 0.04}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 11 with parameters {'lr': 0.0, 'momentum': 0.02}.
[INFO 05-14 21:26:38] ax.service.ax_client: Generated new trial 12 with parameters {'lr': 0.0, 'momentum': 0.49}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 13 with parameters {'lr': 0.0, 'momentum': 0.22}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 14 with parameters {'lr': 0.01, 'momentum': 0.52}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 15 with parameters {'lr': 0.0, 'momentum': 0.11}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 16 with parameters {'lr': 0.0, 'momentum': 0.03}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 17 with parameters {'lr': 0.16, 'momentum': 0.62}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 18 with parameters {'lr': 0.0, 'momentum': 0.63}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 19 with parameters {'lr': 0.0, 'momentum': 0.75}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 20 with parameters {'lr': 0.0, 'momentum': 0.49}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 21 with parameters {'lr': 0.0, 'momentum': 0.85}.
[INFO 05-14 21:26:39] ax.service.ax_client: Generated new trial 22 with parameters {'lr': 0.0, 'momentum': 0.94}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 23 with parameters {'lr': 0.01, 'momentum': 0.66}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 24 with parameters {'lr': 0.29, 'momentum': 0.13}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 25 with parameters {'lr': 0.07, 'momentum': 0.22}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 26 with parameters {'lr': 0.01, 'momentum': 0.33}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 27 with parameters {'lr': 0.0, 'momentum': 0.58}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 28 with parameters {'lr': 0.01, 'momentum': 0.37}.
[INFO 05-14 21:26:40] ax.service.ax_client: Generated new trial 29 with parameters {'lr': 0.0, 'momentum': 0.02}.

(pid=4100) 2020-05-14 21:26:42,871	INFO trainable.py:217 -- Getting current IP.
(pid=4100) Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw/train-images-idx3-ubyte.gz
(pid=4099) 2020-05-14 21:26:42,897	INFO trainable.py:217 -- Getting current IP.
(pid=4099) Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw/train-images-idx3-ubyte.gz
0.1%
0.3%
1.2%
0.4%
1.2%
9.8%
22.9%
10.2%
23.5%
10.5%
26.9%
28.1%
29.3%
29.4%
11.1%
12.1%
13.6%
14.5%
30.0%
15.9%
54.3%
60.2%
86.4%
90.9%
88.1%
88.9%
89.1%
89.5%
91.8%
89.7%
93.6%
91.6%
93.6%
100.1%
(pid=4100) Extracting /home/travis/.data/MNIST/raw/train-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4099) Extracting /home/travis/.data/MNIST/raw/train-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw
100.1%
(pid=4100) Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw/train-labels-idx1-ubyte.gz
(pid=4099) Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw/train-labels-idx1-ubyte.gz
113.5%
(pid=4100) Extracting /home/travis/.data/MNIST/raw/train-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4100) Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw/t10k-images-idx3-ubyte.gz
(pid=4099) Extracting /home/travis/.data/MNIST/raw/train-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4099) Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw/t10k-images-idx3-ubyte.gz
113.5%
2.5%
2.5%
3.0%
3.5%
88.4%
88.9%
66.6%
92.4%
70.1%
71.5%
72.0%
93.9%
74.5%
100.4%
(pid=4100) Extracting /home/travis/.data/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4099) Extracting /home/travis/.data/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/travis/.data/MNIST/raw
100.4%
(pid=4100) Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz
(pid=4099) Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz
180.4%
(pid=4100) Extracting /home/travis/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4100) Processing...
(pid=4099) Extracting /home/travis/.data/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/travis/.data/MNIST/raw
(pid=4099) Processing...
180.4%
(pid=4099) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning:
(pid=4099) 
(pid=4099) The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
(pid=4099) 
(pid=4100) /pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning:
(pid=4100) 
(pid=4100) The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
(pid=4100) 
(pid=4100) Done!
(pid=4099) Done!

[INFO 05-14 21:27:01] ax.service.ax_client: Completed trial 1 with data: {'mean_accuracy': (0.91, 0.0)}.
[INFO 05-14 21:27:01] ax.service.ax_client: Completed trial 0 with data: {'mean_accuracy': (0.91, 0.0)}.

(pid=4170) 2020-05-14 21:27:04,305	INFO trainable.py:217 -- Getting current IP.
(pid=4169) 2020-05-14 21:27:04,297	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:27:21] ax.service.ax_client: Completed trial 2 with data: {'mean_accuracy': (0.1, 0.0)}.
[INFO 05-14 21:27:21] ax.service.ax_client: Completed trial 3 with data: {'mean_accuracy': (0.87, 0.0)}.

(pid=4179) 2020-05-14 21:27:23,521	INFO trainable.py:217 -- Getting current IP.
(pid=4212) 2020-05-14 21:27:24,536	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:27:40] ax.service.ax_client: Completed trial 4 with data: {'mean_accuracy': (0.11, 0.0)}.
[INFO 05-14 21:27:42] ax.service.ax_client: Completed trial 5 with data: {'mean_accuracy': (0.09, 0.0)}.

(pid=4238) 2020-05-14 21:27:43,987	INFO trainable.py:217 -- Getting current IP.
(pid=4250) 2020-05-14 21:27:45,119	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:28:01] ax.service.ax_client: Completed trial 6 with data: {'mean_accuracy': (0.92, 0.0)}.
[INFO 05-14 21:28:02] ax.service.ax_client: Completed trial 7 with data: {'mean_accuracy': (0.88, 0.0)}.

(pid=4271) 2020-05-14 21:28:04,379	INFO trainable.py:217 -- Getting current IP.
(pid=4282) 2020-05-14 21:28:05,216	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:28:21] ax.service.ax_client: Completed trial 8 with data: {'mean_accuracy': (0.97, 0.0)}.
[INFO 05-14 21:28:22] ax.service.ax_client: Completed trial 9 with data: {'mean_accuracy': (0.36, 0.0)}.

(pid=4323) 2020-05-14 21:28:24,521	INFO trainable.py:217 -- Getting current IP.
(pid=4334) 2020-05-14 21:28:25,613	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:28:41] ax.service.ax_client: Completed trial 10 with data: {'mean_accuracy': (0.1, 0.0)}.
[INFO 05-14 21:28:43] ax.service.ax_client: Completed trial 11 with data: {'mean_accuracy': (0.89, 0.0)}.

(pid=4356) 2020-05-14 21:28:45,108	INFO trainable.py:217 -- Getting current IP.
(pid=4367) 2020-05-14 21:28:46,141	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:29:03] ax.service.ax_client: Completed trial 12 with data: {'mean_accuracy': (0.91, 0.0)}.
[INFO 05-14 21:29:03] ax.service.ax_client: Completed trial 13 with data: {'mean_accuracy': (0.31, 0.0)}.

(pid=4393) 2020-05-14 21:29:06,639	INFO trainable.py:217 -- Getting current IP.
(pid=4394) 2020-05-14 21:29:06,906	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:29:24] ax.service.ax_client: Completed trial 15 with data: {'mean_accuracy': (0.92, 0.0)}.
[INFO 05-14 21:29:24] ax.service.ax_client: Completed trial 14 with data: {'mean_accuracy': (0.11, 0.0)}.

(pid=4402) 2020-05-14 21:29:27,163	INFO trainable.py:217 -- Getting current IP.
(pid=4436) 2020-05-14 21:29:27,839	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:29:44] ax.service.ax_client: Completed trial 16 with data: {'mean_accuracy': (0.8, 0.0)}.
[INFO 05-14 21:29:45] ax.service.ax_client: Completed trial 17 with data: {'mean_accuracy': (0.1, 0.0)}.

(pid=4458) 2020-05-14 21:29:48,236	INFO trainable.py:217 -- Getting current IP.
(pid=4459) 2020-05-14 21:29:48,550	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:30:06] ax.service.ax_client: Completed trial 18 with data: {'mean_accuracy': (0.92, 0.0)}.
[INFO 05-14 21:30:06] ax.service.ax_client: Completed trial 19 with data: {'mean_accuracy': (0.87, 0.0)}.

(pid=4467) 2020-05-14 21:30:08,654	INFO trainable.py:217 -- Getting current IP.
(pid=4501) 2020-05-14 21:30:09,780	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:30:26] ax.service.ax_client: Completed trial 20 with data: {'mean_accuracy': (0.24, 0.0)}.
2020-05-14 21:30:27,499	WARNING worker.py:1090 -- The actor or task with ID ffffffffffffffffacdc00e00100 is pending and cannot currently be scheduled. It requires {CPU: 1.000000} for execution and {CPU: 1.000000} for placement, but this node only has remaining {CPU: 1.000000}, {node:10.30.0.181: 1.000000}, {memory: 4.345703 GiB}, {object_store_memory: 1.464844 GiB}. In total there are 0 pending tasks and 1 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.
[INFO 05-14 21:30:27] ax.service.ax_client: Completed trial 21 with data: {'mean_accuracy': (0.92, 0.0)}.

(pid=4527) 2020-05-14 21:30:29,676	INFO trainable.py:217 -- Getting current IP.
(pid=4537) 2020-05-14 21:30:30,456	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:30:47] ax.service.ax_client: Completed trial 22 with data: {'mean_accuracy': (0.94, 0.0)}.
[INFO 05-14 21:30:47] ax.service.ax_client: Completed trial 23 with data: {'mean_accuracy': (0.12, 0.0)}.

(pid=4560) 2020-05-14 21:30:50,543	INFO trainable.py:217 -- Getting current IP.
(pid=4561) 2020-05-14 21:30:50,625	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:31:08] ax.service.ax_client: Completed trial 25 with data: {'mean_accuracy': (0.1, 0.0)}.
[INFO 05-14 21:31:08] ax.service.ax_client: Completed trial 24 with data: {'mean_accuracy': (0.1, 0.0)}.

(pid=4570) 2020-05-14 21:31:10,537	INFO trainable.py:217 -- Getting current IP.
(pid=4604) 2020-05-14 21:31:11,281	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:31:28] ax.service.ax_client: Completed trial 26 with data: {'mean_accuracy': (0.11, 0.0)}.
[INFO 05-14 21:31:29] ax.service.ax_client: Completed trial 27 with data: {'mean_accuracy': (0.92, 0.0)}.

(pid=4625) 2020-05-14 21:31:31,879	INFO trainable.py:217 -- Getting current IP.
(pid=4626) 2020-05-14 21:31:32,408	INFO trainable.py:217 -- Getting current IP.

[INFO 05-14 21:31:49] ax.service.ax_client: Completed trial 28 with data: {'mean_accuracy': (0.11, 0.0)}.
[INFO 05-14 21:31:49] ax.service.ax_client: Completed trial 29 with data: {'mean_accuracy': (0.91, 0.0)}.

Out[6]:

<ray.tune.analysis.experiment_analysis.ExperimentAnalysis at 0x7f92642dc080>

5. Retrieve the optimization results¶

In [7]:

best_parameters, values = ax.get_best_parameters()
best_parameters

Out[7]:

{'lr': 0.0010468142116849283, 'momentum': 0.6876892084255815}

In [8]:

means, covariances = values
means

Out[8]:

{'mean_accuracy': 0.9715}

6. Plot the response surface and optimization trace¶

In [9]:

render(
    plot_contour(
        model=ax.generation_strategy.model, param_x='lr', param_y='momentum', metric_name='mean_accuracy'
    )
)

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/numpy/core/fromnumeric.py:3335: RuntimeWarning:

Mean of empty slice.

/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/numpy/core/_methods.py:161: RuntimeWarning:

invalid value encountered in double_scalars

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-9-1ad1935854d6> in <module>
      1 render(
      2     plot_contour(
----> 3         model=ax.generation_strategy.model, param_x='lr', param_y='momentum', metric_name='mean_accuracy'
      4     )
      5 )

~/build/facebook/Ax/ax/plot/contour.py in plot_contour(model, param_x, param_y, metric_name, generator_runs_dict, relative, density, slice_values, lower_is_better, fixed_features, trial_index)
    150         generator_runs_dict=generator_runs_dict,
    151         density=density,
--> 152         slice_values=slice_values,
    153     )
    154     config = {

~/build/facebook/Ax/ax/plot/contour.py in _get_contour_predictions(model, x_param_name, y_param_name, metric, generator_runs_dict, density, slice_values, fixed_features)
     95         param_grid_obsf.append(predf)
     96 
---> 97     mu, cov = model.predict(param_grid_obsf)
     98 
     99     f_plt = mu[metric]

~/build/facebook/Ax/ax/modelbridge/base.py in predict(self, observation_features)
    486         # Predict in single batch.
    487         try:
--> 488             observation_data = self._batch_predict(observation_features)
    489         # Predict one by one.
    490         except (TypeError, ValueError):

~/build/facebook/Ax/ax/modelbridge/base.py in _batch_predict(self, observation_features)
    416             )
    417         # Apply terminal transform and predict
--> 418         observation_data = self._predict(observation_features)
    419 
    420         # Apply reverse transforms, in reverse order

~/build/facebook/Ax/ax/modelbridge/random.py in _predict(self, observation_features)
    102         output.
    103         """
--> 104         raise NotImplementedError("RandomModelBridge does not support prediction.")
    105 
    106     def _cross_validate(

NotImplementedError: RandomModelBridge does not support prediction.

In [10]:

# `plot_single_method` expects a 2-d array of means, because it expects to average means from multiple 
# optimization runs, so we wrap out best objectives array in another array.
best_objectives = np.array([[trial.objective_mean * 100 for trial in ax.experiment.trials.values()]])
best_objective_plot = optimization_trace_single_method(
    y=np.maximum.accumulate(best_objectives, axis=1),
    title="Model performance vs. # of iterations",
    ylabel="Accuracy",
)
render(best_objective_plot)

Download Tutorial Jupyter Notebook

Download Tutorial Source Code