Ax for AutoML
Automated machine learning (AutoML) encompasses a large class of problems related to automating time-consuming and labor-intensive aspects of developing ML models. Adaptive experimentation is a natural fit for solving many AutoML tasks, which are often iterative in nature and can involve many expensive trial evaluations.
In this tutorial we will use Ax for hyperparameter optimization (HPO), a common AutoML task in which a model's hyperparameters are adjusted to improve model performance. Hyperparameters refer to the parameters which are set prior to model training or fitting, rather than parameters being learned from data. Traditionally, ML engineers use a combination of domain knowledge, intuition, and manual experimentation comparing many models with different hyperparameter configurations to determine good hyperparameters. As the number of hyperparameters grows and as models become more expensive to train and evaluate sample efficient aproaches to experimentation like Bayesian optimization become increasingly valuable.
In this tutorial we will train an SGDClassifier
from the popular
scikit-learn library to recognize handwritten digits and
tune the model's hyperparameters to improve its performance. You can read more about the
SGDClassifier
model in their example
here,
which this tutorial is largely based on. This tutorial will incorporate many advanced
features in Ax to demonstrate how they can be applied on complex engineering challenges
in a real-world setting.
Learning Objectives
- Understand how Ax can be used for HPO tasks
- Use complex optimization configurations like multiple objectives and outcome constraints to achieve nuanced real-world goals
- Use early stopping to save experimentation resources
- Analyze the results of the optimization
Prerequisites
- Familiarity with scikit-learn and basic machine learning concepts
- Understanding of adaptive experimentation and Bayesian optimization
- Ask-tell Optimization of Python Functions with early stopping
Step 1: Import Necessary Modules
First, ensure you have all the necessary imports:
import time
import matplotlib.pyplot as plt
import sklearn.datasets
import sklearn.linear_model
import sklearn.model_selection
from ax.api.client import Client
from ax.api.configs import (
ChoiceParameterConfig,
ExperimentConfig,
GenerationMethod,
GenerationStrategyConfig,
ParameterScaling,
ParameterType,
RangeParameterConfig,
)
from pyre_extensions import assert_is_instance
Step 1.1: Understanding the baseline performance of SGDClassifier
Before we begin HPO, let's understand the task and the performance of SGDClassifier
with its default hyperparameters. The following code is largely adapted from the example
on scikit-learn's webiste
here.
# Load the digits dataset and display the first 4 images to demonstrate
digits = sklearn.datasets.load_digits()
classes = list(set(digits.target))
_, axes = plt.subplots(nrows=1, ncols=4, figsize=(10, 3))
for ax, image, label in zip(axes, digits.images, digits.target):
ax.set_axis_off()
ax.imshow(image, cmap=plt.cm.gray_r, interpolation="nearest")
ax.set_title("Training: %i" % label)
# Instantiate a SGDClassifier with default hyperparameters
clf = sklearn.linear_model.SGDClassifier()
# Split the data into a training set and a validation set
train_x, valid_x, train_y, valid_y = sklearn.model_selection.train_test_split(
digits.data, digits.target, test_size=0.20, random_state=0
)
# Train the classifier on the training set using 10 batches
#Also time the training.
batch_size = len(train_x) // 10
start_time = time.time()
for i in range(10):
start_idx = i * batch_size
end_idx = (i + 1) * batch_size
# Use partial fit to update the model on the current batch
clf.partial_fit(
train_x[start_idx:end_idx], train_y[start_idx:end_idx], classes=classes
)
training_time = time.time() - start_time
# Evaluate the classifier on the validation set
score = clf.score(valid_x, valid_y)
score, training_time
(0.8583333333333333, 0.034377336502075195)
The model performs well, but let's see if we can improve performance by tuning the hyperparameters.
Step 2: Initialize the Client
As always, the first step in running our adaptive experiment with Ax is to create an
instance of the Client
to manage the state of your experiment.
client = Client()
Step 3: Configure the Experiment
The Client
expects a series of Config
s which define how the experiment will be run.
We'll set this up the same way as we did in our previous tutorial.
Our current task is to tune the hyperparameters of an scikit-learn's SGDClassifier. These parameters control aspects of the model's training process and configuring them can have dramatic effects on the model's ability to correctly classify inputs. A full list of this model's hyperparameters and appropriate values are available in the library's documentation. In this tutorial we will tune the following hyperparameters:
- loss: The loss function to be used
- penalty: The penalty (aka regularization term) to be used
- learning_rate: The learning rate schedule
- alpha: Constant that multiplies the regularization term. The higher the value, the stronger the regularization
- eta0: The learning rate for training. In this example we will use a constant learning rate schedule
- batch_size: A training parameter which controls how many examples are shown during a single epoch. We will use all samples in the dataset for each model training, so a smaller batch size will translate to more epochs and vice versa.
You will notice some hyperparameters are continuous ranges, some are discrete ranges,
and some are categorical choices; Ax is able to handle all of these types of parameters
via its RangeParameterConfig
and ChoiceParameterConfig
classes.
# Create an experiment configuration
experiment_config = ExperimentConfig(
name="SGDClassifier_hpo",
parameters=[
ChoiceParameterConfig(
name="loss",
parameter_type=ParameterType.STRING,
values=[
"hinge",
"log_loss",
"squared_hinge",
"modified_huber",
"perceptron",
],
is_ordered=False,
),
ChoiceParameterConfig(
name="penalty",
parameter_type=ParameterType.STRING,
values=["l1", "l2", "elasticnet"],
is_ordered=False,
),
ChoiceParameterConfig(
name="learning_rate",
parameter_type=ParameterType.STRING,
values=["constant", "optimal", "invscaling", "adaptive"],
is_ordered=False,
),
RangeParameterConfig(
name="alpha",
bounds=(1e-8, 100),
parameter_type=ParameterType.FLOAT,
scaling=ParameterScaling.LOG,
),
RangeParameterConfig(
name="eta0",
bounds=(1e-8, 1),
parameter_type=ParameterType.FLOAT,
scaling=ParameterScaling.LOG,
),
RangeParameterConfig(
name="batch_size",
bounds=(5, 500),
parameter_type=ParameterType.INT,
),
],
# The following arguments are optional
description="Optimization of SGDClassifier for digits dataset",
owner="developer",
)
# Apply the experiment configuration to the client
client.configure_experiment(experiment_config=experiment_config)
client.configure_generation_strategy(
GenerationStrategyConfig(method=GenerationMethod.FAST)
)
/home/runner/work/Ax/Ax/ax/api/utils/instantiation/from_config.py:89: AxParameterWarning: sort_values is not specified for ChoiceParameter "loss". Defaulting to False for parameters of ParameterType STRING. To override this behavior (or avoid this warning), specify sort_values during ChoiceParameter construction.
return ChoiceParameter(
/home/runner/work/Ax/Ax/ax/api/utils/instantiation/from_config.py:89: AxParameterWarning: sort_values is not specified for ChoiceParameter "penalty". Defaulting to False for parameters of ParameterType STRING. To override this behavior (or avoid this warning), specify sort_values during ChoiceParameter construction.
return ChoiceParameter(
/home/runner/work/Ax/Ax/ax/api/utils/instantiation/from_config.py:89: AxParameterWarning: sort_values is not specified for ChoiceParameter "learning_rate". Defaulting to False for parameters of ParameterType STRING. To override this behavior (or avoid this warning), specify sort_values during ChoiceParameter construction.
return ChoiceParameter(
Step 4: Configure Optimization
Now, we must set up the optimization objective in Client
, where objective
is a
string that specifies which metric we would like to optimize and the direction (higher
or lower) that is considered optimal.
In our example we want to consider both performance and computational cost implications
of hyperparameter modifications. scikit-learn
models use a function called score
to
report the mean accuracy of the model, and in our optimization we should seek to
maximize this value. Since model training can be a very expensive process, especially
for large models, this can represent a significant cost.
Let's configure Ax to maximize score while minimizing training time. We call this a multi-objective optimization, and rather than returning a single best parameterization we return a Pareto frontier of points which represent optimal tradeoffs between all metrics present. Multi-objective optimization is useful for competing metrics where a gain in one metric may represent a regression in the other.
In these settings we can also specify outcome constraints, which indicate that if a metric result falls outside of the specified threshold we are not interested in any result, regardless of the wins observed in any other metric. For a concrete example, imagine Ax finding a parameterization that trains in no time at all but has an score no better than if the model were guessing at random.
For this toy example let's configure Ax to maximize score and minimize training time, but avoid any hyperparameter configurations that result in a mean accuracy score of less than 75% or a training time greater than 1 second.
client.configure_optimization(
objective="score, -training_time",
outcome_constraints=["score >= 0.85", "training_time <= 1"],
)
Step 5: Run Trials with early stopping
Before we begin our Bayesian optimization loop, we can attach the data we collected from
triaing SGDClassifier
with default hyperparameters. This will give our experiment a
head start by providing a datapoint to our surrogate model. Because these are the
default settings provided by scikit-learn
, it's likely they will be pretty good and
will provide the optimization with a promising start. It is always advantageous to
attach any existing data to an experiment to improve performance.
trial_index = client.attach_baseline(
parameters={
"loss": clf.loss,
"penalty": clf.penalty,
"alpha": clf.alpha,
"learning_rate": clf.learning_rate,
"eta0": clf.eta0
+ 1e-8, # Default eta is 0.0, so add a small value to avoid division by zero
"batch_size": batch_size,
}
)
client.complete_trial(
trial_index=trial_index,
raw_data={"score": score, "training_time": training_time},
)
[INFO 04-18 05:09:53] ax.core.experiment: Attached custom parameterizations [{'loss': 'hinge', 'penalty': 'l2', 'alpha': 0.0001, 'learning_rate': 'optimal', 'eta0': 0.0, 'batch_size': 143}] as trial 0.
<enum 'TrialStatus'>.COMPLETED
After attaching the initial trial, we will begin the experimentation loop by writing a
for loop to execute our full experimentation budget of 30 trials. In each iteration we
will ask Ax for the next trials (in this case just one), then instantiate an
SGDClassifier
with the suggested hyperparameters. We will then split the data into
train and test sets. Next we will define an inner loop to perform minibatch training, in
which we divide the train set into a number of smaller batches and train one epoch of
stochastic gradient descent at a time. After each epoch we will report the score and the
time.
Because training machine learning models is expensive, we will utilize Ax's early
stopping functionality to kill trials unlikely to produce optimal results before they
have been completed. After data has been attached we will ask the Client
whether or
not we should stop the trial, and if it advises us to do so we will report it early
stopped and exit out of the training loop. By early stopping, we proactively save
compute without regressing optimization performance.
for _ in range(30):
trials = client.get_next_trials(maximum_trials=1)
for trial_index, parameters in trials.items():
clf = sklearn.linear_model.SGDClassifier(
loss=parameters["loss"],
penalty=parameters["penalty"],
alpha=parameters["alpha"],
learning_rate=parameters["learning_rate"],
eta0=parameters["eta0"],
)
train_x, valid_x, train_y, valid_y = sklearn.model_selection.train_test_split(
digits.data,
digits.target,
test_size=0.20,
)
batch_size = assert_is_instance(parameters["batch_size"], int)
num_epochs = len(train_x) // batch_size
start_time = time.time()
for i in range(0, num_epochs):
start_idx = i * batch_size
end_idx = (i + 1) * batch_size
# Use partial fit to update the model on the current batch
clf.partial_fit(
train_x[start_idx:end_idx], train_y[start_idx:end_idx], classes=classes
)
raw_data = {
"score": clf.score(valid_x, valid_y),
"training_time": time.time() - start_time,
}
# On the final epoch call complete_trial and break, else call attach_data
if i == num_epochs - 1:
client.complete_trial(
trial_index=trial_index,
raw_data=raw_data,
progression=end_idx, # Use the index of the last example in the batch as the progression value
)
break
client.attach_data(
trial_index=trial_index,
raw_data=raw_data,
progression=end_idx,
)
# If the trial is underperforming, stop it
if client.should_stop_trial_early(trial_index=trial_index):
client.mark_trial_early_stopped(trial_index=trial_index)
break
[INFO 04-18 05:10:04] ax.early_stopping.strategies.percentile: Early stoppinging trial 5: Trial objective value 0.4222222222222222 is worse than 50.0-th percentile (0.8594444444444445) across comparable trials..
[INFO 04-18 05:10:14] ax.early_stopping.strategies.percentile: Early stoppinging trial 6: Trial objective value 0.625 is worse than 50.0-th percentile (0.8452458445918284) across comparable trials..
[INFO 04-18 05:10:19] ax.early_stopping.strategies.percentile: Early stoppinging trial 7: Trial objective value 0.8472222222222222 is worse than 50.0-th percentile (0.8496829710144927) across comparable trials..
[INFO 04-18 05:10:26] ax.early_stopping.strategies.percentile: Early stoppinging trial 8: Trial objective value 0.9027777777777778 is worse than 50.0-th percentile (0.9036231884057971) across comparable trials..
[INFO 04-18 05:11:02] ax.early_stopping.strategies.percentile: Early stoppinging trial 11: Trial objective value 0.4166666666666667 is worse than 50.0-th percentile (0.8436985596707818) across comparable trials..
[INFO 04-18 05:11:16] ax.early_stopping.strategies.percentile: Early stoppinging trial 12: Trial objective value 0.32222222222222224 is worse than 50.0-th percentile (0.8416666666666667) across comparable trials..
[INFO 04-18 05:11:28] ax.early_stopping.strategies.percentile: Early stoppinging trial 13: Trial objective value 0.32222222222222224 is worse than 50.0-th percentile (0.8444631500187056) across comparable trials..
[INFO 04-18 05:11:40] ax.early_stopping.strategies.percentile: Early stoppinging trial 14: Trial objective value 0.4 is worse than 50.0-th percentile (0.843443696221474) across comparable trials..
[INFO 04-18 05:11:53] ax.early_stopping.strategies.percentile: Early stoppinging trial 15: Trial objective value 0.49444444444444446 is worse than 50.0-th percentile (0.7303319209039548) across comparable trials..
[INFO 04-18 05:12:07] ax.early_stopping.strategies.percentile: Early stoppinging trial 16: Trial objective value 0.4666666666666667 is worse than 50.0-th percentile (0.8414141414141415) across comparable trials..
[INFO 04-18 05:12:22] ax.early_stopping.strategies.percentile: Early stoppinging trial 17: Trial objective value 0.2972222222222222 is worse than 50.0-th percentile (0.618997175141243) across comparable trials..
[INFO 04-18 05:12:35] ax.early_stopping.strategies.percentile: Early stoppinging trial 18: Trial objective value 0.24722222222222223 is worse than 50.0-th percentile (0.72953475432289) across comparable trials..
[INFO 04-18 05:12:53] ax.early_stopping.strategies.percentile: Early stoppinging trial 19: Trial objective value 0.2361111111111111 is worse than 50.0-th percentile (0.6176553672316384) across comparable trials..
[INFO 04-18 05:13:07] ax.early_stopping.strategies.percentile: Early stoppinging trial 20: Trial objective value 0.5361111111111111 is worse than 50.0-th percentile (0.5768832391713747) across comparable trials..
[INFO 04-18 05:13:20] ax.early_stopping.strategies.percentile: Early stoppinging trial 21: Trial objective value 0.4222222222222222 is worse than 50.0-th percentile (0.5567208097928438) across comparable trials..
[INFO 04-18 05:13:35] ax.early_stopping.strategies.percentile: Early stoppinging trial 22: Trial objective value 0.3388888888888889 is worse than 50.0-th percentile (0.5361111111111111) across comparable trials..
[INFO 04-18 05:13:50] ax.early_stopping.strategies.percentile: Early stoppinging trial 23: Trial objective value 0.16944444444444445 is worse than 50.0-th percentile (0.49444444444444446) across comparable trials..
[INFO 04-18 05:14:02] ax.early_stopping.strategies.percentile: Early stoppinging trial 24: Trial objective value 0.13333333333333333 is worse than 50.0-th percentile (0.45833333333333337) across comparable trials..
[INFO 04-18 05:14:16] ax.early_stopping.strategies.percentile: Early stoppinging trial 25: Trial objective value 0.075 is worse than 50.0-th percentile (0.4222222222222222) across comparable trials..
[INFO 04-18 05:14:28] ax.early_stopping.strategies.percentile: Early stoppinging trial 26: Trial objective value 0.30833333333333335 is worse than 50.0-th percentile (0.8439534231200898) across comparable trials..
[INFO 04-18 05:14:43] ax.early_stopping.strategies.percentile: Early stoppinging trial 27: Trial objective value 0.375 is worse than 50.0-th percentile (0.5013888888888889) across comparable trials..
[INFO 04-18 05:14:57] ax.early_stopping.strategies.percentile: Early stoppinging trial 28: Trial objective value 0.5638888888888889 is worse than 50.0-th percentile (0.8708080808080809) across comparable trials..
[INFO 04-18 05:15:13] ax.early_stopping.strategies.percentile: Early stoppinging trial 29: Trial objective value 0.3416666666666667 is worse than 50.0-th percentile (0.4222222222222222) across comparable trials..
[INFO 04-18 05:15:24] ax.early_stopping.strategies.percentile: Early stoppinging trial 30: Trial objective value 0.36944444444444446 is worse than 50.0-th percentile (0.41944444444444445) across comparable trials..
Step 6: Analyze Results
After running trials, you can analyze the results. Most commonly this means extracting the parameterization from the best performing trial you conducted.
Since we are optimizing multiple objectives, rather than a single best point we want to get the Pareto frontier -- the set of points that presents optimal tradeoffs between maximizing score and minimizing training time.
frontier = client.get_pareto_frontier()
# Frontier is a list of tuples, where each tuple contains the parameters, the metric readings, the trial index, and the arm name for a point on the Pareto frontier
for parameters, metrics, trial_index, arm_name in frontier:
print(f"Trial {trial_index} with {parameters=} and {metrics=}\n")
Trial 3 with parameters={'alpha': 1.4159718661387047e-05, 'eta0': 0.00041286951368824283, 'batch_size': 193, 'loss': 'squared_hinge', 'penalty': 'elasticnet', 'learning_rate': 'invscaling'} and metrics={'score': (np.float64(0.9325881932299056), 5.497167657276985e-06), 'training_time': (np.float64(0.11364205816610574), 0.0007267456234127333)}
Trial 9 with parameters={'alpha': 1e-08, 'eta0': 0.01685929597184217, 'batch_size': 500, 'loss': 'squared_hinge', 'penalty': 'elasticnet', 'learning_rate': 'invscaling'} and metrics={'score': (np.float64(0.8918996467768461), 6.044725228194316e-06), 'training_time': (np.float64(0.037048372216394704), 0.000726827660478983)}
Trial 0 with parameters={'loss': 'hinge', 'penalty': 'l2', 'alpha': 0.0001, 'learning_rate': 'optimal', 'eta0': 1e-08, 'batch_size': 143} and metrics={'score': (np.float64(0.8584423781660401), 6.071006798132691e-06), 'training_time': (np.float64(0.035445128208759585), 0.000726822889555319)}
Step 7: Compute Analyses
Ax can also produce a number of analyses to help interpret the results of the experiment
via client.compute_analyses
. Users can manually select which analyses to run, or can
allow Ax to select which would be most relevant. In this case Ax selects the following:
- Scatter Plot shows a plane with each objective on its own axis and a point for each observation. In multi-objective optimizations like ours it also draws a line through the Pareto frontier, indicating which points represent optimal tradeoffs between our objectives.
- Interaction Analysis Plot shows which parameters have the largest affect on the function and plots the most important parameters as 1 or 2 dimensional surfaces
- Summary lists all trials generated along with their parameterizations, observations, and miscellaneous metadata
# display=True instructs Ax to sort then render the resulting analyses
cards = client.compute_analyses(display=True)
Modeled score vs. training_time
This plot displays the effects of each arm on the two selected metrics. It is useful for understanding the trade-off between the two metrics and for visualizing the Pareto frontier.
score by progression
The progression plot tracks the evolution of each metric over the course of the experiment. This visualization is typically used to monitor the improvement of metrics over Trial iterations, but can also be useful in informing decisions about early stopping for Trials.
training_time by progression
The progression plot tracks the evolution of each metric over the course of the experiment. This visualization is typically used to monitor the improvement of metrics over Trial iterations, but can also be useful in informing decisions about early stopping for Trials.
Summary for SGDClassifier_hpo
High-level summary of the Trial
-s in this Experiment
trial_index | arm_name | trial_status | generation_node | score | training_time | loss | penalty | alpha | learning_rate | eta0 | batch_size | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | baseline | COMPLETED | nan | 0.858333 | 0.034377 | hinge | l2 | 0.0001 | optimal | 1e-08 | 143 |
1 | 1 | 1_0 | COMPLETED | Sobol | 0.883333 | 0.058346 | squared_hinge | l1 | 1.72336e-08 | constant | 0.0828341 | 346 |
2 | 2 | 2_0 | COMPLETED | Sobol | 0.894444 | 0.203673 | perceptron | l2 | 0.849649 | adaptive | 4.04723e-06 | 118 |
3 | 3 | 3_0 | COMPLETED | Sobol | 0.933333 | 0.112456 | squared_hinge | elasticnet | 1.41597e-05 | invscaling | 0.00041287 | 193 |
4 | 4 | 4_0 | COMPLETED | Sobol | 0.855556 | 0.035398 | modified_huber | l2 | 0.0128116 | optimal | 8.48764e-08 | 460 |
5 | 5 | 5_0 | EARLY_STOPPED | MBM | 0.422222 | 0.031615 | squared_hinge | elasticnet | 1e-08 | invscaling | 5.08559e-07 | 322 |
6 | 6 | 6_0 | EARLY_STOPPED | MBM | 0.625 | 0.005857 | squared_hinge | elasticnet | 16.203 | invscaling | 0.0216218 | 500 |
7 | 7 | 7_0 | EARLY_STOPPED | MBM | 0.847222 | 0.136926 | squared_hinge | elasticnet | 6.58979 | invscaling | 0.000439125 | 121 |
8 | 8 | 8_0 | EARLY_STOPPED | MBM | 0.902778 | 0.528525 | modified_huber | elasticnet | 1.85015e-06 | invscaling | 0.00333132 | 55 |
9 | 9 | 9_0 | COMPLETED | MBM | 0.891667 | 0.03601 | squared_hinge | elasticnet | 1e-08 | invscaling | 0.0168593 | 500 |
10 | 10 | 10_0 | COMPLETED | MBM | 0.930556 | 0.937276 | squared_hinge | elasticnet | 8.03566e-05 | invscaling | 0.00034343 | 54 |
11 | 11 | 11_0 | EARLY_STOPPED | MBM | 0.416667 | 0.005088 | squared_hinge | elasticnet | 4.87935e-05 | invscaling | 4.13276e-06 | 275 |
12 | 12 | 12_0 | EARLY_STOPPED | MBM | 0.322222 | 0.005223 | squared_hinge | elasticnet | 4.94001e-05 | invscaling | 3.40195e-06 | 275 |
13 | 13 | 13_0 | EARLY_STOPPED | MBM | 0.322222 | 0.005166 | squared_hinge | elasticnet | 4.92153e-05 | invscaling | 2.66869e-06 | 278 |
14 | 14 | 14_0 | EARLY_STOPPED | MBM | 0.4 | 0.00524 | squared_hinge | elasticnet | 4.80722e-05 | invscaling | 3.0704e-06 | 274 |
15 | 15 | 15_0 | EARLY_STOPPED | MBM | 0.494444 | 0.005089 | squared_hinge | elasticnet | 5.09445e-05 | invscaling | 3.94464e-06 | 275 |
16 | 16 | 16_0 | EARLY_STOPPED | MBM | 0.466667 | 0.005215 | squared_hinge | elasticnet | 4.67331e-05 | invscaling | 3.54848e-06 | 274 |
17 | 17 | 17_0 | EARLY_STOPPED | MBM | 0.297222 | 0.005137 | squared_hinge | elasticnet | 4.65142e-05 | invscaling | 2.16058e-06 | 275 |
18 | 18 | 18_0 | EARLY_STOPPED | MBM | 0.247222 | 0.005208 | squared_hinge | elasticnet | 4.90206e-05 | invscaling | 3.91649e-06 | 274 |
19 | 19 | 19_0 | EARLY_STOPPED | MBM | 0.236111 | 0.005167 | squared_hinge | elasticnet | 5.13724e-05 | invscaling | 4.62322e-06 | 274 |
20 | 20 | 20_0 | EARLY_STOPPED | MBM | 0.536111 | 0.005214 | squared_hinge | elasticnet | 4.96727e-05 | invscaling | 3.97872e-06 | 274 |
21 | 21 | 21_0 | EARLY_STOPPED | MBM | 0.422222 | 0.005232 | squared_hinge | elasticnet | 4.54236e-05 | invscaling | 2.78685e-06 | 275 |
22 | 22 | 22_0 | EARLY_STOPPED | MBM | 0.338889 | 0.005184 | squared_hinge | elasticnet | 4.50176e-05 | invscaling | 4.25982e-06 | 274 |
23 | 23 | 23_0 | EARLY_STOPPED | MBM | 0.169444 | 0.005162 | squared_hinge | elasticnet | 4.77686e-05 | invscaling | 2.46731e-06 | 275 |
24 | 24 | 24_0 | EARLY_STOPPED | MBM | 0.133333 | 0.005115 | squared_hinge | elasticnet | 5.15691e-05 | invscaling | 3.14476e-06 | 275 |
25 | 25 | 25_0 | EARLY_STOPPED | MBM | 0.075 | 0.005197 | squared_hinge | elasticnet | 4.68887e-05 | invscaling | 1.98274e-06 | 275 |
26 | 26 | 26_0 | EARLY_STOPPED | MBM | 0.308333 | 0.005215 | squared_hinge | elasticnet | 4.95867e-05 | invscaling | 2.23493e-06 | 276 |
27 | 27 | 27_0 | EARLY_STOPPED | MBM | 0.375 | 0.005091 | squared_hinge | elasticnet | 4.87782e-05 | invscaling | 3.80524e-06 | 274 |
28 | 28 | 28_0 | EARLY_STOPPED | MBM | 0.563889 | 0.057223 | squared_hinge | elasticnet | 4.65636e-05 | invscaling | 2.16822e-06 | 275 |
29 | 29 | 29_0 | EARLY_STOPPED | MBM | 0.341667 | 0.005107 | squared_hinge | elasticnet | 4.78552e-05 | invscaling | 2.97253e-06 | 275 |
30 | 30 | 30_0 | EARLY_STOPPED | MBM | 0.369444 | 0.005157 | squared_hinge | elasticnet | 4.32422e-05 | invscaling | 3.70796e-06 | 275 |
Sensitivity Analysis for score
Understand how each parameter affects score according to a second-order sensitivity analysis.
Sensitivity Analysis for training_time
Understand how each parameter affects training_time according to a second-order sensitivity analysis.
alpha vs. score
The slice plot provides a one-dimensional view of predicted outcomes for score as a function of a single parameter, while keeping all other parameters fixed at their status_quo value (or mean value if status_quo is unavailable). This visualization helps in understanding the sensitivity and impact of changes in the selected parameter on the predicted metric outcomes.
eta0, batch_size vs. score
The contour plot visualizes the predicted outcomes for score across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
eta0 vs. score
The slice plot provides a one-dimensional view of predicted outcomes for score as a function of a single parameter, while keeping all other parameters fixed at their status_quo value (or mean value if status_quo is unavailable). This visualization helps in understanding the sensitivity and impact of changes in the selected parameter on the predicted metric outcomes.
alpha, batch_size vs. score
The contour plot visualizes the predicted outcomes for score across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
alpha, eta0 vs. score
The contour plot visualizes the predicted outcomes for score across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
batch_size vs. training_time
The slice plot provides a one-dimensional view of predicted outcomes for training_time as a function of a single parameter, while keeping all other parameters fixed at their status_quo value (or mean value if status_quo is unavailable). This visualization helps in understanding the sensitivity and impact of changes in the selected parameter on the predicted metric outcomes.
eta0, batch_size vs. training_time
The contour plot visualizes the predicted outcomes for training_time across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
alpha, eta0 vs. training_time
The contour plot visualizes the predicted outcomes for training_time across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
alpha, batch_size vs. training_time
The contour plot visualizes the predicted outcomes for training_time across a two-dimensional parameter space, with other parameters held fixed at their status_quo value (or mean value if status_quo is unavailable). This plot helps in identifying regions of optimal performance and understanding how changes in the selected parameters influence the predicted outcomes. Contour lines represent levels of constant predicted values, providing insights into the gradient and potential optima within the parameter space.
alpha vs. training_time
The slice plot provides a one-dimensional view of predicted outcomes for training_time as a function of a single parameter, while keeping all other parameters fixed at their status_quo value (or mean value if status_quo is unavailable). This visualization helps in understanding the sensitivity and impact of changes in the selected parameter on the predicted metric outcomes.
Cross Validation for score
The cross-validation plot displays the model fit for each metric in the experiment. It employs a leave-one-out approach, where the model is trained on all data except one sample, which is used for validation. The plot shows the predicted outcome for the validation set on the y-axis against its actual value on the x-axis. Points that align closely with the dotted diagonal line indicate a strong model fit, signifying accurate predictions. Additionally, the plot includes 95% confidence intervals that provide insight into the noise in observations and the uncertainty in model predictions. A horizontal, flat line of predictions indicates that the model has not picked up on sufficient signal in the data, and instead is just predicting the mean.
Cross Validation for training_time
The cross-validation plot displays the model fit for each metric in the experiment. It employs a leave-one-out approach, where the model is trained on all data except one sample, which is used for validation. The plot shows the predicted outcome for the validation set on the y-axis against its actual value on the x-axis. Points that align closely with the dotted diagonal line indicate a strong model fit, signifying accurate predictions. Additionally, the plot includes 95% confidence intervals that provide insight into the noise in observations and the uncertainty in model predictions. A horizontal, flat line of predictions indicates that the model has not picked up on sufficient signal in the data, and instead is just predicting the mean.
Conclusion
This tutorial demonstates Ax's ability to solve AutoML tasks with in a resource efficient manor. We configured a complex optimization which captures the nuanced goals of the experiment and utilized early stopping to save resources by killing training runs unlikely to produce optimal results.
While this tutorial shows how to use Ax for HPO on an SGDClassifier
, the same
techniques can be used for many different AutoML tasks such as feature selection,
architecture search, and more.