19.5. Asynchronous Successive Halving
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

As we have seen in Section 19.3, we can accelerate HPO by distributing the evaluation of hyperparameter configurations across either multiple instances or multiples CPUs / GPUs on a single instance. However, compared to random search, it is not straightforward to run successive halving (SH) asynchronously in a distributed setting. Before we can decide which configuration to run next, we first have to collect all observations at the current rung level. This requires to synchronize workers at each rung level. For example, for the lowest rung level \(r_{\text{min}}\), we first have to evaluate all \(N = \eta^K\) configurations, before we can promote the \(\frac{1}{\eta}\) of them to the next rung level.

In any distributed system, synchronization typically implies idle time for workers. First, we often observe high variations in training time across hyperparameter configurations. For example, assuming the number of filters per layer is a hyperparameter, then networks with less filters finish training faster than networks with more filters, which implies idle worker time due to stragglers. Moreover, the number of slots in a rung level is not always a multiple of the number of workers, in which case some workers may even sit idle for a full batch.

Figure Fig. 19.5.1 shows the scheduling of synchronous SH with \(\eta=2\) for four different trials with two workers. We start with evaluating Trial-0 and Trial-1 for one epoch and immediately continue with the next two trials once they are finished. We first have to wait until Trial-2 finishes, which takes substantially more time than the other trials, before we can promote the best two trials, i.e., Trial-0 and Trial-3 to the next rung level. This causes idle time for Worker-1. Then, we continue with Rung 1. Also, here Trial-3 takes longer than Trial-0, which leads to an additional ideling time of Worker-0. Once, we reach Rung-2, only the best trial, Trial-0, remains which occupies only one worker. To avoid that Worker-1 idles during that time, most implementaitons of SH continue already with the next round, and start evaluating new trials (e.g Trial-4) on the first rung.

../_images/sync_sh.svg

Fig. 19.5.1 Synchronous successive halving with two workers.

Asynchronous successive halving (ASHA) (Li et al., 2018) adapts SH to the asynchronous parallel scenario. The main idea of ASHA is to promote configurations to the next rung level as soon as we collected at least \(\eta\) observations on the current rung level. This decision rule may lead to suboptimal promotions: configurations can be promoted to the next rung level, which in hindsight do not compare favourably against most others at the same rung level. On the other hand, we get rid of all synchronization points this way. In practice, such suboptimal initial promotions have only a modest impact on performance, not only because the ranking of hyperparameter configurations is often fairly consistent across rung levels, but also because rungs grow over time and reflect the distribution of metric values at this level better and better. If a worker is free, but no configuration can be promoted, we start a new configuration with \(r = r_{\text{min}}\), i.e the first rung level.

Fig. 19.5.2 shows the scheduling of the same configurations for ASHA. Once Trial-1 finishes, we collect the results of two trials (i.e Trial-0 and Trial-1) and immediately promote the better of them (Trial-0) to the next rung level. After Trial-0 finishes on rung 1, there are too few trials there in order to support a further promotion. Hence, we continue with rung 0 and evaluate Trial-3. Once Trial-3 finishes, Trial-2 is still pending. At this point we have 3 trials evaluated on rung 0 and one trial evaluated already on rung 1. Since Trial-3 performs worse than Trial-0 at rung 0, and \(\eta=2\), we cannot promote any new trial yet, and Worker-1 starts Trial-4 from scratch instead. However, once Trial-2 finishes and scores worse than Trial-3, the latter is promoted towards rung 1. Afterwards, we collected 2 evaluations on rung 1, which means we can now promote Trial-0 towards rung 2. At the same time, Worker-1 continues with evaluating new trials (i.e., Trial-5) on rung 0.

../_images/asha.svg

Fig. 19.5.2 Asynchronous successive halving (ASHA) with two workers.

import logging
from d2l import torch as d2l

logging.basicConfig(level=logging.INFO)
import matplotlib.pyplot as plt
from syne_tune import StoppingCriterion, Tuner
from syne_tune.backend.python_backend import PythonBackend
from syne_tune.config_space import loguniform, randint
from syne_tune.experiments import load_experiment
from syne_tune.optimizer.baselines import ASHA
INFO:root:SageMakerBackend is not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
AWS dependencies are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[aws]'
or (for everything)
   pip install 'syne-tune[extra]'
INFO:root:Ray Tune schedulers and searchers are not imported since dependencies are missing. You can install them with
   pip install 'syne-tune[raytune]'
or (for everything)
   pip install 'syne-tune[extra]'

19.5.1. Objective Function

We will use Syne Tune with the same objective function as in Section 19.3.

def hpo_objective_lenet_synetune(learning_rate, batch_size, max_epochs):
    from syne_tune import Reporter
    from d2l import torch as d2l

    model = d2l.LeNet(lr=learning_rate, num_classes=10)
    trainer = d2l.HPOTrainer(max_epochs=1, num_gpus=1)
    data = d2l.FashionMNIST(batch_size=batch_size)
    model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
    report = Reporter()
    for epoch in range(1, max_epochs + 1):
        if epoch == 1:
            # Initialize the state of Trainer
            trainer.fit(model=model, data=data)
        else:
            trainer.fit_epoch()
        validation_error = trainer.validation_error().cpu().detach().numpy()
        report(epoch=epoch, validation_error=float(validation_error))

We will also use the same configuration space as before:

min_number_of_epochs = 2
max_number_of_epochs = 10
eta = 2

config_space = {
    "learning_rate": loguniform(1e-2, 1),
    "batch_size": randint(32, 256),
    "max_epochs": max_number_of_epochs,
}
initial_config = {
    "learning_rate": 0.1,
    "batch_size": 128,
}

19.5.2. Asynchronous Scheduler

First, we define the number of workers that evaluate trials concurrently. We also need to specify how long we want to run random search, by defining an upper limit on the total wall-clock time.

n_workers = 2  # Needs to be <= the number of available GPUs
max_wallclock_time = 12 * 60  # 12 minutes

The code for running ASHA is a simple variation of what we did for asynchronous random search.

mode = "min"
metric = "validation_error"
resource_attr = "epoch"

scheduler = ASHA(
    config_space,
    metric=metric,
    mode=mode,
    points_to_evaluate=[initial_config],
    max_resource_attr="max_epochs",
    resource_attr=resource_attr,
    grace_period=min_number_of_epochs,
    reduction_factor=eta,
)
INFO:syne_tune.optimizer.schedulers.fifo:max_resource_level = 10, as inferred from config_space
INFO:syne_tune.optimizer.schedulers.fifo:Master random_seed = 1420030479

Here, metric and resource_attr specify the key names used with the report callback, and max_resource_attr denotes which input to the objective function corresponds to \(r_{\text{max}}\). Moreover, grace_period provides \(r_{\text{min}}\), and reduction_factor is \(\eta\). We can run Syne Tune as before (this will take about 12 minutes):

trial_backend = PythonBackend(
    tune_function=hpo_objective_lenet_synetune,
    config_space=config_space,
)

stop_criterion = StoppingCriterion(max_wallclock_time=max_wallclock_time)
tuner = Tuner(
    trial_backend=trial_backend,
    scheduler=scheduler,
    stop_criterion=stop_criterion,
    n_workers=n_workers,
    print_update_interval=int(max_wallclock_time * 0.6),
)
tuner.run()
INFO:syne_tune.tuner:results of trials will be saved on /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470
INFO:root:Detected 3 GPUs
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1 --batch_size 128 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/0/checkpoints
INFO:syne_tune.tuner:(trial 0) - scheduled config {'learning_rate': 0.1, 'batch_size': 128, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.5985177262140595 --batch_size 249 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/1/checkpoints
INFO:syne_tune.tuner:(trial 1) - scheduled config {'learning_rate': 0.5985177262140595, 'batch_size': 249, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7930685343937359 --batch_size 167 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/2/checkpoints
INFO:syne_tune.tuner:(trial 2) - scheduled config {'learning_rate': 0.7930685343937359, 'batch_size': 167, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 1 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.01332236841633562 --batch_size 83 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/3/checkpoints
INFO:syne_tune.tuner:(trial 3) - scheduled config {'learning_rate': 0.01332236841633562, 'batch_size': 83, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.06395616061075522 --batch_size 157 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/4/checkpoints
INFO:syne_tune.tuner:(trial 4) - scheduled config {'learning_rate': 0.06395616061075522, 'batch_size': 157, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.03894552388726989 --batch_size 246 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/5/checkpoints
INFO:syne_tune.tuner:(trial 5) - scheduled config {'learning_rate': 0.03894552388726989, 'batch_size': 246, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.21589884448181618 --batch_size 171 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/6/checkpoints
INFO:syne_tune.tuner:(trial 6) - scheduled config {'learning_rate': 0.21589884448181618, 'batch_size': 171, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.02134388373929251 --batch_size 112 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/7/checkpoints
INFO:syne_tune.tuner:(trial 7) - scheduled config {'learning_rate': 0.02134388373929251, 'batch_size': 112, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.4281530822285822 --batch_size 236 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/8/checkpoints
INFO:syne_tune.tuner:(trial 8) - scheduled config {'learning_rate': 0.4281530822285822, 'batch_size': 236, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.14597685354921533 --batch_size 46 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/9/checkpoints
INFO:syne_tune.tuner:(trial 9) - scheduled config {'learning_rate': 0.14597685354921533, 'batch_size': 46, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.023606359106159654 --batch_size 141 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/10/checkpoints
INFO:syne_tune.tuner:(trial 10) - scheduled config {'learning_rate': 0.023606359106159654, 'batch_size': 141, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.020475406507570872 --batch_size 137 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/11/checkpoints
INFO:syne_tune.tuner:(trial 11) - scheduled config {'learning_rate': 0.020475406507570872, 'batch_size': 137, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.03072386439846424 --batch_size 52 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/12/checkpoints
INFO:syne_tune.tuner:(trial 12) - scheduled config {'learning_rate': 0.03072386439846424, 'batch_size': 52, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2884807064939987 --batch_size 52 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/13/checkpoints
INFO:syne_tune.tuner:(trial 13) - scheduled config {'learning_rate': 0.2884807064939987, 'batch_size': 52, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.49517571939878474 --batch_size 47 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/14/checkpoints
INFO:syne_tune.tuner:(trial 14) - scheduled config {'learning_rate': 0.49517571939878474, 'batch_size': 47, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 13 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.07069910046322526 --batch_size 79 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/15/checkpoints
INFO:syne_tune.tuner:(trial 15) - scheduled config {'learning_rate': 0.07069910046322526, 'batch_size': 79, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1393298222152186 --batch_size 92 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/16/checkpoints
INFO:syne_tune.tuner:(trial 16) - scheduled config {'learning_rate': 0.1393298222152186, 'batch_size': 92, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.01102850321799237 --batch_size 160 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/17/checkpoints
INFO:syne_tune.tuner:(trial 17) - scheduled config {'learning_rate': 0.01102850321799237, 'batch_size': 160, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.045553351306542623 --batch_size 163 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/18/checkpoints
INFO:syne_tune.tuner:(trial 18) - scheduled config {'learning_rate': 0.045553351306542623, 'batch_size': 163, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.18738829796624235 --batch_size 36 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/19/checkpoints
INFO:syne_tune.tuner:(trial 19) - scheduled config {'learning_rate': 0.18738829796624235, 'batch_size': 36, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.05077467757237889 --batch_size 228 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/20/checkpoints
INFO:syne_tune.tuner:(trial 20) - scheduled config {'learning_rate': 0.05077467757237889, 'batch_size': 228, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.2818254023025138 --batch_size 66 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/21/checkpoints
INFO:syne_tune.tuner:(trial 21) - scheduled config {'learning_rate': 0.2818254023025138, 'batch_size': 66, 'max_epochs': 10}
INFO:syne_tune.tuner:tuning status (last metric is reported)
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0    Stopped     2       0.100000         128          10    2.0          0.899031    16.655625
        1  Completed    10       0.598518         249          10   10.0          0.201665    63.855728
        2    Stopped    10       0.793069         167          10   10.0          0.178710    66.577238
        3    Stopped     2       0.013322          83          10    2.0          0.900000    17.155993
        4    Stopped     2       0.063956         157          10    2.0          0.900171    15.577330
        5    Stopped     2       0.038946         246          10    2.0          0.900107    14.921423
        6    Stopped     4       0.215899         171          10    4.0          0.332322    27.402298
        7    Stopped     2       0.021344         112          10    2.0          0.900546    16.316442
        8    Stopped     4       0.428153         236          10    4.0          0.323575    26.391071
        9    Stopped     8       0.145977          46          10    8.0          0.227629    79.750229
       10    Stopped     2       0.023606         141          10    2.0          0.900008    15.916068
       11    Stopped     2       0.020475         137          10    2.0          0.899999    16.027407
       12    Stopped     4       0.030724          52          10    4.0          0.899686    37.019265
       13  Completed    10       0.288481          52          10   10.0          0.149586    92.460085
       14    Stopped    10       0.495176          47          10   10.0          0.141497    97.445830
       15    Stopped     2       0.070699          79          10    2.0          0.899757    16.952509
       16    Stopped     4       0.139330          92          10    4.0          0.341133    30.270227
       17    Stopped     4       0.011029         160          10    4.0          0.899802    28.178677
       18    Stopped     2       0.045553         163          10    2.0          0.899761    14.328157
       19 InProgress     3       0.187388          36          10    3.0          0.229159    35.283644
       20    Stopped     2       0.050775         228          10    2.0          0.900010    14.251756
       21 InProgress     0       0.281825          66          10      -                 -            -
2 trials running, 20 finished (2 until the end), 437.04s wallclock-time

INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13450694334348853 --batch_size 226 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/22/checkpoints
INFO:syne_tune.tuner:(trial 22) - scheduled config {'learning_rate': 0.13450694334348853, 'batch_size': 226, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.1383206315022528 --batch_size 76 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/23/checkpoints
INFO:syne_tune.tuner:(trial 23) - scheduled config {'learning_rate': 0.1383206315022528, 'batch_size': 76, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.13182291830737192 --batch_size 137 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/24/checkpoints
INFO:syne_tune.tuner:(trial 24) - scheduled config {'learning_rate': 0.13182291830737192, 'batch_size': 137, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.20451149999816387 --batch_size 178 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/25/checkpoints
INFO:syne_tune.tuner:(trial 25) - scheduled config {'learning_rate': 0.20451149999816387, 'batch_size': 178, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.044544495073044366 --batch_size 214 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/26/checkpoints
INFO:syne_tune.tuner:(trial 26) - scheduled config {'learning_rate': 0.044544495073044366, 'batch_size': 214, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.026195132773807073 --batch_size 106 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/27/checkpoints
INFO:syne_tune.tuner:(trial 27) - scheduled config {'learning_rate': 0.026195132773807073, 'batch_size': 106, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.15508038457501633 --batch_size 120 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/28/checkpoints
INFO:syne_tune.tuner:(trial 28) - scheduled config {'learning_rate': 0.15508038457501633, 'batch_size': 120, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.7306582456725493 --batch_size 79 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/29/checkpoints
INFO:syne_tune.tuner:(trial 29) - scheduled config {'learning_rate': 0.7306582456725493, 'batch_size': 79, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.08473828437431256 --batch_size 130 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/30/checkpoints
INFO:syne_tune.tuner:(trial 30) - scheduled config {'learning_rate': 0.08473828437431256, 'batch_size': 130, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.10784996779411136 --batch_size 40 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/31/checkpoints
INFO:syne_tune.tuner:(trial 31) - scheduled config {'learning_rate': 0.10784996779411136, 'batch_size': 40, 'max_epochs': 10}
INFO:syne_tune.tuner:Trial trial_id 29 completed.
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.37677242671098565 --batch_size 205 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/32/checkpoints
INFO:syne_tune.tuner:(trial 32) - scheduled config {'learning_rate': 0.37677242671098565, 'batch_size': 205, 'max_epochs': 10}
INFO:root:running subprocess with command: /home/d2l-worker/miniconda3/envs/d2l-en-release-1/bin/python /home/d2l-worker/miniconda3/envs/d2l-en-release-1/lib/python3.9/site-packages/syne_tune/backend/python_backend/python_entrypoint.py --learning_rate 0.12320387549715911 --batch_size 41 --max_epochs 10 --tune_function_root /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/tune_function --tune_function_hash 40412c9643f2f0fda4865de73fad34e4 --st_checkpoint_dir /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470/33/checkpoints
INFO:syne_tune.tuner:(trial 33) - scheduled config {'learning_rate': 0.12320387549715911, 'batch_size': 41, 'max_epochs': 10}
INFO:syne_tune.stopping_criterion:reaching max wallclock time (720), stopping there.
INFO:syne_tune.tuner:Stopping trials that may still be running.
INFO:syne_tune.tuner:Tuning finished, results of trials can be found on /home/d2l-worker/syne-tune/python-entrypoint-2022-12-14-06-32-15-470
--------------------
Resource summary (last result is reported):
 trial_id     status  iter  learning_rate  batch_size  max_epochs  epoch  validation_error  worker-time
        0    Stopped     2       0.100000         128          10    2.0          0.899031    16.655625
        1  Completed    10       0.598518         249          10   10.0          0.201665    63.855728
        2    Stopped    10       0.793069         167          10   10.0          0.178710    66.577238
        3    Stopped     2       0.013322          83          10    2.0          0.900000    17.155993
        4    Stopped     2       0.063956         157          10    2.0          0.900171    15.577330
        5    Stopped     2       0.038946         246          10    2.0          0.900107    14.921423
        6    Stopped     4       0.215899         171          10    4.0          0.332322    27.402298
        7    Stopped     2       0.021344         112          10    2.0          0.900546    16.316442
        8    Stopped     4       0.428153         236          10    4.0          0.323575    26.391071
        9    Stopped     8       0.145977          46          10    8.0          0.227629    79.750229
       10    Stopped     2       0.023606         141          10    2.0          0.900008    15.916068
       11    Stopped     2       0.020475         137          10    2.0          0.899999    16.027407
       12    Stopped     4       0.030724          52          10    4.0          0.899686    37.019265
       13  Completed    10       0.288481          52          10   10.0          0.149586    92.460085
       14    Stopped    10       0.495176          47          10   10.0          0.141497    97.445830
       15    Stopped     2       0.070699          79          10    2.0          0.899757    16.952509
       16    Stopped     4       0.139330          92          10    4.0          0.341133    30.270227
       17    Stopped     4       0.011029         160          10    4.0          0.899802    28.178677
       18    Stopped     2       0.045553         163          10    2.0          0.899761    14.328157
       19    Stopped    10       0.187388          36          10   10.0          0.170378   112.051995
       20    Stopped     2       0.050775         228          10    2.0          0.900010    14.251756
       21    Stopped     4       0.281825          66          10    4.0          0.416098    43.777161
       22    Stopped     2       0.134507         226          10    2.0          0.900179    14.330927
       23    Stopped     8       0.138321          76          10    8.0          0.226629    82.387080
       24    Stopped     2       0.131823         137          10    2.0          0.900001    15.938905
       25    Stopped     2       0.204511         178          10    2.0          0.899640    15.654077
       26    Stopped     2       0.044544         214          10    2.0          0.899911    15.411022
       27    Stopped     2       0.026195         106          10    2.0          0.899923    16.008510
       28    Stopped     4       0.155080         120          10    4.0          0.339484    30.281955
       29  Completed    10       0.730658          79          10   10.0          0.147388    75.855369
       30    Stopped     2       0.084738         130          10    2.0          0.899983    16.240244
       31 InProgress     7       0.107850          40          10    7.0          0.203700    74.223180
       32    Stopped     8       0.376772         205          10    8.0          0.227063    50.852603
       33 InProgress     0       0.123204          41          10      -                 -            -
2 trials running, 32 finished (3 until the end), 723.68s wallclock-time

validation_error: best 0.1414971947669983 for trial-id 14
--------------------

Note that we are running a variant of ASHA where underperforming trials are stopped early. This is different to our implementation in Section 19.4.1, where each training job is started with a fixed max_epochs. In the latter case, a well-performing trial which reaches the full 10 epochs, first needs to train 1, then 2, then 4, then 8 epochs, each time starting from scratch. This type of pause-and-resume scheduling can be implemented efficiently by checkpointing the training state after each epoch, but we avoid this extra complexity here. After the experiment has finished, we can retrieve and plot results.

d2l.set_figsize()
e = load_experiment(tuner.name)
e.plot()
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../_images/output_sh-async_bb0ea6_13_1.svg

19.5.3. Visualize the Optimization Process

Once more, we visualize the learning curves of every trial (each color in the plot represents a trial). Compare this to asynchronous random search in Section 19.3. As we have seen for successive halving in Section 19.4, most of the trials are stopped at 1 or 2 epochs (\(r_{\text{min}}\) or \(\eta * r_{\text{min}}\)). However, trials do not stop at the same point, because they require different amount of time per epoch. If we ran standard successive halving instead of ASHA, we would need to synchronize our workers, before we can promote configurations to the next rung level.

d2l.set_figsize([6, 2.5])
results = e.results
for trial_id in results.trial_id.unique():
    df = results[results["trial_id"] == trial_id]
    d2l.plt.plot(
        df["st_tuner_time"],
        df["validation_error"],
        marker="o"
    )
d2l.plt.xlabel("wall-clock time")
d2l.plt.ylabel("objective function")
Text(0, 0.5, 'objective function')
../_images/output_sh-async_bb0ea6_15_1.svg

19.5.4. Summary

Compared to random search, successive halving is not quite as trivial to run in an asynchronous distributed setting. To avoid synchronisation points, we promote configurations as quickly as possible to the next rung level, even if this means promoting some wrong ones. In practice, this usually does not hurt much, and the gains of asynchronous versus synchronous scheduling are usually much higher than the loss of the suboptimal decision making.

Discussions