Introduction

Neural networks are well-known for their strong performance, but they require a proper setting of hyperparameters. In fact, appropriate hyperparameters are necessary for successful applications of neural networks. To be more precise, hyperparameters are the specific configuration choices that specify exactly how neural network training should work. For example, common hyperparameters are the learning rate and regularisation strength, which determine how neural networks trade-off between memorising the training data and generalising to new unseen data. Such hyperparameters can be tuned manually by trial and error of engineers developing systems.

However, the simple way of selecting hyperparameters is infeasible when there are many hyperparameters. As a result, it is important to develop approaches that can automatically search for hyperparameters by repeatedly training neural networks, checking how well they perform, and updating hyperparameters to make them perform better. In our work we focus on the case when there are thousands or millions of hyperparameters. In these cases gradient-based meta-learning methods need to be used instead of standard hyperparameter optimization techniques such as random search or Bayesian optimization.

Existing gradient-based methods optimize hyperparameters by calculating gradients with respect to hyperparameters, a calculation that allows precise hyperparameter tuning but is very expensive. In fact, it is so expensive that we can only do it with smaller neural networks, making gradient-based hyperparameter search on large neural networks nearly impossible. We develop a new method for hyperparameter search that is much more efficient to the point that we can apply it to bigger and more powerful neural networks.

We present EvoGrad, a new approach to meta-learning that draws upon evolutionary techniques to more efficiently compute hypergradients. EvoGrad estimates the hypergradient with respect to hyperparameters without calculating higher-order gradients, leading to significant improvements in efficiency. Our results show that EvoGrad significantly improves efficiency and enables scaling meta-learning to bigger architectures such as from ResNet10 to ResNet34, while keeping the performance of standard meta-learning methods.

TL;DR: More efficient gradient-based meta-learning and hyperparameter optimization inspired by evolutionary methods

Method

Setting

We aim to solve a bilevel optimization problem where our goal is to find hyperparameters \(\lambda\) that minimize the validation loss \(\ell_V\) of the model parametrized by \(\theta\) and trained with loss \(\ell_T\) and \(\lambda\).

In order to meta-learn the value of \(\lambda\) using gradient-based methods, we need to calculate the hypergradient \(\frac{\partial\ell_V}{\partial\lambda}\). However, its direct evaluation is typically zero because the hyperparameter does not directly influence the value of the validation loss – it influences the validation loss via the impact on the model weights \(\theta\). The model weights \(\theta\) are themselves trained using gradient optimization as part of the inner loop, which gives rise to higher-order derivatives.

We propose a variation where the update of the model weights is inspired by evolutionary methods, allowing us to eliminate the need for higher-order derivatives. We consider the setting where the hypergradient of hyperparameter \(\lambda\) is estimated online together with updating the base model \(\theta\), as this is the most widely used setting in substantial practical applications. The commonly used approach that adopts the online meta-learning setting and estimates the hypergradient by backpropagating through gradient-based inner loop is known as \(T_1-T_2\) [1].

Evolutionary inner step

First, we sample random perturbations \(\epsilon\), and apply them to \(\theta\). Sampling \(K\) perturbations, we can create a population of \(K\) variants \(\{\theta_k\}_{k=1}^K\) of the current model as \(\theta_k=\theta+\epsilon_k\). We can now compute the training losses \(\{\ell_k\}_{k=1}^K\) for each of the \(K\) models, \(\ell_k=f\left(\mathcal{D}_T\middle|\theta_k,\lambda\right)\) using the current minibatch \(\mathcal{D}_T\) drawn from the training set. Given these loss values, we can calculate the weights of the population of candidate models as \[w_1,w_2,\ldots,w_K=\text{softmax}\left(\left[-\ell_1,-\ell_2,\ldots,-\ell_K\right]/\tau\right),\] where \(\tau\) is a temperature parameter that rescales the losses to control the scale of weight variability.

Given the weights \(\{w_k\}_{k=1}^K\), we complete the current step of evolutionary learning by updating the model parameters via the affine combination \[\theta^\ast=w_1\theta_1+w_2\theta_2+\ldots+w_K\theta_K.\]

Computing the hypergradient

We now evaluate the updated model \(\theta^\ast\) for a minibatch from the validation set \(\mathcal{D}_V\) and take gradient of the validation loss \(\ell_V=f\left(\mathcal{D}_V\middle|\theta^\ast\right)\) w.r.t. the hyperparameter: \[\frac{\partial\ell_V}{\partial\lambda}=\frac{\partial f\left(\mathcal{D}_V\middle|\theta^\ast\right)}{\partial\lambda}.\] One can easily verify that the computation does not involve second-order gradients as no first-order gradients were used in the inner loop. We illustrate the EvoGrad update in Figure 1.

Figure 1: Graphical illustration of a single EvoGrad update using \(K=2\) model copies. Once the hypergradient \(\frac{\partial\ell_V}{\partial\lambda}\) is calculated, it is used to update the hyperparameters. We do a standard update of the model afterwards.

Illustration

For illustration we consider a problem in which we minimize a validation loss function \(f_V\left(x\right)=\left(x-0.5\right)^2\) where parameter \(x\) is optimized using SGD with training loss function \(f_T\left(x\right)=\left(x-1\right)^2+\lambda|x|_2^2\) that includes a meta-parameter \(\lambda\). A closed-form solution for the hypergradient is available, which allows us to compare EvoGrad against the ground-truth gradient. The results in Figure 2 show that EvoGrad estimates have a similar trend to the ground-truth gradient, even if the EvoGrad estimates are noisy. The level of noise decreases with more models in the population, but the correct trend is visible even if we use only two models.

Figure 2: Comparison of the hypergradient \(\partial f_V/\partial\lambda\) estimated by EvoGrad vs the ground-truth.

Case Study

We select the cross-domain few-shot learning (CD-FSL) problem for our case study that shows the benefits of EvoGrad in this blog post. CD-FSL is considered an important and highly challenging problem at the forefront of computer vision. The goal is to learn to solve a new classification task from a new unseen domain using a very small number of examples. For instance, we may want to learn to distinguish different types of birds after seeing only one example of each – while pre-training was done only on examples coming from other domains and classes. CD-FSL tasks are structured as \(N\)-way \(K\)-shot tasks where we try to classify among \(N\) new classes after seeing \(K\) examples of each class. We illustrate CD-FSL in Figure 3.

Figure 3: Illustration of cross-domain few-shot learning with 3-way 2-shot tasks. We try to classify query set examples into one of the three classes after seeing two examples of each class in the support set. Test tasks are sampled from domains not seen during training.

The state-of-the-art approach to solve CD-FSL achieves it via learned feature-wise transformation (LFT) layers [2]. The approach aims to meta-learn stochastic feature-wise transformation layers that regularize metric-based few-shot learners such as RelationNet to improve their few-shot learning generalisation in cross-domain conditions. In the case of CD-FSL as well as other practical uses-cases it is enough to use EvoGrad with two model copies.

Table 1 shows the baseline performance of vanilla unregularised ResNet (-), manually tuned FT layers (FT), FT layers meta-learned by second-order gradient (LFT) and by EvoGrad. The results show that EvoGrad matches the accuracy of the original LFT approach, leading to clear accuracy improvements over training with no feature-wise transformation or training with fixed feature-wise parameters selected manually. At the same time EvoGrad is significantly more efficient in terms of the memory and time costs as shown in Figure 4. The memory improvements from EvoGrad allow us to scale the base feature extractor to ResNet34 within the standard 12GB GPU.

Table 1: Test accuracies (%) and 95% confidence intervals across test tasks on various unseen datasets. EvoGrad can clearly match the accuracies obtained by the original approach that uses \(T_1-T_2\). LFT EvoGrad can scale to ResNet34 on all tasks within 12GB GPU memory, while vanilla second-order LFT \(T_1-T_2\) cannot. We also report the results of our own rerun of the LFT approach using the official code – denoted as our run.

Figure 4: Cross-domain few-shot learning with LFT [2]: analysis of memory and time efficiency of EvoGrad vs standard second-order \(T_1-T_2\) approach. EvoGrad is significantly more efficient in terms of both memory usage and time per epoch. Mean and standard deviation reported across experiments with different test datasets.

In our paper we have also evaluated EvoGrad on two other practical problems where meta-learning has made an impact: 1) learning with noisy labels using sample weighting network [3] and 2) low-resource cross-lingual learning with meta representation transformation layers [4]. In both cases EvoGrad significantly reduces the memory and time costs, yet keeps the accuracy improvements brought by meta-learning.

Summary

We have proposed a new efficient method for meta-learning that allows us to scale gradient-based meta-learning to bigger models and problems. We have evaluated the method on a variety of problems, most notably meta-learning feature-wise transformation layers, training with noisy labels using sample weighting model, and meta-learning meta representation transformation for low-resource cross-lingual learning. In all cases we have shown significant time and memory efficiency improvements, while achieving similar or better performance compared to the existing meta-learning methods.

Publication

We have presented EvoGrad at the 35th Conference on Neural Information Processing Systems (NeurIPS), 2021.

References

[1] Luketina, J., Berglund, M., Klaus Greff, A., and Raiko, T. (2016). Scalable gradient-based tuning of continuous regularization hyperparameters. In ICML.

[2] Tseng, H.-Y., Lee, H.-Y., Huang, J.-B., and Yang, M.-H. (2020). Cross-domain few-shot classification via learned feature-wise transformation. In ICLR.

[3] Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., and Meng, D. (2019). Meta-Weight-Net: learning an explicit mapping for sample weighting. In NeurIPS.

[4] Xia, M., Zheng, G., Mukherjee, S., Shokouhi, M., Neubig, G., and Awadallah, A. H. (2021). MetaXL: meta representation transformation for low-resource cross-lingual learning. In NAACL.

Design and source code from Jon Barron's website