ABSTRACT
We introduce kernels with random Fourier features in the meta-learning framework for few-shot learning. We propose meta variational random features (MetaVRF) to learn adaptive kernels for the base-learner, which is developed in a latent variable model by treating the random feature basis as the latent variable. We formulate the optimization of MetaVRF as a variational inference problem by deriving an evidence lower bound under the meta-learning framework. To incorporate shared knowledge from related tasks, we propose a context inference of the posterior, which is established by an LSTM architecture. The LSTM-based inference network effectively integrates the context information of previous tasks with task-specific information, generating informative and adaptive features. The learned MetaVRF is able to produce kernels of high representational power with a relatively low spectral sampling rate and also enables fast adaptation to new tasks. Experimental results on a variety of few-shot regression and classification tasks demonstrate that MetaVRF can deliver much better, or at least competitive, performance compared to existing meta-learning alternatives.
Supplemental Material
Available for Download
Supplemental material.
- Allen, K. R., Shelhamer, E., Shin, H., and Tenenbaum, J. B. Infinite mixture prototypes for few-shot learning. In Proceedings of the 36th International Conference on Machine Learning, pp. 232-241, 2019.Google Scholar
- Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T., Shillingford, B., and de Freitas, N. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, 2016.Google Scholar
Digital Library
- Aravind Rajeswaran, Chelsea Finn, S. K. S. L. Meta-learning with implicit gradients. arXiv preprint arXiv:1909.04630, 2019.Google Scholar
- Avron, H., Sindhwani, V., Yang, J., and Mahoney, M. W. Quasi-monte carlo feature maps for shift-invariant kernels. The Journal of Machine Learning Research, 17(1):4096- 4133, 2016.Google Scholar
Digital Library
- Bach, F. R., Lanckriet, G. R., and Jordan, M. I. Multiple kernel learning, conic duality, and the smo algorithm. In Proceedings of the twenty-first international conference on Machine learning, pp. 6, 2004.Google Scholar
Digital Library
- Bauer, M., Rojas-Carulla, M., Świątkowski, J. B., Schölkopf, B., and Turner, R. E. Discriminative kshot learning using probabilistic models. arXiv preprint arXiv:1706.00326, 2017.Google Scholar
- Bertinetto, L., Henriques, J. F., Torr, P. H., and Vedaldi, A. Meta-learning with differentiable closed-form solvers. In International Conference on Learning Representations, 2019.Google Scholar
- Bishop, C. M. Pattern recognition and machine learning. springer, 2006.Google Scholar
Digital Library
- Bullins, B., Zhang, C., and Zhang, Y. Not-so-random features. In International Conference on Learning Representations, 2018.Google Scholar
- Carratino, L., Rudi, A., and Rosasco, L. Learning with sgd and random features. In Advances in Neural Information Processing Systems, pp. 10192-10203, 2018.Google Scholar
- Chang, W.-C., Li, C.-L., Yang, Y., and Poczos, B. Data-driven random fourier features using stein effect. arXiv preprint arXiv:1705.08525, 2017.Google Scholar
- Chen, Y., Hoffman, M. W., Colmenarejo, S. G., Denil, M., Lillicrap, T. P., Botvinick, M., and De Freitas, N. Learning to learn without gradient descent by gradient descent. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 748-756. JMLR. org, 2017.Google Scholar
Digital Library
- Devos, A., Chatel, S., and Grossglauser, M. Reproducing meta-learning with differentiable closed-form solvers. In ICLR Workshop, 2019.Google Scholar
- Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum, J. B., and Ghahramani, Z. Structure discovery in nonparametric regression through compositional kernel search. arXiv preprint arXiv:1302.4922, 2013.Google Scholar
- Finn, C. and Levine, S. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. In International Conference on Learning Representations, 2018.Google Scholar
- Finn, C., Abbeel, P., and Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pp. 1126-1135. JMLR. org, 2017.Google Scholar
Digital Library
- Finn, C., Xu, K., and Levine, S. Probabilistic modelagnostic meta-learning. In Advances in Neural Information Processing Systems, pp. 9516-9527, 2018.Google Scholar
- Garcia, V. and Bruna, J. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.Google Scholar
- Gärtner, T., Flach, P. A., Kowalczyk, A., and Smola, A. J. Multi-instance kernels. In International Conference on Machine Learning, 2002.Google Scholar
- Gers, F. A. and Schmidhuber, J. Recurrent nets that time and count. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, volume 3, pp. 189-194. IEEE, 2000.Google Scholar
Cross Ref
- Gidaris, S. and Komodakis, N. Dynamic few-shot visual learning without forgetting. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4367-4375, 2018.Google Scholar
Cross Ref
- Gönen, M. and Alpaydin, E. Multiple kernel learning algorithms. Journal of machine learning research, 12(Jul): 2211-2268, 2011.Google Scholar
- Gordon, J., Bronskill, J., Bauer, M., Nowozin, S., and Turner, R. E. Meta-learning probabilistic inference for prediction. In International Conference on Learning Representations, 2019.Google Scholar
- Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602-610, 2005.Google Scholar
Digital Library
- Hensman, J., Durrande, N., and Solin, A. Variational fourier features for gaussian processes. The Journal of Machine Learning Research, 18(1):5537-5588, 2017.Google Scholar
Digital Library
- Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.Google Scholar
Digital Library
- Hofmann, T., Schölkopf, B., and Smola, A. J. Kernel methods in machine learning. The annals of statistics, pp. 1171-1220, 2008.Google Scholar
- Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019.Google Scholar
- Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.Google Scholar
- Koch, G. Siamese neural networks for one-shot image recognition. In ICML Workshop, 2015.Google Scholar
- Krizhevsky, A. et al. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.Google Scholar
- Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. Human-level concept learning through probabilistic program induction. Science, 350(6266):1332-1338, 2015.Google Scholar
Cross Ref
- Li, C.-L., Chang, W.-C., Mroueh, Y., Yang, Y., and Poczos, B. Implicit kernel learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2007-2016, 2019.Google Scholar
- Li, Z., Zhou, F., Chen, F., and Li, H. Meta-sgd: Learning to learn quickly for few-shot learning. arXiv preprint arXiv:1707.09835, 2017.Google Scholar
- Mishra, N., Rohaninejad, M., Chen, X., and Abbeel, P. A simple neural attentive meta-learner. In International Conference on Learning Representations, 2018.Google Scholar
- Munkhdalai, T., Yuan, X., Mehri, S., and Trischler, A. Rapid adaptation with conditionally shifted neurons. arXiv preprint arXiv:1712.09926, 2017.Google Scholar
- Oreshkin, B., López, P. R., and Lacoste, A. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pp. 721-731, 2018.Google Scholar
- Qiao, S., Liu, C., Shen, W., and Yuille, A. L. Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7229-7238, 2018.Google Scholar
Cross Ref
- Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177-1184, 2007.Google Scholar
Digital Library
- Ravi, S. and Larochelle, H. Optimization as a model for fewshot learning. In International Conference on Learning Representations, 2017.Google Scholar
- Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.Google Scholar
- Rudin, W. Fourier analysis on groups, volume 121967. Wiley Online Library, 1962.Google Scholar
- Rusu, A. A., Rao, D., Sygnowski, J., Vinyals, O., Pascanu, R., Osindero, S., and Hadsell, R. Meta-learning with latent embedding optimization. In International Conference on Learning Representations, 2019.Google Scholar
- Satorras, V. G. and Estrach, J. B. Few-shot learning with graph neural networks. In International Conference on Learning Representations, 2018.Google Scholar
- Schmidhuber, J. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131-139, 1992.Google Scholar
Digital Library
- Schuster, M. and Paliwal, K. K. Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673-2681, 1997.Google Scholar
Digital Library
- Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. Weisfeilerlehman graph kernels. Journal of Machine Learning Research, 12(Sep):2539-2561, 2011.Google Scholar
- Sinha, A. and Duchi, J. C. Learning kernels with random features. In Advances in Neural Information Processing Systems, pp. 1298-1306, 2016.Google Scholar
- Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pp. 4077-4087, 2017.Google Scholar
Digital Library
- Sohn, K., Lee, H., and Yan, X. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483-3491, 2015.Google Scholar
Digital Library
- Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., and Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1199-1208, 2018.Google Scholar
Cross Ref
- Thrun, S. and Pratt, L. Learning to learn. Springer Science & Business Media, 2012.Google Scholar
Digital Library
- Titsias, M. K., Schwarz, J., Matthews, A. G. d. G., Pascanu, R., and Teh, Y. W. Functional regularisation for continual learning using gaussian processes. arXiv preprint arXiv:1901.11356, 2019.Google Scholar
- Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pp. 3630-3638, 2016.Google Scholar
Digital Library
- Wilson, A. and Adams, R. Gaussian process kernels for pattern discovery and extrapolation. In International Conference on Machine Learning, pp. 1067-1075, 2013.Google Scholar
Digital Library
- Yang, Z., Wilson, A., Smola, A., and Song, L. A la carte-learning fast kernels. In Artificial Intelligence and Statistics, pp. 1098-1106, 2015.Google Scholar
- Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann-Rice, D. N., and Kumar, S. Orthogonal random features. In Advances in Neural Information Processing Systems, pp. 1975-1983, 2016.Google Scholar
- Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.Google Scholar
- Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J. Deep sets. In Advances in Neural Information Processing Systems, pp. 3391-3401, 2017.Google Scholar
- Zintgraf, L., Shiarli, K., Kurin, V., Hofmann, K., and Whiteson, S. Fast context adaptation via meta-learning. In International Conference on Machine Learning, pp. 7693- 7702, 2019.Google Scholar




Comments