References¶ Open the notebook in SageMaker Studio Lab
- Abadi et al., 2016
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … et al. (2016). TensorFlow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283).
- Abdel-Hamid et al., 2014
Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
- Ahmed et al., 2012
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (pp. 123–132).
- Akiba et al., 2019
Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
- Alayrac et al., 2022
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … et al. (2022). Flamingo: a visual language model for few-shot learning. ArXiv:2204.14198.
- Alsallakh et al., 2020
Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the PAD – CNNs can develop blind spots. ArXiv:2010.02178.
- Anil et al., 2023
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., … et al. (2023). PaLM 2 Technical Report. ArXiv:2305.10403.
- Anil et al., 2020
Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2020). Scalable second-order optimization for deep learning. ArXiv:2002.09018.
- Aronszajn, 1950
Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.
- Ba et al., 2016
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv:1607.06450.
- Baevski & Auli, 2018
Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. International Conference on Learning Representations.
- Bahdanau et al., 2014
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473.
- Bai et al., 2022
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … et al. (2022). Constitutional AI: harmlessness from AI feedback. ArXiv:2212.08073.
- Baptista & Poloczek, 2018
Baptista, R., & Poloczek, M. (2018). Bayesian optimization of combinatorial structures. Proceedings of the 35th International Conference on Machine Learning.
- Bardenet et al., 2013
Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013). Collaborative hyperparameter tuning. Proceedings of the 30th International Conference on Machine Learning (ICML'13).
- Bay et al., 2006
Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. European Conference on Computer Vision (pp. 404–417).
- Bellman, 1966
Bellman, R. (1966). Dynamic programming. Science, 153, 34–37.
- Bellman, 1952
Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8), 716–719.
- Bellman, 1957a
Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684. URL: http://www.jstor.org/stable/24900506
- Bellman, 1957b
Bellman, R. (1957). Dynamic Programming. Dover Publications.
- Beltagy et al., 2020
Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: the long-document transformer. ArXiv:2004.05150.
- Bengio et al., 2003
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
- Bengio et al., 1994
Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
- Bergstra et al., 2011
Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24.
- Bergstra et al., 2010
Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a CPU and GPU math compiler in Python. Proc. 9th Python in Science Conference (pp. 3–10).
- Beutel et al., 2014
Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). CoBaFi: collaborative Bayesian filtering. Proceedings of the 23rd International Conference on World Wide Web (pp. 97–108).
- Bishop, 1995
Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
- Bishop, 2006
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Black & Scholes, 1973
Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81, 637–654.
- Bodla et al., 2017
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS-improving object detection with one line of code. Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).
- Bojanowski et al., 2017
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
- Bollobas, 1999
Bollobás, B. (1999). Linear Analysis. Cambridge University Press.
- Bommasani et al., 2021
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … et al. (2021). On the opportunities and risks of foundation models. ArXiv:2108.07258.
- Bottou, 2010
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 (pp. 177–186). Springer.
- Bottou & Le Cun, 1988
Bottou, L., & Le Cun, Y. (1988). SN: a simulator for connectionist models. Proceedings of NeuroNimes 88 (pp. 371–382). Nimes, France. URL: http://leon.bottou.org/papers/bottou-lecun-88
- Boucheron et al., 2005
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
- Bowman et al., 2015
Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. ArXiv:1508.05326.
- Boyd & Vandenberghe, 2004
Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge University Press.
- Bradley & Terry, 1952
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.
- Brown & Sandholm, 2017
Brown, N., & Sandholm, T. (2017). Libratus: the superhuman AI for no-limit poker. IJCAI (pp. 5226–5228).
- Brown et al., 1990
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
- Brown et al., 1988
Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin, P. (1988). A statistical approach to language translation. COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics.
- Brown et al., 2020
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
- Buslaev et al., 2020
Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020). Albumentations: Fast and flexible image augmentations. Information, 11(2), 125.
- Campbell et al., 2002
Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial Intelligence, 134(1-2), 57–83.
- Canny, 1987
Canny, J. (1987). A computational approach to edge detection. Readings in Computer Vision (pp. 184–203). Elsevier.
- Cer et al., 2017
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).
- Chan et al., 2015
Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. ArXiv:1508.01211.
- Chen et al., 2021
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., … Mordatch, I. (2021). Decision transformer: reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34, 15084–15097.
- Chen et al., 2015
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., … Zhang, Z. (2015). MXNET: a flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv:1512.01274.
- Cheng et al., 2016
Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–561).
- Chetlur et al., 2014
Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). CuDNN: Efficient primitives for deep learning. ArXiv:1410.0759.
- Cho et al., 2014a
Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. ArXiv:1409.1259.
- Cho et al., 2014b
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. ArXiv:1406.1078.
- Chowdhery et al., 2022
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … et al. (2022). PaLM: scaling language modeling with pathways. ArXiv:2204.02311.
- Chung et al., 2014
Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv:1412.3555.
- Clark et al., 2020
Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.
- Collobert et al., 2011
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
- Cordonnier et al., 2020
Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations.
- Cover & Thomas, 1999
Cover, T., & Thomas, J. (1999). Elements of Information Theory. John Wiley & Sons.
- Csiszar, 2008
Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.
- Cybenko, 1989
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
- Dalal & Triggs, 2005
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) (pp. 886–893).
- DeCock, 2011
De Cock, D. (2011). Ames, Iowa: alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3).
- Dean et al., 2012
Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … et al. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1 (pp. 1223–1231).
- DeCandia et al., 2007
DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. ACM SIGOPS Operating Systems Review (pp. 205–220).
- Deng et al., 2009
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
- DerKiureghian & Ditlevsen, 2009
Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural Safety, 31(2), 105–112.
- Devlin et al., 2018
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv:1810.04805.
- Dinh et al., 2014
Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: non-linear independent components estimation. ArXiv:1410.8516.
- Dinh et al., 2017
Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using real NVP. International Conference on Learning Representations.
- Doersch et al., 2015
Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision (pp. 1422–1430).
- Dosovitskiy et al., 2021
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … et al. (2021). An image is worth 16 x 16 words: transformers for image recognition at scale. International Conference on Learning Representations.
- Duchi et al., 2011
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
- Dumoulin & Visin, 2016
Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. ArXiv:1603.07285.
- Dwivedi & Bresson, 2020
Dwivedi, V. P., & Bresson, X. (2020). A generalization of transformer networks to graphs. ArXiv:2012.09699.
- Dwork et al., 2015
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving statistical validity in adaptive data analysis. Proceedings of the 47th Annual ACM Symposium on Theory of Computing (pp. 117–126).
- Elman, 1990
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
- Elsken et al., 2018
Elsken, T., Metzen, J. H., & Hutter, F. (2018). Neural architecture search: a ssurvey. ArXiv:1808.05377 [stat.ML].
- Fechner, 1860
Fechner, G. T. (1860). Elemente der Psychophysik. Vol. 2. Breitkopf u. Härtel.
- Fedus et al., 2022
Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.
- Fernando, 2004
Fernando, R. (2004). GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley.
- Feurer & Hutter, 2018
Feurer, M., & Hutter, F. (2018). Hyperparameter ptimization. Automatic Machine Learning: Methods, Systems, Challenges. Springer.
- Feurer et al., 2022
Feurer, M., Letham, B., Hutter, F., & Bakshy, E. (2022). Practical transfer learning for Bayesian optimization. ArXiv:1802.02219 [stat.ML].
- Field, 1987
Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. JOSA A, 4(12), 2379–2394.
- Fisher, 1925
Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd.
- Flammarion & Bach, 2015
Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Conference on Learning Theory (pp. 658–695).
- Forrester et al., 2007
Forrester, A. I., Sóbester, A., & Keane, A. J. (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088), 3251–3269.
- Franceschi et al., 2017
Franceschi, L., Donini, M., Frasconi, P., & Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
- Frankle & Carbin, 2018
Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. ArXiv:1803.03635.
- Frazier, 2018
Frazier, P. I. (2018). A tutorial on Bayesian optimization. ArXiv:1807.02811.
- Freund & Schapire, 1996
Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proceedings of the International Conference on Machine Learning (pp. 148–156).
- Friedman, 1987
Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397), 249–266.
- Frostig et al., 2018
Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level tracing. Proceedings of Systems for Machine Learning.
- Fukushima, 1982
Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. Competition and Cooperation in Neural Nets (pp. 267–285). Springer.
- Gardner et al., 2018
Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., & Wilson, A. G. (2018). GPyTorch: blackbox matrix–matrix Gaussian process inference with GPU acceleration. Advances in Neural Information Processing Systems.
- Garg et al., 2021
Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). RATT: leveraging unlabeled data to guarantee generalization. International Conference on Machine Learning (pp. 3598–3609).
- Gatys et al., 2016
Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414–2423).
- Gauss, 1809
Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke. Königlich Preussische Akademie der Wissenschaften.
- Gibbs, 1902
Gibbs, J. W. (1902). Elementary Principles of Statistical Mhanics. Scribner's.
- Ginibre, 1965
Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3), 440–449.
- Girshick, 2015
Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
- Girshick et al., 2014
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587).
- Glorot & Bengio, 2010
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (pp. 249–256).
- Goh, 2017
Goh, G. (2017). Why momentum really works. Distill. URL: http://distill.pub/2017/momentum
- Goldberg et al., 1992
Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–71.
- Golub & VanLoan, 1996
Golub, G. H., & Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins University Press.
- Goodfellow et al., 2016
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
- Goodfellow et al., 2014
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (pp. 2672–2680).
- Gotmare et al., 2018
Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. ArXiv:1810.13243.
- Goyal et al., 2021
Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. ArXiv:2110.07641.
- Graham, 2014
Graham, B. (2014). Fractional max-pooling. ArXiv:1412.6071.
- Graves, 2013
Graves, A. (2013). Generating sequences with recurrent neural networks. ArXiv:1308.0850.
- Graves et al., 2008
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.
- Graves & Schmidhuber, 2005
Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6), 602–610.
- Griewank, 1989
Griewank, A. (1989). On automatic differentiation. Mathematical Programming: Recent Developments and Applications (pp. 83–107). Kluwer.
- Gulati et al., 2020
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … et al. (2020). Conformer: convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp. 5036–5040.
- Gunawardana & Shani, 2015
Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. Recommender Systems Handbook (pp. 265–308). Springer.
- Guo et al., 2017
Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1725–1731).
- Guyon et al., 2008
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature Extraction: Foundations and Applications. Springer.
- Hadjis et al., 2016
Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. ArXiv:1606.04487.
- Hartley & Zisserman, 2000
Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.
- Hartley & Kahl, 2009
Hartley, R. I., & Kahl, F. (2009). Global optimization through rotation space search. International Journal of Computer Vision, 82(1), 64–79.
- He et al., 2022
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
- He et al., 2017a
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969).
- He et al., 2015
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (pp. 1026–1034).
- He et al., 2016a
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
- He et al., 2016b
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European Conference on Computer Vision (pp. 630–645).
- He & Chua, 2017
He, X., & Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 355–364).
- He et al., 2017b
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th International Conference on World Wide Web (pp. 173–182).
- Hebb, 1949
Hebb, D. O. (1949). The Organization of Behavior. Wiley.
- Hendrycks & Gimpel, 2016
Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). ArXiv:1606.08415.
- Hennessy & Patterson, 2011
Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach. Elsevier.
- Herlocker et al., 1999
Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. 22nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR 1999 (pp. 230–237).
- Hidasi et al., 2015
Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2015). Session-based recommendations with recurrent neural networks. ArXiv:1511.06939.
- Ho et al., 2020
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
- Hochreiter et al., 2001
Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
- Hochreiter & Schmidhuber, 1997
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
- Hoffmann et al., 2022
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … et al. (2022). Training compute-optimal large language models. ArXiv:2203.15556.
- Howard et al., 2019
Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).
- Hoyer et al., 2009
Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems (pp. 689–696).
- Hu et al., 2018
Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132–7141).
- Hu et al., 2008
Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008 8th IEEE International Conference on Data Mining (pp. 263–272).
- Hu et al., 2022
Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2022). Text style transfer: a review and experimental evaluation. SIGKDD Explor. Newsl., 24(1). URL: https://doi.org/10.1145/3544903.3544906
- Huang et al., 2018
Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D. (2018). Music transformer: generating music with long-term structure. International Conference on Learning Representations.
- Huang et al., 2017
Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708).
- Huang et al., 2015
Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM–CRF models for sequence tagging. ArXiv:1508.01991.
- Hubel & Wiesel, 1959
Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex. Journal of Physiology, 148(3), 574–591.
- Hubel & Wiesel, 1962
Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160(1), 106–154.
- Hubel & Wiesel, 1968
Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195(1), 215–243.
- Hutter et al., 2011
Hutter, F., Hoos, H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION'11).
- Hutter et al., 2019
Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
- Ioffe, 2017
Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Advances in Neural Information Processing Systems (pp. 1945–1953).
- Ioffe & Szegedy, 2015
Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv:1502.03167.
- Izmailov et al., 2018
Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. ArXiv:1803.05407.
- Jacot et al., 2018
Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization in neural networks. Advances in Neural Information Processing Systems.
- Jaeger, 2002
Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD-Forschungszentrum Informationstechnik Bonn.
- Jamieson & Talwalkar, 2016
Jamieson, K., & Talwalkar, A. (2016). Non-stochastic best arm identification and hyperparameter optimization. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics.
- Jenatton et al., 2017
Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian optimization with tree-structured dependencies. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
- Jia et al., 2018
Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … et al. (2018). Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes. ArXiv:1807.11205.
- Jia et al., 2014
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International Conference on Multimedia (pp. 675–678).
- Joshi et al., 2020
Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.
- Jouppi et al., 2017
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … et al. (2017). In-datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (pp. 1–12).
- Kalchbrenner et al., 2014
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. ArXiv:1404.2188.
- Kalman & Kwasny, 1992
Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. Proceedings of the International Joint Conference on Neural Networks (IJCNN) (pp. 578–581).
- Kaplan et al., 2020
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. ArXiv:2001.08361.
- Karnin et al., 2013
Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. Proceedings of the 30th International Conference on Machine Learning (ICML'13).
- Karras et al., 2017
Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. ArXiv:1710.10196.
- Kim et al., 2017
Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: design of a deep recurrent architecture for distant speech recognition. ArXiv:1701.03360.
- Kim, 2014
Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv:1408.5882.
- Kimeldorf & Wahba, 1971
Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82–95.
- Kingma & Ba, 2014
Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. ArXiv:1412.6980.
- Kingma & Welling, 2014
Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR).
- Kipf & Welling, 2016
Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. ArXiv:1609.02907.
- Kojima et al., 2022
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arxiv.org/abs/2205.11916.
- Koller & Friedman, 2009
Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
- Kolmogorov, 1933
Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari, Giorn., 4, 83–91.
- Kolter, 2008
Kolter, Z. (2008). Linear algebra review and reference. Available online: http://cs229.stanford.edu/section/cs229-linalg.pdf.
- Koren et al., 2009
Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, pp. 30–37.
- Krizhevsky et al., 2012
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (pp. 1097–1105).
- Kung, 1988
Kung, S. Y. (1988). VLSI Array Processors. Prentice Hall.
- Kuzovkin et al., 2018
Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications Biology, 1(1), 1–12.
- Lan et al., 2019
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv:1909.11942.
- Lavin & Gray, 2016
Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4013–4021).
- Le, 2013
Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8595–8598).
- LeCun et al., 1995a
LeCun, Y., Bengio, Y., & et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks (p. 3361). MIT Press.
- LeCun et al., 1989
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
- LeCun et al., 1998a
LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade. Springer.
- LeCun et al., 1998b
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
- LeCun et al., 1995b
LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … et al. (1995). Comparison of learning algorithms for handwritten digit recognition. International Conference on Artificial Neural Networks (pp. 53–60).
- Legendre, 1805
Legendre, A. M. (1805). Mémoire sur les Opérations Trigonométriques: dont les Résultats Dépendent de la Figure de la Terre. F. Didot.
- Lewis et al., 2019
Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019). BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv:1910.13461.
- Lewkowycz et al., 2022
Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … et al. (2022). Solving quantitative reasoning problems with language models. ArXiv:2206.14858.
- Li et al., 2018
Li, L., Jamieson, K., Rostamizadeh, A., Gonina, K., Hardt, M., Recht, B., & Talwalkar, A. (2018). Massively parallel hyperparameter tuning. ArXiv:1810.05934.
- Li, 2017
Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral dissertation). PhD Thesis, CMU.
- Li et al., 2014a
Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. 11th Symposium on Operating Systems Design and Implementation (OSDI 14) (pp. 583–598).
- Li et al., 2014b
Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670).
- Liaw et al., 2018
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J., & Stoica, I. (2018). Tune: a research platform for distributed model selection and training. ArXiv:1807.05118.
- Lin et al., 2013
Lin, M., Chen, Q., & Yan, S. (2013). Network in network. ArXiv:1312.4400.
- Lin et al., 2017a
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).
- Lin et al., 2010
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). ImageNet classification: fast descriptor coding and large-scale SVM training. Large Scale Visual Recognition Challenge.
- Lin et al., 2017b
Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. ArXiv:1703.03130.
- Lipton et al., 2015
Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. ArXiv:1506.00019.
- Lipton et al., 2016
Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with LSTM recurrent neural networks. International Conference on Learning Representations (ICLR).
- Lipton & Steinhardt, 2018
Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Communications of the ACM, 17, 45–77.
- Liu & Nocedal, 1989
Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
- Liu et al., 2018
Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: differentiable architecture search. ArXiv:1806.09055.
- Liu et al., 2016
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: single shot multibox detector. European Conference on Computer Vision (pp. 21–37).
- Liu et al., 2019
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: a robustly optimized BERT pretraining approach. ArXiv:1907.11692.
- Liu et al., 2021
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
- Liu et al., 2022
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convNet for the 2020s. ArXiv:2201.03545.
- Long et al., 2015
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431–3440).
- Loshchilov & Hutter, 2016
Loshchilov, I., & Hutter, F. (2016). SGDR: stochastic gradient descent with warm restarts. ArXiv:1608.03983.
- Lowe, 2004
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
- Luo et al., 2018
Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch normalization. ArXiv:1809.00846.
- Maas et al., 2011
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 142–150).
- Mack & Silverman, 1982
Mack, Y.-P., & Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3), 405–415.
- MacKay, 2003
MacKay, D. J. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press.
- Maclaurin et al., 2015
Maclaurin, D., Duvenaud, D., & Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. Proceedings of the 32nd International Conference on Machine Learning (ICML'15).
- Mangasarian, 1965
Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13, 444-452.
- Mangram, 2013
Mangram, M. E. (2013). A simplified perspective of the Markowitz portfolio theory. Global Journal of Business Research, 7(1), 59–70.
- Matthews et al., 2018
Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. ArXiv:1804.11271.
- McCann et al., 2017
McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: Contextualized word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).
- McCulloch & Pitts, 1943
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
- McMahan et al., 2013
McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … et al. (2013). Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230).
- Mead, 1980
Mead, C. (1980). Introduction to VLSI systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1), 18.
- Merity et al., 2016
Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. ArXiv:1609.07843.
- Micchelli, 1984
Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally positive definite functions. Approximation Theory and Spline Functions (pp. 143–145). Springer.
- Mikolov et al., 2013a
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv:1301.3781.
- Mikolov et al., 2013b
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (pp. 3111–3119).
- Miller, 1995
Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.
- Mirhoseini et al., 2017
Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device placement optimization with reinforcement learning. Proceedings of the 34th International Conference on Machine Learning (pp. 2430–2439).
- Mnih et al., 2014
Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (pp. 2204–2212).
- Mnih et al., 2013
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. ArXiv:1312.5602.
- Mnih et al., 2015
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
- Moon et al., 2010
Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise and pairwise constraints. Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (pp. 151–160).
- Morey et al., 2016
Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123.
- Morozov, 1984
Morozov, V. A. (1984). Methods for Solving Incorrectly Posed Problems. Springer.
- Nadaraya, 1964
Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & its Applications, 9(1), 141–142.
- Nair & Hinton, 2010
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. ICML.
- Nakkiran et al., 2021
Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.
- Naor & Reingold, 1999
Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: Luby–Rackoff revisited. Journal of Cryptology, 12(1), 29–66.
- Neal, 1996
Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.
- Nesterov, 2018
Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.
- Nesterov & Vial, 2000
Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming. Automatica, 44(6), 1559–1568.
- Neyman, 1937
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.
- Norelli et al., 2022
Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (2022). ASIF: coupled data turns unimodal models to multimodal without training. ArXiv:2210.01738.
- Novak et al., 2018
Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., … Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are Gaussian processes. ArXiv:1810.05148.
- Novikoff, 1962
Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615–622).
- Olshausen & Field, 1996
Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.
- Ong et al., 2005
Ong, C. S., Smola, A., & Williamson, R. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1043–1071.
- OpenAI, 2023
OpenAI. (2023). GPT-4 Technical Report. ArXiv:2303.08774.
- Ouyang et al., 2022
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … et al. (2022). Training language models to follow instructions with human feedback. ArXiv:2203.02155.
- Papineni et al., 2002
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318).
- Parikh et al., 2016
Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. ArXiv:1606.01933.
- Park et al., 2019
Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2337–2346).
- Parzen, 1957
Parzen, E. (1957). On consistent estimates of the spectrum of a stationary time series. Annals of Mathematical Statistics, 28, 329–348.
- Paszke et al., 2019
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … et al. (2019). PyTorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.
- Paulus et al., 2017
Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. ArXiv:1705.04304.
- Penedo et al., 2023
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., … Launay, J. (2023). The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. ArXiv:2306.01116.
- Pennington et al., 2017
Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems (pp. 4785–4795).
- Pennington et al., 2014
Pennington, J., Socher, R., & Manning, C. (2014). GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).
- Peters et al., 2017a
Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
- Peters et al., 2017b
Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1 (pp. 1756–1765).
- Peters et al., 2018
Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 2227–2237).
- Petersen & Pedersen, 2008
Petersen, K. B., & Pedersen, M. S. (2008). The Matrix Cookbook. Technical University of Denmark.
- Pleiss et al., 2017
Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-efficient implementation of densenets. ArXiv:1707.06990.
- Polyak, 1964
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
- Prakash et al., 2016
Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual LSTM networks. ArXiv:1610.03098.
- Qin et al., 2023
Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? ArXiv:2302.06476.
- Quadrana et al., 2018
Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM Computing Surveys, 51(4), 66.
- Quinlan, 1993
Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Elsevier.
- Rabiner & Juang, 1993
Rabiner, L., & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice-Hall.
- Radford et al., 2021
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … et al. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763).
- Radford et al., 2015
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv:1511.06434.
- Radford et al., 2018
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
- Radford et al., 2019
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
- Radosavovic et al., 2019
Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1882–1890).
- Radosavovic et al., 2020
Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
- Rae et al., 2021
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … et al. (2021). Scaling language models: methods, analysis & insights from training gopher. ArXiv:2112.11446.
- Raffel et al., 2020
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.
- Rajpurkar et al., 2016
Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. ArXiv:1606.05250.
- Ramachandran et al., 2019
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.
- Ramachandran et al., 2017
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. ArXiv:1710.05941.
- Ramesh et al., 2022
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. ArXiv:2204.06125.
- Cajal & Azoulay, 1894
Ramón y Cajal, Santiago, & Azoulay, L. (1894). Les Nouvelles Idées sur la Structure du Système Nerveux chez l'Homme et chez les Vertébrés. Paris, C. Reinwald & Cie.
- Ranzato et al., 2007
Ranzato, M.-A., Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).
- Rasmussen & Williams, 2006
Rasmussen, C. E., & Williams, C. K. (2006). Gaussian Processes for Machine Learning. MIT Press.
- Reddi et al., 2019
Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of Adam and beyond. ArXiv:1904.09237.
- Redmon et al., 2016
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779–788).
- Redmon & Farhadi, 2018
Redmon, J., & Farhadi, A. (2018). YOLOv3: an incremental improvement. ArXiv:1804.02767.
- Reed & DeFreitas, 2015
Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. ArXiv:1511.06279.
- Reed et al., 2022
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … et al. (2022). A generalist agent. ArXiv:2205.06175.
- Ren et al., 2015
Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (pp. 91–99).
- Rendle, 2010
Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).
- Rendle et al., 2009
Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (pp. 452–461).
- Revels et al., 2016
Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. ArXiv:1607.07892.
- Rezende et al., 2014
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning (pp. 1278–1286).
- Riesenhuber & Poggio, 1999
Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.
- Rockafellar, 1970
Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press.
- Rolnick et al., 2017
Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. ArXiv:1705.10694.
- Rudin, 1973
Rudin, W. (1973). Functional Analysis. McGraw-Hill.
- Rumelhart et al., 1988
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.
- Russakovsky et al., 2013
Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: what have we done, and where are we going? International Conference on Computer Vision (ICCV).
- Russakovsky et al., 2015
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
- Russell & Norvig, 2016
Russell, S. J., & Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Pearson Education Limited.
- Saharia et al., 2022
Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. ArXiv:2205.11487.
- Salinas et al., 2022
Salinas, D., Seeger, M., Klein, A., Perrone, V., Wistuba, M., & Archambeau, C. (2022). Syne Tune: a library for large scale hyperparameter tuning and reproducible research. First Conference on Automated Machine Learning.
- Sanh et al., 2019
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv:1910.01108.
- Sanh et al., 2021
Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., … et al. (2021). Multitask prompted training enables zero-shot task generalization. ArXiv:2110.08207.
- Santurkar et al., 2018
Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? Advances in Neural Information Processing Systems (pp. 2483–2493).
- Sarwar et al., 2001
Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of 10th International Conference on World Wide Web (pp. 285–295).
- Scao et al., 2022
Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., … et al. (2022). BLOOM: a 176B-parameter open-access multilingual language model. ArXiv:2211.05100.
- Schein et al., 2002
Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 253–260).
- Schuhmann et al., 2022
Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … et al. (2022). LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv:2210.08402.
- Schuster & Paliwal, 1997
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
- Scholkopf et al., 2001
Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A generalized representer theorem. Proceedings of the Annual Conference on Computational Learning Theory (pp. 416–426). Springer-Verlag.
- Scholkopf et al., 1996
Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. International Conference on Artificial Neural Networks (pp. 47–52).
- Scholkopf & Smola, 2002
Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
- Sedhain et al., 2015
Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). Autorec: autoencoders meet collaborative filtering. Proceedings of the 24th International Conference on World Wide Web (pp. 111–112).
- Sennrich et al., 2015
Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. ArXiv:1508.07909.
- Sergeev & DelBalso, 2018
Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. ArXiv:1802.05799.
- Shannon, 1948
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
- Shao et al., 2020
Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). ControlVAE: controllable variational autoencoder. Proceedings of the 37th International Conference on Machine Learning.
- Shaw et al., 2018
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. ArXiv:1803.02155.
- Shoeybi et al., 2019
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism. ArXiv:1909.08053.
- Silver et al., 2016
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.
- Silverman, 1986
Silverman, B. W. (1986). Density Estimation for Statistical and Data Analysis. Chapman and Hall.
- Simard et al., 1998
Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition – tangent distance and tangent propagation. Neural Networks: Tricks of the Trade (pp. 239–274). Springer.
- Simonyan & Zisserman, 2014
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556.
- Sindhwani et al., 2015
Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep learning. ArXiv:1510.01722.
- Sivic & Zisserman, 2003
Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. Proceedings of the IEEE International Conference on Computer Vision (pp. 1470–1470).
- Smith et al., 2022
Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … et al. (2022). Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. ArXiv:2201.11990.
- Smola & Narayanamurthy, 2010
Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703–710.
- Snoek et al., 2012
Snoek, J., Larochelle, H., & Adams, R. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems 25 (pp. 2951–2959).
- Sohl-Dickstein et al., 2015
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (pp. 2256–2265).
- Song & Ermon, 2019
Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.
- Song et al., 2021
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.
- Speelpenning, 1980
Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral dissertation). University of Illinois at Urbana-Champaign.
- Srivastava et al., 2022
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … et al. (2022). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. ArXiv:2206.04615.
- Srivastava et al., 2014
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
- Srivastava et al., 2015
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. ArXiv:1505.00387.
- Strang, 1993
Strang, G. (1993). Introduction to Linear Algebra. Wellesley–Cambridge Press.
- Su & Khoshgoftaar, 2009
Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009.
- Sukhbaatar et al., 2015
Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. Advances in Neural Information Processing Systems (pp. 2440–2448).
- Sutskever et al., 2013
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (pp. 1139–1147).
- Sutskever et al., 2014
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (pp. 3104–3112).
- Szegedy et al., 2017
Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. 31st AAAI Conference on Artificial Intelligence.
- Szegedy et al., 2015
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–9).
- Szegedy et al., 2016
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).
- Tallec & Ollivier, 2017
Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. ArXiv:1705.08209.
- Tan & Le, 2019
Tan, M., & Le, Q. (2019). EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning (pp. 6105–6114).
- Tang & Wang, 2018
Tang, J., & Wang, K. (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 565–573).
- Taskar et al., 2004
Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in Neural Information Processing Systems, 16, 25.
- Tay et al., 2020
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. ArXiv:2009.06732.
- Taylor et al., 2022
Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., … Stojnic, R. (2022). Galactica: a large language model for science. ArXiv:2211.09085.
- Teye et al., 2018
Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. ArXiv:1802.06455.
- Thomee et al., 2016
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016). Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.
- Tieleman & Hinton, 2012
Tieleman, T., & Hinton, G. (2012). Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, Lecture 6.5-rmsprop.
- Tikhonov & Arsenin, 1977
Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. W.H. Winston.
- Tolstikhin et al., 2021
Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … et al. (2021). MLP-mixer: an all-MLP architecture for vision. Advances in Neural Information Processing Systems, 34.
- Torralba et al., 2008
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.
- Touvron et al., 2021
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (pp. 10347–10357).
- Touvron et al., 2023a
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … et al. (2023a). LLaMA: open and efficient foundation language models. ArXiv:2302.13971.
- Touvron et al., 2023b
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … et al. (2023b). LLaMA 2: open foundation and fine-tuned chat models. ArXiv:2307.09288.
- Tsoumakas & Katakis, 2007
Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
- Turing, 1950
Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.
- Toscher et al., 2009
Töscher, A., Jahrer, M., & Bell, R. M. (2009). The bigchaos solution to the Netflix grand prize.
- Uijlings et al., 2013
Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
- Vapnik, 1995
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.
- Vapnik, 1998
Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.
- Vapnik & Chervonenkis, 1964
Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.
- Vapnik & Chervonenkis, 1968
Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.
- Vapnik & Chervonenkis, 1971
Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2), 264-281.
- Vapnik & Chervonenkis, 1981
Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3), 543-564.
- Vapnik & Chervonenkis, 1991
Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.
- Vapnik & Chervonenkis, 1974
Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote Control, 35, 1226–1235, 1403–1412.
- Vapnik, 1992
Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems (pp. 831–838).
- Vapnik et al., 1994
Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6(5), 851–876.
- Vaswani et al., 2017
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (pp. 5998–6008).
- Wahba, 1990
Wahba, G. (1990). Spline Models for Observational Data. SIAM.
- Waibel et al., 1989
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.
- Wang et al., 2022
Wang, H., Zhang, A., Zheng, S., Shi, X., Li, M., & Wang, Z. (2022). Removing batch normalization boosts adversarial training. International Conference on Machine Learning (pp. 23433–23445).
- Wang et al., 2018
Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation. Networks, 2(3), 2–3.
- Wang et al., 2019
Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1810–1822).
- Wang et al., 2023
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations.
- Wang et al., 2016
Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Notices (p. 11).
- Warstadt et al., 2019
Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641.
- Wasserman, 2013
Wasserman, L. (2013). All of Statistics: A Concise Course in Statistical Inference. Springer.
- Watkins & Dayan, 1992
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
- Watson, 1964
Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pp. 359–372.
- Wei et al., 2021
Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., … Le, Q. V. (2021). Finetuned language models are zero-shot learners. ArXiv:2109.01652.
- Wei et al., 2022a
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … et al. (2022). Emergent abilities of large language models. ArXiv:2206.07682.
- Wei et al., 2022b
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. ArXiv:2201.11903.
- Welling & Teh, 2011
Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 681–688).
- Wengert, 1964
Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464.
- Werbos, 1990
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.
- Wigner, 1958
Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math. (pp. 325–327).
- Wilson & Izmailov, 2020
Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in Neural Information Processing Systems, 33, 4697–4708.
- Wistuba et al., 2019
Wistuba, M., Rawat, A., & Pedapati, T. (2019). A survey on neural architecture search. ArXiv:1905.01392 [cs.LG].
- Wistuba et al., 2018
Wistuba, M., Schilling, N., & Schmidt-Thieme, L. (2018). Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 108, 43–78.
- Wolpert & Macready, 1995
Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.
- Wood et al., 2011
Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.
- Wu et al., 2018
Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., … Keutzer, K. (2018). Shift: a zero flop, zero parameter alternative to spatial convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9127–9135).
- Wu et al., 2016
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … et al. (2016). Google's neural machine translation system: bridging the gap between human and machine translation. ArXiv:1609.08144.
- Xiao et al., 2017
Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv:1708.07747.
- Xiao et al., 2018
Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning (pp. 5393–5402).
- Xie et al., 2017
Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1492–1500).
- Xiong et al., 2020
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).
- Xiong et al., 2018
Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938).
- Yamaguchi et al., 1990
Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-independent isolated word recognition. First International Conference on Spoken Language Processing.
- Yang et al., 2016
Yang, Z., Hu, Z., Deng, Y., Dyer, C., & Smola, A. (2016). Neural machine translation with recurrent attention modeling. ArXiv:1607.05108.
- Yang et al., 2015
Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).
- Ye et al., 2011
Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative point-of-interest recommendation. Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 325–334).
- You et al., 2017
You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. ArXiv:1708.03888.
- Yu et al., 2022
Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. ArXiv:2206.10789.
- Zaheer et al., 2018
Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).
- Zeiler, 2012
Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. ArXiv:1212.5701.
- Zeiler & Fergus, 2013
Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. ArXiv:1301.3557.
- Zhang et al., 2021a
Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. International Conference on Learning Representations.
- Zhang et al., 2021b
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
- Zhang et al., 2019
Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 52(1), 5.
- Zhang et al., 2022
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … et al. (2022). OPT: open pre-trained transformer language models. ArXiv:2205.01068.
- Zhang et al., 1988
Zhang, W., Tanida, J., Itoh, K., & Ichioka, Y. (1988). Shift-invariant pattern recognition neural network and its optical architecture. Proceedings of Annual Conference of the Japan Society of Applied Physics.
- Zhang et al., 2021c
Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). ByteTrack: multi-object tracking by associating every detection box. ArXiv:2110.06864.
- Zhang et al., 2023a
Zhang, Z., Zhang, A., Li, M., & Smola, A. (2023). Automatic chain of thought prompting in large language models. International Conference on Learning Representations.
- Zhang et al., 2023b
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. ArXiv:2302.00923.
- Zhao et al., 2019
Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems, 30(11), 3212–3232.
- Zhou et al., 2023
Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., … Chi, E. (2023). Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations.
- Zhu et al., 2017
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision (pp. 2223–2232).
- Zhu et al., 2015
Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE International Conference on Computer Vision (pp. 19–27).
- Zoph & Le, 2016
Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. ArXiv:1611.01578.