References
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

Abadi et al., 2016

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … et al. (2016). TensorFlow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283).

Abdel-Hamid et al., 2014

Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.

Ahmed et al., 2012

Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (pp. 123–132).

Akiba et al., 2019

Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.

Alayrac et al., 2022

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … et al. (2022). Flamingo: a visual language model for few-shot learning. ArXiv:2204.14198.

Alsallakh et al., 2020

Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the PAD – CNNs can develop blind spots. ArXiv:2010.02178.

Anil et al., 2023

Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., … et al. (2023). PaLM 2 Technical Report. ArXiv:2305.10403.

Anil et al., 2020

Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2020). Scalable second-order optimization for deep learning. ArXiv:2002.09018.

Aronszajn, 1950

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.

Ba et al., 2016

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv:1607.06450.

Baevski & Auli, 2018

Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. International Conference on Learning Representations.

Bahdanau et al., 2014

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473.

Bai et al., 2022

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … et al. (2022). Constitutional AI: harmlessness from AI feedback. ArXiv:2212.08073.

Baptista & Poloczek, 2018

Baptista, R., & Poloczek, M. (2018). Bayesian optimization of combinatorial structures. Proceedings of the 35th International Conference on Machine Learning.

Bardenet et al., 2013

Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013). Collaborative hyperparameter tuning. Proceedings of the 30th International Conference on Machine Learning (ICML'13).

Bay et al., 2006

Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. European Conference on Computer Vision (pp. 404–417).

Bellman, 1966

Bellman, R. (1966). Dynamic programming. Science, 153, 34–37.

Bellman, 1952

Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8), 716–719.

Bellman, 1957a

Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684. URL: http://www.jstor.org/stable/24900506

Bellman, 1957b

Bellman, R. (1957). Dynamic Programming. Dover Publications.

Beltagy et al., 2020

Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: the long-document transformer. ArXiv:2004.05150.

Bengio et al., 2003

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.

Bengio et al., 1994

Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.

Bergstra et al., 2011

Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24.

Bergstra et al., 2010

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a CPU and GPU math compiler in Python. Proc. 9th Python in Science Conference (pp. 3–10).

Beutel et al., 2014

Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). CoBaFi: collaborative Bayesian filtering. Proceedings of the 23rd International Conference on World Wide Web (pp. 97–108).

Bishop, 1995

Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.

Bishop, 2006

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Black & Scholes, 1973

Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81, 637–654.

Bodla et al., 2017

Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS-improving object detection with one line of code. Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).

Bojanowski et al., 2017

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

Bollobas, 1999

Bollobás, B. (1999). Linear Analysis. Cambridge University Press.

Bommasani et al., 2021

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … et al. (2021). On the opportunities and risks of foundation models. ArXiv:2108.07258.

Bottou, 2010

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 (pp. 177–186). Springer.

Bottou & Le Cun, 1988

Bottou, L., & Le Cun, Y. (1988). SN: a simulator for connectionist models. Proceedings of NeuroNimes 88 (pp. 371–382). Nimes, France. URL: http://leon.bottou.org/papers/bottou-lecun-88

Boucheron et al., 2005

Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.

Bowman et al., 2015

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. ArXiv:1508.05326.

Boyd & Vandenberghe, 2004

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge University Press.

Bradley & Terry, 1952

Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.

Brown & Sandholm, 2017

Brown, N., & Sandholm, T. (2017). Libratus: the superhuman AI for no-limit poker. IJCAI (pp. 5226–5228).

Brown et al., 1990

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.

Brown et al., 1988

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin, P. (1988). A statistical approach to language translation. COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics.

Brown et al., 2020

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Buslaev et al., 2020

Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020). Albumentations: Fast and flexible image augmentations. Information, 11(2), 125.

Campbell et al., 2002

Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial Intelligence, 134(1-2), 57–83.

Canny, 1987

Canny, J. (1987). A computational approach to edge detection. Readings in Computer Vision (pp. 184–203). Elsevier.

Cer et al., 2017

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).

Chan et al., 2015

Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. ArXiv:1508.01211.

Chen et al., 2021

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., … Mordatch, I. (2021). Decision transformer: reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34, 15084–15097.

Chen et al., 2015

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., … Zhang, Z. (2015). MXNET: a flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv:1512.01274.

Cheng et al., 2016

Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–561).

Chetlur et al., 2014

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). CuDNN: Efficient primitives for deep learning. ArXiv:1410.0759.

Cho et al., 2014a

Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. ArXiv:1409.1259.

Cho et al., 2014b

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. ArXiv:1406.1078.

Chowdhery et al., 2022

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … et al. (2022). PaLM: scaling language modeling with pathways. ArXiv:2204.02311.

Chung et al., 2014

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv:1412.3555.

Clark et al., 2020

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.

Collobert et al., 2011

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.

Cordonnier et al., 2020

Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations.

Cover & Thomas, 1999

Cover, T., & Thomas, J. (1999). Elements of Information Theory. John Wiley & Sons.

Csiszar, 2008

Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.

Cybenko, 1989

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.

Dalal & Triggs, 2005

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) (pp. 886–893).

DeCock, 2011

De Cock, D. (2011). Ames, Iowa: alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3).

Dean et al., 2012

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … et al. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1 (pp. 1223–1231).

DeCandia et al., 2007

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. ACM SIGOPS Operating Systems Review (pp. 205–220).

Deng et al., 2009

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).

DerKiureghian & Ditlevsen, 2009

Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural Safety, 31(2), 105–112.

Devlin et al., 2018

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv:1810.04805.

Dinh et al., 2014

Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: non-linear independent components estimation. ArXiv:1410.8516.

Dinh et al., 2017

Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using real NVP. International Conference on Learning Representations.

Doersch et al., 2015

Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision (pp. 1422–1430).

Dosovitskiy et al., 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … et al. (2021). An image is worth 16 x 16 words: transformers for image recognition at scale. International Conference on Learning Representations.

Duchi et al., 2011

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.

Dumoulin & Visin, 2016

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. ArXiv:1603.07285.

Dwivedi & Bresson, 2020

Dwivedi, V. P., & Bresson, X. (2020). A generalization of transformer networks to graphs. ArXiv:2012.09699.

Dwork et al., 2015

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving statistical validity in adaptive data analysis. Proceedings of the 47th Annual ACM Symposium on Theory of Computing (pp. 117–126).

Elman, 1990

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.

Elsken et al., 2018

Elsken, T., Metzen, J. H., & Hutter, F. (2018). Neural architecture search: a ssurvey. ArXiv:1808.05377 [stat.ML].

Fechner, 1860

Fechner, G. T. (1860). Elemente der Psychophysik. Vol. 2. Breitkopf u. Härtel.

Fedus et al., 2022

Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.

Fernando, 2004

Fernando, R. (2004). GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley.

Feurer & Hutter, 2018

Feurer, M., & Hutter, F. (2018). Hyperparameter ptimization. Automatic Machine Learning: Methods, Systems, Challenges. Springer.

Feurer et al., 2022

Feurer, M., Letham, B., Hutter, F., & Bakshy, E. (2022). Practical transfer learning for Bayesian optimization. ArXiv:1802.02219 [stat.ML].

Field, 1987

Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. JOSA A, 4(12), 2379–2394.

Fisher, 1925

Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd.

Flammarion & Bach, 2015

Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Conference on Learning Theory (pp. 658–695).

Forrester et al., 2007

Forrester, A. I., Sóbester, A., & Keane, A. J. (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088), 3251–3269.

Franceschi et al., 2017

Franceschi, L., Donini, M., Frasconi, P., & Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. Proceedings of the 34th International Conference on Machine Learning (ICML'17).

Frankle & Carbin, 2018

Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. ArXiv:1803.03635.

Frazier, 2018

Frazier, P. I. (2018). A tutorial on Bayesian optimization. ArXiv:1807.02811.

Freund & Schapire, 1996

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proceedings of the International Conference on Machine Learning (pp. 148–156).

Friedman, 1987

Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397), 249–266.

Frostig et al., 2018

Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level tracing. Proceedings of Systems for Machine Learning.

Fukushima, 1982

Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. Competition and Cooperation in Neural Nets (pp. 267–285). Springer.

Gardner et al., 2018

Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., & Wilson, A. G. (2018). GPyTorch: blackbox matrix–matrix Gaussian process inference with GPU acceleration. Advances in Neural Information Processing Systems.

Garg et al., 2021

Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). RATT: leveraging unlabeled data to guarantee generalization. International Conference on Machine Learning (pp. 3598–3609).

Gatys et al., 2016

Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414–2423).

Gauss, 1809

Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke. Königlich Preussische Akademie der Wissenschaften.

Gibbs, 1902

Gibbs, J. W. (1902). Elementary Principles of Statistical Mhanics. Scribner's.

Ginibre, 1965

Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3), 440–449.

Girshick, 2015

Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).

Girshick et al., 2014

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587).

Glorot & Bengio, 2010

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (pp. 249–256).

Goh, 2017

Goh, G. (2017). Why momentum really works. Distill. URL: http://distill.pub/2017/momentum

Goldberg et al., 1992

Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–71.

Golub & VanLoan, 1996

Golub, G. H., & Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins University Press.

Goodfellow et al., 2016

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.

Goodfellow et al., 2014

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (pp. 2672–2680).

Gotmare et al., 2018

Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. ArXiv:1810.13243.

Goyal et al., 2021

Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. ArXiv:2110.07641.

Graham, 2014

Graham, B. (2014). Fractional max-pooling. ArXiv:1412.6071.

Graves, 2013

Graves, A. (2013). Generating sequences with recurrent neural networks. ArXiv:1308.0850.

Graves et al., 2008

Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.

Graves & Schmidhuber, 2005

Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6), 602–610.

Griewank, 1989

Griewank, A. (1989). On automatic differentiation. Mathematical Programming: Recent Developments and Applications (pp. 83–107). Kluwer.

Gulati et al., 2020

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … et al. (2020). Conformer: convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp. 5036–5040.

Gunawardana & Shani, 2015

Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. Recommender Systems Handbook (pp. 265–308). Springer.

Guo et al., 2017

Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1725–1731).

Guyon et al., 2008

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature Extraction: Foundations and Applications. Springer.

Hadjis et al., 2016

Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. ArXiv:1606.04487.

Hartley & Zisserman, 2000

Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.

Hartley & Kahl, 2009

Hartley, R. I., & Kahl, F. (2009). Global optimization through rotation space search. International Journal of Computer Vision, 82(1), 64–79.

He et al., 2022

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).

He et al., 2017a

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969).

He et al., 2015

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (pp. 1026–1034).

He et al., 2016a

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).

He et al., 2016b

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European Conference on Computer Vision (pp. 630–645).

He & Chua, 2017

He, X., & Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 355–364).

He et al., 2017b

He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th International Conference on World Wide Web (pp. 173–182).

Hebb, 1949

Hebb, D. O. (1949). The Organization of Behavior. Wiley.

Hendrycks & Gimpel, 2016

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). ArXiv:1606.08415.

Hennessy & Patterson, 2011

Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach. Elsevier.

Herlocker et al., 1999

Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. 22nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR 1999 (pp. 230–237).

Hidasi et al., 2015

Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2015). Session-based recommendations with recurrent neural networks. ArXiv:1511.06939.

Ho et al., 2020

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.

Hochreiter et al., 2001

Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.

Hochreiter & Schmidhuber, 1997

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Hoffmann et al., 2022

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … et al. (2022). Training compute-optimal large language models. ArXiv:2203.15556.

Howard et al., 2019

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).

Hoyer et al., 2009

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems (pp. 689–696).

Hu et al., 2018

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132–7141).

Hu et al., 2008

Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008 8th IEEE International Conference on Data Mining (pp. 263–272).

Hu et al., 2022

Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2022). Text style transfer: a review and experimental evaluation. SIGKDD Explor. Newsl., 24(1). URL: https://doi.org/10.1145/3544903.3544906

Huang et al., 2018

Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D. (2018). Music transformer: generating music with long-term structure. International Conference on Learning Representations.

Huang et al., 2017

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708).

Huang et al., 2015

Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM–CRF models for sequence tagging. ArXiv:1508.01991.

Hubel & Wiesel, 1959

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex. Journal of Physiology, 148(3), 574–591.

Hubel & Wiesel, 1962

Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160(1), 106–154.

Hubel & Wiesel, 1968

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195(1), 215–243.

Hutter et al., 2011

Hutter, F., Hoos, H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION'11).

Hutter et al., 2019

Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.

Ioffe, 2017

Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Advances in Neural Information Processing Systems (pp. 1945–1953).

Ioffe & Szegedy, 2015

Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv:1502.03167.

Izmailov et al., 2018

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. ArXiv:1803.05407.

Jacot et al., 2018

Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization in neural networks. Advances in Neural Information Processing Systems.

Jaeger, 2002

Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD-Forschungszentrum Informationstechnik Bonn.

Jamieson & Talwalkar, 2016

Jamieson, K., & Talwalkar, A. (2016). Non-stochastic best arm identification and hyperparameter optimization. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics.

Jenatton et al., 2017

Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian optimization with tree-structured dependencies. Proceedings of the 34th International Conference on Machine Learning (ICML'17).

Jia et al., 2018

Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … et al. (2018). Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes. ArXiv:1807.11205.

Jia et al., 2014

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International Conference on Multimedia (pp. 675–678).

Joshi et al., 2020

Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.

Jouppi et al., 2017

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … et al. (2017). In-datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (pp. 1–12).

Kalchbrenner et al., 2014

Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. ArXiv:1404.2188.

Kalman & Kwasny, 1992

Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. Proceedings of the International Joint Conference on Neural Networks (IJCNN) (pp. 578–581).

Kaplan et al., 2020

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. ArXiv:2001.08361.

Karnin et al., 2013

Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. Proceedings of the 30th International Conference on Machine Learning (ICML'13).

Karras et al., 2017

Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. ArXiv:1710.10196.

Kim et al., 2017

Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: design of a deep recurrent architecture for distant speech recognition. ArXiv:1701.03360.

Kim, 2014

Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv:1408.5882.

Kimeldorf & Wahba, 1971

Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82–95.

Kingma & Ba, 2014

Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. ArXiv:1412.6980.

Kingma & Welling, 2014

Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR).

Kipf & Welling, 2016

Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. ArXiv:1609.02907.

Kojima et al., 2022

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arxiv.org/abs/2205.11916.

Koller & Friedman, 2009

Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.

Kolmogorov, 1933

Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari, Giorn., 4, 83–91.

Kolter, 2008

Kolter, Z. (2008). Linear algebra review and reference. Available online: http://cs229.stanford.edu/section/cs229-linalg.pdf.

Koren et al., 2009

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, pp. 30–37.

Krizhevsky et al., 2012

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (pp. 1097–1105).

Kung, 1988

Kung, S. Y. (1988). VLSI Array Processors. Prentice Hall.

Kuzovkin et al., 2018

Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications Biology, 1(1), 1–12.

Lan et al., 2019

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv:1909.11942.

Lavin & Gray, 2016

Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4013–4021).

Le, 2013

Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8595–8598).

LeCun et al., 1995a

LeCun, Y., Bengio, Y., & et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks (p. 3361). MIT Press.

LeCun et al., 1989

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.

LeCun et al., 1998a

LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade. Springer.

LeCun et al., 1998b

LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

LeCun et al., 1995b

LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … et al. (1995). Comparison of learning algorithms for handwritten digit recognition. International Conference on Artificial Neural Networks (pp. 53–60).

Legendre, 1805

Legendre, A. M. (1805). Mémoire sur les Opérations Trigonométriques: dont les Résultats Dépendent de la Figure de la Terre. F. Didot.

Lewis et al., 2019

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019). BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv:1910.13461.

Lewkowycz et al., 2022

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … et al. (2022). Solving quantitative reasoning problems with language models. ArXiv:2206.14858.

Li et al., 2018

Li, L., Jamieson, K., Rostamizadeh, A., Gonina, K., Hardt, M., Recht, B., & Talwalkar, A. (2018). Massively parallel hyperparameter tuning. ArXiv:1810.05934.

Li, 2017

Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral dissertation). PhD Thesis, CMU.

Li et al., 2014a

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. 11th Symposium on Operating Systems Design and Implementation (OSDI 14) (pp. 583–598).

Li et al., 2014b

Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670).

Liaw et al., 2018

Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J., & Stoica, I. (2018). Tune: a research platform for distributed model selection and training. ArXiv:1807.05118.

Lin et al., 2013

Lin, M., Chen, Q., & Yan, S. (2013). Network in network. ArXiv:1312.4400.

Lin et al., 2017a

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).

Lin et al., 2010

Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). ImageNet classification: fast descriptor coding and large-scale SVM training. Large Scale Visual Recognition Challenge.

Lin et al., 2017b

Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. ArXiv:1703.03130.

Lipton et al., 2015

Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. ArXiv:1506.00019.

Lipton et al., 2016

Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with LSTM recurrent neural networks. International Conference on Learning Representations (ICLR).

Lipton & Steinhardt, 2018

Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Communications of the ACM, 17, 45–77.

Liu & Nocedal, 1989

Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.

Liu et al., 2018

Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: differentiable architecture search. ArXiv:1806.09055.

Liu et al., 2016

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: single shot multibox detector. European Conference on Computer Vision (pp. 21–37).

Liu et al., 2019

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: a robustly optimized BERT pretraining approach. ArXiv:1907.11692.

Liu et al., 2021

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).

Liu et al., 2022

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convNet for the 2020s. ArXiv:2201.03545.

Long et al., 2015

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431–3440).

Loshchilov & Hutter, 2016

Loshchilov, I., & Hutter, F. (2016). SGDR: stochastic gradient descent with warm restarts. ArXiv:1608.03983.

Lowe, 2004

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

Luo et al., 2018

Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch normalization. ArXiv:1809.00846.

Maas et al., 2011

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 142–150).

Mack & Silverman, 1982

Mack, Y.-P., & Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3), 405–415.

MacKay, 2003

MacKay, D. J. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press.

Maclaurin et al., 2015

Maclaurin, D., Duvenaud, D., & Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. Proceedings of the 32nd International Conference on Machine Learning (ICML'15).

Mangasarian, 1965

Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13, 444-452.

Mangram, 2013

Mangram, M. E. (2013). A simplified perspective of the Markowitz portfolio theory. Global Journal of Business Research, 7(1), 59–70.

Matthews et al., 2018

Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. ArXiv:1804.11271.

McCann et al., 2017

McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: Contextualized word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).

McCulloch & Pitts, 1943

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.

McMahan et al., 2013

McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … et al. (2013). Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230).

Mead, 1980

Mead, C. (1980). Introduction to VLSI systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1), 18.

Merity et al., 2016

Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. ArXiv:1609.07843.

Micchelli, 1984

Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally positive definite functions. Approximation Theory and Spline Functions (pp. 143–145). Springer.

Mikolov et al., 2013a

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv:1301.3781.

Mikolov et al., 2013b

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (pp. 3111–3119).

Miller, 1995

Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.

Mirhoseini et al., 2017

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device placement optimization with reinforcement learning. Proceedings of the 34th International Conference on Machine Learning (pp. 2430–2439).

Mnih et al., 2014

Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (pp. 2204–2212).

Mnih et al., 2013

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. ArXiv:1312.5602.

Mnih et al., 2015

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.

Moon et al., 2010

Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise and pairwise constraints. Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (pp. 151–160).

Morey et al., 2016

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123.

Morozov, 1984

Morozov, V. A. (1984). Methods for Solving Incorrectly Posed Problems. Springer.

Nadaraya, 1964

Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & its Applications, 9(1), 141–142.

Nair & Hinton, 2010

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. ICML.

Nakkiran et al., 2021

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.

Naor & Reingold, 1999

Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: Luby–Rackoff revisited. Journal of Cryptology, 12(1), 29–66.

Neal, 1996

Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.

Nesterov, 2018

Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.

Nesterov & Vial, 2000

Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming. Automatica, 44(6), 1559–1568.

Neyman, 1937

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.

Norelli et al., 2022

Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (2022). ASIF: coupled data turns unimodal models to multimodal without training. ArXiv:2210.01738.

Novak et al., 2018

Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., … Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are Gaussian processes. ArXiv:1810.05148.

Novikoff, 1962

Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615–622).

Olshausen & Field, 1996

Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.

Ong et al., 2005

Ong, C. S., Smola, A., & Williamson, R. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1043–1071.

OpenAI, 2023

OpenAI. (2023). GPT-4 Technical Report. ArXiv:2303.08774.

Ouyang et al., 2022

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … et al. (2022). Training language models to follow instructions with human feedback. ArXiv:2203.02155.

Papineni et al., 2002

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318).

Parikh et al., 2016

Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. ArXiv:1606.01933.

Park et al., 2019

Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2337–2346).

Parzen, 1957

Parzen, E. (1957). On consistent estimates of the spectrum of a stationary time series. Annals of Mathematical Statistics, 28, 329–348.

Paszke et al., 2019

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … et al. (2019). PyTorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.

Paulus et al., 2017

Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. ArXiv:1705.04304.

Penedo et al., 2023

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., … Launay, J. (2023). The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. ArXiv:2306.01116.

Pennington et al., 2017

Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems (pp. 4785–4795).

Pennington et al., 2014

Pennington, J., Socher, R., & Manning, C. (2014). GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).

Peters et al., 2017a

Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.

Peters et al., 2017b

Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1 (pp. 1756–1765).

Peters et al., 2018

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 2227–2237).

Petersen & Pedersen, 2008

Petersen, K. B., & Pedersen, M. S. (2008). The Matrix Cookbook. Technical University of Denmark.

Pleiss et al., 2017

Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-efficient implementation of densenets. ArXiv:1707.06990.

Polyak, 1964

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.

Prakash et al., 2016

Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual LSTM networks. ArXiv:1610.03098.

Qin et al., 2023

Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? ArXiv:2302.06476.

Quadrana et al., 2018

Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM Computing Surveys, 51(4), 66.

Quinlan, 1993

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Elsevier.

Rabiner & Juang, 1993

Rabiner, L., & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice-Hall.

Radford et al., 2021

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … et al. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763).

Radford et al., 2015

Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv:1511.06434.

Radford et al., 2018

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.

Radford et al., 2019

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Radosavovic et al., 2019

Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1882–1890).

Radosavovic et al., 2020

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).

Rae et al., 2021

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … et al. (2021). Scaling language models: methods, analysis & insights from training gopher. ArXiv:2112.11446.

Raffel et al., 2020

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.

Rajpurkar et al., 2016

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. ArXiv:1606.05250.

Ramachandran et al., 2019

Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.

Ramachandran et al., 2017

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. ArXiv:1710.05941.

Ramesh et al., 2022

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. ArXiv:2204.06125.

Cajal & Azoulay, 1894

Ramón y Cajal, Santiago, & Azoulay, L. (1894). Les Nouvelles Idées sur la Structure du Système Nerveux chez l'Homme et chez les Vertébrés. Paris, C. Reinwald & Cie.

Ranzato et al., 2007

Ranzato, M.-A., Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).

Rasmussen & Williams, 2006

Rasmussen, C. E., & Williams, C. K. (2006). Gaussian Processes for Machine Learning. MIT Press.

Reddi et al., 2019

Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of Adam and beyond. ArXiv:1904.09237.

Redmon et al., 2016

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779–788).

Redmon & Farhadi, 2018

Redmon, J., & Farhadi, A. (2018). YOLOv3: an incremental improvement. ArXiv:1804.02767.

Reed & DeFreitas, 2015

Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. ArXiv:1511.06279.

Reed et al., 2022

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … et al. (2022). A generalist agent. ArXiv:2205.06175.

Ren et al., 2015

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (pp. 91–99).

Rendle, 2010

Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).

Rendle et al., 2009

Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (pp. 452–461).

Revels et al., 2016

Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. ArXiv:1607.07892.

Rezende et al., 2014

Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning (pp. 1278–1286).

Riesenhuber & Poggio, 1999

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.

Rockafellar, 1970

Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press.

Rolnick et al., 2017

Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. ArXiv:1705.10694.

Rudin, 1973

Rudin, W. (1973). Functional Analysis. McGraw-Hill.

Rumelhart et al., 1988

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.

Russakovsky et al., 2013

Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: what have we done, and where are we going? International Conference on Computer Vision (ICCV).

Russakovsky et al., 2015

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

Russell & Norvig, 2016

Russell, S. J., & Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Pearson Education Limited.

Saharia et al., 2022

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. ArXiv:2205.11487.

Salinas et al., 2022

Salinas, D., Seeger, M., Klein, A., Perrone, V., Wistuba, M., & Archambeau, C. (2022). Syne Tune: a library for large scale hyperparameter tuning and reproducible research. First Conference on Automated Machine Learning.

Sanh et al., 2019

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv:1910.01108.

Sanh et al., 2021

Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., … et al. (2021). Multitask prompted training enables zero-shot task generalization. ArXiv:2110.08207.

Santurkar et al., 2018

Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? Advances in Neural Information Processing Systems (pp. 2483–2493).

Sarwar et al., 2001

Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of 10th International Conference on World Wide Web (pp. 285–295).

Scao et al., 2022

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., … et al. (2022). BLOOM: a 176B-parameter open-access multilingual language model. ArXiv:2211.05100.

Schein et al., 2002

Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 253–260).

Schuhmann et al., 2022

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … et al. (2022). LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv:2210.08402.

Schuster & Paliwal, 1997

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

Scholkopf et al., 2001

Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A generalized representer theorem. Proceedings of the Annual Conference on Computational Learning Theory (pp. 416–426). Springer-Verlag.

Scholkopf et al., 1996

Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. International Conference on Artificial Neural Networks (pp. 47–52).

Scholkopf & Smola, 2002

Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.

Sedhain et al., 2015

Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). Autorec: autoencoders meet collaborative filtering. Proceedings of the 24th International Conference on World Wide Web (pp. 111–112).

Sennrich et al., 2015

Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. ArXiv:1508.07909.

Sergeev & DelBalso, 2018

Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. ArXiv:1802.05799.

Shannon, 1948

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

Shao et al., 2020

Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). ControlVAE: controllable variational autoencoder. Proceedings of the 37th International Conference on Machine Learning.

Shaw et al., 2018

Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. ArXiv:1803.02155.

Shoeybi et al., 2019

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism. ArXiv:1909.08053.

Silver et al., 2016

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.

Silverman, 1986

Silverman, B. W. (1986). Density Estimation for Statistical and Data Analysis. Chapman and Hall.

Simard et al., 1998

Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition – tangent distance and tangent propagation. Neural Networks: Tricks of the Trade (pp. 239–274). Springer.

Simonyan & Zisserman, 2014

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556.

Sindhwani et al., 2015

Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep learning. ArXiv:1510.01722.

Sivic & Zisserman, 2003

Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. Proceedings of the IEEE International Conference on Computer Vision (pp. 1470–1470).

Smith et al., 2022

Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … et al. (2022). Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. ArXiv:2201.11990.

Smola & Narayanamurthy, 2010

Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703–710.

Snoek et al., 2012

Snoek, J., Larochelle, H., & Adams, R. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems 25 (pp. 2951–2959).

Sohl-Dickstein et al., 2015

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (pp. 2256–2265).

Song & Ermon, 2019

Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.

Song et al., 2021

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.

Speelpenning, 1980

Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral dissertation). University of Illinois at Urbana-Champaign.

Srivastava et al., 2022

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … et al. (2022). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. ArXiv:2206.04615.

Srivastava et al., 2014

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.

Srivastava et al., 2015

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. ArXiv:1505.00387.

Strang, 1993

Strang, G. (1993). Introduction to Linear Algebra. Wellesley–Cambridge Press.

Su & Khoshgoftaar, 2009

Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009.

Sukhbaatar et al., 2015

Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. Advances in Neural Information Processing Systems (pp. 2440–2448).

Sutskever et al., 2013

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (pp. 1139–1147).

Sutskever et al., 2014

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (pp. 3104–3112).

Szegedy et al., 2017

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. 31st AAAI Conference on Artificial Intelligence.

Szegedy et al., 2015

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–9).

Szegedy et al., 2016

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).

Tallec & Ollivier, 2017

Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. ArXiv:1705.08209.

Tan & Le, 2019

Tan, M., & Le, Q. (2019). EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning (pp. 6105–6114).

Tang & Wang, 2018

Tang, J., & Wang, K. (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 565–573).

Taskar et al., 2004

Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in Neural Information Processing Systems, 16, 25.

Tay et al., 2020

Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. ArXiv:2009.06732.

Taylor et al., 2022

Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., … Stojnic, R. (2022). Galactica: a large language model for science. ArXiv:2211.09085.

Teye et al., 2018

Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. ArXiv:1802.06455.

Thomee et al., 2016

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016). Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.

Tieleman & Hinton, 2012

Tieleman, T., & Hinton, G. (2012). Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, Lecture 6.5-rmsprop.

Tikhonov & Arsenin, 1977

Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. W.H. Winston.

Tolstikhin et al., 2021

Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … et al. (2021). MLP-mixer: an all-MLP architecture for vision. Advances in Neural Information Processing Systems, 34.

Torralba et al., 2008

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.

Touvron et al., 2021

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (pp. 10347–10357).

Touvron et al., 2023a

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … et al. (2023a). LLaMA: open and efficient foundation language models. ArXiv:2302.13971.

Touvron et al., 2023b

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … et al. (2023b). LLaMA 2: open foundation and fine-tuned chat models. ArXiv:2307.09288.

Tsoumakas & Katakis, 2007

Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.

Turing, 1950

Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.

Toscher et al., 2009

Töscher, A., Jahrer, M., & Bell, R. M. (2009). The bigchaos solution to the Netflix grand prize.

Uijlings et al., 2013

Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.

Vapnik, 1995

Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.

Vapnik, 1998

Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.

Vapnik & Chervonenkis, 1964

Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.

Vapnik & Chervonenkis, 1968

Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.

Vapnik & Chervonenkis, 1971

Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2), 264-281.

Vapnik & Chervonenkis, 1981

Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3), 543-564.

Vapnik & Chervonenkis, 1991

Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.

Vapnik & Chervonenkis, 1974

Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote Control, 35, 1226–1235, 1403–1412.

Vapnik, 1992

Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems (pp. 831–838).

Vapnik et al., 1994

Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6(5), 851–876.

Vaswani et al., 2017

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (pp. 5998–6008).

Wahba, 1990

Wahba, G. (1990). Spline Models for Observational Data. SIAM.

Waibel et al., 1989

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.

Wang et al., 2022

Wang, H., Zhang, A., Zheng, S., Shi, X., Li, M., & Wang, Z. (2022). Removing batch normalization boosts adversarial training. International Conference on Machine Learning (pp. 23433–23445).

Wang et al., 2018

Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation. Networks, 2(3), 2–3.

Wang et al., 2019

Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1810–1822).

Wang et al., 2023

Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations.

Wang et al., 2016

Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Notices (p. 11).

Warstadt et al., 2019

Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641.

Wasserman, 2013

Wasserman, L. (2013). All of Statistics: A Concise Course in Statistical Inference. Springer.

Watkins & Dayan, 1992

Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.

Watson, 1964

Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pp. 359–372.

Wei et al., 2021

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., … Le, Q. V. (2021). Finetuned language models are zero-shot learners. ArXiv:2109.01652.

Wei et al., 2022a

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … et al. (2022). Emergent abilities of large language models. ArXiv:2206.07682.

Wei et al., 2022b

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. ArXiv:2201.11903.

Welling & Teh, 2011

Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 681–688).

Wengert, 1964

Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464.

Werbos, 1990

Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.

Wigner, 1958

Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math. (pp. 325–327).

Wilson & Izmailov, 2020

Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in Neural Information Processing Systems, 33, 4697–4708.

Wistuba et al., 2019

Wistuba, M., Rawat, A., & Pedapati, T. (2019). A survey on neural architecture search. ArXiv:1905.01392 [cs.LG].

Wistuba et al., 2018

Wistuba, M., Schilling, N., & Schmidt-Thieme, L. (2018). Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 108, 43–78.

Wolpert & Macready, 1995

Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.

Wood et al., 2011

Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.

Wu et al., 2018

Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., … Keutzer, K. (2018). Shift: a zero flop, zero parameter alternative to spatial convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9127–9135).

Wu et al., 2016

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … et al. (2016). Google's neural machine translation system: bridging the gap between human and machine translation. ArXiv:1609.08144.

Xiao et al., 2017

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv:1708.07747.

Xiao et al., 2018

Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning (pp. 5393–5402).

Xie et al., 2017

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1492–1500).

Xiong et al., 2020

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).

Xiong et al., 2018

Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938).

Yamaguchi et al., 1990

Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-independent isolated word recognition. First International Conference on Spoken Language Processing.

Yang et al., 2016

Yang, Z., Hu, Z., Deng, Y., Dyer, C., & Smola, A. (2016). Neural machine translation with recurrent attention modeling. ArXiv:1607.05108.

Yang et al., 2015

Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).

Ye et al., 2011

Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative point-of-interest recommendation. Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 325–334).

You et al., 2017

You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. ArXiv:1708.03888.

Yu et al., 2022

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. ArXiv:2206.10789.

Zaheer et al., 2018

Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).

Zeiler, 2012

Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. ArXiv:1212.5701.

Zeiler & Fergus, 2013

Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. ArXiv:1301.3557.

Zhang et al., 2021a

Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. International Conference on Learning Representations.

Zhang et al., 2021b

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.

Zhang et al., 2019

Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 52(1), 5.

Zhang et al., 2022

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … et al. (2022). OPT: open pre-trained transformer language models. ArXiv:2205.01068.

Zhang et al., 1988

Zhang, W., Tanida, J., Itoh, K., & Ichioka, Y. (1988). Shift-invariant pattern recognition neural network and its optical architecture. Proceedings of Annual Conference of the Japan Society of Applied Physics.

Zhang et al., 2021c

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). ByteTrack: multi-object tracking by associating every detection box. ArXiv:2110.06864.

Zhang et al., 2023a

Zhang, Z., Zhang, A., Li, M., & Smola, A. (2023). Automatic chain of thought prompting in large language models. International Conference on Learning Representations.

Zhang et al., 2023b

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. ArXiv:2302.00923.

Zhao et al., 2019

Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems, 30(11), 3212–3232.

Zhou et al., 2023

Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., … Chi, E. (2023). Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations.

Zhu et al., 2017

Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision (pp. 2223–2232).

Zhu et al., 2015

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE International Conference on Computer Vision (pp. 19–27).

Zoph & Le, 2016

Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. ArXiv:1611.01578.