At the core of Artificial Intelligence (AI) lies a set of elaborate non-linear, data-driven or implictly-defined machine learning methods and algorithms. The latter however largely rely on "small dimensional intuitions" and heuristics which have recently been shown to be *mostly inappropriate and behave strikingly differently in large dimensions* (see for instance the case of kernel spectral clustering in Fig. 1, or semi-supervised learning in Fig. 2). Recent advances in tools from large dimensional statistics, random matrix theory and statistical physics have provided a series of answers to this curse of dimensionality in proposing a renewed understanding and means of *striking improvements through novel algorithms* of elementary ML methods for bigdata (in the context of community detection, graph semi surpervised learning, subspace clustering, etc.). Of particular interest is the *random matrix analysis of simple neural network structures*.

More importantly, while mostly relying on simple modelling (iid Gaussian, simple mixture models, etc.), these tools are adequate and resiliant to realistic datasets, as they provably demonstrate*universality features*. Precisely, leveraging on a new approach to the concentration of measure theory, these results *fully explain realistic advanced ML algorithm behaviors, such as deep learners and GANs* (see Fig. 3).

The GSTATS and MIAI LargeDATA chairs aim at gathering these findings in a coherent new*random matrix paradigm for big data machine learning*. In particular, the project relies on innovative key theoretical directions:

More importantly, while mostly relying on simple modelling (iid Gaussian, simple mixture models, etc.), these tools are adequate and resiliant to realistic datasets, as they provably demonstrate

The GSTATS and MIAI LargeDATA chairs aim at gathering these findings in a coherent new

- (i) large dimensional statistics (random matrix theory) for the analysis and improvement non-linear optimization, kernel methods, generalized linear mixed models, etc.
- (ii) concentration of measure theory and universality for deep learning understanding,
- (iii) statistical physics methods for sparse graph mining, clustering, and neural network analysis,

**1) Random Matrix Theory for AI:**- RMT analysis and improvement of ML methods in large dimensional regimes (kernel random matrices, spectral methods, random neural nets)
- Asymptotics of optimization problems in machine learning, generalized linear mixed models
- Large dimensional estimation and detection
- Statistical learning on large dimensional graphs

**2) Statistical Physics Approaches:**- Statistical physics for large and sparse data and graphs
- Neural network asymptotics

**3) Universality Results: from Theory to Practice:**- Universality through concentration of measure advances for ML
- Universal models and performance in applied areas (from electrical engineering to computer vision, statistical biology, finance, BCI, etc.).

**HUAWEI RMT4AI:**We collaborate with HUAWEI Labs within the scope of a 2-year project (2020-2022) on the fundamental limitations of AI.Asymptotics of large dimensional non-convex machine learning (Charles Séjournée, advisor: R. Couillet).**PhD Thesis:**Random tensors in large dimensions: spiked models and fundamental limits (Henrique Goulard, advisors: P. Comon, R. Couillet).**Postdoc:**

**STMicroelectonics Embedded AI:**The STM-LargeDATA collaboration aims at designing and studying cost-efficient methods for embedded machine learning.Practical considerations on embedded AI (XXX, advisor: Stéphane Mancini).**PhD Thesis:**

**CEA List:**Concentration of measure theory and random matrices for machine learning (Cosme Louart, advisors: R. Couillet, M. Tamaazousti).**PhD Thesis:**Random matrix theory for AI: from theory to practice (Mohammed El Amine Seddik, advisors: R. Couillet, M. Tamaazousti).**PhD Thesis:**

**Academic Theses:**Information-theoretic bounds for large dimensional ML (Minh-Toan Nguyen, advisors: R. Couillet, P. Comon).**PhD Thesis (sponsored by MIAI) -- 2020-2023 -- :**Structured random matrix models and the complexity-performance tradeoff (Tayeb Zarrouk, advisors: F. Chatelain, R. Couillet, N. LeBihan).**PhD Thesis (sponsored by MIAI) -- 2020-2023 -- :**Randomized linear algebra for large dimensional data (Yigit Pilavci, advisors: P. O. Amblard, S. Barthelme, N. Tremblay).**PhD Thesis -- 2020-2023 -- :**Statistical physics methods for large dimensional sparse data processing (Lorenzo Dall'Amico, advisors: R. Couillet, N. Tremblay).**PhD Thesis -- 2019-2022 -- :**Advanced random matrix methods for machine learning (Malik Tiomoko, advisors: R. Couillet, F. Pascal).**PhD Thesis (with CentraleSupélec) -- 2018-2021 -- :**Large dimensional classification in array processing (Cyprien Doz, advisors: R. Couillet, J. P. Ovarlez, C. Ren).**PhD Thesis (with Sondra@CentraleSupélec) -- 2019-2022 -- :**Large dimensional statistics for financial data (Bernard Nabet, advisor: R. Couillet).**PhD Thesis -- 2019-2024 -- :**

**Internships:**Semi-supervised transfer learning in large data (Victor Léger, advisors: R. Couillet, M. Tiomoko).**M2 Internship -- 2021 -- :**Kernel streaming in large dimensions (Hugo Lebeau, advisors: R. Couillet, F. Chatelain).**M2 Internship -- 2021 -- :**Theoretical tools for semi-sparse clustering (Jianyuang Wang, advisors: R. Couillet).**M2 Internship -- 2021 -- :**Statistical physics for semi-supervised learning (Filippo Zimmaro, advisors: R. Couillet, L. Dall'Amico).**M2 Internship -- 2021 -- :**Concentration of measure and word embeddings methods in AI (Muhammad Imran, advisors: R. Couillet, E. Gaussier).**M2 Internship (MIAI collaboration) -- 2020 -- :**Large-scale learning on graphs (Hashem Ghanem, advisors: R. Couillet, N. Keriven, N. Tremblay).**M2 Internship -- 2020 -- :**Performance Optima of Large Dimensional Machine Learning: A Random Matrix and Information Theory Analysis (Hugues Souchard de Lavoreille, advisors: R. Couillet, S. Zozor).**M2 Internship -- 2020 -- :**

- R. Couillet, M. Tiomoko, S. Zozor, E. Moisan,
**"Random matrix-improved estimation of covariance matrix distances"**, Journal of Multivariate Analysis, no. 174, pp. 104531, 2019. [article] - X. Mai, R. Couillet,
**"A Random Matrix Analysis and Improvement of Semi-Supervised Learning for Large Dimensional Data"**, Journal of Machine Learning Research, vol. 19, no. 79, pp. 1-27, 2018. [article] - Ch. Séjourné, R. Couillet, P. Comon,
**"A large-dimensional analysis of symmetric SNE"**, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'21), Toronto, Canada, 2021. [article] - M. Seddik, C. Louart, R. Couillet, M. Tamaazousti,
**"The Unexpected Deterministic and Universal Behavior of Large Softmax Classifiers"**, Artificial Intelligence and Statistics (AISTATS'21), virtual conference, 2021. [article] - M. Tiomoko, H. Tiomoko, R. Couillet,
**"Deciphering and Optimizing Multi-Task and Transfer Learning: a Random Matrix Approach"**, International Conference on Learning Representations (ICLR'21), virtual conference, 2021.**Spotlight article**. [article] - Z. Liao, R. Couillet, M. Mahoney,
**"Sparse Quantized Spectral Clustering"**, International Conference on Learning Representations (ICLR'21), virtual conference, 2021.**Spotlight article**. [article] - R. Couillet, Y. Cinar, E. Gaussier, M. Imran,
**"Word Representations Concentrate and This is Good News!"**, SIGNLL Conference on Computational Natural Language Learning (CoNLL'20), virtual conference, 2020. [article] - M. Seddik, R. Couillet, M. Tamaazousti,
**"A Random Matrix Analysis of Learning with α-Dropout"**, International Conference on Machine Learning (ICML'20), Artemiss workshop, Graz, Autria, 2020. [article] - L. Dall'Amico, R. Couillet, N. Tremblay,
**"Community detection in sparse time-evolving graphs with a dynamical Bethe-Hessian"**, Conference on Neural Information Processing Systems (NeurIPS'20), Vacouver, Canada, 2020. [article] - Z. Liao, R. Couillet, M. Mahoney,
**"A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent"**, Conference on Neural Information Processing Systems (NeurIPS'20), Vacouver, Canada, 2020. [article] - M. Seddik, R. Couillet, M. Tamaazousti,
**"Random Matrix Theory Proves that Deep Learning Representations of GAN-data Behave as Gaussian Mixtures"**, International Conference on Machine Learning (ICML'20), Graz, Autria, 2020. [article] - T. Zarrouk, R. Couillet, F. Chatelain, N. Le Bihan,
**"Performance-Complexity Trade-Off in Large Dimensional Statistics"**, International Workshop on Machine Learning for Signal Processing (MLSP'20), Espoo, Finland, 2020. [article] - L. Dall'Amico, R. Couillet, N. Tremblay,
**"Optimal Laplacian Regularization for Sparse Spectral Community Detection"**, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'20), Barcelona, Spain, 2020. [article] - M. Tiomoko, C. Louart, R. Couillet,
**"Large Dimensional Asymptotics of Multi-Task Learning"**, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP'20), Barcelona, Spain, 2020. [article]

**Kernel Methods don't behave the same in Large Dimensions.** A first key finding consists in demonstrating that, under a "non-trivial" Gaussian mixture model (that is for not too easily separable mixtures), as the dimension **Standard Semi-Supervised Learning Methods are Suboptimal but can be Improved.** A consequence of the large dimensional "concentration" of distances lies in the inappropriateness of many classical machine learning methods which, initially developed to tackle finite dimensional (small **Gaussian Mixtures are Universal Models.** A main frustration of large dimensional statistics versus practice often lies in the inaccuracy of modelling real datasets through basic Gaussian mixture models. We showed that this state of fact is much less true in large dimensional data which seem to behave much more like Gaussian random variables than in small dimensions. We theoretically proved this statement as follows: (i) random matrix universality results occur in large dimensional data which, in particular, make asymptotics of kernel and neural network classification