At the core of Artificial Intelligence (AI) lies a set of elaborate non-linear, data-driven or implictly-defined machine learning methods and algorithms. The latter however largely rely on "small dimensional intuitions" and heuristics which have recently been shown to be
mostly inappropriate and behave strikingly differently in large dimensions (see for instance the case of kernel spectral clustering in Fig. 1, or semi-supervised learning in Fig. 2). Recent advances in tools from large dimensional statistics, random matrix theory and statistical physics have provided a series of answers to this curse of dimensionality in proposing a renewed understanding and means of
striking improvements through novel algorithms of elementary ML methods for bigdata (in the context of community detection, graph semi surpervised learning, subspace clustering, etc.). Of particular interest is the
random matrix analysis of simple neural network structures.
More importantly, while mostly relying on simple modelling (iid Gaussian, simple mixture models, etc.), these tools are adequate and resiliant to realistic datasets, as they provably demonstrate
universality features. Precisely, leveraging on a new approach to the concentration of measure theory, these results
fully explain realistic advanced ML algorithm behaviors, such as deep learners and GANs (see Fig. 3).
The GSTATS and MIAI LargeDATA chairs aim at gathering these findings in a coherent new
random matrix paradigm for big data machine learning. In particular, the project relies on innovative key theoretical directions:
- (i) large dimensional statistics (random matrix theory) for the analysis and improvement non-linear optimization, kernel methods, generalized linear mixed models, etc.
- (ii) concentration of measure theory and universality for deep learning understanding,
- (iii) statistical physics methods for sparse graph mining, clustering, and neural network analysis,