# Anne Morvan

Ph.D Student in Machine Learning at CEA/Université Paris-Dauphine, PSL Research University

anne [dot] morvan [at] cea [dot] fr

Graduate from engineering school EISTI (Data Science option) and Paris-Dauphine Université for the Master's degree in Computer Science Intelligent Systems (Machine Learning speciality), I am currently on my last year of PhD thesis in Machine Learning with the CEA (French Center for Atomic and alternative Energies) and Université Paris-Dauphine, PSL Research University.

I have the chance to be advised by Professor Jamal Atif (CNRS, LAMSADE) and Cédric Gouy-Pailler, Technical Lead of Streaming Data Analytics at CEA. This PhD thesis mainly focuses on summarizing massive data streams (structured projection, linear sketching, hashing/dictionary learning...) for Machine Learning applications. I worked also closely with Krzysztof Choromanski researcher at Google Brain Robotics NYC, Antoine Souloumiac researcher at CEA, Rafaël Pinot PhD student at CEA/Université Paris-Dauphine and Florian Yger, associate Professor at Université Paris-Dauphine.

Spending much of my free time in Switzerland, after the defense of my PhD (October 2018), I will be looking for a Data scientist / Machine Learning scientist position preferably in Geneva area.

## My PhD work

### Structured adaptive and random spinners for fast machine learning computations

Abstract: We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms,new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the $HD_3 HD_2 HD_1$ structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications.

Authors: Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cédric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tamas Sarlos, Jamal Atif. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS'17), 54, pp.1020-1029, Fort Lauderdale, FL, USA, 20-22 Apr.

I gave an oral presentation on preliminary work of this paper at Google Research NY seminar on July, 14th 2016.

### Streaming Binary Sketching based on Subspace Tracking and Diagonal Uniformization

Abstract: In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new online method for computing binary compressed representations -sketches- of high-dimensional real feature vectors. Given an expected code length $c$ and high-dimensional input data points, our algorithm provides a $c$-bits binary code for preserving the distance between the points from the original high-dimensional space. Our algorithm does not require neither the storage of the whole dataset nor a chunk, thus it is fully adaptable to the streaming setting. It also provides low time complexity and convergence guarantees. We demonstrate the quality of our binary sketches through experiments on real data for the nearest neighbors search task in the online setting.

Authors: Anne Morvan, Antoine Souloumiac, Cédric Gouy-Pailler, Jamal Atif. To appear at the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, Alberta, Canada, 15-20 Apr.

[PDF] [Code]

### On the Needs for Rotations in Hypercubic Quantization Hashing

Abstract: The aim of this paper is to endow the well-known family of hypercubic quantization hashing methods with theoretical guarantees. In hypercubic quantization, applying a suitable (random or learned) rotation after dimensionality reduction has been experimentally shown to improve the results accuracy in the nearest neighbors search problem. We prove in this paper that the use of these rotations is optimal under some mild assumptions: getting optimal binary sketches is equivalent to applying a rotation uniformizing the diagonal of the covariance matrix between data points. Moreover, for two closed points, the probability to have dissimilar binary sketches is upper bounded by a factor of the initial distance between the data points. Relaxing these assumptions, we obtain a general concentration result for random matrices. We also provide some experiments illustrating these theoretical points and compare a set of algorithms in both the batch and online settings.

Authors: Anne Morvan, Antoine Souloumiac, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif. 2018

[PDF] [Code]

### Graph sketching-based Space-efficient Data Clustering

Abstract: In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based non-parametric method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a sketched version of the dissimilarity graph $\mathcal{G}$ between the $N$ objects to cluster. Unlike $k$-means, $k$-medians or $k$-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic semi-streaming model in handling the dissimilarity graph $\mathcal{G}$ as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring $O(N \operatorname{polylog}(N))$ space, far better than the theoretical memory cost $O(N^2)$ of $\mathcal{G}$. The recovered approximate MST $\mathcal{T}$ as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on $\mathcal{T}$ in a time linear in $N$. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

Authors: Anne Morvan, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif. To appear at SIAM International Conference on DATA MINING (SDM'18), San Diego, CA, USA, 3-4 May.

I presented preliminary work of this paper during the poster session at the DS3 Data Science Summer School of Ecole Polytechnique on August, 30th 2017.

[PDF] [Code] [DS3 Poster]

### Graph-based Clustering under Differential Privacy

Abstract: In this paper, we present the first differentially private clustering method for arbitrary-shaped node clusters in a graph. This algorithm takes as input only an approximate Minimum Spanning Tree (MST) $\mathcal{T}$ released under weight differential privacy constraints from the graph. Then, the underlying nonconvex clustering partition is successfully recovered from cutting optimal cuts on $\mathcal{T}$. As opposed to existing methods, our algorithm is theoretically well-motivated. Experiments support our theoretical findings.

Authors: Rafael Pinot, Anne Morvan, Florian Yger, Cédric Gouy-Pailler, Jamal Atif. 2018

[PDF]

## My MSc work

### Deep Learning: Solving the detection problem

Abstract: Deep learning algorithms such as convolutional neural networks dramatically changed the computer vision landscape by outperforming other state-of-the-art models in many object recognition tasks. Most benchmarks so far take place in the classification context, where an image known to contain a relevant object has to be labeled using one of the known object classes. In comparison, the use of deep learning algorithms in the object detection task is much less investigated. During this internship, several aspects related to object detection have been examined with a particular focus on pedestrian detection. First, a state of the art is made on object and pedestrian detection. Then, a classifier model is proposed and the results for the pedestrian classification tasks are presented.

Author: Anne Morvan. Submitted in fulfilment of the requirements for the degree of Master of Science from Université Paris-Dauphine and Enginering degree from EISTI, 2015.

[PDF] [Oral slides]