Anne Morvan

Anne Morvan

Ph.D in Machine Learning, Machine Learning Scientist at Expedia Group

amorvan [at] expediagroup [dot] com

Currently Machine Learning Scientist at Expedia Group in Geneva, I hold a PhD in Machine Learning from Université Paris-Dauphine, Paris Sciences et Lettres (PSL) Research University.

During my PhD, I had the chance to be advised by Professor Jamal Atif (CNRS, LAMSADE) and Cédric Gouy-Pailler, Technical Lead of Streaming Data Analytics at CEA (French Center for Atomic and alternative Energies). This PhD thesis mainly focuses on summarizing massive data streams (structured projections, linear sketching, hashing/dictionary learning...) for Machine Learning applications. I worked also closely with Krzysztof Choromanski researcher at Google Brain Robotics NYC, Antoine Souloumiac researcher at CEA, Rafaël Pinot PhD student at CEA/Université Paris-Dauphine and Florian Yger, associate Professor at Université Paris-Dauphine.

Currently, my PhD work has led to $4$ publications in top-tier international conferences (AISTATS'17, ICASSP 2018, SIAM SDM'18, UAI 2018).

I defended my PhD thesis "Contributions to unsupervised learning from massive high-dimensional data streams: structuring, hashing and clustering" on Monday, 12th November 2018 at Université Paris-Dauphine and I have been laureate of the "Jeune Chercheur" (Young Researcher in French) Prize of the Dauphine Foundation for this work.

Previously, I also graduated from engineering school EISTI (Data Science option) and Université Paris-Dauphine for the Master's degree in Computer Science Intelligent Systems (Machine Learning speciality).

My Research work at Expedia Group

Hotel2Vec: Learning Hotel Embeddings from User Click Sessions with Side Information

Abstract: We propose a new neural network architecture for learning vector representations of items with attributes, specifically hotels. Unlike previous works, which typically only rely on modeling of user-item interactions for learning item embeddings, we propose a framework that combines several sources of data, including user clicks, hotel attributes (e.g., property type, star rating, average user rating), amenity information (e.g., if the hotel has free Wi-Fi or free breakfast), and geographic information that leverages an hexagonal geospatial system as well as spatial encoders. During model training, a joint embedding is learned from all of the above information. We show that including structured attributes about hotels enables us to make better predictions in a downstream task than when we rely exclusively on click data. We train our embedding model on more than 60 million user click sessions from a leading online travel platform, and learn embeddings for more than one million hotels. Our final learned embeddings integrate distinct sub-embeddings for user clicks, hotel attributes, and geographic information, providing a representation that can be used flexibly depending on the application. An important advantage of the proposed neural model is that it addresses the cold-start problem for hotels with insufficient historical click information by incorporating additional hotel attributes, which are available for all hotels. We show through the results of an online A/B test that our model generates high-quality representations that boost the performance of a hotel recommendation system on a large online travel platform.

Authors: Ioannis Partalas, Anne Morvan, Ali Sadeghian, Shervin Minaee, Xinxin Li, Brooke Cowan, Daisy Zhe Wang. Proceedings of the Workshop on (Recommenders in Tourism) co-located with the 15th ACM Conference on Recommender Systems (RecSys 2021), 2974, pp.69-84, Online, 26 Sep.

[PDF]

My PhD work

CEA DGA Université Paris-Dauphine LAMSADE CNRS PSL Université

Structured adaptive and random spinners for fast machine learning computations

Abstract: We consider an efficient computational framework for speeding up several machine learning algorithms with almost no loss of accuracy. The proposed framework relies on projections via structured matrices that we call Structured Spinners, which are formed as products of three structured matrix-blocks that incorporate rotations. The approach is highly generic, i.e. i) structured matrices under consideration can either be fully-randomized or learned, ii) our structured family contains as special cases all previously considered structured schemes, iii) the setting extends to the non-linear case where the projections are followed by non-linear functions, and iv) the method finds numerous applications including kernel approximations via random feature maps, dimensionality reduction algorithms,new fast cross-polytope LSH techniques, deep learning, convex optimization algorithms via Newton sketches, quantization with random projection trees, and more. The proposed framework comes with theoretical guarantees characterizing the capacity of the structured model in reference to its unstructured counterpart and is based on a general theoretical principle that we describe in the paper. As a consequence of our theoretical analysis, we provide the first theoretical guarantees for one of the most efficient existing LSH algorithms based on the $HD_3 HD_2 HD_1$ structured matrix [Andoni et al., 2015]. The exhaustive experimental evaluation confirms the accuracy and efficiency of structured spinners for a variety of different applications.

Authors: Mariusz Bojarski, Anna Choromanska, Krzysztof Choromanski, Francois Fagan, Cédric Gouy-Pailler, Anne Morvan, Nouri Sakr, Tamas Sarlos, Jamal Atif. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS'17), 54, pp.1020-1029, Fort Lauderdale, FL, USA, 20-22 Apr.

I gave an oral presentation on preliminary work of this paper at Google Research NY seminar on July, 14th 2016. I also presented the final work at Expedia Geneva on April, 6th 2018.

[PDF] [Code] [Google seminar slides] [AISTATS 2017 Poster] [Expedia seminar slides]

Streaming Binary Sketching based on Subspace Tracking and Diagonal Uniformization

Abstract: In this paper, we address the problem of learning compact similarity-preserving embeddings for massive high-dimensional streams of data in order to perform efficient similarity search. We present a new online method for computing binary compressed representations -sketches- of high-dimensional real feature vectors. Given an expected code length $c$ and high-dimensional input data points, our algorithm provides a $c$-bits binary code for preserving the distance between the points from the original high-dimensional space. Our algorithm does not require neither the storage of the whole dataset nor a chunk, thus it is fully adaptable to the streaming setting. It also provides low time complexity and convergence guarantees. We demonstrate the quality of our binary sketches through experiments on real data for the nearest neighbors search task in the online setting.

Authors: Anne Morvan, Antoine Souloumiac, Cédric Gouy-Pailler, Jamal Atif. Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), Calgary, Alberta, Canada, 15-20 Apr.

I presented this work at ICASSP 2018 (Calgary, Canada) during poster session on Tuesday, April 17 2018.

[PDF] [Code] [ICASSP 2018 Poster]

On the Needs for Rotations in Hypercubic Quantization Hashing

Abstract: The aim of this paper is to endow the well-known family of hypercubic quantization hashing methods with theoretical guarantees. In hypercubic quantization, applying a suitable (random or learned) rotation after dimensionality reduction has been experimentally shown to improve the results accuracy in the nearest neighbors search problem. We prove in this paper that the use of these rotations is optimal under some mild assumptions: getting optimal binary sketches is equivalent to applying a rotation uniformizing the diagonal of the covariance matrix between data points. Moreover, for two closed points, the probability to have dissimilar binary sketches is upper bounded by a factor of the initial distance between the data points. Relaxing these assumptions, we obtain a general concentration result for random matrices. We also provide some experiments illustrating these theoretical points and compare a set of algorithms in both the batch and online settings.

Authors: Anne Morvan, Antoine Souloumiac, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif. 2018

[PDF] [Code]

Graph sketching-based Space-efficient Data Clustering

Abstract: In this paper, we address the problem of recovering arbitrary-shaped data clusters from datasets while facing high space constraints, as this is for instance the case in many real-world applications when analysis algorithms are directly deployed on resources-limited mobile devices collecting the data. We present DBMSTClu a new space-efficient density-based non-parametric method working on a Minimum Spanning Tree (MST) recovered from a limited number of linear measurements i.e. a sketched version of the dissimilarity graph $\mathcal{G}$ between the $N$ objects to cluster. Unlike $k$-means, $k$-medians or $k$-medoids algorithms, it does not fail at distinguishing clusters with particular forms thanks to the property of the MST for expressing the underlying structure of a graph. No input parameter is needed contrarily to DBSCAN or the Spectral Clustering method. An approximate MST is retrieved by following the dynamic semi-streaming model in handling the dissimilarity graph $\mathcal{G}$ as a stream of edge weight updates which is sketched in one pass over the data into a compact structure requiring $O(N \operatorname{polylog}(N))$ space, far better than the theoretical memory cost $O(N^2)$ of $\mathcal{G}$. The recovered approximate MST $\mathcal{T}$ as input, DBMSTClu then successfully detects the right number of nonconvex clusters by performing relevant cuts on $\mathcal{T}$ in a time linear in $N$. We provide theoretical guarantees on the quality of the clustering partition and also demonstrate its advantage over the existing state-of-the-art on several datasets.

Authors: Anne Morvan, Krzysztof Choromanski, Cédric Gouy-Pailler, Jamal Atif. Proceedings of the 2018 SIAM International Conference on Data Mining (SDM'18), pp.10-18, San Diego, CA, USA, 3-4 May.

I presented preliminary work of this paper during the poster session at the DS3 Data Science Summer School of Ecole Polytechnique on August, 30th 2017.

This paper I presented this work at SDM'18 (San Diego, USA) during oral and poster sessions on Thursday, May 3 2018.

[PDF] [Code] [DS3 Poster] [SDM'18 Poster] [SDM'18 Slides]

Graph-based Clustering under Differential Privacy

Abstract: In this paper, we present the first differentially private clustering method for arbitrary-shaped node clusters in a graph. This algorithm takes as input only an approximate Minimum Spanning Tree (MST) $\mathcal{T}$ released under weight differential privacy constraints from the graph. Then, the underlying nonconvex clustering partition is successfully recovered from cutting optimal cuts on $\mathcal{T}$. As opposed to existing methods, our algorithm is theoretically well-motivated. Experiments support our theoretical findings.

Authors: Rafael Pinot, Anne Morvan, Florian Yger, Cédric Gouy-Pailler, Jamal Atif. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI 2018), Monterey, CA, USA, 6-10 Aug.

[PDF] [Code DBMSTClu] [Code PAMST]

My MSc work

Deep Learning: Solving the detection problem

Abstract: Deep learning algorithms such as convolutional neural networks dramatically changed the computer vision landscape by outperforming other state-of-the-art models in many object recognition tasks. Most benchmarks so far take place in the classification context, where an image known to contain a relevant object has to be labeled using one of the known object classes. In comparison, the use of deep learning algorithms in the object detection task is much less investigated. During this internship, several aspects related to object detection have been examined with a particular focus on pedestrian detection. First, a state of the art is made on object and pedestrian detection. Then, a classifier model is proposed and the results for the pedestrian classification tasks are presented.

Author: Anne Morvan. Submitted in fulfilment of the requirements for the degree of Master of Science from Université Paris-Dauphine and Enginering degree from EISTI, 2015.

[PDF] [Oral slides]

Grants & Awards

2020 - 1rst Prize of 2020 EG Global Hackathon "The Future of Travel"

Project manager of a team of 4 data scientists and 1 software engineer.

Judges: Peter Kern (CEO), John Kim (President of Platform and Marketplaces) and Karen Bolda (CTO).

2015 - CEA Contrat formation par la recherche (CFR) grant

100% Funding of the PhD thesis.

2015 - Direction Générale de l'Armement (DGA) grant

50% Funding of the PhD thesis.

2015 - The Singapore International Pre-Graduate Award (SIPGA)

100% Funding of short-term research attachments for top international students at A*STAR.

Miscellaneous