Mihai Cucuringu

Associate Professor

Department of Statistics
Mathematical Institute
Oxford-Man Institute of Quantitative Finance
University of Oxford

Stipendiary Lecturer
Merton College

Turing Fellow
The Alan Turing Institute

[Homepage]      [Research]     [StatML in Finance]     [CV (2019)]      [Personal]     

See the Homepage for an up-to-date list of publications, and here for finance-related papers only.

[Last updated: 2021. Please see Google Scholar, arXiv, and SSRN for an up-to-date list]

 (37) Y. He, G. Reinert, M. Cucuringu, DIGRAC: Digraph Clustering with Flow Imbalance, (arXiv) [BibTeX] (2021)

Node clustering is a powerful tool in the analysis of networks. Here, we introduce a graph neural network framework with a novel scalable Directed Mixed Path Aggregation (DIMPA) scheme to obtain node embeddings for directed networks in a self-supervised manner, including a novel probabilistic imbalance loss. The method is end-to-end in combining embedding generation and clustering without an intermediate step. In contrast to standard approaches in the literature, in this paper, directionality is not treated as a nuisance, but rather contains the main signal. In particular, we leverage the recently introduced cut flow imbalance measure, which is tightly related to directionality; cut flow imbalance is optimized without resorting to spectral methods or cluster labels. Experimental results on synthetic data, in the form of directed stochastic block models and real-world data at different scales, demonstrate that our method attains state-of-the-art results on directed graph clustering, for a wide range of noise and sparsity levels and graph structures.

Keywords: graph neural networks, directed graphs, clustering, flow imbalance.

 (36) J. Albers, M. Cucuringu, S. Howison, A. Y. Shestopaloff, Fragmentation, Price Formation, and Cross-Impact in Bitcoin Markets, (arXiv) [BibTeX] (2021)

In light of micro-scale inefficiencies induced by the high degree of fragmentation of the Bitcoin trading landscape, we utilize a granular data set comprised of orderbook and trades data from the most liquid Bitcoin markets, in order to understand the price formation process at sub-1 second time scales. To achieve this goal, we construct a set of features that encapsulate relevant microstructural information over short lookback windows. These features are subsequently leveraged first to generate a leader-lagger network that quantifies how markets impact one another, and then to train linear models capable of explaining between 10% and 37% of total variation in 500ms future returns (depending on which market is the prediction target). The results are then compared with those of various PnL calculations that take trading realities, such as transaction costs, into account. The PnL calculations are based on natural taker strategies (meaning they employ market orders) that we associate to each model. Our findings emphasize the role of a market’s fee regime in determining its propensity to being a leader or a lagger, as well as the profitability of our taker strategy. Taking our analysis further, we also derive a natural maker strategy (i.e., one that uses only passive limit orders), which, due to the difficulties associated with backtesting maker strategies, we test in a real-world live trading experiment, in which we turned over 1.5 million USD in notional volume. Lending additional confidence to our models, and by extension to the features they are based on, the results indicate a significant improvement over a naive benchmark strategy, which we also deploy in a live trading environment with real capital, for the sake of comparison..

Keywords: Limit order books, market microstructure, liquidity, market fragmentation, price impact, trading volume, high frequency data, Bitcoin futures, derivatives, cryptocurrency.

 (35) M. Cucuringu, H. Tyagi, An extension of the angular synchronization problem to the heterogeneous setting, (arXiv) [BibTeX] (2020)

Given an undirected measurement graph G=([n], E), the classical angular synchronization problem consists of recovering unknown angles θ1,...,θn from a collection of noisy pairwise measurements of the form θi - θj mod 2 π, for each {i,j} ∈ E. This problem arises in a variety of applications, including computer vision, time synchronization of distributed networks, and ranking from preference relationships. In this paper, we consider a generalization to the setting where there exist k unknown groups of angles θl,1,..., θl,n, for l = 1,...,k. For each {i,j} ∈ E, we are given noisy pairwise measurements of the form θl,i - θl,j {l,j} for an unknown l ∈ {1,2,...,k}. This can be thought of as a natural extension of the angular synchronization problem to the heterogeneous setting of multiple groups of angles, where the measurement graph has an unknown edge-disjoint decomposition G = G1 ∪ G2 ∪ ... ∪ Gk, where the Gi's denote the subgraphs of edges corresponding to each group. We propose a probabilistic generative model for this problem, along with a spectral algorithm for which we provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise. The theoretical findings are complemented by a comprehensive set of numerical experiments, showcasing the efficacy of our algorithm under various parameter regimes. Finally, we consider an application of bi-synchronization to the graph realization problem, and provide along the way an iterative graph disentangling procedure that uncovers the subgraphs Gi, i=1,...,k, which is of independent interest, as it is shown to improve the final recovery accuracy across all the experiments considered.

Keywords: group synchronization, spectral algorithms, matrix perturbation theory, singular value decomposition, random matrix theory.

 (34) M. Cucuringu, A.V. Singh, D. Sulem, H. Tyagi, Regularized spectral methods for clustering signed networks, (arXiv) [BibTeX] (2020)

We study the problem of k-way clustering in signed graphs. Considerable attention in recent years has been devoted to analyzing and modeling signed graphs, where the affinity measure between nodes takes either positive or negative values. Recently, Cucuringu et al. [CDGT 2019] proposed a spectral method, namely SPONGE (Signed Positive over Negative Generalized Eigenproblem), which casts the clustering task as a generalized eigenvalue problem optimizing a suitably defined objective function. This approach is motivated by social balance theory, where the clustering task aims to decompose a given network into disjoint groups, such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are mainly connected by negative edges. Through extensive numerical simulations, SPONGE was shown to achieve state-of-the-art empirical performance. On the theoretical front, [CDGT 2019] analyzed SPONGE, as well as the popular Signed Laplacian based spectral method under the setting of a Signed Stochastic Block Model, for k=2 equal-sized clusters, in the regime where the graph is moderately dense. In this work, we build on the results in [CDGT 2019] on two fronts for the normalized versions of SPONGE and the Signed Laplacian. Firstly, for both algorithms, we extend the theoretical analysis in [CDGT 2019] to the general setting of ≥ unequal-sized clusters in the moderately dense regime. Secondly, we introduce regularized versions of both methods to handle sparse graphs -- a regime where standard spectral methods are known to underperform -- and provide theoretical guarantees under the same setting of a Signed Stochastic Block Model. To the best of our knowledge, regularized spectral methods have so far not been considered in the setting of clustering signed graphs. We complement our theoretical results with an extensive set of numerical experiments on synthetic data.

Keywords: signed clustering, graph Laplacians, stochastic block models, spectral methods, regularization techniques, sparse graphs.

 (33) W. G. Underwood, A. Elliott, M. Cucuringu, Motif-Based Spectral Clustering of Weighted Directed Networks, Applied Network Science 5, 62 [BibTeX] (2020)

Clustering is an essential technique for network analysis, with applications in a diverse range of fields. Although spectral clustering is a popular and effective method, it fails to consider higher-order structure and can perform poorly on directed networks. One approach is to capture and cluster higher-order structures using motif adjacency matrices. However, current formulations fail to take edge weights into account, and thus are somewhat limited when weight is a key component of the network under study. We address these shortcomings by exploring motif-based weighted spectral clustering methods. We present new and computationally useful matrix formulae for motif adjacency matrices on weighted networks, which can be used to construct efficient algorithms for any anchored or non-anchored motif on three nodes. In a very sparse regime, our proposed method can handle graphs with a million nodes and tens of millions of edges. We further use our framework to construct a motif-based approach for clustering bipartite networks. We provide comprehensive experimental results, demonstrating (i) the scalability of our approach, (ii) advantages of higher-order clustering on synthetic examples, and (iii) the effectiveness of our techniques on a variety of real world data sets; and compare against several techniques from the literature. We conclude that motif-based spectral clustering is a valuable tool for analysis of directed and bipartite weighted networks, which is also scalable and easy to implement.

Keywords: motifs, spectral clustering, weighted networks, directed networks, community detection, graph Laplacian, bipartite networks.

 (32) A. Elliott, A. Chiu, M. Bazzi, G. Reinert, M. Cucuringu, Core-periphery structure in directed networks, Proceedings of the Royal Society A 476, no. 2241 [BibTeX] (2020)

Empirical networks often exhibit different meso-scale structures, such as community and core-periphery structures. Core-periphery structure typically consists of a well-connected core and a periphery that is well connected to the core but sparsely connected internally. Most core-periphery studies focus on undirected networks. We propose a generalization of core-periphery structure to directed networks. Our approach yields a family of core-periphery block model formulations in which, contrary to many existing approaches, core and periphery sets are edge-direction dependent. We focus on a particular structure consisting of two core sets and two periphery sets, which we motivate empirically. We propose two measures to assess the statistical significance and quality of our novel structure in empirical data, where one often has no ground truth. To detect core-periphery structure in directed networks, we propose three methods adapted from two approaches in the literature, each with a different trade-off between computational complexity and accuracy. We assess the methods on benchmark networks where our methods match or outperform standard methods from the literature, with a likelihood approach achieving the highest accuracy. Applying our methods to three empirical networks-faculty hiring, a world trade dataset and political blogs - illustrates that our proposed structure provides novel insights in empirical networks.

Keywords: core-periphery, spectral methods, low-rank approximation, directed networks.

 (31) O.M. Crook, M. Cucuringu, T. Hurst, C.B. Schonlieb, M. Thorpe, K.C. Zygalakis, A Linear Transportation Lp Distance for Pattern Recognition, (arXiv:2009.11262) [BibTeX] (2020)

The transportation Lp distance, denoted TLp , has been proposed as a generalisation of Wasserstein Wp distances motivated by the property that it can be applied directly to colour or multi-channelled images, as well as multivariate time-series without normalization or mass constraints. These distances, as with Wp, are powerful tools in modelling data with spatial or temporal perturbations. However, their computational cost can make them infeasible to apply to even moderate pattern recognition tasks. We propose linear versions of these distances and show that the linear TLp distance significantly improves over the linear Wp distance on signal processing tasks, whilst being several orders of magnitude faster to compute than the TLp distance.

Keywords: optimal transport, linear embedding, multi-channelled signals.

 (30) S. L. Chau, M. Cucuringu, D. Sejdinovic, Spectral Ranking with Covariates, (arXiv) [BibTeX] (2020)

We consider approaches to the classical problem of establishing a statistical ranking on a given set of items from incomplete and noisy pairwise comparisons, and propose spectral algorithms able to leverage available covariate information about the items. We give a comprehensive study of several ways such side information can be useful in spectral ranking. We establish connections of the resulting algorithms to reproducing kernel Hilbert spaces and associated dependence measures, along with an extension to fair ranking using statistical parity. We present an extensive set of numerical experiments showcasing the competitiveness of the proposed algorithms with state-of-the-art methods.
 (29) S. Chretien, M. Cucuringu, G. Lecue, L. Neirac, Learning with Semi-Definite Programming: new statistical bounds based on fixed point analysis and excess risk curvature, (arXiv:2004.01869) [BibTeX] (2020)

Many statistical learning problems have recently been shown to be amenable to Semi-Definite Programming (SDP), with community detection and clustering in Gaussian mixture models} as the most striking instances Javanmard et al. (2016). Given the growing range of applications of SDP-based techniques to machine learning problems, and the rapid progress in the design of efficient algorithms for solving SDPs, an intriguing question is to understand how the recent advances from empirical process theory and Statistical Learning Theory can be leveraged for providing a precise statistical analysis of SDP estimators. In the present paper, we borrow cutting edge techniques and concepts from the Learning Theory literature, such as fixed point equations and excess risk curvature arguments, which yield general estimation and prediction results for a wide class of SDP estimators. From this perspective, we revisit some classical results in community detection from Guedon and Vershynin (2016) and Fei and Chen (2019), and we obtain statistical guarantees for SDP estimators used in signed clustering, group synchronization (for both multiplicative and additive models) and MAX-CUT. Our theoretical findings are complemented by numerical experiments for each of the three problems considered, showcasing the competitiveness of the SDP estimators.

Keywords: Semi-Definite Programming, Statistical Learning, Group Synchronization, Signed Clustering.

 (28) M. Cucuringu, H. Li, H. Sun, L. Zanetti, "Hermitian matrices for clustering directed graphs: insights and applications", (arXiv), In International Conference on Artificial Intelligence and Statistics (AISTATS 2020) , pp. 983-992, PMLR [BibTeX] (2020)

Graph clustering is a basic technique in machine learning, and has widespread applications in different domains. While spectral techniques have been successfully applied for clustering undirected graphs, the performance of spectral clustering algorithms for directed graphs~(digraphs) is not in general satisfactory: these algorithms usually require symmetrising the matrix representing a digraph, and typical objective functions for undirected graph clustering do not capture cluster-structures in which the information given by the direction of the edges is crucial. To overcome these downsides, we propose a spectral clustering algorithm based on a complex-valued matrix representation of digraphs. We analyse its theoretical performance on a Stochastic Block Model for digraphs in which the cluster-structure is given not only by variations in edge densities, but also by the direction of the edges. The significance of our work is highlighted on a data set pertaining to internal migration in the United States: while previous spectral clustering algorithms for digraphs can only reveal that people are more likely to move between counties that are geographically close, our approach is able to cluster together counties with a similar socio-economical profile even when they are geographically distant, and illustrates how people tend to move from rural to more urbanised areas.

Keywords: clustering, directed graphs, stochastic block models, spectral methods, imbalanced flows in graphs.

 (27) M. Cucuringu, H. Tyagi, "Provably robust estimation of modulo 1 samples of a smooth function with applications to phase unwrapping", (arXiv), [code], Journal of Machine Learning Research (JMLR), 21(32):1-77, [BibTeX] 2020

Consider an unknown smooth function f : [0, 1] → R, and assume we are given n noisy mod 1 samples of f, i.e., yi = (f(xi ) + ηi) mod 1 for xi ∈ [0,1], where ηi denotes noise. Given the samples (xi,yi), i =1,...,n, our goal is to recover smooth, robust estimates of the clean samples f(xi) mod 1. We formulate a natural approach for solving this problem, which works with angular embeddings of the noisy mod 1 samples over the unit circle, inspired by the angular synchronization framework. This amounts to solving a smoothness regularized least-squares problem - a quadratically constrained quadratic program (QCQP) - where the variables are constrained to lie on the unit circle. Our proposed approach is based on solving its relaxation, which is a trust-region sub-problem and hence solvable efficiently. We provide theoretical guarantees demonstrating its robustness to noise for adversarial, as well as random Gaussian and Bernoulli noise models. To the best of our knowledge, these are the first such theoretical results for this problem. We demonstrate the robustness and efficiency of our proposed approach via extensive numerical simulations on synthetic data, along with a simple least-squares based solution for the unwrapping stage, that recovers the original samples of f (up to a global shift). It is shown to perform well at high levels of noise, when taking as input the denoised modulo 1 samples.

Finally, we also consider two other approaches for denoising the modulo 1 samples that leverage tools from Riemannian optimization on manifolds, including a Burer-Monteiro approach for a semidefinite programming relaxation of our formulation. For the two-dimensional version of the problem, which has applications in synthetic aperture radar interferometry (InSAR), we are able to solve instances of real-world data with a million sample points in under 10 seconds, on a personal laptop.

Keywords: quadratically constrained quadratic programming (QCQP), trust-region sub-problem, angular embedding, phase unwrapping, semidefinite programming, angular synchronization.

 (25) A. d'Aspremont, M. Cucuringu, H. Tyagi, Ranking and synchronization from pairwise measurements via SVD, Journal of Machine Learning Research (JMLR), 22(19):1-63 [BibTeX] (2021)

Given a measurement graph G = (V,E) and an unknown signal r ∈ Rn, we investigate algorithms for recovering r from pairwise measurements of the form ri - rj; {i,j} ∈ E. This problem arises in a variety of applications, such as ranking teams in sports data and time synchronization of distributed networks. Framed in the context of ranking, the task is to recover the ranking of n teams (induced by r) given a small subset of noisy pairwise rank offsets. We propose a simple SVD-based algorithmic pipeline for both the problem of time synchronization and ranking. We provide a detailed theoretical analysis in terms of robustness against both sampling sparsity and noise perturbations with outliers, using results from matrix perturbation and random matrix theory. Our theoretical findings are complemented by a detailed set of numerical experiments on both synthetic and real data, showcasing the competitiveness of our proposed algorithms with other state-of-the-art methods.

Keywords: ranking, group synchronization, spectral algorithms, matrix perturbation theory, singular value decomposition, low-rank matrix completion.

 (24) M. Cucuringu, A. Pizzoferrato, Y. van Gennip, An MBO scheme for clustering and semi-supervised clustering of signed networks, (arXiv) [BibTeX] (2019)

We introduce a principled method for the signed clustering problem, where the goal is to partition a graph whose edge weights take both positive and negative values, such that edges within the same cluster are mostly positive, while edges spanning across clusters are mostly negative. Our method relies on a graph-based diffuse interface model formulation utilizing the Ginzburg-Landau functional, based on an adaptation of the classic numerical Merriman-Bence-Osher (MBO) scheme for minimizing such graph-based functionals. The proposed objective function aims to minimize the total weight of inter-cluster positively-weighted edges, while maximizing the total weight of the inter-cluster negatively-weighted edges. Our method scales to large sparse networks, and can be easily adjusted to incorporate labelled data information, as is often the case in the context of semi-supervised learning. We tested our method on a number of both synthetic stochastic block models and real-world data sets (including financial correlation matrices), and obtained promising results that compare favourably against a number of state-of-the-art approaches from the recent literature.

Keywords: MBO, clustering, signed networks, graph Laplacians, spectral methods, time series, correlation clustering

 (23) A. Tsakalidis, M. Bazzi, M. Cucuringu, P. Basile, B. McGillivray, Mining the UK Web Archive for Semantic Change Detection In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019) (pp. 1212-1221) [BibTeX] (2019)

Semantic change detection (i.e., identifying words whose meaning has changed over time) started emerging as a growing area of research over the past decade, with important downstream applications in natural language processing, historical linguistics and computational social science. However, several obstacles make progress in the domain slow and difficult. These pertain primarily to the lack of well-established gold standard datasets, resources to study the problem at a fine-grained temporal resolution, and quantitative evaluation approaches. In this work, we aim to mitigate these issues by (a) releasing a new labelled dataset of more than 47K word vectors trained on the UK Web Archive over a short time-frame (2000-2013); (b) proposing a variant of Procrustes alignment to detect words that have undergone semantic shift; and (c) introducing a rank-based approach for evaluation purposes. Through extensive numerical experiments and validation, we illustrate the effectiveness of our approach against competitive baselines. Finally, we also make our resources publicly available to further enable research in the domain.
 (22) M. Cucuringu, P. Davies, A. Glielmo, H. Tyagi, "SPONGE: A generalized eigenproblem for clustering signed networks", AISTATS 2019 (code) [BibTeX] (2019)

We introduce a principled and theoretically sound spectral method for k-way clustering in signed graphs, where the affinity measure between nodes takes either positive or negative values. Our approach is motivated by social balance theory, where the task of clustering aims to decompose the network into disjoint groups such that individuals within the same group are connected by as many positive edges as possible, while individuals from different groups are connected by as many negative edges as possible. Our algorithm relies on a generalized eigenproblem formulation inspired by recent work on constrained clustering. We provide theoretical guarantees for our approach in the setting of a signed stochastic block model, by leveraging tools from matrix perturbation theory and random matrix theory. An extensive set of numerical experiments on both synthetic and real data shows that our approach compares favorably with state-of-the-art methods for signed clustering, especially for large number of clusters.

Keywords: signed clustering, graph Laplacians, stochastic block models, spectral methods, time series, clustering correlation matrices

 (21) A. Elliott, M. Cucuringu, M. M. Luaces, P. Reidy, and G. Reinert, Anomaly detection in networks with application to financial transaction networks, (arXiv) [BibTeX] (2018)

This paper is motivated by the task of detecting anomalies in networks of financial transactions, with accounts as nodes and a directed weighted edge between two nodes denoting a money transfer. The weight of the edge is the transaction amount. Examples of anomalies in networks include long paths of large transaction amounts, rings of large payments, and cliques of accounts. There are many methods available which detect such specific structures in networks. Here we introduce a method which is able to detect previously unspecified anomalies in networks. The method is based on a combination of features from network comparison and spectral analysis as well as local statistics, yielding 140 main features. We then use a simple feature sum method, as well as a random forest method, in order to classify nodes as normal or anomalous. We test the method first on synthetic networks which we generated, and second on a set of synthetic networks which were generated without the methods team having access to the ground truth. The first set of synthetic networks was split in a training set of 70 percent of the networks, and a test set of 30 percent of the networks. The resulting classifier was then applied to the second set of synthetic networks. We compare our method with Oddball, a widely used method for anomaly detection in networks, as well as to random classification. While Oddball outperforms random classification, both our feature sum method and our random forest method outperform Oddball. On the test set, the random forest outperforms feature sum, whereas on the second synthetic data set, initially feature sum tends to pick up more anomalies than random forest, with this behaviour reversing for lower-scoring anomalies. In all cases, the top 2 percent of flagged anomalies contained on average over 90 percent of the planted anomalies.

Keywords: anomaly detection, spectral methods, networkcomparison, planted subgraph

 (20) A. Tsokos, S. Narayanan, I. Kosmidis, G. Baio, M. Cucuringu, G. Whitaker and F. J. Király, "Modeling outcomes of soccer matches", (arXiv), Machine Learning 108, 77-95 (2019) [BibTeX] (2019)

Abstract: We compare various extensions of the Bradley-Terry model and a hierarchical Poisson log-linear model in terms of their performance in predicting the outcome of soccer matches (win, draw, or loss). The parameters of the Bradley-Terry extensions are estimated by maximizing the log-likelihood, or an appropriately penalized version of it, while the posterior densities of the parameters of the hierarchical Poisson log-linear model are approximated using integrated nested Laplace approximations. The prediction performance of the various modeling approaches is assessed using a novel, context-specific framework for temporal validation that is found to deliver accurate estimates of the test error. The direct modeling of outcomes via the various Bradley-Terry extensions and the modeling of match scores using the hierarchical Poisson log-linear model demonstrate similar behavior in terms of predictive performance.

Keywords: Bradley-Terry model; Poisson log-linear hierarchical model; Maximum penalized likelihood; Integrated Nested Laplace Approximation; Temporal validation

 (19) M. Cucuringu, H. Tyagi, "On denoising modulo 1 samples of a function", (arXiv), [code], AISTATS 2018 [BibTeX] (2018)

Consider an unknown smooth function f : [0, 1] → R, and say we are given n noisy mod 1 samples of f, i.e., yi = (f(xi ) + ηi) mod 1 for xi ∈ [0,1], where ηi denotes noise. Given the samples (xi,yi), i =1,...,n, our goal is to recover smooth, robust estimates of the clean samples f(xi) mod 1. We formulate a natural approach for solving this problem which works with angular embeddings of the noisy mod 1 samples over the unit complex circle, inspired by the angular synchronization framework. Our approach amounts to solving a quadratically constrained quadratic program (QCQP) which is NP-hard in its basic form, and therefore we consider its relaxation which is a trust region sub-problem and hence solvable efficiently. We demonstrate its robustness to noise via extensive numerical simulations on several synthetic examples, along with a detailed theoretical analysis. To the best of our knowledge, we provide the first algorithm for denoising mod 1 samples of a smooth function, which comes with robustness guarantees.

Keywords: Angular embedding, QCQP, phase unwrapping, angular synchronization.

 (18) M. Cucuringu, R. Erban "ADM-CLE approach for detecting slow variables in continuous time Markov chains and dynamic data", SIAM Journal on Scientific Computing, 39(1), B76-B101 [BibTeX] (2017)

In this paper we build on the anisotropic diffusion map (ADM) framework for detecting intrinsic slow variables in high-dimensional dynamic data. We develop and analyze a new ADM-Langevin approach, which has the potential of being more efficient than the existing ADM method for a large class of chemical systems. The contribution of our paper is three-fold. First, we are able to improve the computational efficiency of the existing ADM approach and avoid the very computationally intensive step of running local bursts simulations by using an approximation of the chemical Langevin equation. Second, we exploit the spectrum of each local covariance matrix, and locally build a sparse ellipsoid-like neighborhood graph at each point in the data set, as opposed to computing the similarity between all pairs of points in the domain, thus making our approach scalable to networks with thousands or even tens of thousands of nodes. As our third and main contribution, we propose a robust approximation method for the mapping from the observable space to the dynamically meaningful inaccessible space, which allows us to investigate a Markov-Chain based approach to estimating the stationary distribution of the detected slow variable, without any a-priori knowledge of it. Along the way, we derive analytically an expression for the conditional distribution of the fast variable given the slow variable, thus rendering our approach entirely deterministic, as opposed to existing approaches that rely on stochastic simulation algorithms.

Keywords: Anisotropic diffusion maps, random-walk Laplacian, eigenvectors, chemical networks, intrinsic variables, stochastic simulation algorithms, separation of scale, slow variables, stationary distribution, high dimensional data, manifold learning, Langevin equation, complex dynamical systems.

 (17) M. Cucuringu, C. Marshak, D. Montag, P. Rombach "Rank Aggregation for Course Sequence Discovery", Complex Networks [BibTeX] (2017)

Abstract: This work extends the rank aggregation framework for the setting of discovering optimal course sequences at the university level, and contributes to the literature on educational applications of network analysis. Each student provides a partial ranking of the courses taken throughout her or his undergraduate career. We build a network of courses by computing pairwise rank comparisons between courses based on the order students typically take them, and aggregate the results over the entire student population, to obtain a proxy for the rank offset between pairs of courses. We extract a global ranking of the courses via several state-of-the art al- gorithms for ranking with pairwise noisy information, including SerialRank, Rank Centrality, and the recent SyncRank based on the group synchronization problem. We test this application of rank aggregation on 15 years of student data from the Department of Mathematics at the University of California, Los Angeles (UCLA). Furthermore, we experiment with the above approach on different subsets of the stu- dent population conditioned on final GPA, and highlight several differences in the obtained rankings that uncover potential hidden pre-requisites in the Mathematics curriculum. Keywords: Temporal rank aggregation, spectral methods, educational data, ranking algorithms.
 (16) M. Cucuringu, "Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation via Eigenvector and SDP Synchronization", IEEE Transactions on Network Science and Engineering, 3 (1): 58-79, (2016). Compact version here. [BibTeX] (2016)

We consider the classic problem of establishing a statistical ranking of a set of n items given a set of inconsistent and incomplete pairwise comparisons between such items. Instantiations of this problem occur in numerous applications in data analysis (e.g., ranking teams in sports data), computer vision, and machine learning. We formulate the above problem of ranking with incomplete noisy information as an instance of the group synchronization problem over the group SO(2) of planar rotations, whose usefulness has been demonstrated in numerous applications in recent years in computer vision and graphics, sensor network localization and structural biology. Its least squares solution can be approximated by either a spectral or a semidefinite programming (SDP) relaxation, followed by a rounding procedure. We show extensive numerical simulations on both synthetic and real-world data sets (Premier League soccer games, a Halo 2 game tournament and NCAA College Basketball games), which show that our proposed method compares favorably to other ranking methods from the recent literature. Existing theoretical guarantees on the group synchronization problem imply lower bounds on the largest amount of noise permissible in the data while still achieving exact recovery of the ground truth ranking. We propose a similar synchronization-based algorithm for the rank-aggregation problem, which integrates in a globally consistent ranking many pairwise rank-offsets or partial rankings, given by different rating systems on the same set of items, an approach which yields significantly more accurate results than other aggregation methods, including Rank-Centrality, a recent state-of-the-art algorithm. Furthermore, we discuss the problem of semi-supervised ranking when there is available information on the ground truth rank of a subset of players, and propose an algorithm based on SDP which is able to recover the ranking of the remaining players, subject to such hard constraints. Finally, synchronization-based ranking, combined with a spectral technique for the densest subgraph problem, makes it possible to extract locally-consistent partial rankings, in other words, to identify the rank of a small subset of players whose pairwise rank comparisons are less noisy than the rest of the data, which other methods are not able to identify. We discuss a number of related open questions and variations of the ranking problem in other settings, which we defer for future investigation.

Keywords: ranking, angular synchronization, spectral algorithms, semidefinite programming, rank aggregation, partial rankings, least squares, singular value decomposition, densest subgraph problem.

 (15) M. Cucuringu, I. Koutis, S. Chawla, G. Miller, and R. Peng, "Simple and Scalable Constrained Clustering: A Generalized Spectral Method", AISTATS 2016 (Artificial Intelligence and Statistics Conference) [BibTeX] (2016)

We present a principled spectral approach to the well-studied constrained clustering problem. It reduces clustering to a generalized eigenvalue problem on Laplacians. The method works in nearly-linear time and provides concrete guarantees for the quality of the clusters, at least for the case of 2-way partitioning. In practice this translates to a very fast implementation that consistently outperforms existing spectral approaches. We support this claim with experiments on various data sets: our approach recovers correct clusters in examples where previous methods fail, and handles data sets with millions of data points - two orders of magnitude larger than before.

Keywords: constrained clustering, generalized eigenvalues problem, Laplacian linear systems

 (14) M. Cucuringu, M. P. Rombach, S. H. Lee, M. A. Porter, "Detection of Core-Periphery Structure in Networks Using Spectral Methods and Geodesic Paths", European Journal of Applied Mathematics, Vol. 27, No. 6: 846-887 [BibTeX] (2016)

We introduce several novel and computationally ecient methods for detecting “core-periphery structure” in networks. Core- periphery structure is a type of meso-scale structure that includes densely-connected core vertices and sparsely-connected peripheral vertices. Core vertices are well-connected both among themselves and to peripheral vertices, which are not well-connected to any vertices. Our first method, which is based on transportation in networks, aggregates information from many geodesic paths in a network and yields a score for each vertex that reflects the likelihood that that vertex is a core vertex. Our second method is based on a low-rank approximation of the adjacency matrix of the network, which can often be expressed as a tensor-product matrix. Our third approach uses the bottom eigenvector of the random-walk Laplacian to infer a coreness score and a classification into core and peripheral vertices. Additionally, we design an objective function to (1) help classify vertices into core or peripheral vertices and (2) provide a goodness-of-fit criterion for classifications into core versus peripheral vertices. To examine the performance of our methods, we apply our algorithms to both synthetically-generated networks and a variety of real-world data sets.

Keywords: Networks, core-periphery structure, shortest-path algorithms, low-rank matrix approximations, graph Laplacians, spectral algorithms.

 (13) M. Cucuringu, J. Woodworth, "Point Localization and Density Estimation from Ordinal kNN Graphs Using Synchronization", 2015 IEEE Machine Learning for Signal Processing Workshop (Short version) [BibTeX] (2015)

We consider the problem of embedding unweighted, directed k-nearest neighbor graphs in low-dimensional Euclidean space. The k-nearest neighbors of each vertex provides ordinal information on the distances between points, but not the distances themselves. We use this ordinal information along with the low-dimensionality to recover the coordinates of the points up to arbitrary similarity transformations (rigid transformations and scaling). Furthermore, we also illustrate the possibility of robustly recovering the underlying density via the Total Variation Maximum Penalized Likelihood Estimation (TV-MPLE) method. We make existing approaches scalable by using an instance of a local-to-global algorithm based on group synchronization, recently proposed in the literature in the context of sensor network localization and structural biology, which we augment with a scaling synchronization step. We demonstrate the scalability of our approach on large graphs, and show how it compares to the Local Ordinal Embedding (LOE) algorithm, which was recently proposed for recovering the configuration of a cloud of points from pairwise ordinal comparisons between a sparse set of distances.

Keywords: k-nearest-neighbor graphs, ordinal constraints, graph embeddings, density estimation, eigenvector synchronization, linear programming, divide-and-conquer.

 (12) M. Cucuringu, "Synchronization over Z2 and community detection in multiplex signed networks with constraints", Journal of Complex Networks, 3 (3):469-506 [BibTeX] (2015)

Finding group elements from noisy measurements of their pairwise ratios is also known as the synchronization problem, first introduced in the context of the group SO(2) of planar rotations. The usefulness of synchronization over the group Z2 has been demonstrated in recent algorithms for localization of sensor networks and three-dimensional structuring of molecules. In this paper, we focus on synchronization over Z2 and consider the problem of identifying communities in a multiplex network when the interaction between the nodes is described by a signed (and possibly weighted) measure of similarity, and when the multiplex network has a natural separation into two components. In the setting when one has the additional information that certain subsets of nodes represent the same (unknown) group element, we consider and compare several algorithms for the synchronization problem over Z2, based on spectral and semidefinite programming relaxations (SDP), and message passing algorithms. In other words, all nodes within such a subset represent the same unknown group element, and one has available noisy pairwise measurements between pairs of nodes that belong to different non-overlapping subsets. Following a recent analysis of the eigenvector method for synchronization over SO(2), we analyze the robustness to noise of the eigenvector method for synchronization over Z2, when the underlying graph of pairwise measurements is the Erdos-Renyi random graph, using results from random matrix theory on the largest eigenvalue of rank-1 deformation of large random matrices. We also propose a message passing synchronization algorithm, inspired by the standard belief propagation algorithm for approximating marginal distributions, that outperforms the existing eigenvector synchronization algorithm only for certain classes of graphs and noise models, and enjoys the flexibility of incorporating additional (possibly non-convex) constraints that may not be easily accommodated by any of the other spectral or SDP-based methods. We apply the synchronization methods both to synthetic models and a real data set of roll call voting patterns in the U.S. Congress across time, to identify the two existing communities, i.e., the Democratic and Republican parties, and discuss a number of related open problems.
 (11) S. H. Lee, M. Cucuringu, M. A. Porter, "Density-Based and Transport-Based Core-Periphery Structures in Networks", Physical Review E, Vol. 89, No. 3: 032810 [BibTeX] (2014)

Networks often possess mesoscale structures, and studying them can yield insights into both structure and function. It is most common to study community structure, but numerous other types of mesoscale structures also exist. In this paper, we examine core-periphery structures based on both density and transport. In such structures, core network components are well-connected both among themselves and to peripheral components, which are not well-connected to anything. We examine core-periphery structures in a wide range of examples of transportation, social, and financial networks—including road networks in large urban areas, a rabbit warren, a dolphin social network, a European interbank network, and a migration network between counties in the United States. We illustrate that a recently developed transport-based notion of node coreness is very useful for characterizing transportation networks. We also generalize this notion to examine core versus peripheral edges, and we show that the resulting diagnostic is also useful for transportation networks. To examine the properties of transportation networks further, we develop a family of generative models of roadlike networks. We illustrate the effect of the dimensionality of the embedding space on transportation networks, and we demonstrate that the correlations between different measures of coreness can be very different for different types of networks.

 (10) M. Cucuringu, A. Singer, D. Cowburn, "Eigenvector Synchronization, Graph Rigidity and the Molecule Problem", Information and Inference: A Journal of the IMA, 1 (1), pp. 2167 [BibTeX] (2012)

The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology. In this paper, we extend on previous work and propose the 3D-ASAP algorithm, for the graph realization problem in $\mathbb{R}^3$, given a sparse and noisy set of distance measurements. 3D-ASAP is a divide and conquer, non-incremental and non-iterative algorithm, which integrates local distance information into a global structure determination. Our approach starts with identifying, for every node, a subgraph of its 1-hop neighborhood graph, which can be accurately embedded in its own coordinate system. In the noise-free case, the computed coordinates of the sensors in each patch must agree with their global positioning up to some unknown rigid motion, that is, up to translation, rotation and possibly reflection. In other words, to every patch there corresponds an element of the Euclidean group Euc(3) of rigid transformations in $\mathbb{R}^3$, and the goal is to estimate the group elements that will properly align all the patches in a globally consistent way. Furthermore, 3D-ASAP successfully incorporates information specific to the molecule problem in structural biology, in particular information on known substructures and their orientation. In addition, we also propose 3D-SP-ASAP, a faster version of 3D-ASAP, which uses a spectral partitioning algorithm as a preprocessing step for dividing the initial graph into smaller subgraphs. Our extensive numerical simulations show that 3D-ASAP and 3D-SP-ASAP are very robust to high levels of noise in the measured distances and to sparse connectivity in the measurement graph, and compare favorably to similar state-of-the art localization algorithms.

Keywords: Graph realization, distance geometry, eigenvectors, synchronization, spectral graph theory, rigidity theory, SDP, the molecule problem, divide-and-conquer.

 (9) M. Cucuringu, V. Blondel, P. Van Dooren, "Extracting spatial information from networks with low-order eigenvectors", Physical Review E 87, 032803 [BibTeX] (2013)

We consider the problem of inferring meaningful spatial information in networks from incomplete information on the connection intensity between the nodes of the network. We consider two spatially distributed networks: a population migration flow network within the US, and a network of mobile phone calls between cities in Belgium. For both networks we use the eigenvectors of the Laplacian matrix constructed from the link intensities to obtain informative visualizations and capture natural geographical subdivisions. We observe that some low order eigenvectors localize very well and seem to reveal small geographically cohesive regions that match remarkably well with political and administrative boundaries. We discuss possible explanations for this observation by describing diffusion maps and localized eigenfunctions. In addition, we discuss a possible connection with the weighted graph cut problem, and provide numerical evidence supporting the idea that lower order eigenvectors point out local cuts in the network. However, we do not provide a formal and rigorous justification for our observations.

Keywords: Eigenvector localization, diffusion maps, visualization of large data sets, social-economic networks, community detection, gravity model.

 (8) M. Cucuringu, Y. Lipman , A. Singer, "Sensor network localization by eigenvector synchronization over the Euclidean group", ACM Transactions on Sensor Networks, 8 (3), pp. 1-42 [BibTeX] (2012)

We present a new approach to localization of sensors from noisy measurements of a subset of their Euclidean distances. Our algorithm starts by finnding, embedding and aligning uniquely realizable subsets of neighboring sensors called patches. In the noise-free case, each patch agrees with its global positioning up to an unknown rigid motion of translation, rotation and possibly reflection. The reflections and rotations are estimated using the recently developed eigenvector synchronization algorithm, while the translations are estimated by solving an overdetermined linear system. The algorithm is scalable as the number of nodes increases, and can be implemented in a distributed fashion. Extensive numerical experiments show that it compares favorably to other existing algorithms in terms of robustness to noise, sparse connectivity and running time. While our approach is applicable to higher dimensions, in the current paper we focus on the two dimensional case.

Keywords: Sensor networks, distance geometry, eigenvectors, synchronization, rigidity theory, spectral graph theory.

 (7) M. Cucuringu, M. W. Mahoney, "Localization on low-order eigenvectors of data matrices", Technical Report (arXiv) [BibTeX] (2011)

Eigenvector localization refers to the situation when most of the components of an eigenvector are zero or near-zero. This phenomenon has been observed on eigenvectors associated with extremal eigenvalues, and in many of those cases it can be meaningfully interpreted in terms of "structural heterogeneities" in the data. For example, the largest eigenvectors of adjacency matrices of large complex networks often have most of their mass localized on high-degree nodes; and the smallest eigenvectors of the Laplacians of such networks are often localized on small but meaningful community-like sets of nodes. Here, we describe localization associated with low-order eigenvectors, i.e., eigenvectors corresponding to eigenvalues that are not extremal but that are \buried" further down in the spectrum. Although we have observed it in several unrelated applications, this phenomenon of low-order eigenvector localization defies common intuitions and simple explanations, and it creates serious difficulties for the applicability of popular eigenvector-based machine learning and data analysis tools. After describing two examples where low-order eigenvector localization arises, we present a very simple model that qualitatively reproduces several of the empirically-observed results. This model suggests certain coarse structural similarities among the seemingly-unrelated applications where we have observed low-order eigenvector localization, and it may be used as a diagnostic tool to help extract insight from data graphs when such low-order eigenvector localization is present.
 (6) F. Blanchet-Sadri, E. Allen, C. Byrum, M. Cucuringu and R. Mercas, "Counting Bordered Partial Words by Critical Positions" , The Electronic Journal of Combinatorics, Vol. 18, pp. 138 [BibTeX] (2011)

A partial word, sequence over a finite alphabet that may have some undefined positions or holes, is bordered if one of its proper prefixes is compatible with one of its suffixes. The number theoretical problem of enumerating all bordered full words (the ones without holes) of a fixed length n over an alphabet of a fixed size k is well known. In this paper, we investigate the number of bordered partial words having h holes with the parameters k, n. It turns out that all borders of a full word are simple, and so every bordered full word has a unique minimal border no longer than half its length. Counting bordered partial words is made extremely more difficult by the failure of that combinatorial property since there is now the possibility of a minimal border that is nonsimple. Here, we give recursive formulas based on our approach of the so-called simple and nonsimple critical positions.

Keywords: Theory of formal languages; Number theory; Combinatorics on words; Partial words; Bordered partial words; Simple border; Simply bordered partial words; Critical positions

 (5) F. Blanchet-Sadri, M. Cucuringu, "Counting primitive partial words", Journal of Automata, Languages and Combinatorics 15, 3/4, 199-227 [BibTeX] (2010)

A word is primitive if it is not a power of another word. The number of primitive words of a fixed length over an alphabet of a fixed size is well known and relates to the M¨obius function. In this paper, we investigate the number of primitive partial words which are strings that may contain “do not know” symbols.

Keywords: Combinatorics on words; Words; Partial words; Primitive words; Primitive partial words; Mobius function; Periods; Exact periods.

 (4) M. Cucuringu, J. Puente, and D. Shue, "Model Selection in Undirected Graphical Models with Elastic Net ", Technical Report (arXiv) [BibTeX] (2010)
Structure learning in random fields has attracted considerable attention due to its difficulty and importance in areas such as remote sensing, computational biology, natural language processing, protein networks, and social network analysis. We consider the problem of estimating the probabilistic graph structure associated with a Gaussian Markov Random Field (GMRF), the Ising model and the Potts model, by extending previous work on l1 regularized neighborhood estimation to include the elastic net l1 + l2 penalty. Additionally, we show numerical evidence that the edge density plays a role in the graph recovery process. Finally, we introduce a novel method for augmenting neighborhood estimation by leveraging pair-wise neighborhood union estimates.

 (3) A. Singer, M. Cucuringu, "Uniqueness of Low-Rank Matrix Completion by Rigidity Theory", SIAM Journal on Matrix Analysis and Applications, 31 (4), pp. 1621-1641 [BibTeX] (2010)

The problem of completing a low-rank matrix from a subset of its entries is often encountered in the analysis of incomplete data sets exhibiting an underlying factor model with applications in collaborative filtering, computer vision, and control. Most recent work has been focused on constructing efficient algorithms for exact or approximate recovery of the missing matrix entries and proving lower bounds for the number of known entries that guarantee a successful recovery with high probability. A related problem from both the mathematical and algorithmic points of view is the distance geometry problem of realizing points in a Euclidean space from a given subset of their pairwise distances. Rigidity theory answers basic questions regarding the uniqueness of the realization satisfying a given partial set of distances. We observe that basic ideas and tools of rigidity theory can be adapted to determine uniqueness of low-rank matrix completion, where inner products play the role that distances play in rigidity theory. This observation leads to efficient randomized algorithms for testing necessary and sufficient conditions for local completion and for testing sufficient conditions for global completion. Crucial to our analysis is a new matrix, which we call the completion matrix, that serves as the analogue of the rigidity matrix.

Keywords: low-rank matrices, missing values, rigidity theory, iterative methods, collaborative filtering

 (2) M. Cucuringu, R. Strichartz, "Infinitesimal Resistance Metrics on Sierpinski Gasket Type Fractals", Analysis, Vol. 28, Issue 3, pp. 319-331 [BibTeX] (2008)

We prove the existence of an infinitesimal resistance metric on the Sierpinksi gasket (SG) at boundary points, junction points and periodic points. This is a renormalized limit of the effective resistance metric as we zoom in on the point, and satisfies a self-similar identity. We obtain similar results on PCF fractals with three boundary points.
 (1) M. Cucuringu, R. Strichartz, "Self-Similar Energy Forms on the Sierpinski Gasket with Twists", Potential Analysis, Volume 27, Issue 1, pp. 45-60 [BibTeX] (2007)

By introducing twists into the iterated function system that defines the Sierpinski gasket, we are able to construct a unique regular energy form that satisfies a self-similar identity with any prescribed projective weights. Our construction is explicit (involving finding a root of a 4th order polynomial), and we are able to find explicitly a polynomial identity for the algebraic variety containing the smooth manifold of admissible weights. Without the twists, there are obstructions to existence, and a complete description due to Sabot is quite complicated.