Mihai Cucuringu

Alan Turing Research Fellow

Department of Statistics + Mathematical Institute

University of Oxford

Alan Turing Institute

[Homepage]      [Research]     [CV]      [Personal]      [Links]    

  (18) M. Cucuringu, R. Erban "ADM-CLE approach for detecting slow variables in continuous time Markov chains and dynamic data", SIAM Journal on Scientific Computing, 39(1), B76-B101 (2017), (arXiv:1504.01786)
In this paper we build on the anisotropic diffusion map (ADM) framework for detecting intrinsic slow variables in high-dimensional dynamic data. We develop and analyze a new ADM-Langevin approach, which has the potential of being more efficient than the existing ADM method for a large class of chemical systems. The contribution of our paper is three-fold. First, we are able to improve the computational efficiency of the existing ADM approach and avoid the very computationally intensive step of running local bursts simulations by using an approximation of the chemical Langevin equation. Second, we exploit the spectrum of each local covariance matrix, and locally build a sparse ellipsoid-like neighborhood graph at each point in the data set, as opposed to computing the similarity between all pairs of points in the domain, thus making our approach scalable to networks with thousands or even tens of thousands of nodes. As our third and main contribution, we propose a robust approximation method for the mapping from the observable space to the dynamically meaningful inaccessible space, which allows us to investigate a Markov-Chain based approach to estimating the stationary distribution of the detected slow variable, without any a-priori knowledge of it. Along the way, we derive analytically an expression for the conditional distribution of the fast variable given the slow variable, thus rendering our approach entirely deterministic, as opposed to existing approaches that rely on stochastic simulation algorithms.

Key words. Anisotropic diffusion maps, random-walk Laplacian, eigenvectors, chemical networks, intrinsic variables, stochastic simulation algorithms, separation of scale, slow variables, stationary distribution, high dimensional data, manifold learning, Langevin equation, complex dynamical systems.

  (17) M. Cucuringu, "Sync-Rank: Robust Ranking, Constrained Ranking and Rank Aggregation via Eigenvector and SDP Synchronization", IEEE Transactions on Network Science and Engineering, 3 (1): 58-79, (2016), arXiv:1504.01070, (arXiv:1504.01070)
We consider the classic problem of establishing a statistical ranking of a set of n items given a set of inconsistent and incomplete pairwise comparisons between such items. Instantiations of this problem occur in numerous applications in data analysis (e.g., ranking teams in sports data), computer vision, and machine learning. We formulate the above problem of ranking with incomplete noisy information as an instance of the group synchronization problem over the group SO(2) of planar rotations, whose usefulness has been demonstrated in numerous applications in recent years in computer vision and graphics, sensor network localization and structural biology. Its least squares solution can be approximated by either a spectral or a semidefinite programming (SDP) relaxation, followed by a rounding procedure. We show extensive numerical simulations on both synthetic and real-world data sets (Premier League soccer games, a Halo 2 game tournament and NCAA College Basketball games), which show that our proposed method compares favorably to other ranking methods from the recent literature. Existing theoretical guarantees on the group synchronization problem imply lower bounds on the largest amount of noise permissible in the data while still achieving exact recovery of the ground truth ranking. We propose a similar synchronization-based algorithm for the rank-aggregation problem, which integrates in a globally consistent ranking many pairwise rank-offsets or partial rankings, given by different rating systems on the same set of items, an approach which yields significantly more accurate results than other aggregation methods, including Rank-Centrality, a recent state-of-the-art algorithm. Furthermore, we discuss the problem of semi-supervised ranking when there is available information on the ground truth rank of a subset of players, and propose an algorithm based on SDP which is able to recover the ranking of the remaining players, subject to such hard constraints. Finally, synchronization-based ranking, combined with a spectral technique for the densest subgraph problem, makes it possible to extract locally-consistent partial rankings, in other words, to identify the rank of a small subset of players whose pairwise rank comparisons are less noisy than the rest of the data, which other methods are not able to identify. We discuss a number of related open questions and variations of the ranking problem in other settings, which we defer for future investigation.

Key words. ranking, angular synchronization, spectral algorithms, semidefinite programming, rank aggregation, partial rankings, least squares, singular value decomposition, densest subgraph problem.

  (16) M. Cucuringu, I. Koutis, S. Chawla, G. Miller, and R. Peng, "Simple and Scalable Constrained Clustering: A Generalized Spectral Method", AISTATS 2016 (Artificial Intelligence and Statistics Conference) (arXiv:1601.04746)
We present a principled spectral approach to the well-studied constrained clustering problem. It reduces clustering to a generalized eigenvalue problem on Laplacians. The method works in nearly-linear time and provides concrete guarantees for the quality of the clusters, at least for the case of 2-way partitioning. In practice this translates to a very fast implementation that consistently outperforms existing spectral approaches. We support this claim with experiments on various data sets: our approach recovers correct clusters in examples where previous methods fail, and handles data sets with millions of data points - two orders of magnitude larger than before.

Key words. constrained clustering, generalized eigenvalues problem, Laplacian linear systems

  (15) M. Cucuringu, M. P. Rombach, S. H. Lee, M. A. Porter, "Detection of Core-Periphery Structure in Networks Using Spectral Methods and Geodesic Paths", European Journal of Applied Mathematics, Vol. 27, No. 6: 846-887 (2016), (arXiv:1410.6572), (arXiv:1410.6572)
We introduce several novel and computationally ecient methods for detecting “core-periphery structure” in networks. Core- periphery structure is a type of meso-scale structure that includes densely-connected core vertices and sparsely-connected peripheral vertices. Core vertices are well-connected both among themselves and to peripheral vertices, which are not well-connected to any vertices. Our first method, which is based on transportation in networks, aggregates information from many geodesic paths in a network and yields a score for each vertex that reflects the likelihood that that vertex is a core vertex. Our second method is based on a low-rank approximation of the adjacency matrix of the network, which can often be expressed as a tensor-product matrix. Our third approach uses the bottom eigenvector of the random-walk Laplacian to infer a coreness score and a classification into core and peripheral vertices. Additionally, we design an objective function to (1) help classify vertices into core or peripheral vertices and (2) provide a goodness-of-fit criterion for classifications into core versus peripheral vertices. To examine the performance of our methods, we apply our algorithms to both synthetically-generated networks and a variety of real-world data sets.

Key words. Networks, core-periphery structure, shortest-path algorithms, low-rank matrix approximations, graph Laplacians, spectral algorithms.

  (14) M. Cucuringu, J. Woodworth, "Point Localization and Density Estimation from Ordinal kNN graphs using Synchronization", submitted, (2015), (arXiv:1504.00722); Conference version accepted to 2015 IEEE Machine Learning for Signal Processing Workshop (Boston, 2015)
We consider the problem of embedding unweighted, directed k-nearest neighbor graphs in low-dimensional Euclidean space. The k-nearest neighbors of each vertex provides ordinal information on the distances between points, but not the distances themselves. We use this ordinal information along with the low-dimensionality to recover the coordinates of the points up to arbitrary similarity transformations (rigid transformations and scaling). Furthermore, we also illustrate the possibility of robustly recovering the underlying density via the Total Variation Maximum Penalized Likelihood Estimation (TV-MPLE) method. We make existing approaches scalable by using an instance of a local-to-global algorithm based on group synchronization, recently proposed in the literature in the context of sensor network localization and structural biology, which we augment with a scaling synchronization step. We demonstrate the scalability of our approach on large graphs, and show how it compares to the Local Ordinal Embedding (LOE) algorithm, which was recently proposed for recovering the configuration of a cloud of points from pairwise ordinal comparisons between a sparse set of distances.

Key words. k-nearest-neighbor graphs, ordinal constraints, graph embeddings, density estimation, eigenvector synchronization, linear programming, divide-and-conquer.

 (13) M. Cucuringu, "Synchronization over Z2 and community detection in multiplex signed networks with constraints", Journal of Complex Networks, 3 (3):469-506, (2015)

Finding group elements from noisy measurements of their pairwise ratios is also known as the synchronization problem, first introduced in the context of the group SO(2) of planar rotations. The usefulness of synchronization over the group Z2 has been demonstrated in recent algorithms for localization of sensor networks and three-dimensional structuring of molecules. In this paper, we focus on synchronization over Z2 and consider the problem of identifying communities in a multiplex network when the interaction between the nodes is described by a signed (and possibly weighted) measure of similarity, and when the multiplex network has a natural separation into two components. In the setting when one has the additional information that certain subsets of nodes represent the same (unknown) group element, we consider and compare several algorithms for the synchronization problem over Z2, based on spectral and semidefinite programming relaxations (SDP), and message passing algorithms. In other words, all nodes within such a subset represent the same unknown group element, and one has available noisy pairwise measurements between pairs of nodes that belong to different non-overlapping subsets. Following a recent analysis of the eigenvector method for synchronization over SO(2), we analyze the robustness to noise of the eigenvector method for synchronization over Z2, when the underlying graph of pairwise measurements is the Erdos-Renyi random graph, using results from random matrix theory on the largest eigenvalue of rank-1 deformation of large random matrices. We also propose a message passing synchronization algorithm, inspired by the standard belief propagation algorithm for approximating marginal distributions, that outperforms the existing eigenvector synchronization algorithm only for certain classes of graphs and noise models, and enjoys the flexibility of incorporating additional (possibly non-convex) constraints that may not be easily accommodated by any of the other spectral or SDP-based methods. We apply the synchronization methods both to synthetic models and a real data set of roll call voting patterns in the U.S. Congress across time, to identify the two existing communities, i.e., the Democratic and Republican parties, and discuss a number of related open problems.
  (12) S. H. Lee, M. Cucuringu, M. A. Porter, "Density-Based and Transport-Based Core-Periphery Structures in Networks", Physical Review E, Vol. 89, No. 3: 032810 (2014), (arXiv:1310.8387)

Networks often possess mesoscale structures, and studying them can yield insights into both structure and function. It is most common to study community structure, but numerous other types of mesoscale structures also exist. In this paper, we examine core-periphery structures based on both density and transport. In such structures, core network components are well-connected both among themselves and to peripheral components, which are not well-connected to anything. We examine core-periphery structures in a wide range of examples of transportation, social, and financial networks—including road networks in large urban areas, a rabbit warren, a dolphin social network, a European interbank network, and a migration network between counties in the United States. We illustrate that a recently developed transport-based notion of node coreness is very useful for characterizing transportation networks. We also generalize this notion to examine core versus peripheral edges, and we show that the resulting diagnostic is also useful for transportation networks. To examine the properties of transportation networks further, we develop a family of generative models of roadlike networks. We illustrate the effect of the dimensionality of the embedding space on transportation networks, and we demonstrate that the correlations between different measures of coreness can be very different for different types of networks.

 (11) M. Cucuringu, A. Singer, D. Cowburn, "Eigenvector Synchronization, Graph Rigidity and the Molecule Problem", Information and Inference: A Journal of the IMA, 1 (1), pp. 2167 (2012), arXiv:1111.3304, (arXiv)

The graph realization problem has received a great deal of attention in recent years, due to its importance in applications such as wireless sensor networks and structural biology. In this paper, we extend on previous work and propose the 3D-ASAP algorithm, for the graph realization problem in $\mathbb{R}^3$, given a sparse and noisy set of distance measurements. 3D-ASAP is a divide and conquer, non-incremental and non-iterative algorithm, which integrates local distance information into a global structure determination. Our approach starts with identifying, for every node, a subgraph of its 1-hop neighborhood graph, which can be accurately embedded in its own coordinate system. In the noise-free case, the computed coordinates of the sensors in each patch must agree with their global positioning up to some unknown rigid motion, that is, up to translation, rotation and possibly reflection. In other words, to every patch there corresponds an element of the Euclidean group Euc(3) of rigid transformations in $\mathbb{R}^3$, and the goal is to estimate the group elements that will properly align all the patches in a globally consistent way. Furthermore, 3D-ASAP successfully incorporates information specific to the molecule problem in structural biology, in particular information on known substructures and their orientation. In addition, we also propose 3D-SP-ASAP, a faster version of 3D-ASAP, which uses a spectral partitioning algorithm as a preprocessing step for dividing the initial graph into smaller subgraphs. Our extensive numerical simulations show that 3D-ASAP and 3D-SP-ASAP are very robust to high levels of noise in the measured distances and to sparse connectivity in the measurement graph, and compare favorably to similar state-of-the art localization algorithms.

Keywords: Graph realization, distance geometry, eigenvectors, synchronization, spectral graph theory, rigidity theory, SDP, the molecule problem, divide-and-conquer.

 (10) M. Cucuringu, V. Blondel, P. Van Dooren, "Extracting spatial information from networks with low-order eigenvectors", Physical Review E 87, 032803 (2013), arXiv:1111.0920, (arXiv)

We consider the problem of inferring meaningful spatial information in networks from incomplete information on the connection intensity between the nodes of the network. We consider two spatially distributed networks: a population migration flow network within the US, and a network of mobile phone calls between cities in Belgium. For both networks we use the eigenvectors of the Laplacian matrix constructed from the link intensities to obtain informative visualizations and capture natural geographical subdivisions. We observe that some low order eigenvectors localize very well and seem to reveal small geographically cohesive regions that match remarkably well with political and administrative boundaries. We discuss possible explanations for this observation by describing diffusion maps and localized eigenfunctions. In addition, we discuss a possible connection with the weighted graph cut problem, and provide numerical evidence supporting the idea that lower order eigenvectors point out local cuts in the network. However, we do not provide a formal and rigorous justification for our observations.

Keywords: Eigenvector localization, diffusion maps, visualization of large data sets, social-economic networks, community detection, gravity model.

 (9) M. Cucuringu, Y. Lipman , A. Singer, "Sensor network localization by eigenvector synchronization over the Euclidean group", ACM Transactions on Sensor Networks, 8 (3), pp. 1-42 (2012)

We present a new approach to localization of sensors from noisy measurements of a subset of their Euclidean distances. Our algorithm starts by finnding, embedding and aligning uniquely realizable subsets of neighboring sensors called patches. In the noise-free case, each patch agrees with its global positioning up to an unknown rigid motion of translation, rotation and possibly reflection. The reflections and rotations are estimated using the recently developed eigenvector synchronization algorithm, while the translations are estimated by solving an overdetermined linear system. The algorithm is scalable as the number of nodes increases, and can be implemented in a distributed fashion. Extensive numerical experiments show that it compares favorably to other existing algorithms in terms of robustness to noise, sparse connectivity and running time. While our approach is applicable to higher dimensions, in the current paper we focus on the two dimensional case.

Keywords: Sensor networks, distance geometry, eigenvectors, synchronization, rigidity theory, spectral graph theory.

 (8) M. Cucuringu, M. W. Mahoney, "Localization on low-order eigenvectors of data matrices", Technical Report, arXiv:1109.1355 (2011) (arXiv)

Eigenvector localization refers to the situation when most of the components of an eigenvector are zero or near-zero. This phenomenon has been observed on eigenvectors associated with extremal eigenvalues, and in many of those cases it can be meaningfully interpreted in terms of "structural heterogeneities" in the data. For example, the largest eigenvectors of adjacency matrices of large complex networks often have most of their mass localized on high-degree nodes; and the smallest eigenvectors of the Laplacians of such networks are often localized on small but meaningful community-like sets of nodes. Here, we describe localization associated with low-order eigenvectors, i.e., eigenvectors corresponding to eigenvalues that are not extremal but that are \buried" further down in the spectrum. Although we have observed it in several unrelated applications, this phenomenon of low-order eigenvector localization defies common intuitions and simple explanations, and it creates serious difficulties for the applicability of popular eigenvector-based machine learning and data analysis tools. After describing two examples where low-order eigenvector localization arises, we present a very simple model that qualitatively reproduces several of the empirically-observed results. This model suggests certain coarse structural similarities among the seemingly-unrelated applications where we have observed low-order eigenvector localization, and it may be used as a diagnostic tool to help extract insight from data graphs when such low-order eigenvector localization is present.
 (7) F. Blanchet-Sadri, E. Allen, C. Byrum, M. Cucuringu and R. Mercas, "Counting Bordered Partial Words by Critical Positions" , The Electronic Journal of Combinatorics, Vol. 18, 2011, #P138.

A partial word, sequence over a finite alphabet that may have some undefined positions or holes, is bordered if one of its proper prefixes is compatible with one of its suffixes. The number theoretical problem of enumerating all bordered full words (the ones without holes) of a fixed length n over an alphabet of a fixed size k is well known. In this paper, we investigate the number of bordered partial words having h holes with the parameters k, n. It turns out that all borders of a full word are simple, and so every bordered full word has a unique minimal border no longer than half its length. Counting bordered partial words is made extremely more difficult by the failure of that combinatorial property since there is now the possibility of a minimal border that is nonsimple. Here, we give recursive formulas based on our approach of the so-called simple and nonsimple critical positions.

Keywords: Theory of formal languages; Number theory; Combinatorics on words; Partial words; Bordered partial words; Simple border; Simply bordered partial words; Critical positions

 (6) F. Blanchet-Sadri, M. Cucuringu, "Counting primitive partial words", Journal of Automata, Languages and Combinatorics 15 (2010) 3/4, 199-227.
(Chapter 6 in book by F. Blanchet-Sadri, "Algorithmic Combinatorics on Partial Words", Chapman & Hall/CRC Press, Boca Raton, FL, 2008)

A word is primitive if it is not a power of another word. The number of primitive words of a fixed length over an alphabet of a fixed size is well known and relates to the M¨obius function. In this paper, we investigate the number of primitive partial words which are strings that may contain “do not know” symbols.

Keywords: Combinatorics on words; Words; Partial words; Primitive words; Primitive partial words; Mobius function; Periods; Exact periods.

 (5) M. Cucuringu, J. Puente, and D. Shue, "Model Selection in Undirected Graphical Models with Elastic Net ", Technical Report, 2010, arXiv: 1111.0559 (arXiv)

Structure learning in random fields has attracted considerable attention due to its difficulty and importance in areas such as remote sensing, computational biology, natural language processing, protein networks, and social network analysis. We consider the problem of estimating the probabilistic graph structure associated with a Gaussian Markov Random Field (GMRF), the Ising model and the Potts model, by extending previous work on l1 regularized neighborhood estimation to include the elastic net l1 + l2 penalty. Additionally, we show numerical evidence that the edge density plays a role in the graph recovery process. Finally, we introduce a novel method for augmenting neighborhood estimation by leveraging pair-wise neighborhood union estimates.

 (4) A. Singer, M. Cucuringu, "Uniqueness of Low-Rank Matrix Completion by Rigidity Theory", SIAM Journal on Matrix Analysis and Applications, 31 (4), pp. 1621-1641 (2010) (arXiv)

The problem of completing a low-rank matrix from a subset of its entries is often encountered in the analysis of incomplete data sets exhibiting an underlying factor model with applications in collaborative filtering, computer vision, and control. Most recent work has been focused on constructing efficient algorithms for exact or approximate recovery of the missing matrix entries and proving lower bounds for the number of known entries that guarantee a successful recovery with high probability. A related problem from both the mathematical and algorithmic points of view is the distance geometry problem of realizing points in a Euclidean space from a given subset of their pairwise distances. Rigidity theory answers basic questions regarding the uniqueness of the realization satisfying a given partial set of distances. We observe that basic ideas and tools of rigidity theory can be adapted to determine uniqueness of low-rank matrix completion, where inner products play the role that distances play in rigidity theory. This observation leads to efficient randomized algorithms for testing necessary and sufficient conditions for local completion and for testing sufficient conditions for global completion. Crucial to our analysis is a new matrix, which we call the completion matrix, that serves as the analogue of the rigidity matrix.

Keywords: low-rank matrices, missing values, rigidity theory, iterative methods, collaborative filtering

 (3) F. Blanchet-Sadri, M. Cordier, M. Cucuringu and R. Kirsch, "Combinatorics on Border Correlations of Partial Words", International Conference on Automata, Languages and Related Topics, Debrecen, Hungary, October 21-24, 2008

In this paper, we consider the border sets of partial words of length n over a finite alphabet, and study the combinatorics of specific representations of them, called border correlations, which are binary vectors of length n indicating the borders. We characterize precisely which of these vectors are valid border correlations. We establish a one-to-one correspondence between the set of valid border correlations and the set of valid ternary correlations of a given length, the latter being ternary vectors representing the strong and strictly weak period sets. This gives an efficient algorithm for generating the valid ternary correlations of a given length. It turns out that the set of border correlations of partial words over an arbitrary alphabet of cardinality at least two is the same as the set of border correlations of partial words over the binary alphabet. We also give a correspondence between the ternary correlation of a partial word and its refined border correlation which specifies the lengths of all the word’s bordered cyclic shifts’ minimal borders. Finally, we investigate the population size, that is, the number of partial words sharing a given border correlation, and obtain formulas to compute it. Our approach is based on the fact that the sets of all border correlations of a given length form distributive lattices under suitably defined partial orderings.

Keywords: Theory of formal languages; Combinatorics on words; Partial words; Border correlations; Refined border correlations; Ternary correlations; Population size.

 (2) M. Cucuringu, R. Strichartz, "Infinitesimal Resistance Metrics on Sierpinski Gasket Type Fractals"- Analysis , Vol. 28, Issue 3 (2008), page 319-331

We prove the existence of an infinitesimal resistance metric on the Sierpinksi gasket (SG) at boundary points, junction points and periodic points. This is a renormalized limit of the effective resistance metric as we zoom in on the point, and satisfies a self-similar identity. We obtain similar results on PCF fractals with three boundary points.
 (1) M. Cucuringu, R. Strichartz, "Self-Similar Energy Forms on the Sierpinski Gasket with Twists" - Potential Analysis, 27 ( Aug 2006), page 45-60

By introducing twists into the iterated function system that defines the Sierpinski gasket, we are able to construct a unique regular energy form that satisfies a self-similar identity with any prescribed projective weights. Our construction is explicit (involving finding a root of a 4th order polynomial), and we are able to find explicitly a polynomial identity for the algebraic variety containing the smooth manifold of admissible weights. Without the twists, there are obstructions to existence, and a complete description due to Sabot is quite complicated.

Undergraduate Theses

© 2017 Mihai Cucuringu