U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Artif Intell

Logo of frontai

An Introduction to Topological Data Analysis: Fundamental and Practical Aspects for Data Scientists

Frédéric chazal.

1 Inria Saclay - Île-de-France Research Centre, Palaiseau, France

Bertrand Michel

2 Ecole Centrale de Nantes, Nantes, France

Devendra Singh Dhami , Darmstadt University of Technology, Germany

With the recent explosion in the amount, the variety, and the dimensionality of available data, identifying, extracting, and exploiting their underlying structure has become a problem of fundamental importance for data analysis and statistical learning. Topological data analysis ( tda ) is a recent and fast-growing field providing a set of new topological and geometric tools to infer relevant features for possibly complex data. It proposes new well-founded mathematical theories and computational tools that can be used independently or in combination with other data analysis and statistical learning techniques. This article is a brief introduction, through a few selected topics, to basic fundamental and practical aspects of tda for nonexperts.

1 Introduction and Motivation

Topological data analysis ( tda ) is a recent field that emerged from various works in applied (algebraic) topology and computational geometry during the first decade of the century. Although one can trace back geometric approaches to data analysis quite far into the past, tda really started as a field with the pioneering works of Edelsbrunner et al. (2002) and Zomorodian and Carlsson (2005) in persistent homology and was popularized in a landmark article in 2009 Carlsson (2009) . tda is mainly motivated by the idea that topology and geometry provide a powerful approach to infer robust qualitative, and sometimes quantitative, information about the structure of data [e.g., Chazal (2017 )].

tda aims at providing well-founded mathematical, statistical, and algorithmic methods to infer, analyze, and exploit the complex topological and geometric structures underlying data that are often represented as point clouds in Euclidean or more general metric spaces. During the last few years, a considerable effort has been made to provide robust and efficient data structures and algorithms for tda that are now implemented and available and easy to use through standard libraries such as the GUDHI library 1 (C++ and Python) Maria et al. (2014) and its R software interface Fasy et al. (2014a) , Dionysus 2 , PHAT 3 , DIPHA 4 , or Giotto 5 . Although it is still rapidly evolving, tda now provides a set of mature and efficient tools that can be used in combination with or complementarily to other data science tools.

The Topological Data Analysis Pipeline

tda has recently known developments in various directions and application fields. There now exist a large variety of methods inspired by topological and geometric approaches. Providing a complete overview of all these existing approaches is beyond the scope of this introductory survey. However, many standard ones rely on the following basic pipeline that will serve as the backbone of this article:

  • 1. The input is assumed to be a finite set of points coming with a notion of distance—or similarity—between them. This distance can be induced by the metric in the ambient space (e.g., the Euclidean metric when the data are embedded in R d ) or comes as an intrinsic metric defined by a pairwise distance matrix. The definition of the metric on the data is usually given as an input or guided by the application. It is, however, important to notice that the choice of the metric may be critical to revealing interesting topological and geometric features of the data.
  • 2. A “continuous” shape is built on the top of the data in order to highlight the underlying topology or geometry. This is often a simplicial complex or a nested family of simplicial complexes, called a filtration, which reflects the structure of the data on different scales. Simplicial complexes can be seen as higher-dimensional generalizations of neighboring graphs that are classically built on the top of data in many standard data analysis or learning algorithms. The challenge here is to define such structures as are proven to reflect relevant information about the structure of data and that can be effectively constructed and manipulated in practice.
  • 3. Topological or geometric information is extracted from the structures built on the top of the data. This may either result in a full reconstruction, typically a triangulation, of the shape underlying the data from which topological/geometric features can be easily extracted or in crude summaries or approximations from which the extraction of relevant information requires specific methods, such as persistent homology. Beyond the identification of interesting topological/geometric information and its visualization and interpretation, the challenge at this step is to show its relevance, in particular its stability with respect to perturbations or the presence of noise in the input data. For that purpose, understanding the statistical behavior of the inferred features is also an important question.
  • 4. The extracted topological and geometric information provides new families of features and descriptors of the data. They can be used to better understand the data—in particular, through visualization—or they can be combined with other kinds of features for further analysis and machine learning tasks. This information can also be used to design well-suited data analysis and machine learning models. Showing the added value and the complementarity (with respect to other features) of the information provided using tda tools is an important question at this step.

Topological Data Analysis and Statistics

Until quite recently, the theoretical aspects of TDA and topological inference mostly relied on deterministic approaches. These deterministic approaches do not take into account the random nature of data and the intrinsic variability of the topological quantity they infer. Consequently, most of the corresponding methods remain exploratory, without being able to efficiently distinguish between information and what is sometimes called the “topological noise” (see Section 6.2 further in the article).

A statistical approach to TDA means that we consider data as generated from an unknown distribution but also that the topological features inferred using TDA methods are seen as estimators of topological quantities describing an underlying object. Under this approach, the unknown object usually corresponds to the support of the data distribution (or part of it). The main goals of a statistical approach to topological data analysis can be summarized as the following list of problems:

  • Topic 1: proving consistency and studying the convergence rates of TDA methods.
  • Topic 2: providing confidence regions for topological features and discussing the significance of the estimated topological quantities.
  • Topic 3: selecting relevant scales on which the topological phenomenon should be considered, as a function of observed data.
  • Topic 4: dealing with outliers and providing robust methods for TDA.

Applications of Topological Data Analysis in Data Science

On the application side, many recent promising and successful results have demonstrated the interest in topological and geometric approaches in an increasing number of fields such as material science ( Kramar et al., 2013 ; Nakamura et al., 2015 ; Pike et al., 2020 ), 3D shape analysis ( Skraba et al., 2010 ; Turner et al., 2014b ), image analysis ( Qaiser et al., 2019 ; Rieck et al., 2020 ), multivariate time series analysis ( Khasawneh and Munch, 2016 ; Seversky et al., 2016 ; Umeda, 2017 ), medicine ( Dindin et al., 2020 ), biology ( Yao et al., 2009 ), genomics ( Carrière and Rabadán, 2020 ), chemistry ( Lee et al., 2017 ; Smith et al., 2021 ), sensor networks De Silva and Ghrist (2007) , or transportation ( Li et al., 2019 ), to name a few. It is beyond our scope to give an exhaustive list of applications of tda . On the other hand, most of the successes of tda result from its combination with other analysis or learning techniques (see Section 6.5 for a discussion and references). So, clarifying the position and complementarity of tda with respect to other approaches and tools in data science is also an important question and an active research domain.

The overall objective of this survey article is two-fold. First, it intends to provide data scientists with a brief and comprehensive introduction to the mathematical and statistical foundations of tda . For that purpose, the focus is put on a few selected, but fundamental, tools and topics, which are simplicial complexes ( Section 2 ) and their use for exploratory topological data analysis ( Section 3 ), geometric inference ( Section 4 ), and persistent homology theory ( Section 5 ), which play a central role in tda . Second, this article also aims at demonstrating how, thanks to the recent progress of software, tda tools can be easily applied in data science. In particular, we show how the Python version of the GUDHI library allows us to easily implement and use the tda tools presented in this article ( Section 7 ). Our goal is to quickly provide the data scientist with a few basic keys—and relevant references—so that he can get a clear understanding of the basics of tda and will be able to start to use tda methods and software for his own problems and data.

Other reviews on tda can be found in the literature, which are complementary to our work. Wasserman (2018) presented a statistical view on tda, and it focused, in particular, on the connections between tda and density clustering. Sizemore et al. (2019) proposed a survey about the application of tda to neurosciences. Finally, Hensel et al. (2021) proposed a recent overview of applications of tda to machine learning.

2 Metric Spaces, Covers, and Simplicial Complexes

As topological and geometric features are usually associated with continuous spaces, data represented as finite sets of observations do not directly reveal any topological information per se . A natural way to highlight some topological structure out of data is to “connect” data points that are close to each other in order to exhibit a global continuous shape underlying the data. Quantifying the notion of closeness between data points is usually done using a distance (or a dissimilarity measure), and it often turns out to be convenient in tda to consider data sets as discrete metric spaces or as samples of metric spaces. This section introduces general concepts for geometric and topological inference; a more complete presentation of the topic is given in the study by Boissonnat et al. (2018).

Metric Spaces

Recall that a metric space ( M , ρ ) is a set M with a function ρ : M × M → R + , called a distance, such that for any x , y , z ∈ M , the following is the case:

  • i) ρ ( x , y ) ≥ 0 and ρ ( x , y ) = 0 if and only if x = y ,
  • ii) ρ ( x , y ) = ρ ( y , x ), and
  • iii) ρ ( x , z ) ≤ ρ ( x , y ) + ρ ( y , z ).

Given a metric space ( M , ρ ), the set K ( M ) of its compact subsets can be endowed with the so-called Hausdorff distance; given two compact subsets A , B ⊆ M , the Hausdorff distance d H ( A , B ) between A and B is defined as the smallest nonnegative number δ such that for any a ∈ A , there exists b ∈ B such that ρ ( a , b ) ≤ δ , and for any b ∈ B , there exists a ∈ A such that ρ ( a , b ) ≤ δ (see Figure 1 ). In other words, if for any compact subset C ⊆ M , we denote by d ( . , C ) : M → R + the distance function to C defined by d ( x , C )≔ inf  c ∈ C ρ ( x , c ) for any x ∈ M , then one can prove that the Hausdorff distance between A and B is defined by any of the two following equalities:

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g001.jpg

Left: the Hausdorff distance between two subsets A and B of the plane. In this example, d H ( A , B ) is the distance between the point a in A which is the farthest from B and its nearest neighbor b on B . Right: the Gromov–Hausdorff distance between A and B . A can be rotated—this is an isometric embedding of A in the plane—to reduce its Hausdorff distance to B . As a consequence, d GH ( A , B ) ≤ d H ( A , B ).

It is a basic and classical result that the Hausdorff distance is indeed a distance on the set of compact subsets of a metric space. From a tda perspective, it provides a convenient way to quantify the proximity between different data sets issued from the same ambient metric space. However, it sometimes occurs that one has to compare data sets that are not sampled from the same ambient space. Fortunately, the notion of the Hausdorff distance can be generalized to the comparison of any pair of compact metric spaces, giving rise to the notion of the Gromov–Hausdorff distance.

Two compact metric spaces, ( M 1 , ρ 1 ) and ( M 2 , ρ 2 ), are isometric if there exists a bijection ϕ : M 1 → M 2 that preserves distances, that is, ρ 2 ( ϕ ( x ), ϕ ( y )) = ρ 1 ( x , y ) for any x , y ∈ M 1 . The Gromov–Hausdorff distance measures how far two metric spaces are from being isometric.

Definition 1. The Gromov–Hausdorff distance d GH ( M 1 , M 2 ) between two compact metric spaces is the infimum of the real numbers r ≥ 0 such that there exists a metric space ( M , ρ ) and two compact subspaces C 1 and C 2 ⊂ M that are isometric to M 1 and M 2 and such that d H ( C 1 , C 2 ) ≤ r.

The Gromov–Hausdorff distance will be used later, in Section 5 , for the study of stability properties and persistence diagrams.

Connecting pairs of nearby data points by edges leads to the standard notion of the neighboring graph from which the connectivity of the data can be analyzed, for example, using some clustering algorithms. To go beyond connectivity, a central idea in TDA is to build higher-dimensional equivalents of neighboring graphs using not only connecting pairs but also ( k + 1)-uple of nearby data points. The resulting objects, called simplicial complexes, allow us to identify new topological features such as cycles, voids, and their higher-dimensional counterpart.

Geometric and Abstract Simplicial Complexes

Simplicial complexes can be seen as higher-dimensional generalization of graphs. They are mathematical objects that are both topological and combinatorial, a property making them particularly useful for tda .

Given a set X = { x 0 , … , x k } ⊂ R d of k + 1 affinely independent points, the k -dimensional simplex σ = [ x 0 , … , x k ] spanned by X is the convex hull of X . The points of X are called the vertices of σ , and the simplices spanned by the subsets of X are called the faces of σ . A geometric simplicial complex K in R d is a collection of simplices such that the following are the case:

  • i any face of a simplex of K is a simplex of K and
  • i i the intersection of any two simplices of K is either empty or a common face of both.

The union of the simplices of K is a subset of R d called the underlying space of K that inherits from the topology of R d . So, K can also be seen as a topological space through its underlying space. Notice that once its vertices are known, K is fully characterized by the combinatorial description of a collection of simplices satisfying some incidence rules.

Given a set V , an abstract simplicial complex with the vertex set V is a set K ~ of finite subsets of V such that the elements of V belong to K ~ and for any σ ∈ K ~ , any subset of σ belongs to K ~ . The elements of K ~ are called the faces or the simplices of K ~ . The dimension of an abstract simplex is just its cardinality minus 1 and the dimension of K ~ is the largest dimension of its simplices. Notice that simplicial complexes of dimension 1 are graphs.

The combinatorial description of any geometric simplicial K obviously gives rise to an abstract simplicial complex K ~ . The converse is also true; one can always associate with an abstract simplicial complex K ~ a topological space | K ~ | such that if K is a geometric complex whose combinatorial description is the same as K ~ , the underlying space of K is homeomorphic to | K ~ | . Such a K is called a geometric realization of K ~ . As a consequence, abstract simplicial complexes can be seen as topological spaces and geometric complexes can be seen as geometric realizations of their underlying combinatorial structure. So, one can consider simplicial complexes at the same time as combinatorial objects that are well suited for effective computations and as topological spaces from which topological properties can be inferred.

Building Simplicial Complexes From Data

Given a data set, or more generally, a topological or metric space, there exist many ways to build simplicial complexes. We present here a few classical examples that are widely used in practice.

A first example is an immediate extension of the notion of the α -neighboring graph. Assume that we are given a set of points X in a metric space ( M , ρ ) and a real number α ≥ 0. The Vietoris–Rips complex Rips α ( X ) is the set of simplices [ x 0 , … , x k ] such that d X ( x i , x j ) ≤ α for all ( i , j ), see Figure 2 . It follows immediately from the definition that this is an abstract simplicial complex. However, in general, even when X is a finite subset of R d , Rips α ( X ) does not admit a geometric realization in R d ; in particular, it can be of a dimension higher than d .

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g002.jpg

Čech complex Cech α ( X ) (left) and the Vietoris–Rips Rips 2 α ( X ) (right) of a finite point cloud in the plane R 2 . The bottom part of Cech α ( X ) is the union of two adjacent triangles, while the bottom part of Rips 2 α ( X ) is the tetrahedron spanned by the four vertices and all its faces. The dimension of the Čech complex is 2. The dimension of the Vietoris–Rips complex is 3. Notice that this latter is thus not embedded in R 2 .

Closely related to the Vietoris–Rips complex is the Čech complex Cech α ( X ) that is defined as the set of simplices [ x 0 , … , x k ] such that the k + 1 closed balls B ( x i , α ) have a non-empty intersection, see Figure 2 . Notice that these two complexes are related by

and that if X ⊂ R d , then Cech α ( X ) and Rips 2 α ( X ) have the same one-dimensional skeleton, that is, the same set of vertices and edges.

The Nerve Theorem

The Čech complex is a particular case of a family of complexes associated with covers. Given a cover U = ( U i ) i ∈ I of M , that is, a family of sets U i such that M = ∪ i ∈ I U i , the nerve of U is the abstract simplicial complex C ( U ) whose vertices are the U i ’s and such that

Given a cover of a data set, where each set of the cover can be, for example, a local cluster or a grouping of data points sharing some common properties, its nerve provides a compact and global combinatorial description of the relationship between these sets through their intersection patterns (see Figure 3 ).

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g003.jpg

Point cloud sampled in the plane and a cover of open sets for this point cloud (left). The nerve of this cover is a triangle (right). Edges correspond to a set of the cover whereas a vertex corresponds to a non-empty intersection between two sets of the cover.

A fundamental theorem in algebraic topology relates, under some assumptions, the topology of the nerve of a cover to the topology of the union of the sets of the cover. To be formally stated, this result, known as the Nerve theorem, requires the introduction of a few notions.

Two topological spaces, X and Y , are usually considered as being the same from a topological point of view if they are homeomorphic, that is, if there exist two continuous bijective maps f : X → Y and g : Y → X such that f ° g and g ° f are the identity map of Y and X , respectively. In many cases, asking X and Y to be homeomorphic turns out to be too strong a requirement to ensure that X and Y share the same topological features of interest for tda . Two continuous maps f 0 , f 1 : X → Y are said to be homotopic if there exists a continuous map H : X × [0, 1] → Y such that for any x ∈ X , H ( x , 0) = f 0 ( x ) and H ( x , 1) = g ( x ). The spaces X and Y are then said to be homotopy equivalent if there exist two maps, f : X → Y and g : Y → X , such that f ° g and g ° f are homotopic to the identity map of Y and X , respectively. The maps f and g are then called homotopy equivalent. The notion of homotopy equivalence is weaker than the notion of homeomorphism; if X and Y are homeomorphic, then they are obviously homotopy equivalent, but the converse is not true. However, spaces that are homotopy equivalent still share many topological invariants; in particular, they have the same homology (see Section 4 ).

A space is said to be contractible if it is homotopy equivalent to a point. Basic examples of contractible spaces are the balls and, more generally, the convex sets in R d . Open covers for whom all elements and their intersections are contractible have the remarkable following property.

Theorem 1 (Nerve theorem). Let U = ( U i ) i ∈ I be a cover of a topological space X by open sets such that the intersection of any subcollection of the U i ’s is either empty or contractible. Then, X and the nerve C ( U ) are homotopy equivalent.

It is easy to verify that convex subsets of Euclidean spaces are contractible. As a consequence, if U = ( U i ) i ∈ I is a collection of convex subsets of R d , then C ( U ) and ∪ i ∈ I U i are homotopy equivalent. In particular, if X is a set of points in R d , then the Čech complex Cech α ( X ) is homotopy equivalent to the union of balls ∪ x ∈ X B ( x , α ) .

The Nerve theorem plays a fundamental role in tda ; it provides a way to encode the topology of continuous spaces into abstract combinatorial structures that are well suited for the design of effective data structures and algorithms.

3 Using Covers and Nerves for Exploratory Data Analysis and Visualization: The Mapper Algorithm

Using the nerve of covers as a way to summarize, visualize, and explore data is a natural idea that was first proposed for tda in the study by Singh et al. (2007) , giving rise to the so-called Mapper algorithm.

Definition 2. Let f : X → R d , d ≥ 1 , be a continuous real valued function and let U = ( U i ) i ∈ I be a cover of R d . The pull-back cover of X induced by ( f , U ) is the collection of open sets ( f − 1 ( U i ) ) i ∈ I . The refined pull-back is the collection of connected components of the open sets f −1 ( U i ) , i ∈ I.

The idea of the Mapper algorithm is, given a data set X and a well-chosen real-valued function f : X → R d , to summarize X through the nerve of the refined pull-back of a cover U of f ( X ) (see Figure 4A ). For well-chosen covers U (see below), this nerve is a graph providing an easy and convenient way to visualize the summary of the data. It is described in Algorithm 1 and illustrated on a simple example in Figure 4B .

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g004.jpg

(A) Refined pull-back cover of the height function on a surface in R 3 and its nerve. (B) Mapper algorithm on a point cloud sampled around a circle and the height function. First, the pull-back cover of the height function defined on the point cloud is computed and refined (left). Second, the nerve of the refined pull-back is visualized as a graph (right).

Algorithm 1

The Mapper algorithm

The Mapper algorithm is very simple (see Algorithm 1 ); but it raises several questions about the various choices that are left to the user and that we briefly discuss in the following.

The Choice of f

The choice of the function f , sometimes called the filter or lens function, strongly depends on the features of the data that one expects to highlight. The following ones are among the ones more or less classically encountered in the literature:

  • -Density estimates: the Mapper complex may help to understand the structure and connectivity of high-density areas (clusters).
  • -PCA coordinates or coordinate functions obtained from a nonlinear dimensionality reduction (NLDR) technique, eigenfunctions of graph laplacians may help to reveal and understand some ambiguity in the use of nonlinear dimensionality reductions.
  • -The centrality function f ( x ) = ∑ y ∈ X d ( x , y ) and the eccentricity function f ( x ) = max y ∈ X d ( x , y ) sometimes appear to be good choices that do not require any specific knowledge about the data.
  • -For data that are sampled around one-dimensional filamentary structures, the distance function to a given point allows us to recover the underlying topology of the filamentary structures Chazal et al. (2015d) .

The Choice of the Cover U

When f is a real-valued function, a standard choice is to take U to be a set of regularly spaced intervals of equal length, r > 0, covering the set f ( X ) . The real r is sometimes called the resolution of the cover, and the percentage g of overlap between two consecutive intervals is called the gain of the cover. Note that if the gain g is chosen below 50 % , then every point of the real line is covered by, at most, 2 open sets of U , and the output nerve is a graph. It is important to notice that the output of Mapper is very sensitive to the choice of U , and small changes in the resolution and gain parameters may result in very large changes in the output, making the method very unstable. A classical strategy consists in exploring some range of parameters and selecting the ones that turn out to provide the most informative output from the user perspective.

The Choice of the Clusters

The Mapper algorithm requires the clustering of the preimage of the open sets U ∈ U . There are two strategies to compute the clusters. A first strategy consists in applying, for each U ∈ U , a cluster algorithm, chosen by the user, to the preimage f −1 ( U ). A second, more global, strategy consists in building a neighboring graph on the top of the data set X , for example, a k-NN graph or a ɛ -graph, and, for each U ∈ U , taking the connected components of the subgraph with the vertex set f −1 ( U ).

Theoretical and Statistical Aspects of Mapper

Based on the results on stability and the structure of Mapper proposed in the study by Carrière and Oudot (2017) , advances toward a statistically well-founded version of Mapper have been made recently in the study by Carriere et al. (2018) . Unsurprisingly, the convergence of Mapper depends on both the sampling of the data and the regularity of the filter function. Moreover, subsampling strategies can be proposed to select a complex in a Rips filtration on a convenient scale, as well as the resolution and the gain for defining the Mapper graph. The case of stochastic and multivariate filters has also been studied by Carrière and Michel (2019) . An alternative description of the probabilistic convergence of Mapper, in terms of categorification, has also been proposed in the study by Brown et al. (2020) . Other approaches have been proposed to study and deal with the instabilities of the Mapper algorithm in the works of Dey et al. (2016) , Dey et al. (2017) .

Data Analysis With Mapper

As an exploratory data analysis tool, Mapper has been successfully used for clustering and feature selection. The idea is to identify specific structures in the Mapper graph (or complex), in particular, loops and flares. These structures are then used to identify interesting clusters or to select features or variables that best discriminate the data in these structures. Applications on real data, illustrating these techniques, may be found, for example, in the studies by Carrière and Rabadán (2020) , Lum et al. (2013) , Yao et al. (2009) .

4 Geometric Reconstruction and Homology Inference

Another way to build covers and use their nerves to exhibit the topological structure of data is to consider the union of balls centered on the data points. In this section, we assume that X n = { x 0 , … , x n } is a subset of R d , sampled i. i. d. according to a probability measure μ with compact support M ⊂ R d . The general strategy to infer topological information about M from μ proceeds in two steps that are discussed in the following part of this section:

  • 1. X n is covered by a union of balls of a fixed radius centered on the x i ’s. Under some regularity assumptions on M , one can relate the topology of this union of balls to the one of M and
  • 2. from a practical and algorithmic perspective, topological features of M are inferred from the nerve of the union of balls, using the Nerve theorem.

In this framework, it is indeed possible to compare spaces through isotopy equivalence, a stronger notion than homeomorphism; X ⊆ R d and Y ⊆ R d are said to be (ambient) isotopic if there exists a continuous family of homeomorphisms H : [ 0,1 ] × R d → R d , H continuous, such that for any t ∈ [0, 1], H t = H ( t , . ) : R d → R d is a homeomorphism, H 0 is the identity map in R d , and H 1 ( X ) = Y . Obviously, if X and Y are isotopic, then they are homeomorphic. The converse is not true; a knotted circle and an unknotted circle in R 3 are not homeomorphic (notice that although this claim seems rather intuitive, its formal proof requires the use of some nonobvious algebraic topology tools).

4.1 Distance-Like Functions and Reconstruction

Given a compact subset K of R d and a nonnegative real number r , the union of balls of radius r centered on K , K r = ∪ x ∈ K B ( x , r ), called the r -offset of K , is the r -sublevel set of the distance function d K : R d → R defined by d K ( x ) = inf  y ∈ K ‖ x − y ‖; in other words, K r = d k − 1 ( [ 0 , r ] ) . This remark allows us to use differential properties of distance functions and to compare the topology of the offsets of compact sets that are close to each other with respect to the Hausdorff distance.

Definition 3 (Hausdorff distance in R d ). The Hausdorff distance between two compact subsets K , K ′ of R d is defined by

In our setting, the considered compact sets are the data set X n and of the support M of the measure μ . When M is a smooth compact submanifold, under mild conditions on d H ( X n , M ) , for some well-chosen r , the offsets of X n are homotopy equivalent to M Chazal and Lieutier (2008) , Niyogi et al. (2008) (see Figure 5 for an illustration). These results extend to larger classes of compact sets and lead to stronger results on the inference of the isotopy type of the offsets of M Chazal et al. (2009c) , Chazal et al. (2009d) . They also lead to results on the estimation of other geometric and differential quantities such as normals Chazal et al. (2009c) , curvatures Chazal et al. (2009e) , or boundary measures Chazal et al. (2010) under assumptions on the Hausdorff distance between the underlying shape and the data sample.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g005.jpg

Example of a point cloud X n sampled on the surface of a torus in R 3 (top left) and its offsets for different values of radii r 1 < r 2 < r 3 . For well-chosen values of the radius (e.g., r 1 and r 2 ), the offsets are clearly homotopy equivalent to a torus.

These results rely on the one-semiconcavity of the squared distance function d K 2 , that is, the convexity of the function x → ‖ x ‖ 2 − d K 2 ( x ) , and can be naturally stated in the following general framework.

Definition 4. A function ϕ : R d → R + is distance-like if it is proper (the preimage of any compact set in R is a compact set in R d ) and x → ‖ x ‖ 2 − ϕ 2 ( x ) is convex.

Thanks to its semiconcavity, a distance-like function ϕ has a well-defined, but not continuous, gradient ∇ ϕ : R d → R d that can be integrated into a continuous flow ( Petrunin, 2007 ) that allows us to track the evolution of the topology of its sublevel sets and to compare it to one of the sublevel sets of close distance-like functions.

Definition 5. Let ϕ be a distance-like function and let ϕ r = ϕ −1 ([0, r ]) be the r-sublevel set of ϕ.

  • • A point x ∈ R d is called α-critical if ‖∇ x ϕ ‖ ≤ α. The corresponding value r = ϕ ( x ) is also said to be α-critical.
  • • The weak feature size of ϕ at r is the minimum r ′ > 0 such that ϕ does not have any critical value between r and r + r ′ . We denote it by wfs ϕ ( r ) . For any 0 < α < 1 , the α-reach of ϕ is the maximum r such that ϕ −1 ((0, r ]) does not contain any α-critical point.

The weak feature size wfs ϕ ( r ) (resp. α -reach) measures the regularity of ϕ around its r -level sets (resp. O -level set). When ϕ = d K is the distance function to a compact set K ⊂ R d , the one-reach coincides with the classical reach from geometric measure theory Federer (1959) . Its estimation from random samples has been studied by Aamari et al. (2019) . An important property of a distance-like function ϕ is that the topology of their sublevel sets ϕ r can only change when r crosses a 0-critical value.

Lemma 1 (isotopy lemma grove (1993)). Let ϕ be a distance-like function and r 1 < r 2 be two positive numbers such that ϕ has no 0-critical point, that is, points x such that ∇ ϕ ( x ) = 0 , in the subset ϕ −1 ([ r 1 , r 2 ]) . Then all the sublevel sets ϕ −1 ([0, r ]) are isotopic for r ∈ [ r 1 , r 2 ] .

As an immediate consequence of the isotopy lemma, all the sublevel sets of ϕ between r and r + wfs ϕ ( r ) have the same topology. Now the following reconstruction theorem from Chazal et al. (2011b) provides a connection between the topology of the sublevel sets of close distance-like functions.

Theorem 2 (Reconstruction theorem). Let ϕ , ψ be two distance-like functions such that ‖ ϕ − ψ ‖ ∞ < ɛ, with reach α ( ϕ ) ≥ R for some positive ɛ and α. Then, for every r ∈ [4 ɛ / α 2 , R − 3 ɛ ] and every η ∈ (0, R ) , the sublevel sets ψ r and ϕ η are homotopy equivalent when

Under similar but slightly more technical conditions, the Reconstruction theorem can be extended to prove that the sublevel sets are indeed homeomorphic and even isotopic (Chazal et al., 2009c; Chazal et al., 2008).

Coming back to our setting and taking for ϕ = d M and ψ = d X n the distance functions to the support M of the measure μ and to the data set X n , the condition reach α ( d M ) ≥ R can be interpreted as the regularity condition on M 6 . The Reconstruction theorem combined with the Nerve theorem tells that for well-chosen values of r , η and the η -offsets of M are homotopy equivalent to the nerve of the union of balls of radius r centered on X n , that is, the Cech complex Cech r ( X n ) .

From a statistical perspective, the main advantage of these results involving the Hausdorff distance is that the estimation of the considered topological quantities boils down to support estimation questions that have been widely studied (see Section 4.3 ).

4.2 Homology Inference

The above results provide a mathematically well-founded framework to infer the topology of shapes from a simplicial complex built on the top of an approximating finite sample. However, from a more practical perspective, it raises two issues. First, the Reconstruction theorem requires a regularity assumption through the α -reach condition that may not always be satisfied and the choice of a radius r for the ball used to build the Čech complex Cech r ( X n ) . Second, Cech r ( X n ) provides a topologically faithful summary of the data through a simplicial complex that is usually not well suited for further data processing. One often needs topological descriptors that are easier to handle, in particular numerical ones, which can be easily computed from the complex. This second issue is addressed by considering the homology of the considered simplicial complexes in the next paragraph, while the first issue will be addressed in the next section with the introduction of persistent homology.

Homology in a Nutshell

Homology is a classical concept in algebraic topology, providing a powerful tool to formalize and handle the notion of the topological features of a topological space or of a simplicial complex in an algebraic way. For any dimension k , the k -dimensional “holes” are represented by a vector space H k , whose dimension is intuitively the number of such independent features. For example, the zero-dimensional homology group H 0 represents the connected components of the complex, the one-dimensional homology group H 1 represents the one-dimensional loops, the two-dimensional homology group H 2 represents the two-dimensional cavities, and so on.

To avoid technical subtleties and difficulties, we restrict the introduction of homology to the minimum that is necessary to understand its usage in the following of the article. In particular, we restrict our information to homology with coefficients in Z 2 , that is, the field with two elements, 0 and 1, such that 1 + 1 = 0, which turns out to be geometrically a little bit more intuitive. However, all the notions and results presented in the sequel naturally extend to homology with coefficients in any field. We refer the reader to the study by Hatcher (2001) for a complete and comprehensible introduction to homology and to the study by Ghrist (2017) for a recent, concise, and very good introduction to applied algebraic topology and its connections to data analysis.

Let K be a (finite) simplicial complex and let k be a nonnegative integer. The space of k -chains on K , C k ( K ) is the set whose elements are the formal (finite) sums of k -simplices of K . More precisely, if { σ 1 , … , σ p } is the set of k -simplices of K , then any k -chain can be written as

If c ′ = ∑ i = 1 p ε i ′ σ i is another k -chain and λ ∈ Z 2 , the sum c + c ′ is defined as c + c ′ = ∑ i = 1 p ( ε i + ε i ′ ) σ i and the product λ . c is defined as λ . c = ∑ i = 1 p ( λ . ε i ) σ i , making C k ( K ) a vector space with coefficients in Z 2 . Since we are considering coefficients in Z 2 , geometrically, a k -chain can be seen as a finite collection of k -simplices and the sum of two k -chains as the symmetric difference of the two corresponding collections 7 .

The boundary of a k -simplex σ = [ v 0 , … , v k ] is the ( k − 1)-chain

where [ v 0 , … , v ^ i , … , v k ] is the ( k − 1)-simplex spanned by all the vertices except v i 8 . As the k -simplices form a basis of C k ( K ), ∂ k extends as a linear map from C k ( K ) to C k −1 ( K ) called the boundary operator. The kernel Z k ( K ) = { c ∈ C k ( K ): ∂ k = 0} of ∂ k is called the space of k -cycles of K , and the image B k ( K ) = { c ∈ C k ( K ): ∃c ′ ∈ C k +1 ( K ), ∂ k +1 ( c ′) = c } of ∂ k +1 is called the space of k -boundaries of K . The boundary operators satisfy the following fundamental property:

In other words, any k -boundary is a k -cycle, that is, B k ( K ) ⊆ Z k ( K ) ⊆ C k ( K ). These notions are illustrated in Figure 6 .

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g006.jpg

Some examples of chains, cycles, and boundaries on a two-dimensional complex K : c 1 , c 2 , and c 4 are one-cycles; c 3 is a one-chain but not a one-cycle; c 4 is the one-boundary, namely, the boundary of the two-chain obtained as the sum of the two triangles surrounded by c 4 . The cycles c 1 and c 2 span the same element in H 1 ( K ) as their difference is the two-chain represented by the union of the triangles surrounded by the union of c 1 and c 2 .

Definition 6 (simplicial homology group and Betti numbers) . The k th (simplicial) homology group of K is the quotient vector space

The k th Betti number of K is the dimension β k ( K ) = dim  H k ( K ) of the vector space H k ( K ) .

Figure 7 gives the Betti numbers of several simple spaces. Two cycles, c , c ′ ∈ Z k ( K ), are said to be homologous if they differ by a boundary, that is, if there exists a ( k + 1)-chain d such that c ′ = c + ∂ k +1 ( d ). Two such cycles give rise to the same element of H k . In other words, the elements of H k ( K ) are the equivalence classes of homologous cycles.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g007.jpg

Betti numbers of the circle (top left), the two-dimensional sphere (top right), and the two-dimensional torus (bottom). The blue curves on the torus represent two independent cycles whose homology class is a basis of its one-dimensional homology group.

Simplicial homology groups and Betti numbers are topological invariants; if K , K ′ are two simplicial complexes whose geometric realizations are homotopy equivalent, then their homology groups are isomorphic and their Betti numbers are the same.

Singular homology is another notion of homology that allows us to consider larger classes of topological spaces. It is defined for any topological space X similarly to simplicial homology, except that the notion of the simplex is replaced by the notion of the singular simplex, which is just any continuous map σ : Δ k → X where Δ k is the standard k -dimensional simplex. The space of k -chains is the vector space spanned by the k -dimensional singular simplices, and the boundary of a simplex σ is defined as the (alternated) sum of the restriction of σ to the ( k − 1)-dimensional faces of Δ k . A remarkable fact about singular homology is that it coincides with simplicial homology whenever X is homeomorphic to the geometric realization of a simplicial complex. This allows us, in the sequel of this article, to indifferently talk about simplicial or singular homology for topological spaces and simplicial complexes.

Observing that if f : X → Y is a continuous map, then for any singular simplex σ : Δ k → X in X , f ° σ : Δ k → Y is a singular simplex in Y , one easily deduces that continuous maps between topological spaces canonically induce homomorphisms between their homology groups. In particular, if f is a homeomorphism or a homotopy equivalence, then it induces an isomorphism between H k ( X ) and H k ( Y ) for any nonnegative integer k . As an example, it follows from the Nerve theorem that for any set of points X ⊂ R d and any r > 0, the r -offset X r and the Čech complex Cech r ( X ) have isomorphic homology groups and the same Betti numbers.

As a consequence, the Reconstruction theorem 2 leads to the following result on the estimation of Betti numbers.

Theorem 3. Let M ⊂ R d be a compact set such that reach α ( d M ) ≥ R > 0 for some α ∈ (0, 1) and let X be a finite set of points such that d H ( M , X ) = ε < R 5 + 4 / α 2 . Then, for every r ∈ [4 ɛ / α 2 , R − 3 ɛ ] and every η ∈ (0, R ) , the Betti numbers of Cech r ( X ) are the same as the ones of M η .

In particular, if M is a smooth m-dimensional submanifold of R d , then β k ( Cech r ( X ) ) = β k ( M ) for any k = 0, … , m.

From a practical perspective, this result raises three difficulties: first, the regularity assumption involving the α -reach of M may be too restrictive; second, the computation of the nerve of a union of balls requires the use of a tricky predicate testing the emptiness of a finite union of balls; third, the estimation of the Betti numbers relies on the scale parameter r , whose choice may be a problem.

To overcome these issues, Chazal and Oudot (2008) established the following result, which offers a solution to the first two problems.

Theorem 4. Let M ⊂ R d be a compact set such that w f s ( M ) = w f s d M ( 0 ) ≥ R > 0 and let X be a finite set of points such that d H ( M , X ) = ε < 1 9 w f s ( M ) . Then for any r ∈ [ 2 ε , 1 4 ( w f s ( M ) − ε ) ] and any η ∈ (0, R ) ,

where rk H k ( Rips r ( X ) ) → H k ( Rips 4 r ( X ) ) denotes the rank of the homomorphism induced by the (continuous) canonical inclusion Rips r ( X ) ↪ Rips 4 r ( X ) .

Although this result leaves the question of the choice of the scale parameter r open, it is proven in the study by Chazal and Oudot (2008) that a multiscale strategy whose description is beyond the scope of this article provides some help in identifying the relevant scales on which Theorem 4 can be applied.

4.3 Statistical Aspects of Homology Inference

According to the stability results presented in the previous section, a statistical approach to topological inference is strongly related to the problem of distribution support estimation and level sets estimation under the Hausdorff metric. A large number of methods and results are available for estimating the support of a distribution in statistics. For instance, the Devroye and Wise estimator ( Devroye and Wise, 1980 ) defined on a sample X n is also a particular offset of X n . The convergence rates of both X n and the Devroye and Wise estimator to the support of the distribution for the Hausdorff distance were studied by Cuevas and Rodríguez-Casal (2004) in R d . More recently, the minimax rates of convergence of manifold estimation for the Hausdorff metric, which is particularly relevant for topological inference, has been studied by Genovese et al. (2012) . There is also a large body of literature about level sets estimation in various metrics (see, for instance, Cadre, 2006 ; Polonik, 1995 ; Tsybakov, 1997 ) and, more particularly, for the Hausdorff metric Chen et al. (2017) . All these works about support and level sets estimation shed light on the statistical analysis of topological inference procedures.

In the study by Niyogi et al. (2008) , it was shown that the homotopy type of Riemannian manifolds with a reach larger than a given constant can be recovered with high probability from offsets of a sample on (or close to) the manifold. This article was probably the first attempt to consider the topological inference problem in terms of probability. The result of the study by Niyogi et al. (2008) was derived from a retract contraction argument and was on tight bounds over the packing number of the manifold in order to control the Hausdorff distance between the manifold and the observed point cloud. The homology inference in the noisy case, in the sense that the distribution of the observation is concentrated around the manifold, was also studied by Niyogi et al. (2008) , Niyogi et al. (2011) . The assumption that the geometric object is a smooth Riemannian manifold is only used in the article to control in probability the Hausdorff distance between the sample and the manifold and is not actually necessary for the “topological part” of the result. Regarding the topological results, these are similar to those of the studies by Chazal et al. (2009d) , Chazal and Lieutier (2008) in the particular framework of Riemannian manifolds. Starting from the result of the study by Niyogi et al. (2008) , the minimax rates of convergence of the homology type have been studied by Balakrishna et al. (2012) under various models for Riemannian manifolds with a reach larger than a constant. In contrast, a statistical version of the work of Chazal et al. (2009d) has not yet been proposed.

More recently, following the ideas of Niyogi et al. (2008) , Bobrowski et al. (2014) have proposed a robust homology estimator for the level sets of both density and regression functions, by considering the inclusion map between nested pairs of estimated level sets (in the spirit of Theorem 4 above) obtained using a plug-in approach from a kernel estimator.

4.4 Going Beyond Hausdorff Distance: Distance to Measure

It is well known that distance-based methods in tda may fail completely in the presence of outliers. Indeed, adding even a single outlier to the point cloud can change the distance function dramatically (see Figure 8 for an illustration). To answer this drawback, Chazal et al. (2011b) have introduced an alternative distance function which is robust to noise, the distance-to-measure.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g008.jpg

Effect of outliers on the sublevel sets of distance functions. Adding just a few outliers to a point cloud may dramatically change its distance function and the topology of its offsets.

Given a probability distribution P in R d and a real parameter 0 ≤ u ≤ 1, the notion of distance to the support of P may be generalized as the function

where B ( x , t ) is the closed Euclidean ball of center x and radius t . To avoid issues due to discontinuities of the map P → δ P , u , the distance-to-measure (DTM) function with parameter m ∈ [0, 1] and power r ≥ 1 is defined by

A nice property of the DTM proved by Chazal et al. (2011b) is its stability with respect to perturbations of P in the Wasserstein metric. More precisely, the map P → d P , m , r is m − 1 r -Lipschitz, that is, if P and P ~ are two probability distributions on R d , then

where W r is the Wasserstein distance for the Euclidean metric on R d , with exponent r 9 . This property implies that the DTM associated with close distributions in the Wasserstein metric have close sublevel sets. Moreover, when r = 2, the function d P , m , 2 2 is semiconcave, ensuring strong regularity properties on the geometry of its sublevel sets. Using these properties, Chazal et al. (2011b) showed that under general assumptions, if P ~ is a probability distribution approximating P , then the sublevel sets of d P ~ , m , 2 provide a topologically correct approximation of the support of P .

In practice, the measure P is usually only known through a finite set of observations X n = { X 1 , … , X n } sampled from P , raising the question of the approximation of the DTM. A natural idea to estimate the DTM from X n is to plug the empirical measure P n instead of P into the definition of the DTM. This “plug-in strategy” corresponds to computing the distance to the empirical measure (DTEM). For m = k n , the DTEM satisfies

where ‖ x − X n ‖ ( j ) denotes the distance between x and its j th neighbor in { X 1 , … , X n }. This quantity can be easily computed in practice since it only requires the distances between x and the sample points. The convergence of the DTEM to the DTM has been studied by Chazal et al. (2017) and Chazal et al. (2016b) .

The introduction of the DTM has motivated further works and applications in various directions such as topological data analysis ( Buchet et al., 2015a ), GPS trace analysis ( Chazal et al., 2011a ), density estimation ( Biau et al., 2011 ), hypothesis testing Brécheteau (2019) , and clustering ( Chazal et al., 2013 ), just to name a few. Approximations, generalizations, and variants of the DTM have also been considered ( Guibas et al., 2013 ; Phillips et al., 2014 ; Buchet et al., 2015b ; Brécheteau and Levrard, 2020 ).

5 Persistent Homology

Persistent homology is a powerful tool used to efficiently compute, study, and encode multiscale topological features of nested families of simplicial complexes and topological spaces. It does not only provide efficient algorithms to compute the Betti numbers of each complex in the considered families, as required for homology inference in the previous section, but also encodes the evolution of the homology groups of the nested complexes across the scales. Ideas and preliminary results underlying persistent homology theory can be traced back to the 20th century, in particular in the works of Barannikov (1994) , Frosini (1992) , Robins (1999) . It started to know an important development in its modern form after the seminal works of Edelsbrunner et al. (2002) and Zomorodian and Carlsson (2005) .

5.1 Filtrations

A filtration of a simplicial complex K is a nested family of subcomplexes ( K r ) r ∈ T , where T ⊆ R , such that for any r , r ′ ∈ T , if r ≤ r ′ then K r ⊆ K r ’ and K = ∪ r ∈ T K r . The subset T may be either finite or infinite. More generally, a filtration of a topological space M is a nested family of subspaces ( M r ) r ∈ T , where T ⊆ R , such that for any r , r ′ ∈ T , if r ≤ r ′ then M r ⊆ M r ’ and M = ∪ r ∈ T M r . For example, if f : M → R is a function, then the family M r = f −1 ((− ∞ , r ]), r ∈ R defines a filtration called the sublevel set filtration of f .

In practical situations, the parameter r ∈ T can often be interpreted as a scale parameter, and filtrations classically used in TDA often belong to one of the two following families.

Filtrations Built on Top of Data

Given a subset X of a compact metric space ( M , ρ ), the families of Rips–Vietoris complexes ( Rips r ( X ) ) r ∈ R and Čech complexes ( Cech r ( X ) ) r ∈ R are filtrations 10 . Here, the parameter r can be interpreted as a resolution at which one considers the data set X . For example, if X is a point cloud in R d , thanks to the Nerve theorem, the filtration ( Cech r ( X ) ) r ∈ R encodes the topology of the whole family of unions of balls X r = ∪ x ∈ X B ( x , r ) , as r goes from 0 to + ∞ . As the notion of filtration is quite flexible, many other filtrations have been considered in the literature and can be constructed on the top of data, such as the so-called witness complex popularized in tda by De Silva and Carlsson (2004) , the weighted Rips filtrations Buchet et al. (2015b) , or the so-called DTM filtrations Anai et al. (2019) that allow us to handle data corrupted by noise and outliers.

Sublevel Sets Filtrations

Functions defined on the vertices of a simplicial complex give rise to another important example of filtration: let K be a simplicial complex with vertex set V and f : V → R . Then f can be extended to all simplices of K by f ([ v 0 , … , v k ]) = max{ f ( v i ): i = 1, … , k } for any simplex σ = [ v 0 , … , v k ] ∈ K and the family of subcomplexes, K r = { σ ∈ K : f ( σ ) ≤ r }, defines a filtration called the sublevel set filtration of f . Similarly, one can define the upper-level set filtration of f .

In practice, even if the index set is infinite, all the considered filtrations are built on finite sets and are indeed finite. For example, when X is finite, the Vietoris–Rips complex Rips r ( X ) changes only at a finite number of indices, r . This allows us to easily handle them from an algorithmic perspective.

5.2 Starting With a Few Examples

Given a filtration Filt = ( F r ) r ∈ T of a simplicial complex or a topological space, the homology of F r changes as r increases; new connected components can appear, existing components can merge, loops and cavities can appear or be filled, etc. Persistent homology tracks these changes, identifies the appearing features, and associates a lifetime with them. The resulting information is encoded as a set of intervals called a barcode or, equivalently, as a multiset of points in R 2 where the coordinate of each point is the starting and end point of the corresponding interval.

Before giving formal definitions, we introduce and illustrate persistent homology on a few simple examples.

Let f : [ 0,1 ] → R be the function of Figure 9A and let F r = f − 1 ( ( − ∞ , r ) ) r ∈ R be the sublevel set filtration of f . All the sublevel sets of f are either empty or a union of intervals, so the only nontrivial topological information they carry is their zero-dimensional homology, that is, their number of connected components. For r < a 1 , F r is empty, but at r = a 1 , a first connected component appears in F a 1 . Persistent homology thus registers a 1 as the birth time of a connected component and starts to keep track of it by creating an interval starting at a 1 . Then, F r remains connected until r reaches the value a 2 , where a second connected component appears. Persistent homology starts to keep track of this new connected component by creating a second interval starting at a 2 . Similarly, when r reaches a 3 , a new connected component appears and persistent homology creates a new interval starting at a 3 . When r reaches a 4 , the two connected components created at a 1 and a 3 merge together to give a single larger component. At this step, persistent homology follows the rule that it is the most recently appeared component in the filtration that dies; the interval started at a 3 is thus ended at a 4 , and a first persistence interval encoding the life span of the component born at a 3 is created. When r reaches a 5 , as in the previous case, the component born at a 2 dies, and the persistent interval ( a 2 , a 5 ) is created. The interval created at a 1 remains until the end of the filtration, giving rise to the persistent interval ( a 1 , a 6 ), if the filtration is stopped at a 6 , or ( a 1 , + ∞ ), if r goes to + ∞ (notice that in this latter case, the filtration remains constant for r > a 6 ). The obtained set of intervals encoding the life span of the different homological features encountered along the filtration is called the persistence barcode of f . Each interval ( a , a ′) can be represented by the point of coordinates ( a , a ′) in the R 2 plane. The resulting set of points is called the persistence diagram of f . Notice that a function may have several copies of the same interval in its persistence barcode. As a consequence, the persistence diagram of f is indeed a multi-set where each point has an integer-valued multiplicity. Last, for technical reasons that will become clear in the next section, one adds to the persistence all the points of the diagonal Δ = {( b , d ): b = d } with an infinite multiplicity.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g009.jpg

(A) Example 1: the persistence barcode and the persistence diagram of a function f : [ 0,1 ] → R . (B) Example 2: the persistence barcode and the persistence diagram of the height function (projection on the z -axis) defined on a surface in R 3 .

Let f : M → R now be the function of Figure 9B , where M is a two-dimensional surface homeomorphic to a torus, and let F r = f − 1 ( ( − ∞ , r ) ) r ∈ R be the sublevel set filtration of f . The zero-dimensional persistent homology is computed as in the previous example, giving rise to the red bars in the barcode. Now, the sublevel sets also carry one-dimensional homological features. When r goes through the height a 1 , the sublevel sets F r that were homeomorphic to two discs become homeomorphic to the disjoint union of a disc and an annulus, creating a first cycle homologous to σ 1 in Figure 9B . An interval (in blue) representing the birth of this new one-cycle is thus started at a 1 . Similarly, when r goes through the height a 2 , a second cycle, homologous to σ 2 , is created, giving rise to the start of a new persistent interval. These two created cycles are never filled (indeed, they span H 1 ( M )) and the corresponding intervals remain until the end of the filtration. When r reaches a 3 , a new cycle is created that is filled and thus dies at a 4 , giving rise to the persistence interval ( a 3 , a 4 ). So now, the sublevel set filtration of f gives rise to two barcodes, one for zero-dimensional homology (in red) and one for one-dimensional homology (in blue). As previously stated, these two barcodes can equivalently be represented as diagrams in the plane.

In this last example, we consider the filtration given by a union of growing balls centered on the finite set of points C in Figure 10 . Notice that this is the sublevel set filtration of the distance function to C , and thanks to the Nerve theorem, this filtration is homotopy equivalent to the Čech filtration built on the top of C . Figure 10 shows several level sets of the filtration as follows:

  • a) For the radius r = 0, the union of balls is reduced to the initial finite set of points, each of them corresponding to a zero-dimensional feature, that is, a connected component; an interval is created for the birth for each of these features at r = 0.
  • b) Some of the balls started to overlap, resulting in the death of some connected components that get merged together; the persistence diagram keeps track of these deaths, putting an end point to the corresponding intervals as they disappear.
  • c) New components have merged, giving rise to a single connected component and, so, all the intervals associated with a zero-dimensional feature have been ended, except the one corresponding to the remaining components; two new one-dimensional features have appeared, resulting in two new intervals (in blue) starting on their birth scale.
  • d) One of the two one-dimensional cycles has been filled, resulting in its death in the filtration and the end of the corresponding blue interval.
  • e) All the one-dimensional features have died; only the long (and never dying) red interval remains. As in the previous examples, the final barcode can also be equivalently represented as a persistence diagram where every interval ( a , b ) is represented by the point of coordinates ( a , b ) in R 2 . Intuitively, the longer an interval in the barcode or, equivalently, the farther from the diagonal the corresponding point in the diagram, the more persistent, and thus relevant, the corresponding homological feature across the filtration. Notice also that for a given radius r , the k th Betti number of the corresponding union of balls is equal to the number of persistence intervals corresponding to k -dimensional homological features and containing r. So, the persistence diagram can be seen as a multiscale topological signature encoding the homology of the union of balls for all radii as well as its evolution across the values of r.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g010.jpg

The sublevel set filtration of the distance function to a point cloud and the construction of its persistence barcode as the radius of balls increases. The blue curves in the unions of balls represent one-cycles associated with the blue bars in the barcodes. The persistence diagram is finally defined from the persistence barcodes. (A) For the radius r = 0, the union of balls is reduced to the initial finite set of points, each of them corresponding to a zero-dimensional feature, that is, a connected component; an interval is created for the birth for each of these features at r = 0. (B) Some of the balls started to overlap, resulting in the death of some connected components that get merged together; the persistence diagram keeps track of these deaths, putting an end point to the corresponding intervals as they disappear. (C) New components have merged, giving rise to a single connected component and, so, all the intervals associated with a zero-dimensional feature have been ended, except the one corresponding to the remaining components; two new one-dimensional features have appeared, resulting in two new intervals (in blue) starting on their birth scale. (D) One of the two one-dimensional cycles has been filled, resulting in its death in the filtration and the end of the corresponding blue interval. (E) All the one-dimensional features have died; only the long (and never dying) red interval remains. As in the previous examples, the final barcode can also be equivalently represented as a persistence diagram where every interval ( a , b ) is represented by the point of coordinates ( a , b ) in R 2 . Intuitively, the longer an interval in the barcode or, equivalently, the farther from the diagonal the corresponding point in the diagram, the more persistent, and thus relevant, the corresponding homological feature across the filtration. Notice also that for a given radius r , the k th Betti number of the corresponding union of balls is equal to the number of persistence intervals corresponding to k -dimensional homological features and containing r. So, the persistence diagram can be seen as a multiscale topological signature encoding the homology of the union of balls for all radii as well as its evolution across the values of r.

5.3 Persistent Modules and Persistence Diagrams

Persistent diagrams can be formally and rigorously defined in a purely algebraic way. This requires some care, and we only give the basic necessary notions here, leaving aside technical subtleties and difficulties. We refer the readers interested in a detailed exposition to Chazal et al. (2016a) .

Let Filt = ( F r ) r ∈ T be a filtration of a simplicial complex or a topological space. Given a nonnegative integer k and considering the homology groups H k ( F r ), we obtain a sequence of vector spaces where the inclusions F r ⊂ F r ’ , r ≤ r ′ induce linear maps between H k ( F r ) and H k ( F r ’ ). Such a sequence of vector spaces together with the linear maps connecting them is called a persistence module.

Definition 7. A persistence module V over a subset T of the real numbers R is an indexed family of vector spaces ( V r ∣ r ∈ T ) and a doubly indexed family of linear maps ( v s r : V r → V s ∣ r ≤ s ) which satisfy the composition law v t s ◦ v s r = v t r whenever r ≤ s ≤ t, and where v r r is the identity map on V r .

In many cases, a persistence module can be decomposed into a direct sum of interval modules I ( b , d ) of the form

where the maps Z 2 → Z 2 are identity maps while all the other maps are 0. Denoting b (resp. d ), the infimum (resp. supremum) of the interval of indices corresponds to nonzero vector spaces; such a module can be interpreted as a feature that appears in the filtration at index b and disappears at index d . When a persistence module V can be decomposed as a direct sum of interval modules, one can show that this decomposition is unique up to reordering the intervals (see ( Chazal et al., 2016a , Theorem 2.7)). As a consequence, the set of resulting intervals is independent of the decomposition of V and is called the persistence barcode of V . As in the examples of the previous section, each interval ( b , d ) in the barcode can be represented as the point of coordinates ( b , d ) in the plane R 2 . The disjoint union of these points, together with the diagonal Δ = { x = y }, is a multi-set called the persistence diagram of V .

The following result, from ( Chazal et al., 2016a , Theorem 2.8), gives some necessary conditions for a persistence module to be decomposable as a direct sum of interval modules.

Theorem 5. Let V be a persistence module indexed by T ⊂ R . If T is a finite set or if all the vector spaces V r are finite-dimensional, then V is decomposable as a direct sum of interval modules. Moreover, for any s , t ∈ T, s ≤ t, the number β t s of intervals starting before s and ending after t is equal to the rank of the linear map v t s and is called the ( s , t ) -persistent Betti number of the filtration.

As both conditions above are satisfied for the persistent homology of filtrations of finite simplicial complexes, an immediate consequence of this result is that the persistence diagrams of such filtrations are always well defined.

Indeed, it is possible to show that persistence diagrams can be defined as soon as the following simple condition is satisfied.

Definition 8. A persistence module V indexed by T ⊂ R is q-tame if for any r < s in T, the rank of the linear map v s r : V r → V s is finite.

Theorem 6 Chazal et al. (2009a) , Chazal et al. (2016a) . If V is a q-tame persistence module, then it has a well-defined persistence diagram. Such a persistence diagram dgm ( V ) is the union of the points of the diagonal Δ of R 2 , counted with infinite multiplicity, and a multi-set above the diagonal in R 2 that is locally finite. Here, by locally finite, we mean that for any rectangle R with sides parallel to the coordinate axes that does not intersect Δ , the number of points of dgm ( V ) , counted with multiplicity, contained in R is finite. Also, the part of the diagram made of the points with the infinite second coordinate is called the essential part of the diagram.

The construction of persistence diagrams of q-tame modules is beyond the scope of this article, but it gives rise to the same notion as in the case of decomposable modules. It can be done either by following the algebraic approach based upon the decomposability properties of modules or by adopting a measure theoretic approach that allows us to define diagrams as integer-valued measures on a space of rectangles in the plane. We refer the reader to Chazal et al. (2016a) for more information.

Although persistence modules encountered in practice are decomposable, the general framework of the q-tame persistence module plays a fundamental role in the mathematical and statistical analysis of persistent homology. In particular, it is needed to ensure the existence of limit diagrams when convergence properties are studied (see Section 6 ).

A filtration Filt = ( F r ) r ∈ T of a simplicial complex or of a topological space is said to be tame if for any integer k , the persistence module ( H k ( F r )∣ r ∈ T ) is q -tame. Notice that the filtrations of finite simplicial complexes are always tame. As a consequence, for any integer k , a persistence diagram denoted dgm k (Filt) is associated with the filtration Filt. When k is not explicitly specified and when there is no ambiguity, it is usual to drop the index k in the notation and to talk about “the” persistence diagram dgm(Filt) of the filtration Filt. This notation has to be understood as “dgm k (Filt) for some k .”

5.4 Persistence Landscapes

The persistence landscape introduced in the study by Bubenik (2015) is an alternative representation of persistence diagrams. This approach aims at representing the topological information encoded in persistence diagrams as elements of a Hilbert space, for which statistical learning methods can be directly applied. The persistence landscape is a collection of continuous, piecewise linear functions λ : N × R → R that summarizes a persistence diagram dgm.

A birth–death pair p = ( b , d ) ∈ dgm is transformed into the point b + d 2 , d − b 2 (see Figure 11 ). Remember that the points with infinite persistence have been simply discarded in this definition. The landscape is then defined by considering the set of functions created by tenting the features of the rotated persistence diagram as follows:

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g011.jpg

Example of a persistence landscape (right) associated with a persistence diagram (left). The first landscape is in blue, the second one in red, and the last one in orange. All the other landscapes are zero.

The persistence landscape λ dgm of dgm is a summary of the arrangement of piecewise linear curves obtained by overlaying the graphs of the functions { Λ p } p ∈ dgm . Formally, the persistence landscape of dgm is the collection of functions

where kmax is the k th largest value in the set; in particular, 1max is the usual maximum function. Given k ∈ N , the function λ dgm ( k , . ) : R → R is called the k th landscape of dgm. It is not difficult to see that the map that associates to each persistence diagram its corresponding landscape is injective. In other words, formally, no information is lost when a persistence diagram is represented through its persistence landscape.

The advantage of the persistence landscape representation is two-fold. First, persistence diagrams are mapped as elements of a functional space, opening the door to the use of a broad variety of statistical and data analysis tools for further processing of topological features see Bubenik (2015) , Chazal et al. (2015c) and Section 6.3.1 . Second, and fundamental from a theoretical perspective, the persistence landscapes share the same stability properties as those of persistence diagrams (see Section 5.7 ).

5.5 Linear Representations of Persistence Homology

A persistence diagram without its essential part can be represented as a discrete measure on Δ + = { p = ( b , d ), b < d < ∞ }. With a slight abuse of notation, we can write the following:

where the features are counted with multiplicity and where δ ( b , d ) denotes the Dirac measure in p = ( b , d ). Most of the persistence-based descriptors that have been proposed to analyze persistence can be expressed as linear transformations of the persistence diagram, seen as a point process

for some function f defined on Δ and taking values in a Banach space.

In most cases, we want these transformations to apply independently at each homological dimension. For k ∈ N a given homological dimension, we then consider some linear transformation of the persistence diagram, restricted to the topological features of dimension k as follows:

where dgm k is the persistence diagram of the topological features of dimension k and where f k is defined on Δ and takes values in a Banach space.

Betti Curve

The simplest way to represent persistence homology is the Betti function or the Betti curve. The Betti curve of homological dimension k is defined as

where w is a weight function defined on Δ. In other words, the Betti curve is the number of barcodes at time m . This descriptor is a linear representation of persistence homology by taking f in (5) such that f ( b , d ) ( t ) = w ( b , d ) 1 t ∈[ b , d ] . A typical choice for the weigh function is an increasing function of the persistence w ( b , d ) = w ~ ( d − b ) where w ~ is an increasing function defined on R + . One of the first applications of Betti curves can be found in the study by Umeda (2017) .

Persistence Surface

The persistence surface (also called persistence images) is obtained by making the convolution of a diagram with a kernel. It has been introduced in the study by Adams et al. (2017) . For K : R 2 → R , a kernel, and H , a 2 × 2 bandwidth matrix (e.g., a symmetric positive definite matrix), let for u ∈ R 2

Let w : R 2 → R + a weight function defined on Δ. One defines the persistence surface of homological dimension k associated with a diagram dgm, with kernel K and bandwidth matrix H by the following:

The persistence surface is obviously a linear representation of persistence homology. Typical weigh functions are increasing functions of the persistence.

Other Linear Representations of Persistence

Many other linear representations of persistence have been proposed in the literature, such as the persistence silhouette ( Chazal et al., 2015b ), the accumulated persistence function ( Biscio and Møller, 2019 ), and variants of the persistence surface ( Reininghaus et al., 2015 ; Kusano et al., 2016 ; Chen et al., 2017 ).

Considering persistence diagrams as discrete measures and their vectorizations as linear representation is an approach that has also proven fruitful to studying distributions of diagrams Divol and Chazal (2020) and the metric structure of the space of persistence diagrams Divol and Lacombe (2020) (see Sections 5.6 and Section 6.3 ).

5.6 Metrics on the Space of Persistence Diagrams

To exploit the topological information and topological features inferred from persistent homology, one needs to be able to compare persistence diagrams, that is, to endow the space of persistence diagrams with a metric structure. Although several metrics can be considered, the most fundamental one is known as the bottleneck distance.

Recall that a persistence diagram is the union of a discrete multi-set in the half-plane above the diagonal Δ and, for technical reasons that will become clear below, of Δ where the point of Δ is counted with infinite multiplicity. A matching (see Figure 12 ) between two diagrams, dgm 1 and dgm 2 , is a subset m ⊆ dgm 1 × dgm 2 such that every point in dgm 1 \Δ and dgm 2 \Δ appears exactly once in m . In other words, for any p ∈ dgm 1 \Δ and for any q ∈ dgm 2 \Δ, ({ p }× dgm 2 ) ∩ m and (dgm 1 ×{ q }) ∩ m each contains a single pair. The bottleneck distance between dgm 1 and dgm 2 is then defined by

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g012.jpg

Perfect matching and the bottleneck distance between a blue and a red diagram. Notice that some points of both diagrams are matched to points of the diagonal.

The practical computation of the bottleneck distance boils down to the computation of a perfect matching in a bipartite graph for which classical algorithms can be used.

The bottleneck metric is an L ∞ -like metric. It turns out to be the natural one to express stability properties of persistence diagrams presented in Section 5.7 , but it suffers from the same drawbacks as the usual L ∞ norms, that is, it is completely determined by the largest distance among the pairs and does not take into account the closeness of the remaining pairs of points. A variant to overcome this issue, the so-called Wasserstein distance between diagrams, is sometimes considered. Given p ≥ 1, it is defined by

Useful stability results for persistence in the W p metric exist among the literature, in particular the study by Cohen-Steiner et al. (2010) , but they rely on assumptions that make them consequences of the stability results in the bottleneck metric. A general study of the space of persistence diagrams endowed with W p metrics has been considered in the study by Divol and Lacombe (2020) , where they proposed a general framework, based upon optimal partial transport, in which many important properties of persistence diagrams can be proven in a natural way.

5.7 Stability Properties of Persistence Diagrams

A fundamental property of persistence homology is that persistence diagrams of filtrations built on the top of data sets turn out to be very stable with respect to some perturbations of the data. To formalize and quantify such stability properties, we first need to be precise with regard to the notion of perturbation that is allowed.

Rather than working directly with filtrations built on the top of data sets, it turns out to be more convenient to define a notion of proximity between persistence modules, from which we will derive a general stability result for persistent homology. Then, most of the stability results for specific filtrations will appear as a consequence of this general theorem. To avoid technical discussions, from now on, we assume, without loss of generality, that the considered persistence modules are indexed by R .

Definition 9. Let V , W be two persistence modules indexed by R . Given δ ∈ R , a homomorphism of degree δ between V and W is a collection Φ of linear maps ϕ r : V r → W r + δ , for all r ∈ R such that for any r ≤ s, ϕ s ◦ v s r = w s + δ r + δ ◦ ϕ r .

An important example of a homomorphism of degree δ is the shift endomorphism 1 V δ which consists of the families of linear maps ( v r + δ r ) . Notice also that homomorphisms of modules can naturally be composed; the composition of a homomorphism Ψ of degree δ between U and V and a homomorphism Φ of degree δ ′ between V and W naturally gives rise to a homomorphism ΦΨ of degree δ + δ ′ between U and W .

Definition 10. Let δ ≥ 0 . Two persistence modules V , W are δ-interleaved if there exist two homomorphisms of degree δ, Φ , from V to W and Ψ , from W to V such that Ψ Φ = 1 V 2 δ and Φ Ψ = 1 W 2 δ .

Although it does not define a metric on the space of persistence modules, the notion of closeness between two persistence modules may be defined as the smallest nonnegative δ such that they are δ -interleaved. Moreover, it allows us to formalize the following fundamental theorem ( Chazal et al., 2009a ; Chazal et al., 2016a ).

Theorem 7 (Stability of persistence). Let V and W be two q-tame persistence modules. If V and W are δ-interleaved for some δ ≥ 0 , then

Although purely algebraic and rather abstract, this result is an efficient tool to easily establish concrete stability results in TDA. For example, we can easily recover the first persistence stability result that appeared in the literature ( Cohen-Steiner et al., 2005 ).

Theorem 8. Let f , g : M → R be two real-valued functions defined on a topological space M that are q-tame, that is, such that the sublevel set filtrations of f and g induce q-tame modules at the homology level. Then for any integer k,

where dgm k ( f ) (resp. dgm k ( g ) ) is the persistence diagram of the persistence module ( H k ( f − 1 ( − ∞ , r ) ) | r ∈ R ) (resp. ( H k ( g − 1 ( − ∞ , r ) ) | r ∈ R ) ) where the linear maps are the one induced by the canonical inclusion maps between sublevel sets.

Proof. Denoting δ = ‖f − g‖ ∞ , we have that for any r ∈ R , f − 1 ( − ∞ , r ) ⊆ g − 1 ( − ∞ , r + δ ) and g − 1 ( − ∞ , r ) ⊆ f − 1 ( − ∞ , r + δ ) . This interleaving between the sublevel sets of f induces a δ-interleaving between the persistence modules at the homology level, and the result follows from the direct application of Theorem 7.

Theorem 7 also implies a stability result for the persistence diagrams of filtrations built on the top of data.

Theorem 9. Let X and Y be two compact metric spaces and let Filt ( X ) and Filt ( Y ) be the Vietoris–Rips of Čech filtrations built on the top of X and Y . Then

where dgm ( Filt ( X ) ) and dgm ( Filt ( Y ) ) denote the persistence diagram of the filtrations Filt ( X ) and Filt ( X ) .

As we already noticed in Example 3 of Section 5.2 , the persistence diagrams can be interpreted as multiscale topological features of X and Y . In addition, Theorem 9 tells us that these features are robust with respect to perturbations of the data in the Gromov–Hausdorff metric. They can be used as discriminative features for classification or other tasks (see, for example, Chazal et al. (2009b) for an application to nonrigid 3D shape classification).

We now give similar results for the alternative persistence homology representations introduced before. From the definition of the persistence landscape, we immediately observe that λ(k, ⋅) is one-Lipschitz, and thus, stability properties similar to those for persistence diagrams are satisfied for the landscapes.

Proposition 1 (stability of persistence landscapes; Bubenik (2015) ). Let dgm and dgm’ be two persistence diagrams (without their essential parts). For any t ∈ R and any k ∈ N , we have the following:

  • (i) λ( k , t ) ≥ λ( k + 1, t ) ≥ 0.
  • (ii) | λ ( k , t ) − λ ′ ( k , t ) | ≤ d b ( dgm , dgm ′ ) .

A large class of linear representations is continuous with respect to the Wasserstein metric W s in the space of persistence diagrams and with respect to the Banach norm of the linear representation of persistence. Generally speaking, it is not always possible to upper bound the modulus of continuity of the linear representation operator. However, in the case where s = 1, it is even possible to show a stability result if the weight function takes small values for points close to the diagonal (see Divol and Lacombe (2020) , Hofer et al. (2019b) ).

Stability Versus Discriminative Capacity of Persistence Representations

The results of the study by Divol and Lacombe (2020) showed that continuity and stability are only possible with weigh functions taking small values for points close to the diagonal. However, in general, there is no specific reason to consider that points close to the diagonal are less important than others, given a learning task. In a machine learning perspective, it is also relevant to design linear representation with general weigh functions, although it would be more difficult to prove the consistency of the corresponding methods without at least the continuity of the representation. Stability is thus important but maybe too strong a requirement for many problems in data sciences. Designing linear representation that is sensitive to specific parts of persistence diagrams rather than globally stable may reveal a good strategy in practice.

6 Statistical Aspects of Persistent Homology

Persistence homology by itself does not take into account the random nature of data and the intrinsic variability of the topological quantity they infer. We now present a statistical approach to persistent homology, in the sense that data are considered to be generated from an unknown distribution. We start with several consistency results for persistent homology inference.

6.1 Consistency Results for Persistent Homology

Assume that we observe n points (X 1 , … , X n ) in a metric space (M, ρ) drawn i. i. d. from an unknown probability measure μ whose support is a compact set denoted X μ . The Gromov–Hausdorff distance allows us to compare X μ with compact metric spaces not necessarily embedded in M. In the following, an estimator X ^ of X μ is a function of X 1 … , X n that takes values in the set of compact metric spaces.

Let Filt ( X μ ) and Filt ( X ^ ) be two filtrations defined on X μ and X ^ . Starting from Theorem 9; a natural strategy for estimating the persistent homology of Filt ( X μ ) consists in estimating the support X μ . Note that in some cases, the space M can be unknown and the observations X 1 … , X n are then only known through their pairwise distances ρ(X i , X j ), i, j = 1, … , n. The use of the Gromov–Hausdorff distance allows us to consider this set of observations as an abstract metric space of cardinality n, independently of the way it is embedded in M. This general framework includes the more standard approach consisting in estimating the support with respect to the Hausdorff distance by restraining the values of X ^ to the compact sets included in M.

The finite set X n ≔ { X 1 , … , X n } is a natural estimator of the support X μ . In several contexts discussed in the following, X n shows optimal rates of convergence to X μ with respect to the Hausdorff distance. For some constants a, b > 0, we say that μ satisfies the (a, b)-standard assumption if for any x ∈ X μ and any r > 0,

This assumption has been widely used in the literature of set estimation under the Hausdorff distance ( Cuevas and Rodríguez-Casal, 2004 ; Singh et al., 2009 ). Under this assumption, it can be easily derived that the rate of convergence of dgm ( Filt ( X n ) ) to dgm ( Filt ( X μ ) ) for the bottleneck metric is upper bounded by O log ⁡ n n 1 / b . More precisely, this rate upper bounds the minimax rate of convergence over the set of probability measures on the metric space (M, ρ) satisfying the (a, b)-standard assumption on M.

Theorem 10. Chazal et al. (2014) For some positive constants a and b, let

Then, it holds

where the constant C only depends on a and b.

Under additional technical assumptions, the corresponding lower bound can be shown (up to a logarithmic term) (see Chazal et al. (2014) ). By applying stability results, similar consistency results can be easily derived under alternative generative models as soon as a consistent estimator of the support under the Hausdorff metric is known. For instance, from the results of the study by Genovese et al. (2012) about Hausdorff support estimation under additive noise, it can be deduced that the minimax convergence rates for the persistence diagram estimation are faster than (log  n) −1/2 . Moreover, as soon as a stability result is available for some given representation of persistence, similar consistency results can be directly derived from the consistency for persistence diagrams.

Estimation of the Persistent Homology of Functions

Theorem 7 opens the door to the estimation of the persistent homology of functions defined on R d , on a submanifold of R d or, more generally, on a metric space. The persistent homology of regression functions has also been studied by Bubenik et al. (2010) . The alternative approach of Bobrowski et al. (2014) , which was based on the inclusion map between nested pairs of estimated level sets, can be applied with kernel density and regression kernel estimators to estimate persistence homology of density functions and regression functions. Another direction of research on this topic concerns various versions of robust TDA. One solution is to study the persistent homology of the upper-level sets of density estimators ( Fasy et al., 2014b ). A different approach, more closely related to the distance function, but robust to noise, consists in studying the persistent homology of the sublevel sets of the distance to measure defined in Section 4.4 ( Chazal et al., 2017 ).

6.2 Statistic of Persistent Homology Computed on a Point Cloud

For many applications, in particular when the support of the point cloud is not drawn on or close to a geometric shape, persistence diagrams can be quite complex to analyze. In particular, many topological features are closed to the diagonal. Since they correspond to topological structures that die very soon after they appear in the filtration, these points are generally considered as noise (see Figure 13 for an illustration). Confidence regions of persistence diagrams are rigorous answers to the problem of distinguishing between the signal and the noise in these representations.

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g013.jpg

(A,B) Two persistence diagrams for two configurations of MBP. (C) MDS configuration for the matrix of bottleneck distances. (D) Persistence diagram and confidence region for the persistence diagram of an MBP.

The stability results given in Section 5.7 motivate the use of the bottleneck distance to define confidence regions. However, alternative distances in the spirit of Wasserstein distances can be proposed too. When estimating a persistence diagram dgm with an estimator dgm ^ , we typically look for some value η α such that

for α ∈ (0, 1). Let B α be the closed ball of radius α for the bottleneck distance, centered at dgm ^ in the space of persistence diagrams. Following Fasy et al. (2014b) , we can visualize the signatures of the points belonging to this ball in various ways. One first option is to center a box of a side length of 2α at each point of the persistence diagram dgm ^ . An alternative solution is to visualize the confidence set by adding a band at (vertical) distance η α /2 from the diagonal (the bottleneck distance being defined for the ℓ ∞ norm) (see Figure 13 for an illustration). The points outside the band are then considered as significant topological features (see Fasy et al. (2014b) for more details).

Several methods have been proposed in the study by Fasy et al. (2014b) to estimate η α in different frameworks. These methods mainly rely on stability results for persistence diagrams; confidence sets for diagrams can be derived from confidence sets in the sample space.

Subsampling Approach

This method is based on a confidence region for the support K of the distribution of the sample in the Hausdorff distance. Let X ~ b be a subsample of size b drawn from the sample X ~ n , where b = o(n/logn). Let q b (1 − α) be the quantile of the distribution of Haus X ~ b , X n . Take η ^ α ≔ 2 q ^ b ( 1 − α ) , where q ^ b is an estimation q b (1 − α) using a standard Monte Carlo procedure. Under a (a, b) standard assumption and for an n large enough, Fasy et al. (2014b) showed that

Bottleneck Bootstrap

The stability results often lead to conservative confidence sets. An alternative strategy is the bottleneck bootstrap introduced in the study by Chazal et al. (2016b) . We consider the general setting where a persistence diagram dgm ^ is defined from the observation (X 1 , … , X n ) in a metric space. This persistence diagram corresponds to the estimation of an underlying persistence diagram dgm, which can be related, for instance, to the support of the measure, or to the sublevel sets of a function related to this distribution (for instance, a density function when the X i ’s are in R d ). Let ( X 1 * , … , X n * ) be a sample from the empirical measure defined from the observations (X 1 , … , X n ). Let also dgm ^ * be the persistence diagram derived from this sample. We can then take for η α the quantity η ^ α defined by

Note that η ^ α can be easily estimated using Monte Carlo procedures. It has been shown in the study by Chazal et al. (2016b) that the bottleneck bootstrap is valid when computing the sublevel sets of a density estimator.

Bootstrapping Persistent Betti Numbers

As already mentioned, confidence regions based on stability properties of persistence may lead to very conservative confidence regions. Based on the concepts of stabilizing statistics Penrose and Yukich (2001) , asymptotic normality for persistent Betti numbers has been shown recently by Krebs and Polonik (2019) and Roycraft et al. (2020) under very mild conditions on the filtration and the distribution of the sample cloud. In addition, bootstrap procedures are also shown to be valid in this framework. More precisely, a smoothed bootstrap procedure together with a convenient rescaling of the point cloud seems to be a promising approach for boostrapping TDA features from point cloud data.

6.3 Statistic for a Family of Persistent Diagrams or Other Representations

Up to now in this section, we were only considering statistics based on one single observed persistence diagram. We now consider a new framework where several persistence diagrams (or other representations) are available, and we are interested in providing the central tendency, confidence regions, and hypothesis tests for topological descriptors built on this family.

6.3.1 Central Tendency for Persistent Homology

Mean and expectations of distributions of diagrams.

The space of persistence diagrams being a general metric space but not a Hilbert space, the definition of a mean persistence diagram is not obvious and unique. One first natural approach to defining a central tendency in this context is to consider Fréchet means of distributions of diagrams. Their existence has been proven in the study by Mileyko et al. (2011) , and they have also been characterized in the study by Turner et al. (2014a) . However, they may not be unique, and they turn out to be difficult to compute in practice. To partly overcome these problems, different approaches have been recently proposed based on numerical optimal transport Lacombe et al. (2018) or linear representations and kernel-based methods Divol and Chazal (2020) .

Topological Signatures From Subsamples

Central tendency properties of persistent homology can also be used to compute topological signatures for very large data sets, as an alternative approach to overcome the prohibitive cost of persistence computations. Given a large point cloud, the idea is to extract many subsamples, to compute the persistence landscape for each subsample, and then to combine the information.

For any positive integer m, let X = {x 1 , … , x m } be a sample of m points drawn from a measure μ in a metric space M and which support is denoted by X μ . We assume that the diameter of X μ is finite and upper bounded by T 2 , where T is the same constant as in the definition of persistence landscapes in Section 5.4 . For ease of exposition, we focus on the case k = 1 and the set λ(t) = λ(1, t). However, the results we present in this section hold for k > 1. The corresponding persistence landscape (associated with the persistence diagram of the Čech or Rips–Vietoris filtration) is λ X and we denote by Ψ μ m the measure induced by μ ⊗m on the space of persistence landscapes. Note that the persistence landscape λ X can be seen as a single draw from the measure Ψ μ m . The point-wise expectations of the (random) persistence landscape under this measure is defined by E Ψ μ m [ λ X ( t ) ] , t ∈ [ 0 , T ] . The average landscape E Ψ μ m [ λ X ] has a natural empirical counterpart, which can be used as its unbiased estimator. Let S 1 m , … , S ℓ m be ℓ independent samples of size m from μ ⊗m . We define the empirical average landscape as

and propose to use λ ℓ m ¯ to estimate λ X μ . Note that computing the persistent homology of X n is O(exp(n)), whereas computing the average landscape is O(b exp(m)).

Another motivation for this subsampling approach is that it can also be applied when μ is a discrete measure with the support X N = { x 1 , … , x N } lying in a metric space M. This framework can be very common in practice, when a continuous (but unknown) measure is approximated by a discrete uniform measure μ N on X N .

The average landscape E Ψ μ m [ λ X ] is an interesting quantity on its own, since it carries some stable topological information about the underlying measure μ, from which the data are generated.

Theorem 11. [ Chazal et al. (2015a) ] Let X ∼ μ ⊗m and Y ∼ ν ⊗m , where μ and ν are two probability measures on M. For any p ≥ 1, we have

where W p is the pth Wasserstein distance on M.

The result of Theorem 11 is useful for two reasons. First, it tells us that for a fixed m, the expected “topological behavior” of a set of m points carries some stable information about the underlying measure from which the data are generated. Second, it provides a lower bound for the Wasserstein distance between two measures, based on the topological signature of samples of m points.

6.3.2 Asymptotic Normality

As in the previous section, we consider several persistence diagrams (or other representations). The next step after giving central tendency descriptors of persistence homology is to provide asymptotic normality results for these quantities together with bootstrap procedures to derive confidence regions. It is of course easier to show such results for functional representations of persistence. In the studies by Chazal et al. (2015b) , Chazal et al. (2015c) , following this strategy, confidence bands for landscapes are proposed from the observation of landscapes λ 1 , … , λ N drawn i. i. d. from a random distribution in the space of landscapes. The asymptotic validity and the uniform convergence of the multiplier bootstrap is shown in this framework. Note that similar results can also be proposed for many representations of persistence, in particular by showing that the corresponding functional spaces are Donsker spaces.

6.4 Other Statistical Approaches to Topological Data Analysis

Statistical approaches for tda are seeing an increasing interest and many others have been proposed in recent years or are still subject to active research activities, as illustrated in the following non-exhaustive list of examples.

Hypothesis Testing

Several methods have been proposed for hypothesis testing procedures for persistent homology, mostly based on permutation strategies and for two-sample testing. Robinson and Turner (2017) focused on pairwise distances of persistence diagrams, whereas Berry et al. (2020) studied more general functional summaries. Hypothesis tests based on kernel approaches have been proposed in the study by Kusano (2019) . A two-stage hypothesis test of filtering and testing for persistent images was also presented in the study by Moon and Lazar (2020) .

Persistence Homology Transform

The representations introduced before are all transformations derived from the persistence diagram computed from a fixed filtration built over a data set. The persistence homology transform introduced in the studies by Curry et al. (2018) , Turner et al. (2014b) to study shapes in R d takes a different path by looking at the persistence homology of the sublevel set filtration induced by the projection of the considered shape in each direction in R d . It comes with several interesting properties; in particular, the persistence homology transform is a sufficient statistic for distributions defined on the set of geometric and finite simplicial complexes embedded in R d .

Bayesian Statistics for Topological Data Analysis

A Bayesian approach to persistence diagram inference has been proposed in the study by Maroulas et al. (2020) by viewing a persistence diagram as a sample from a point process. This Bayesian method computes the point process posterior intensity based on a Gaussian mixture intensity for the prior.

6.5 Persistent Homology and Machine Learning

Using tda and, more specifically, persistent homology for machine learning is a subject that attracts a lot of information and generated an intense research activity. Although the recent progress in this area goes far beyond the scope of this article, we briefly introduce the main research directions with a few references to help the newcomer to the field to get started.

Topological Data Analysis for Exploratory Data Analysis and Descriptive Statistics

In some domains, tda can be fruitfully used as a tool for exploratory analysis and visualization. For example, the Mapper algorithm provides a powerful approach to exploring and visualizing the global topological structure of complex data sets. In some cases, persistence diagrams obtained from data can be directly interpreted and exploited for better understanding of the phenomena from which the data have been generated. This is, for example, the case in the study of force fields in granular media ( Kramar et al., 2013 ) or of atomic structures in glass ( Nakamura et al., 2015 ) in material science, in the study of the evolution of convection patterns in fluid dynamics ( Kramár et al., 2016 ), and in machining monitoring ( Khasawneh and Munch, 2016 ) or in the analysis of nanoporous structures in chemistry ( Lee et al., 2017 ) where topological features can be rather clearly related to specific geometric structures and patterns in the considered data.

Persistent Homology for Feature Engineering

There are many other cases where persistence features cannot be easily or directly interpreted but present valuable information for further processing. However, the highly nonlinear nature of diagrams prevents them from being immediately used as standard features in machine learning algorithms.

Persistence landscapes and linear representations of persistence diagrams offer a first option to convert persistence diagrams into elements of a vector space that can be directly used as features in classical machine learning pipelines. This approach has been used, for example, for protein binding ( Kovacev-Nikolic et al., 2016 ), object recognition ( Li et al., 2014 ), or time series analysis. In the same vein, the construction of kernels for persistence diagrams that preserve their stability properties has recently attracted some attention. Most of them have been obtained by considering diagrams as discrete measures in R 2 . Convolving a symmetrized (with respect to the diagonal) version of persistence diagrams with a 2D Gaussian distribution, Reininghaus et al. (2015) introduced a multiscale kernel and applied it to shape classification and texture recognition problems. Considering the Wasserstein distance between projections of persistence diagrams on lines, Carriere et al. (2017) built another kernel and tested its performance on several benchmarks. Other kernels, still obtained by considering persistence diagrams as measures, have also been proposed in the study by Kusano et al. (2017) .

Various other vector summaries of persistence diagrams have been proposed and then used as features for different problems. For example, basic summaries were considered in the study by Bonis et al. (2016) and combined with quantization and pooling methods to address nonrigid shape analysis problems; Betti curves extracted from persistence diagrams were used with one-dimensional convolutional neural networks (CNNs) to analyze time-dependent data and recognize human activities from inertial sensors in the studies by Dindin et al. (2020) , Umeda (2017) ; persistence images were introduced in the study by Adams et al. (2017) and were considered to address some inverse problems using linear machine learning models in the study by Obayashi et al. (2018) .

The kernels and vector summaries of persistence diagrams mentioned above are built independently of the considered data analysis or learning task. Moreover, it appears that in many cases, the relevant topological information is not carried by the whole persistence diagram but is concentrated in some localized regions that may not be obvious to identify. This usually makes the choice of a relevant kernel or vector summary very difficult for the user. To overcome this issue, various authors have proposed learning approaches that allow us to learn the relevant topological features for a given task. In this direction, Hofer et al. (2017) proposed a deep learning approach to learn the parameters of persistence image representations of persistence diagrams, while Kim et al. (2020) introduced a neural network layer for persistence landscapes. In the study by Carrière et al. (2020a) , the authors introduced a general neural network layer for persistence diagrams that can be either used to learn an appropriate vectorization or directly integrated in a deep neural network architecture. Other methods, inspired from k -means, propose unsupervised methods to vectorize persistence diagrams ( Royer et al., 2021 ; Zieliński et al., 2010 ), some of them coming with theoretical guarantees ( Chazal et al., 2020 ).

Persistent Homology for Machine Learning Architecture Optimization and Model Selection

More recently, tda has seen new developments in machine learning where persistent homology is no longer used for feature engineering but as a tool to design, improve, or select models (see Carlsson and Gabrielsson (2020) , Chen et al. (2019) , Gabrielsson and Carlsson (2019) , Hofer et al. (2019a) , Moor et al. (2020) , Ramamurthy et al. (2019) , Rieck et al. (2019) ). Many of these tools rely on the introduction of loss or regularization functions depending on persistent homology features, raising the problem of their optimization. Building on the powerful tools provided by software libraries such as PyTorch or TensorFlow, practical methods allowing us to encode and optimize a large family of persistence-based functions have been proposed and experimented on ( Poulenard et al., 2018 ; Gabrielsson et al., 2020 ). A general framework for persistence-based function optimization based on stochastic subgradient descent algorithms with convergence guarantees has been recently proposed and implemented in an easy-to-use software tool ( Carriere et al., 2020b ). With a different perspective, another theoretical framework to study the differentiable structure of functions of persistence diagrams has been proposed in the study by Leygonie et al. (2021) .

7 Topological Data Analysis for Data Sciences With the GUDHI Library

In this section, we illustrate TDA methods using the Python library GUDHI 11 ( Maria et al., 2014 ) together with popular libraries such as NumPy ( Walt et al., 2011 ), scikit-learn ( Pedregosa et al., 2011 ), and pandas ( McKinney, 2010 ). This section aims at demonstrating that the topological signatures of TDA can be easily computed and exploited using GUDHI. More illustrations with Python notebooks can be found in the tutorial GitHub 12 of GUDHI.

7.1 Bootstrap and Comparison of Protein Binding Configurations

This example is borrowed from Kovacev-Nikolic et al. (2016) . In this article, persistent homology is used to analyze protein binding, and more precisely, it compares closed and open forms of the maltose-binding protein (MBP), a large biomolecule consisting of 370 amino acid residues. The analysis is not based on geometric distances in R 3 but on a metric of dynamical distances defined by

D i j = 1 − | C i j | , where C is the correlation matrices between residues. The data can be downloaded at this link 13 .

import numpy as np

import gudhi as gd

import pandas as pd

import seaborn as sns

corr_protein = pd.read_csv(“mypath/1anf.corr_1.txt”, header=None, delim_whitespace=True)

dist_protein_1 = 1− np.abs(corr_protein_1.values)

rips_complex_1 = gd.RipsComplex(distance_matrix=dist_protein_1, max_edge_length=1.1)

simplex_tree_1 = rips_complex_1.create_simplex_tree(max_dimension=2)

diag_1 = simplex_tree_1.persistence()

gd.plot_persistence_diagram(diag_1)

For comparing persistence diagrams, we use the bottleneck distance. The block of statements given below computes persistence intervals and computes the bottleneck distance for zero-homology and one-homology as follows:

interv0_1 = simplex_tree_1.persistence_intervals_in_dimension(0)

interv0_2 = simplex_tree_2.persistence_intervals_in_dimension(0)

bot0 = gd.bottleneck_distance(interv0_1,interv0_2)

interv1_1 = simplex_tree_1.persistence_intervals_in_dimension(1)

interv1_2 = simplex_tree_2.persistence_intervals_in_dimension(1)

bot1 = gd.bottleneck_distance(interv1_1,interv1_2)

In this way, we can compute the matrix of bottleneck distances between the fourteen MBPs. Finally, we apply a multidimensional scaling method to find a configuration in R 2 which almost matches with the bottleneck distances (see Figure 13C ). We use the scikit-learn library for the MDS as follows:

import matplotlib.pyplot as plt

from sklearn import manifold

mds = manifold.MDS(n_components=2, dissimilarity=“precomputed”)

config = mds.fit(M).embedding_

plt.scatter(config [0:7,0], config [0:7, 1], color=‘red’, label=“closed”)

plt.scatter(config [7:l,0], config [7:l, 1], color=‘blue’, label=“red”)

plt.legend(loc=1)

We now define a confidence band for a diagram using the bottleneck bootstrap approach. We resample over the lines (and columns) of the matrix of distances, and we compute the bottleneck distance between the original persistence diagram and the bootstrapped persistence diagram. We repeat the procedure many times, and finally, we estimate the quantile 95% of this collection of bottleneck distances. We take the value of the quantile to define a confidence band on the original diagram (see Figure 13D ). However, such a procedure should be considered with caution because as far as we know, the validity of the bottleneck bootstrap has not been proven in this framework.

7.2 Classification for Sensor Data

In this experiment, the 3D acceleration of 3 walkers (A, B, and C) has been recorded using the sensor of a smartphone 14 . Persistence homology is not sensitive to the choice of axes, and so no preprocessing is necessary to align the 3 time series according to the same axis. From these three time series, we have picked, at random, sequences of 8 s in the complete time series, that is, 200 consecutive points of acceleration in R 3 . For each walker, we extract 100 time series in this way. The next block of statements computes the persistence for the alpha complex filtration for data_A_sample, one of the 100 time series of acceleration of Walker A.

alpha_complex_sample = gd.AlphaComplex(points = data_A_sample)

simplex_tree_sample = alpha_complex_sample.create_simplex_tree(max_alpha_square=0.3)

diag_Alpha = simplex_tree_sample.persistence()

From diag_Alpha, we can then easily compute and plot the persistence landscapes (see Figure 14A ). For all 300 time series, we compute the persistence landscapes for dimensions 0 and 1, and we compute the first three landscapes for the 2 dimensions. Moreover, each persistence landscape is discretized on 1,000 points. Each time series is thus described by 6,000 topological variables. To predict the walker from these features, we use a random forest ( Breiman, 2001 ), which is known to be efficient in such a high-dimensional setting. We split the data into train and test samples at random several times. We finally obtain an averaged classification error of around 0.95. We can also visualize the most important variables in the random forest (see Figure 14B ).

An external file that holds a picture, illustration, etc.
Object name is frai-04-667963-g014.jpg

(A) First three landscapes for zero-homology of the alpha shape filtration defined for a time series of acceleration of Walker A. (B) Variable importances of the landscape coefficients for the classification of walkers. The first 3,000 coefficients correspond to the three landscapes of dimension 0 and the last 3,000 coefficients to the three landscapes of dimension 1. There are 1,000 coefficients per landscape. Note that the first landscape of dimension 0 is always the same using the Rips complex (a trivial landscape), and consequently, the corresponding coefficients have a zero-importance value.

8 Discussion

In this introductory article, we propose an overview of the most standard methods in the field of topological data analysis. We also provide a presentation of the mathematical foundations of TDA, on the topological, algebraic, geometric, and statistical aspects. The robustness of TDA methods (coordinate invariance and deformation invariance) and the compressed representation of data they offer make their use very interesting for data analysis, machine learning, and explainable AI. Many applications have been proposed in this direction during the last few years. Finally, TDA constitutes an additional possible approach in the data scientist toolbox.

Of course, TDA is suited to address all kinds of problems. Practitioners may face several potential issues when applying TDA methods. On the algorithmic aspects, computing persistence homology can be time and resource consuming. Even if there is still room for improvement, recent computational advances have enabled TDA to be an effective method for data science, thanks to libraries like GUDHI, for example. Moreover, combing TDA using quantization methods, graph simplification, or dimension reduction methods may reduce the computational cost of the TDA algorithms. Another potential problem we can face with TDA is that returning to the data point to interpret the topological signatures can be tricky because these signatures correspond to classes of equivalence of cycles. This can be a problem when there is a need to identify which part of the point cloud “has created” a given topological signature. TDA is in fact more suited to solving data science problems dealing with a family of point clouds, each data point being described by its persistent homology. Finally, the topological and geometric information that can be extracted from the data is not always efficient for solving a given problem in the data sciences alone. Combining topological signatures with other types of descriptors is generally a relevant approach.

Today, TDA is an active field of research, at the crossroads of many scientific fields. In particular, there is currently an intense effort to effectively combine machine learning, statistics, and TDA. In this perspective, we believe that there is still a need for statistical results which demonstrate and quantify the interest of these data science approaches based on TDA.

Acknowledgments

We thank Kovacev-Nikolic et al. (2016) for making their data available.

1 https://gudhi.inria.fr/

2 http://www.mrzv.org/software/dionysus/

3 https://bitbucket.org/phat-code/phat

4 https://github.com/DIPHA/dipha

5 https://giotto-ai.github.io/gtda-docs/0.4.0/library.html

6 As an example, if M is a smooth compact submanifold, then reach 0 (ϕ) is always positive and known as the reach of M Federer (1959).

7 Recall that the symmetric difference of two sets A and B is the set AΔB = (A \ B) ∪ (B \ A).

8 Notice that as we are considering coefficients in Z 2 , here −1 = 1 and thus (−1) i = 1 for any i.

9 See Villani (2003) for a definition of the Wasserstein distance

10 We take here the convention that for r < 0, Rips r ( X ) = Cech r ( X ) = ∅

11 http://gudhi.gforge.inria.fr/python/latest/

12 https://github.com/GUDHI/TDA-tutorial

13 https://www.researchgate.net/publication/301543862_corr

14 The dataset can be downloaded at the link http://bertrand.michel.perso.math.cnrs.fr/Enseignements/TDA/data_acc

Author Contributions

All authors listed have made a substantial, direct, and intellectual contribution to the work and approved it for publication.

This work was partly supported by the French ANR Chair in Artificial Intelligence TopAI.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

  • Aamari E., Kim J., Chazal F., Michel B., Rinaldo A., Wasserman L., et al. (2019). Estimating the Reach of a Manifold . Electron. J. Stat. 13 ( 1 ), 1359–1399. 10.1214/19-ejs1551 [ CrossRef ] [ Google Scholar ]
  • Adams H., Emerson T., Kirby M., Neville R., Peterson C., Shipman P., et al. (2017). Persistence Images: a Stable Vector Representation of Persistent Homology . J. Machine Learn. Res. 18 ( 8 ), 1–35. [ Google Scholar ]
  • Anai H., Chazal F., Glisse M., Ike Y., Inakoshi H., Tinarrage R., et al. (2020). “ Dtm-based Filtrations ,” in Topological Data Analysis (Springer; ), 33–66. 10.1007/978-3-030-43408-3_2 [ CrossRef ] [ Google Scholar ]
  • Balakrishna S., Rinaldo A., Sheehy D., Singh A., Wasserman L. A. (2012). Minimax Rates for Homology Inference . J. Machine Learn. Res. - Proc. Track 22 , 64–72. [ Google Scholar ]
  • Berry E., Chen Y.-C., Cisewski-Kehe J., Fasy B. T. (2020). Functional Summaries of Persistence Diagrams . J. Appl. Comput. Topol. 4 ( 2 ), 211–262. 10.1007/s41468-020-00048-w [ CrossRef ] [ Google Scholar ]
  • Biau G., Chazal F., Cohen-Steiner D., Devroye L., Rodriguez C. (2011). A Weighted K-Nearest Neighbor Density Estimate for Geometric Inference . Electron. J. Stat. 5 , 204–237. 10.1214/11-ejs606 [ CrossRef ] [ Google Scholar ]
  • Biscio C. A., Møller J. (2019). The Accumulated Persistence Function, a New Useful Functional Summary Statistic for Topological Data Analysis, with a View to Brain Artery Trees and Spatial point Process Applications . J. Comput. Graphical Stat. 28 , 671–681. 10.1080/10618600.2019.1573686 [ CrossRef ] [ Google Scholar ]
  • Bobrowski O., Mukherjee S., Taylor J. (2014). Topological Consistency via Kernel Estimation . arXiv preprint arXiv:1407.5272. [ Google Scholar ]
  • Boissonnat J.-D., Chazal F., Yvinec M. (2018). Geometric and Topological Inference , Vol. 57 . Cambridge University Press. [ Google Scholar ]
  • Bonis T., Ovsjanikov M., Oudot S., Chazal F. (2016). “ Persistence-based Pooling for Shape Pose Recognition ,” in ProceedingsComputational Topology in Image Context - 6th International Workshop, CTIC 2016. Marseille, France, June 15-17, 2016. 19–29. 10.1007/978-3-319-39441-1_3 [ CrossRef ] [ Google Scholar ]
  • Brécheteau C. (2019). A Statistical Test of Isomorphism between Metric-Measure Spaces Using the Distance-To-A-Measure Signature . Electron. J. Stat. 13 ( 1 ), 795–849. 10.1214/19-ejs1539 [ CrossRef ] [ Google Scholar ]
  • Brécheteau C., Levrard C. (2020). A K-Points-Based Distance for Robust Geometric Inference . Bernoulli 26 ( 4 ), 3017–3050. 10.3150/20-bej1214 [ CrossRef ] [ Google Scholar ]
  • Breiman L. (2001). Random Forests . Machine Learn. 45 ( 1 ), 5–32. 10.1023/a:1010933404324 [ CrossRef ] [ Google Scholar ]
  • Brown A., Bobrowski O., Munch E., Wang B. (2020). Probabilistic Convergence and Stability of Random Mapper Graphs . J. Appl. Comput. Topol. 5 , 99–140. 10.1007/s41468-020-00063-x [ CrossRef ] [ Google Scholar ]
  • Brüel-Gabrielsson R., Nelson B. J., Dwaraknath A., Skraba P., Guibas L. J., Carlsson G. (2019). A Topology Layer for Machine Learning . arXiv preprint arXiv:1905.12200. [ Google Scholar ]
  • Bubenik P., Carlsson G., Kim P. T., Luo Z.-M. (2010). Statistical Topology via morse Theory Persistence and Nonparametric Estimation . Algebraic Methods Stat. Probab. 516 , 75–92. 10.1090/conm/516/10167 [ CrossRef ] [ Google Scholar ]
  • Bubenik P. (2015). Statistical Topological Data Analysis Using Persistence Landscapes . J. Machine Learn. Res. 16 , 77–102. [ Google Scholar ]
  • Buchet M., Chazal F., Dey T. K., Fan F., Oudot S. Y., Wang Y. (2015a). “ Topological Analysis of Scalar fields with Outliers ,” in Proc. Sympos. On Computational Geometry . [ Google Scholar ]
  • Buchet M., Chazal F., Oudot S., Sheehy D. R. (2015b). “ Efficient and Robust Persistent Homology for Measures ,” in Proceedings of the 26th ACM-SIAM symposium on Discrete algorithms. SIAM. 10.1137/1.9781611973730.13 [ CrossRef ] [ Google Scholar ]
  • Cadre B. (2006). Kernel Estimation of Density Level Sets . J. Multivar. Anal. 97 ( 4 ), 999–1023. 10.1016/j.jmva.2005.05.004 [ CrossRef ] [ Google Scholar ]
  • Carlsson G., Gabrielsson R. B. (2020). “ Topological Approaches to Deep Learning ,” in Topological Data Analysis (Springer; ), 119–146. 10.1007/978-3-030-43408-3_5 [ CrossRef ] [ Google Scholar ]
  • Carlsson G. (2009). Topology and Data . Bull. Amer. Math. Soc. 46 ( 2 ), 255–308. 10.1090/s0273-0979-09-01249-x [ CrossRef ] [ Google Scholar ]
  • Carriere M., Chazal F., Glisse M., Ike Y., Kannan H. (2020). A Note on Stochastic Subgradient Descent for Persistence-Based Functionals: Convergence and Practical Aspects . arXiv preprint arXiv:2010.08356. [ Google Scholar ]
  • Carrière M., Chazal F., Ike Y., Lacombe T., Royer M., Umeda Y. (2020). “ Perslay: a Neural Network Layer for Persistence Diagrams and New Graph Topological Signatures ,” in International Conference on Artificial Intelligence and Statistics (PMLR; ), 2786–2796. [ Google Scholar ]
  • Carrière M., Michel B. (2019). Approximation of Reeb Spaces with Mappers and Applications to Stochastic Filters . arXiv preprint arXiv:1912.10742. [ Google Scholar ]
  • Carriere M., Michel B., Oudot S. (2018). Statistical Analysis and Parameter Selection for Mapper . J. Machine Learn. Res. 19 ( 12 ). [ Google Scholar ]
  • Carriere M., Oudot S. (2017). “ Sliced Wasserstein Kernel for Persistence Diagrams ,” in To Appear in ICML-17 . [ Google Scholar ]
  • Carrière M., Oudot S. (2015). Structure and Stability of the 1-dimensional Mapper . arXiv preprint arXiv:1511.05823. [ Google Scholar ]
  • Carrière M., Rabadán R. (2020). “ Topological Data Analysis of Single-Cell Hi-C Contact Maps ,” in Topological Data Analysis (Springer; ), 147–162. 10.1007/978-3-030-43408-3_6 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Chen D., Guibas L., Jiang X., Sommer C. (2011a). Data-driven Trajectory Smoothing . Proc. ACM SIGSPATIAL GIS. 10.1145/2093973.2094007 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Glisse M., Guibas L., Oudot S. (2009a). Proximity of Persistence Modules and Their Diagrams . SCG , 237–246. 10.1145/1542362.1542407 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Guibas L. J., Mémoli F., Oudot S. Y. (2009b). Gromov-hausdorff Stable Signatures for Shapes Using Persistence . Comput. Graphics Forum (proc. SGP 2009) 28 , 1393–1403. 10.1111/j.1467-8659.2009.01516.x [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Lieutier A. (2009d). A Sampling Theory for Compact Sets in Euclidean Space . Discrete Comput. Geom. 41 ( 3 ), 461–479. 10.1007/s00454-009-9144-8 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Lieutier A. (2009c). Normal Cone Approximation and Offset Shape Isotopy . Comp. Geom. Theor. Appl. 42 ( 6-7 ), 566–581. 10.1016/j.comgeo.2008.12.002 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Lieutier A., Thibert B. (2008). Stability of Curvature Measures . Comput. Graphics Forum (proc. SGP 2009) , 1485–1496. [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Mérigot Q. (2010). Boundary Measures for Geometric Inference . Found. Comput. Math. 10 , 221–240. 10.1007/s10208-009-9056-2 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Cohen-Steiner D., Mérigot Q. (2011b). Geometric Inference for Probability Measures . Found. Comput. Math. 11 ( 6 ), 733–751. 10.1007/s10208-011-9098-0 [ CrossRef ] [ Google Scholar ]
  • Chazal F., de Silva V., Glisse M., Oudot S. (2016a). The Structure and Stability of Persistence Modules . SpringerBriefs in Mathematics . Springer. 10.1007/978-3-319-42545-0 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Fasy B. T., Lecci F., Michel B., Rinaldo A., Wasserman L. (2014a). “ Robust Topological Inference: Distance to a Measure and Kernel Distance ,” in To Appear in JMLR . . [ Google Scholar ]
  • Chazal F., Fasy B. T., Lecci F., Michel B., Rinaldo A., Wasserman L. (2015a). “ Subsampling Methods for Persistent Homology ,” in To appear in Proceedings of the 32 st International Conference on Machine Learning (ICML-15). [ Google Scholar ]
  • Chazal F., Fasy B. T., Lecci F., Rinaldo A., Singh A., Wasserman L. (2013a). On the Bootstrap for Persistence Diagrams and Landscapes . arXiv preprint arXiv:1311.0376. [ Google Scholar ]
  • Chazal F., Fasy B. T., Lecci F., Rinaldo A., Wasserman L. (2015b). Stochastic Convergence of Persistence Landscapes and Silhouettes . J. Comput. Geom. 6 ( 2 ), 140–161. [ Google Scholar ]
  • Chazal F., Glisse M., Labruère C., Michel B. (2014b). Convergence Rates for Persistence Diagram Estimation in Topological Data Analysis . Proceedings of Machine Learning Research. [ Google Scholar ]
  • Chazal F., Guibas L. J., Oudot S. Y., Skraba P. (2013b). Persistence-based Clustering in Riemannian Manifolds . J. ACM (Jacm) 60 ( 6 ), 41. 10.1145/2535927 [ CrossRef ] [ Google Scholar ]
  • Chazal F. (2017). “ High-dimensional Topological Data Analysis ,” in Handbook of Discrete and Computational Geometry . 3rd Ed- To appear. chapter 27. (CRC Press; ). [ Google Scholar ]
  • Chazal F., Huang R., Sun J. (2015c). Gromov-Hausdorff Approximation of Filamentary Structures Using Reeb-type Graphs . Discrete Comput. Geom. 53 ( 3 ), 621–649. 10.1007/s00454-015-9674-1 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Levrard C., Royer M. (2020). Optimal Quantization of the Mean Measure and Application to Clustering of Measures . arXiv preprint arXiv:2002.01216. [ Google Scholar ]
  • Chazal F., Lieutier A. (2008). Smooth Manifold Reconstruction from Noisy and Non-uniform Approximation with Guarantees . Comput. Geom. 40 ( 2 ), 156–170. 10.1016/j.comgeo.2007.07.001 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Massart P., Michel B. (2016b). Rates of Convergence for Robust Geometric Inference . Electron. J. Statist. 10 , 2243–2286. 10.1214/16-ejs1161 [ CrossRef ] [ Google Scholar ]
  • Chazal F., Oudot S. Y. (2008). “ Towards Persistence-Based Reconstruction in Euclidean Spaces ,” in Proceedings of the twenty-fourth annual symposium on Computational geometry (New York, NY, USA: SCG ’08ACM; ), 232–241. 10.1145/1377676.1377719 [ CrossRef ] [ Google Scholar ]
  • Chen C., Ni X., Bai Q., Wang Y. (2019). “ A Topological Regularizer for Classifiers via Persistent Homology ,” in The 22nd International Conference on Artificial Intelligence and Statistics, 2573–2582. [ Google Scholar ]
  • Chen Y.-C., Genovese C. R., Wasserman L. (2015). Density Level Sets: Asymptotics, Inference, and Visualization . arXiv preprint arXiv:1504.05438. [ Google Scholar ]
  • Chen Y.-C., Genovese C. R., Wasserman L. (2017). Density Level Sets: Asymptotics, Inference, and Visualization . J. Am. Stat. Assoc. 112 ( 520 ), 1684–1696. 10.1080/01621459.2016.1228536 [ CrossRef ] [ Google Scholar ]
  • Cohen-Steiner D., Edelsbrunner H., Harer J., Mileyko Y. (2010). Lipschitz Functions Have L P -Stable Persistence . Found. Comput. Math. 10 ( 2 ), 127–139. 10.1007/s10208-010-9060-6 [ CrossRef ] [ Google Scholar ]
  • Cohen-Steiner D., Edelsbrunner H., Harer J. (2005). Stability of Persistence Diagrams . SCG, 263–271. 10.1145/1064092.1064133 [ CrossRef ] [ Google Scholar ]
  • Cuevas A., Rodríguez-Casal A. (2004). On Boundary Estimation . Adv. Appl. Probab. 36 ( 2 ), 340–354. 10.1239/aap/1086957575 [ CrossRef ] [ Google Scholar ]
  • Curry J., Mukherjee S., Turner K. (2018). How many Directions Determine a Shape and Other Sufficiency Results for Two Topological Transforms . arXiv preprint arXiv:1805.09782. [ Google Scholar ]
  • De Silva V., Carlsson G. (2004). “ Topological Estimation Using Witness Complexes ,” in Proceedings of the First Eurographics Conference on Point-Based Graphics (Switzerland, Switzerland: Aire-la-VilleEurographics Association; ), 157–166. SPBG’04. [ Google Scholar ]
  • De Silva V., Ghrist R. (2007). Homological Sensor Networks . Notices Am. Math. Soc. 54 ( 1 ). [ Google Scholar ]
  • Devroye L., Wise G. L. (1980). Detection of Abnormal Behavior via Nonparametric Estimation of the Support . SIAM J. Appl. Math. 38 ( 3 ), 480–488. 10.1137/0138038 [ CrossRef ] [ Google Scholar ]
  • Dey T. K., Mémoli F., Wang Y. (2016). “ Multiscale Mapper: Topological Summarization via Codomain Covers ,” in Proceedings of the Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms (Society for Industrial and Applied Mathematics; ), 997–1013. 10.1137/1.9781611974331.ch71 [ CrossRef ] [ Google Scholar ]
  • Dey T. K., Mémoli F., Wang Y. (2017). Topological Analysis of Nerves, Reeb Spaces, Mappers, and Multiscale Mappers . Proc. Sympos. Comput . Geom.(SoCG). . [ Google Scholar ]
  • Dindin M., Umeda Y., Chazal F. (2020). “ Topological Data Analysis for Arrhythmia Detection through Modular Neural Networks ,” in Canadian Conference on Artificial Intelligence (Springer; ), 177–188. 10.1007/978-3-030-47358-7_17 [ CrossRef ] [ Google Scholar ]
  • Divol V., Chazal F. (2020). The Density of Expected Persistence Diagrams and its Kernel Based Estimation . J. Comput. Geom. 10 ( 2 ), 127–153. [ Google Scholar ]
  • Divol V., Lacombe T. (2020). Understanding the Topology and the Geometry of the Persistence Diagram Space via Optimal Partial Transport . J. Appl. Comput. Topol. 5 ( 1 ), 1–53. 10.1007/s41468-020-00061-z [ CrossRef ] [ Google Scholar ]
  • Edelsbrunner H., Letscher D., Zomorodian A. (2002). Topological Persistence and Simplification . Discrete Comput. Geom. 28 , 511–533. 10.1007/s00454-002-2885-2 [ CrossRef ] [ Google Scholar ]
  • Fasy B. T., Kim J., Lecci F., Maria C. (2014a). Introduction to the R Package Tda . arXiv preprint arXiv:1411.1830. [ Google Scholar ]
  • Fasy B. T., Lecci F., Rinaldo A., Wasserman L., Balakrishnan S., Singh A. (2014b). Confidence Sets for Persistence Diagrams . Ann. Stat. 42 ( 6 ), 2301–2339. 10.1214/14-aos1252 [ CrossRef ] [ Google Scholar ]
  • Federer H. (1959). Curvature Measures . Trans. Amer. Math. Soc. 93 , 418. 10.1090/s0002-9947-1959-0110078-1 [ CrossRef ] [ Google Scholar ]
  • Frosini P. (1992). “ Measuring Shapes by Size Functions ,” in Intelligent Robots and Computer Vision X: Algorithms and Techniques (International Society for Optics and Photonics; ), Vol. 1607 , 122–133. [ Google Scholar ]
  • Gabrielsson R. B., Carlsson G. (2019). “ Exposition and Interpretation of the Topology of Neural Networks ,” in 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) (IEEE; ), 1069–1076. 10.1109/icmla.2019.00180 [ CrossRef ] [ Google Scholar ]
  • Genovese C. R., Perone-Pacifico M., Verdinelli I., Wasserman L. (2012). Manifold Estimation and Singular Deconvolution under Hausdorff Loss . Ann. Statist. 40 , 941–963. 10.1214/12-aos994 [ CrossRef ] [ Google Scholar ]
  • Ghrist R. (2017). Homological Algebra and Data . preprint. [ Google Scholar ]
  • Grove K. (1993). Critical point Theory for Distance Functions . Proc. Symposia Pure Math. 54 . 10.1090/pspum/054.3/1216630 [ CrossRef ] [ Google Scholar ]
  • Guibas L., Morozov D., Mérigot Q. (2013). Witnessed K-Distance . Discrete Comput. Geom. 49 , 22–45. 10.1007/s00454-012-9465-x [ CrossRef ] [ Google Scholar ]
  • Hatcher A. (2001). Algebraic Topology . Cambridge Univ. Press. [ Google Scholar ]
  • Hensel F., Moor M., Rieck B. (2021). A Survey of Topological Machine Learning Methods . Front. Artif. Intell. 4 , 52. 10.3389/frai.2021.681108 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hofer C. D., Kwitt R., Niethammer M. (2019b). Learning Representations of Persistence Barcodes . J. Machine Learn. Res. 20 ( 126 ), 1–45. [ Google Scholar ]
  • Hofer C., Kwitt R., Niethammer M., Dixit M. (2019a). “ Connectivity-optimized Representation Learning via Persistent Homology ,”in Proceedings of the 36th International Conference on Machine Learning Vol. 97 . Proceedings of Machine Learning Research (PMLR), 2751–2760. [ Google Scholar ]
  • Hofer C., Kwitt R., Niethammer M., Uhl A. (2017). Deep Learning with Topological Signatures . arXiv preprint arXiv:1707.04041. [ Google Scholar ]
  • Khasawneh F. A., Munch E. (2016). Chatter Detection in Turning Using Persistent Homology . Mech. Syst. Signal Process. 70-71 , 527–541. 10.1016/j.ymssp.2015.09.046 [ CrossRef ] [ Google Scholar ]
  • Kim K., Kim J., Zaheer M., Kim J., Chazal F., Wasserman L. (2020). “ Pllay: Efficient Topological Layer Based on Persistence Landscapes ,” in 34th Conference on Neural Information Processing Systems (NeurIPS; ) [ Google Scholar ]
  • Kovacev-Nikolic V., Bubenik P., Nikolić D., Heo G. (2016). Using Persistent Homology and Dynamical Distances to Analyze Protein Binding . Stat. Appl. Genet. Mol. Biol. 15 ( 1 ), 19–38. 10.1515/sagmb-2015-0057 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kramar M., Goullet A., Kondic L., Mischaikow K. (2013). Persistence of Force Networks in Compressed Granular media . Phys. Rev. E Stat. Nonlin Soft Matter Phys. 87 ( 4 ), 042207. 10.1103/PhysRevE.87.042207 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kramár M., Levanger R., Tithof J., Suri B., Xu M., Paul M., et al. (2016). Analysis of Kolmogorov Flow and Rayleigh-Bénard Convection Using Persistent Homology . Physica D: Nonlinear Phenomena 334 , 82–98. 10.1016/j.physd.2016.02.003 [ CrossRef ] [ Google Scholar ]
  • Krebs J. T., Polonik W. (2019). On the Asymptotic Normality of Persistent Betti Numbers . arXiv preprint arXiv:1903.03280. [ Google Scholar ]
  • Kusano G., Fukumizu K., Hiraoka Y. (2017). Kernel Method for Persistence Diagrams via Kernel Embedding and Weight Factor . arXiv preprint arXiv:1706.03472. [ Google Scholar ]
  • Kusano G., Hiraoka Y., Fukumizu K. (2016). “ Persistence Weighted Gaussian Kernel for Topological Data Analysis ,” in International Conference on Machine Learning. 2004–2013. [ Google Scholar ]
  • Kusano G. (2019). On the Expectation of a Persistence Diagram by the Persistence Weighted Kernel . Jpn. J. Indust. Appl. Math. 36 ( 3 ), 861–892. 10.1007/s13160-019-00374-2 [ CrossRef ] [ Google Scholar ]
  • Lacombe T., Cuturi M., Oudot S. (2018). Large Scale Computation of Means and Clusters for Persistence Diagrams Using Optimal Transport . NeurIPS. [ Google Scholar ]
  • Lee Y., Barthel S. D., Dłotko P., Moosavi S. M., Hess K., Smit B. (2017). Quantifying Similarity of Pore-Geometry in Nanoporous Materials . Nat. Commun. 8 , 15396. 10.1038/ncomms15396 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Leygonie J., Oudot S., Tillmann U. (2019). A Framework for Differential Calculus on Persistence Barcodes . arXiv preprint arXiv:1910.00960. [ Google Scholar ]
  • Li C., Ovsjanikov M., Chazal F. (2014). “ Persistence-based Structural Recognition ,” in , 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2003–2010. 10.1109/cvpr.2014.257 [ CrossRef ] [ Google Scholar ]
  • Li M. Z., Ryerson M. S., Balakrishnan H. (2019). Topological Data Analysis for Aviation Applications . Transportation Res. E: Logistics Transportation Rev. 128 , 149–174. 10.1016/j.tre.2019.05.017 [ CrossRef ] [ Google Scholar ]
  • Lum P. Y., Singh G., Lehman A., Ishkanov T., Vejdemo-Johansson M., Alagappan M., et al. (2013). Extracting Insights from the Shape of Complex Data Using Topology . Sci. Rep. 3 , 1236. 10.1038/srep01236 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Maria C., Boissonnat J.-D., Glisse M., Yvinec M. (2014). “ The Gudhi Library: Simplicial Complexes and Persistent Homology ,” in International Congress on Mathematical Software (Springer; ), 167–174. 10.1007/978-3-662-44199-2_28 [ CrossRef ] [ Google Scholar ]
  • Maroulas V., Nasrin F., Oballe C. (2020). A Bayesian Framework for Persistent Homology . SIAM J. Math. Data Sci. 2 ( 1 ), 48–74. 10.1137/19m1268719 [ CrossRef ] [ Google Scholar ]
  • McKinney W. (2010). “ Data Structures for Statistical Computing in python ,” in Proceedings of the 9th Python in Science Conference (TX: SciPy Austin; ), Vol. 445 , 51–56. 10.25080/majora-92bf1922-00a [ CrossRef ] [ Google Scholar ]
  • Mileyko Y., Mukherjee S., Harer J. (2011). Probability Measures on the Space of Persistence Diagrams . Inverse Probl. 27 ( 12 ), 124007. 10.1088/0266-5611/27/12/124007 [ CrossRef ] [ Google Scholar ]
  • Moon C., Lazar N. A. (2020). Hypothesis Testing for Shapes Using Vectorized Persistence Diagrams . arXiv preprint arXiv:2006.05466. [ Google Scholar ]
  • Moor M., Horn M., Rieck B., Borgwardt K. (2020). “ Topological Autoencoders ,” in International Conference on Machine Learning (PMLR; ), 7045–7054. [ Google Scholar ]
  • Nakamura T., Hiraoka Y., Hirata A., Escolar E. G., Nishiura Y. (2015). Persistent Homology and many-body Atomic Structure for Medium-Range Order in the Glass . Nanotechnology 26 ( 30 ), 304001. 10.1088/0957-4484/26/30/304001 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Niyogi P., Smale S., Weinberger S. (2011). A Topological View of Unsupervised Learning from Noisy Data . SIAM J. Comput. 40 ( 3 ), 646–663. 10.1137/090762932 [ CrossRef ] [ Google Scholar ]
  • Niyogi P., Smale S., Weinberger S. (2008). Finding the Homology of Submanifolds with High Confidence from Random Samples . Discrete Comput. Geom. 39 ( 1-3 ), 419–441. 10.1007/s00454-008-9053-2 [ CrossRef ] [ Google Scholar ]
  • Obayashi I., Hiraoka Y. (2017). Persistence Diagrams with Linear Machine Learning Models . arXiv preprint arXiv:1706.10082. [ Google Scholar ]
  • Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., et al. (2011). Scikit-Learn: Machine Learning in Python . J. Machine Learn. Res. 12 ( Oct ), 2825–2830. [ Google Scholar ]
  • Penrose M. D., Yukich J. E. (2001). Central Limit Theorems for Some Graphs in Computational Geometry . Ann. Appl. Probab. 11 , 1005–1041. 10.1214/aoap/1015345393 [ CrossRef ] [ Google Scholar ]
  • Petrunin A. (2007). “ Applied Manifold Geometry ,” in Surveys in Differential Geometry (Somerville, MA: Int. Press; ), Vol. XI , 137–483. 10.1142/9789812770721_0003 [ CrossRef ] [ Google Scholar ]
  • Phillips J. M., Wang B., Zheng Y. (2014). Geometric Inference on Kernel Density Estimates . arXiv preprint 1307.7760. [ Google Scholar ]
  • Pike J. A., Khan A. O., Pallini C., Thomas S. G., Mund M., Ries J., et al. (2020). Topological Data Analysis Quantifies Biological Nano-Structure from Single Molecule Localization Microscopy . Bioinformatics 36 ( 5 ), 1614–1621. 10.1093/bioinformatics/btz788 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Polonik W. (1995). Measuring Mass Concentrations and Estimating Density Contour Clusters-An Excess Mass Approach . Ann. Stat. 23 , 855–881. 10.1214/aos/1176324626 [ CrossRef ] [ Google Scholar ]
  • Poulenard A., Skraba P., Ovsjanikov M. (2018). Topological Function Optimization for Continuous Shape Matching . Comput. Graphics Forum 37 , 13–25. 10.1111/cgf.13487 [ CrossRef ] [ Google Scholar ]
  • Qaiser T., Tsang Y.-W., Taniyama D., Sakamoto N., Nakane K., Epstein D., et al. (2019). Fast and Accurate Tumor Segmentation of Histology Images Using Persistent Homology and Deep Convolutional Features . Med. image Anal. 55 , 1–14. 10.1016/j.media.2019.03.014 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ramamurthy K. N., Varshney K., Mody K. (2019). “ Topological Data Analysis of Decision Boundaries with Application to Model Selection ,” in International Conference on Machine Learning (PMLR; ), 5351–5360. [ Google Scholar ]
  • Reininghaus J., Huber S., Bauer U., Kwitt R. (2015). “ A Stable Multi-Scale Kernel for Topological Machine Learning ,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4741–4748. 10.1109/cvpr.2015.7299106 [ CrossRef ] [ Google Scholar ]
  • Rieck B. A., Togninalli M., Bock C., Moor M., Horn M., Gumbsch T., Borgwardt K. (2019). “ Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology ,” in International Conference on Learning Representations (ICLR 2019) (OpenReview; ) [ Google Scholar ]
  • Rieck B., Yates T., Bock C., Borgwardt K., Wolf G., Turk-Browne N., et al. (2020). Uncovering the Topology of Time-Varying Fmri Data Using Cubical Persistence . Adv. Neural Inf. Process. Syst. 33 . [ Google Scholar ]
  • Robins V. (1999). Towards Computing Homology from Finite Approximations . Topology Proc. 24 , 503–532. [ Google Scholar ]
  • Robinson A., Turner K. (2017). Hypothesis Testing for Topological Data Analysis . J. Appl. Comput. Topol. 1 ( 2 ), 241–261. 10.1007/s41468-017-0008-7 [ CrossRef ] [ Google Scholar ]
  • Roycraft B., Krebs J., Polonik W. (2020). Bootstrapping Persistent Betti Numbers and Other Stabilizing Statistics . arXiv preprint arXiv:2005.01417 [ Google Scholar ]
  • Royer M., Chazal F., Levrard C., Ike Y., Umeda Y. (2021). “ Atol: Measure Vectorisation for Automatic Topologically-Oriented Learning ,” in International Conference on Artificial Intelligence and Statistics (PMLR; ) [ Google Scholar ]
  • Barannikov S. (1994). “ The Framed morse Complex and its Invariants ,” in Adv. Soviet Math. , Vol. 21 . Providence, RI: Amer. Math. Soc., 93–115. 10.1090/advsov/021/03 [ CrossRef ] [ Google Scholar ]
  • Seversky L. M., Davis S., Berger M. (2016). “ On Time-Series Topological Data Analysis: New Data and Opportunities ,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 59–67. 10.1109/cvprw.2016.131 [ CrossRef ] [ Google Scholar ]
  • Singh A., Scott C., Nowak R. (2009). Adaptive Hausdorff Estimation of Density Level Sets . Ann. Statist. 37 ( 5B ), 2760–2782. 10.1214/08-aos661 [ CrossRef ] [ Google Scholar ]
  • Singh G., Mémoli F., Carlsson G. E. (2007). 8. Tensor Decomposition . SPBG, 91–100. Citeseer. 10.1137/1.9780898718867.ch8 [ CrossRef ] [ Google Scholar ]
  • Sizemore A. E., Phillips-Cremins J. E., Ghrist R., Bassett D. S. (2019). The Importance of the Whole: Topological Data Analysis for the Network Neuroscientist . Netw. Neurosci. 3 ( 3 ), 656–673. 10.1162/netn_a_00073 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Skraba P., Ovsjanikov M., Chazal F., Guibas L. (2010). “ Persistence-based Segmentation of Deformable Shapes ,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 (IEEE Computer Society Conference on; ), 45–52. 10.1109/cvprw.2010.5543285 [ CrossRef ] [ Google Scholar ]
  • Smith A. D., Dłotko P., Zavala V. M. (2021). Topological Data Analysis: Concepts, Computation, and Applications in Chemical Engineering . Comput. Chem. Eng. 146 , 107202. 10.1016/j.compchemeng.2020.107202 [ CrossRef ] [ Google Scholar ]
  • Tsybakov A. B. (1997). On Nonparametric Estimation of Density Level Sets . Ann. Stat. 25 ( 3 ), 948–969. 10.1214/aos/1069362732 [ CrossRef ] [ Google Scholar ]
  • Turner K., Mileyko Y., Mukherjee S., Harer J. (2014a). Fréchet Means for Distributions of Persistence Diagrams . Discrete Comput. Geom. 52 ( 1 ), 44–70. 10.1007/s00454-014-9604-7 [ CrossRef ] [ Google Scholar ]
  • Turner K., Mukherjee S., Boyer D. M. (2014b). Persistent Homology Transform for Modeling Shapes and Surfaces . Inf. Inference 3 ( 4 ), 310–344. 10.1093/imaiai/iau011 [ CrossRef ] [ Google Scholar ]
  • Umeda Y. (2017). Time Series Classification via Topological Data Analysis . Trans. Jpn. Soc. Artif. Intell. 32 ( 3 ), D–G72_1. 10.1527/tjsai.d-g72 [ CrossRef ] [ Google Scholar ]
  • van der Walt S., Colbert S. C., Varoquaux G. (2011). The Numpy Array: a Structure for Efficient Numerical Computation . Comput. Sci. Eng. 13 ( 2 ), 22–30. 10.1109/mcse.2011.37 [ CrossRef ] [ Google Scholar ]
  • Villani C. (2003). Topics in Optimal Transportation . American Mathematical Society. [ Google Scholar ]
  • Wasserman L. (2018). Topological Data Analysis . Annu. Rev. Stat. Appl. 5 , 501–532. 10.1146/annurev-statistics-031017-100045 [ CrossRef ] [ Google Scholar ]
  • Yao Y., Sun J., Huang X., Bowman G. R., Singh G., Lesnick M., et al. (2009). Topological Methods for Exploring Low-Density States in Biomolecular Folding Pathways . J. Chem. Phys. 130 ( 14 ), 144115. 10.1063/1.3103496 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zieliński B., Lipiński M., Juda M., Zeppelzauer M., Dłotko P. (2010). “ Persistence Bag-Of-Words for Topological Data Analysis ,” in Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19). [ Google Scholar ]
  • Zomorodian A., Carlsson G. (2005). Computing Persistent Homology . Discrete Comput. Geom. 33 ( 2 ), 249–274. 10.1007/s00454-004-1146-y [ CrossRef ] [ Google Scholar ]

Logo

  • Significance Statement
  • a. Guiding application: Analyzing the mesoscale organization of clouds
  • b. Key TDA concepts discussed here
  • c. Advantages of TDA for image analysis tasks in environmental science
  • d. Combining TDA with simple ML algorithms
  • e. Objectives and organization of this article
  • a. Approaches for dealing with lack of labeled samples
  • b. Dataset details and preprocessing
  • a. Topology
  • b. Homology
  • c. Persistent homology
  • d. Persistence barcodes and diagrams
  • e. How to read and interpret persistence barcodes
  • f. Persistence landscapes
  • g. How to read and interpret persistence landscapes
  • a. Adapting persistent homology to this dataset
  • b. Dimensionality reduction and adding a simple machine learning model to build a classifier
  • c. Results: What initial patterns emerged from applying persistent homology?
  • d. A novel interpretation method: Deriving interpretations in terms of weather and homology
  • e. Comparison of this classifier with those in Rasp et al. (2020)
  • 5. Advanced TDA concepts
  • 6. Conclusions and future work
  • Acknowledgments.
  • Data availability statement.

Adams , H. , and Coauthors , 2017 : Persistence images: A vector representation of persistent homology . J. Mach. Learn. Res. , 18 ( 8 ), 1 – 35 , https://www.jmlr.org/papers/volume18/16-337/16-337.pdf ; Corrigendum, https://jmlr.org/papers/volume18/16-337/pi_erratum.pdf .

  • Search Google Scholar
  • Export Citation

Boser , B. E. , I. M. Guyon , and V. N. Vapnik , 1992 : A training algorithm for optimal margin classifiers. Proc. Fifth Annual Workshop on Computational Learning Theory: COLT’92 , Pittsburgh, PA, ACM, 144–152, https://doi.org/10.1145/130385.130401 .

Brenowitz , N. D. , T. Beucler , M. Pritchard , and C. S. Bretherton , 2020 : Interpreting and stabilizing machine-learning parametrizations of convection . J. Atmos. Sci. , 77 , 4357 – 4375 , https://doi.org/10.1175/JAS-D-20-0082.1 .

Bubenik , P. , 2015 : Statistical topological data analysis using persistence landscapes . J. Mach. Learn. Res. , 16 , 77 – 102 .

Carlsson , G. , 2009 : Topology and data . Bull. Amer. Math. Soc. , 46 , 255 – 308 , https://doi.org/10.1090/S0273-0979-09-01249-X .

Carlsson , G. , and A. Zomorodian , 2009 : The theory of multidimensional persistence . Discrete Comput. Geom. , 42 , 71 – 93 , https://doi.org/10.1007/s00454-009-9176-0 .

Carlsson , G. , G. Singh , and A. Zomorodian , 2009 : Computing multidimensional persistence. International Symposium on Algorithms and Computation , Lecture Notes in Computer Science, Vol. 5878, Springer, 730–739, https://doi.org/10.1007/978-3-642-10631-6_74 .

Cerri , A. , B. D. Fabio , M. Ferri , P. Frosini , and C. Landi , 2013 : Betti numbers in multidimensional persistent homology are stable functions . Math. Methods Appl. Sci. , 36 , 1543 – 1557 , https://doi.org/10.1002/mma.2704 .

Chung , M. K. , P. Bubenik , and P. T. Kim , 2009 : Persistence diagrams of cortical surface data. International Conference on Information Processing in Medical Imaging , Lecture Notes in Computer Science, Vol. 5636, Springer, 386–397, https://doi.org/10.1007/978-3-642-02498-6_32 .

Clough , J. , N. Byrne , I. Oksuz , V. A. Zimmer , J. A. Schnabel , and A. King , 2020 : A topological loss function for deep-learning based image segmentation using persistent homology . IEEE Trans. Pattern Anal. Mach. Intell. , 44 , 8766 – 8778 , https://doi.org/10.1109/TPAMI.2020.3013679 .

Cohen-Steiner , D. , H. Edelsbrunner , and D. Morozov , 2006 : Vines and vineyards by updating persistence in linear time. Proc. 22nd Annual Symp. on Computational Geometry , Sedona, AZ, ACM, 119–126, https://doi.org/10.1145/1137856.1137877 .

Denby , L. , 2020 : Discovering the importance of mesoscale cloud organization through unsupervised classification . Geophys. Res. Lett. , 47 , e2019GL085190 , https://doi.org/10.1029/2019GL085190 .

Deshmukh , V. , S. Baskar , E. Bradley , T. Berger , and J. D. Meiss , 2022 : Machine learning approaches to solar-flare forecasting: Is complex better? arXiv, 2202.08776v1, https://doi.org/10.48550/arXiv.2202.08776 .

Ebert-Uphoff , I. , and K. Hilburn , 2020 : Evaluation, tuning, and interpretation of neural networks for working with images in meteorological applications . Bull. Amer. Meteor. Soc. , 101 , E2149 – E2170 , https://doi.org/10.1175/BAMS-D-20-0097.1 .

Edelsbrunner , H. , and J. L. Harer , 2010 : Computational Topology: An Introduction . American Mathematical Society, 294 pp.

Fletcher , S. , and M. Z. Islam , 2018 : Comparing sets of patterns with the Jaccard index . Australas. J. Inf. Syst. , 22 , https://doi.org/10.3127/ajis.v22i0.1538 .

Gagne , D. J. , II , S. E. Haupt , D. W. Nychka , and G. Thompson , 2019 : Interpretable deep learning for spatial analysis of severe hailstorms . Mon. Wea. Rev. , 147 , 2827 – 2845 , https://doi.org/10.1175/MWR-D-18-0316.1 .

Gardner , R. J. , E. Hermansen , M. Pachitariu , Y. Burak , N. A. Baas , B. A. Dunn , M.-B. Moser , and E. I. Moser , 2022 : Toroidal topology of population activity in grid cells . Nature , 602 , 123 – 128 , https://doi.org/10.1038/s41586-021-04268-7 .

Gentine , P. , M. Pritchard , S. Rasp , G. Reinaudi , and G. Yacalis , 2018 : Could machine learning break the convection parameterization deadlock? Geophys. Res. Lett. , 45 , 5742 – 5751 , https://doi.org/10.1029/2018GL078202 .

Ghrist , R. , 2008 : Barcodes: The persistent topology of data . Bull. Amer. Math. Soc. , 45 , 61 – 76 , https://doi.org/10.1090/S0273-0979-07-01191-3 .

Gumley , L. , J. Descloitres , and J. Schmaltz , 2010 : Creating reprojected true color MODIS images: A tutorial. Space Science and Engineering Center Tech. Rep., 17 pp., https://cdn.earthdata.nasa.gov/conduit/upload/946/MODIS_True_Color.pdf .

ITU-R , 2011 : Studio encoding parameters of digital television for standard 4:3 and wide-screen 16:9 aspect ratios. International Telecommunication Union Tech. Rep. BT.601, 20 pp., https://www.itu.int/dms_pubrec/itu-r/rec/bt/R-REC-BT.601-7-201103-I!!PDF-E.pdf .

Jaccard , P. , 1901 : Étude comparative de la distribution florale dans une portion des alpes et du jura . Bull. Soc. Vaud. Sci. Nat. , 37 , 547 – 579 , https://doi.org/10.5169/seals-266450 .

Kim , H. , and C. Vogel , 2019 : Deciphering active wildfires in the southwestern USA using topological data analysis . Climate , 7 , 135 , https://doi.org/10.3390/cli7120135 .

Kramár , M. , R. Levanger , J. Tithof , B. Suri , M. Xu , M. Paul , M. F. Schatz , and K. Mischaikow , 2016 : Analysis of Kolmogorov flow and Rayleigh–Bénard convection using persistent homology . Physica D , 334 , 82 – 98 , https://doi.org/10.1016/j.physd.2016.02.003 .

Krasnopolsky , V. M. , M. S. Fox-Rabinovitz , and D. V. Chalikov , 2005 : New approach to calculation of atmospheric model physics: Accurate and fast neural network emulation of longwave radiation in a climate model . Mon. Wea. Rev. , 133 , 1370 – 1383 , https://doi.org/10.1175/MWR2923.1 .

Lawson , P. , A. B. Sholl , J. Brown , B. T. Fasy , and C. Wenk , 2019 : Persistent homology for the quantitative evaluation of architectural features in prostate cancer histology . Sci. Rep. , 9 , 1139 , https://doi.org/10.1038/s41598-018-36798-y .

L’Ecuyer , T. S. , and Coauthors , 2015 : The observed state of the energy budget in the early twenty-first century . J. Climate , 28 , 8319 – 8346 , https://doi.org/10.1175/JCLI-D-14-00556.1 .

Maria , C. , J.-D. Boissonnat , M. Glisse , and M. Yvinec , 2014 : The Gudhi library: Simplicial complexes and persistent homology. Mathematical Software—ICMS 2014 , Lecture Notes in Computer Science, Vol. 8592, Springer, 167–174.

McGovern , A. , R. Lagerquist , D. J. Gagne , G. E. Jergensen , K. L. Elmore , C. R. Homeyer , and T. Smith , 2019 : Making the black box more transparent: Understanding the physical implications of machine learning . Bull. Amer. Meteor. Soc. , 100 , 2175 – 2199 , https://doi.org/10.1175/BAMS-D-18-0195.1 .

McGovern , A. , I. Ebert-Uphoff , D. J. Gagne , and A. Bostrom , 2022 : Why we need to focus on developing ethical, responsible, and trustworthy artificial intelligence approaches for environmental science . Environ. Data Sci. , 1 , e6 , https://doi.org/10.1017/eds.2022.5 .

Merritt , R. B. , 2021 : Visualizing planetary Rossby waves with topological data analysis. M.S. thesis, Dept. of Statistics, University of Georgia, 53 pp., https://esploro.libs.uga.edu/esploro/outputs/9949375354302959 .

Mileyko , Y. , S. Mukherjee , and J. Harer , 2011 : Probability measures on the space of persistence diagrams . Inverse Probl. , 27 , 124007 , https://doi.org/10.1088/0266-5611/27/12/124007 .

Moroni , D. , and M. A. Pascali , 2021 : Learning topology: Bridging computational topology and machine learning . Pattern Recognit. Image Anal. , 31 , 443 – 453 , https://doi.org/10.1134/S1054661821030184 .

Muszynski , G. , K. Kashinath , V. Kurlin , M. Wehner , and Prabhat , 2019 : Topological data analysis and machine learning for recognizing atmospheric river patterns in large climate datasets . Geosci. Model Dev. , 12 , 613 – 628 , https://doi.org/10.5194/gmd-12-613-2019 .

Ofori-Boateng , D. , H. Lee , K. M. Gorski , M. J. Garay , and Y. R. Gel , 2021 : Application of topological data analysis to multi-resolution matching of aerosol optical depth maps . Front. Environ. Sci. , 9 , 684716 , https://doi.org/10.3389/fenvs.2021.684716 .

Rasp , S. , M. S. Pritchard , and P. Gentine , 2018 : Deep learning to represent subgrid processes in climate models . Proc. Natl. Acad. Sci. USA , 115 , 9684 – 9689 , https://doi.org/10.1073/pnas.1810286115 .

Rasp , S. , H. Schulz , S. Bony , and B. Stevens , 2020 : Combining crowdsourcing and deep learning to explore the mesoscale organization of shallow convection . Bull. Amer. Meteor. Soc. , 101 , E1980 – E1995 , https://doi.org/10.1175/BAMS-D-19-0324.1 .

Reininghaus , J. , S. Huber , U. Bauer , and R. Kwitt , 2015 : A stable multi-scale kernel for topological machine learning. Proc. IEEE Conf. on Computer Vision and Pattern Recognition , Boston, MA, Institute of Electrical and Electronics Engineers, 4741–4748, https://doi.org/10.1109/CVPR.2015.7299106 .

RIVET Developers , 2020 : RIVET. RIVET Developers, https://github.com/rivetTDA/rivet/ .

Rudin , C. , 2019 : Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead . Nat. Mach. Intell. , 1 , 206 – 215 , https://doi.org/10.1038/s42256-019-0048-x .

Schmit , T. J. , P. Griffith , M. M. Gunshor , J. M. Daniels , S. J. Goodman , and W. J. Lebair , 2017 : A closer look at the ABI on the GOES-R series . Bull. Amer. Meteor. Soc. , 98 , 681 – 698 , https://doi.org/10.1175/BAMS-D-15-00230.1 .

Schultz , M. , C. Betancourt , B. Gong , F. Kleinert , M. Langguth , L. Leufen , A. Mozaffari , and S. Stadtler , 2021 : Can deep learning beat numerical weather prediction? Philos. Trans. Roy. Soc. , A379 , 20200097 , https://doi.org/10.1098/rsta.2020.0097 .

Schwartz , R. , J. Dodge , N. A. Smith , and O. Etzioni , 2020 : Green AI . Commun. ACM , 63 , 54 – 63 , https://doi.org/10.1145/3381831 .

Segovia-Dominguez , I. , Z. Zhen , R. Wagh , H. Lee , and Y. R. Gel , 2021 : TLife-LSTM: Forecasting future COVID-19 progression with topological signatures of atmospheric conditions. Advances in Knowledge Discovery and Data Mining , Lecture Notes in Computer Science, Vol. 12712, Springer, 201–212.

Sena , C. Á. P. , J. A. R. da Paixão , and J. R. de Almeida França , 2021 : A topological data analysis approach for retrieving local climate zones patterns in satellite data . Environ. Challenges , 5 , 100359 , https://doi.org/10.1016/j.envc.2021.100359 .

Stevens , B. , and Coauthors , 2020 : Sugar, gravel, fish and flowers: Mesoscale cloud patterns in the trade winds . Quart. J. Roy. Meteor. Soc. , 146 , 141 – 152 , https://doi.org/10.1002/qj.3662 .

Strommen , K. , M. Chantry , J. Dorrington , and N. Otter , 2023 : A topological perspective on weather regimes . Climate Dyn. , https://doi.org/10.1007/s00382-022-06395-x , in press.

Sun , H. , W. Manchester , and Y. Chen , 2021 : Improved and interpretable solar flare predictions with spatial and topological features of the polarity inversion line masked magnetograms . Space Wea. , 19 , e2021SW002837 , https://doi.org/10.1029/2021SW002837 .

Topaz , C. M. , L. Ziegelmeier , and T. Halverson , 2015 : Topological data analysis of biological aggregation models . PLOS ONE , 10 , e0126383 , https://doi.org/10.1371/journal.pone.0126383 .

Tymochko , S. , E. Munch , J. Dunion , K. Corbosiero , and R. Torn , 2020 : Using persistent homology to quantify a diurnal cycle in hurricanes . Pattern Recognit. Lett. , 133 , 137 – 143 , https://doi.org/10.1016/j.patrec.2020.02.022 .

Xu , J. , W. Zhou , Z. Fu , H. Zhou , and L. Li , 2021 : A survey on green deep learning . arXiv, 2111.05193v2, https://doi.org/10.48550/arXiv.2111.05193 .

Yuval , J. , and P. A. O’Gorman , 2020 : Stable machine-learning parameterization of subgrid processes for climate modeling at a range of resolutions . Nat. Commun. , 11 , 3295 , https://doi.org/10.1038/s41467-020-17142-3 .

Zhu , X. X. , D. Tuia , L. Mou , G.-S. Xia , L. Zhang , F. Xu , and F. Fraundorfer , 2017 : Deep learning in remote sensing: A comprehensive review and list of resources . IEEE Geosci. Remote Sens. Mag. , 5 , 8 – 36 , https://doi.org/10.1109/MGRS.2017.2762307 .

Two different ways to extract desired information from imagery: (a) the pure ML approach, in which image information is extracted by using a complex ML model, typically a deep neural network, and (b) the TDA approach, in which image information is extracted by using TDA followed by a simpler ML model/method. The latter can lead to more transparent and computationally efficient approaches.

Examples of the four cloud types from the sugar, flowers, fish, and gravel dataset from Rasp et al. (2020) . Note that Rasp et al. (2020) use the term “flower,” whereas we follow Stevens et al. (2020) in referring to this type as the plural “flowers.” Image credit: Figure 1 in Rasp et al. (2020) , © American Meteorological Society. Used with permission.

Three shapes that each have the same homology—a single connected component, a single one-dimensional hole, and no higher-dimensional holes.

(a) A grayscale image ranging between intensity 0 for black and 255 for white, and four superlevelsets from (a) at cutoff values (b) 209, (c) 151, (d) 94, and (e) 10. The pixels included in the superlevelsets are colored white. (f) The persistence barcode for (a), with vertical lines from left to right indicating the intensities corresponding to the four superlevelsets in (b)–(e). (g) The equivalent persistence diagram. In the barcode, diagram, and landscape, red elements (bars, points, and lines, respectively) indicate connected components ( H 0 features) and blue elements indicate one-dimensional holes ( H 1 features).

Examples of deformations that all result in the same persistence barcode: (a) the original image and the transformations: (b) scaling, (c) rotation, (d) translation, (e) uneven scaling, (f) shearing, and (g),(h) general homeomorphisms.

The process of computing a persistence landscape from a barcode. Beginning with (a) the barcode (which already has the infinite bar removed), we (b) raise a “mountain” above each bar to obtain the “mountain range.” Our first persistence landscape function is the piecewise-linear function that follows the highest edges of the mountain range in (b) [shown as the solid line in (c)], and (c) the next function is obtained by deleting in (b) the lines corresponding to this first function and then finding the piecewise-linear function that follows the highest edges of this modified mountain range in (b) (the dotted line). Further landscape functions are computed similarly by deleting previous functions in (b) and tracing along the highest remaining edges. In (c), only the first three landscape functions are shown.

(a) A 32 × 32 sample image from the grayscale MODIS imagery used in the sugar, flowers, fish, and gravel dataset, along with its (b) persistence barcode, (c) persistence diagram, and (d) persistence landscape with the first five piecewise-linear functions for each of the H 0 and H 1 classes.

An illustration of the annotation and subsampling process: (a) a full 14° × 21° image, with an example fish annotation outlined in blue, with the subsample boxes outlined in various colors inside, (b) an enlarged view of this annotated region, including the same color-coded subsamples, (c) the corresponding landscapes for each subsample, and (d) the resulting averaged landscape.

A cartoon showing the sampling and concatenation of a persistence landscape into a vector. In this figure, we see the landscape, each landscape function individually, and the process of sampling each landscape function evenly and then concatenating the results into a single vector v .

Plots of the embedded landscape test points for each pairwise combination: (a) sugar vs flowers (89.25%), (b) fish vs sugar (86%), (c) flowers vs gravel (81%), (d) fish vs gravel (79.5%), (e) sugar vs gravel (71.25%), and (f) fish vs flowers (57%). Flower points are red, sugar points are yellow, gravel points are green, and fish points are blue, with color representing the class assigned in the crowdsourced dataset in Rasp et al. (2020) . The SVM separating plane for each pair is shown in wireframe, and the percentage of correctly classified points for each combination (test performance) is reported in parentheses earlier in the caption.

The (a) sugar and (d) flowers samples farthest from the separating hyperplane (most extreme example) and (b),(e) their respective landscapes. Also shown are the “virtual” (c) sugar and (f) flowers landscapes obtained by traveling along the line normal to the separating hyperplane.

  • View raw image
  • Download Powerpoint Slide

thesis on topological data analysis

Related Content

The impacts of interannual climate variability on the declining trend in terrestrial water storage over the tigris–euphrates river basin, evaluation of snowfall retrieval performance of gpm constellation radiometers relative to spaceborne radars, a 440-year reconstruction of heavy precipitation in california from blue oak tree rings, quantifying the role of internal climate variability and its translation from climate variables to hydropower production at basin scale in india, extreme convective rainfall and flooding from winter season extratropical cyclones in the mid-atlantic region of the united states.

  • Previous Article
  • Next Article

A Primer on Topological Data Analysis to Support Image Analysis Tasks in Environmental Science

Displayed acceptance dates for articles published prior to 2023 are approximate to within a week. If needed, exact acceptance dates can be obtained by emailing  [email protected] .

  • Download PDF

Topological data analysis (TDA) is a tool from data science and mathematics that is beginning to make waves in environmental science. In this work, we seek to provide an intuitive and understandable introduction to a tool from TDA that is particularly useful for the analysis of imagery, namely, persistent homology. We briefly discuss the theoretical background but focus primarily on understanding the output of this tool and discussing what information it can glean. To this end, we frame our discussion around a guiding example of classifying satellite images from the sugar, fish, flower, and gravel dataset produced for the study of mesoscale organization of clouds by Rasp et al. We demonstrate how persistent homology and its vectorization, persistence landscapes, can be used in a workflow with a simple machine learning algorithm to obtain good results, and we explore in detail how we can explain this behavior in terms of image-level features. One of the core strengths of persistent homology is how interpretable it can be, so throughout this paper we discuss not just the patterns we find but why those results are to be expected given what we know about the theory of persistent homology. Our goal is that readers of this paper will leave with a better understanding of TDA and persistent homology, will be able to identify problems and datasets of their own for which persistent homology could be helpful, and will gain an understanding of the results they obtain from applying the included GitHub example code.

Information such as the geometric structure and texture of image data can greatly support the inference of the physical state of an observed Earth system, for example, in remote sensing to determine whether wildfires are active or to identify local climate zones. Persistent homology is a branch of topological data analysis that allows one to extract such information in an interpretable way—unlike black-box methods like deep neural networks. The purpose of this paper is to explain in an intuitive manner what persistent homology is and how researchers in environmental science can use it to create interpretable models. We demonstrate the approach to identify certain cloud patterns from satellite imagery and find that the resulting model is indeed interpretable.

© 2023 American Meteorological Society. For information regarding reuse of this content and general copyright information, consult the AMS Copyright Policy ( www.ametsoc.org/PUBSReuseLicenses ).

1. Introduction

Methods for image analysis have become an essential tool for many environmental science (ES) applications to automatically extract key information from satellite imagery or from gridded model output ( Schultz et al. 2021 ; Zhu et al. 2017 ; Gagne et al. 2019 ; Ebert-Uphoff and Hilburn 2020 ). Machine learning (ML) methods such as convolutional neural networks (CNNs) are now the dominant technique for many such tasks, where they operate as black boxes ( McGovern et al. 2019 ). This is undesirable for high stakes applications ( Rudin 2019 ; McGovern et al. 2022 ). In this paper, we show how a tool that is beginning to be used in the community, namely, topological data analysis (TDA), can be combined with ML methods for interpretable image analysis. TDA is a mathematical discipline that can quantify geometric information from an image in a predictable and well-understood way. In section 4d , we give a novel example of how we can leverage this understanding to give a strong interpretation of ML results in terms of image features.

TDA has proven highly successful to aid in the analysis of data in a variety of applications, including neuroscience ( Chung et al. 2009 ; Gardner et al. 2022 ), fluid dynamics ( Kramár et al. 2016 ), and cancer histology ( Lawson et al. 2019 ). In environmental science, TDA has recently shown potential to help identify atmospheric rivers ( Muszynski et al. 2019 ), detect solar flares ( Deshmukh et al. 2022 ; Sun et al. 2021 ), identify which wildfires are active ( Kim and Vogel 2019 ), quantify the diurnal cycle in hurricanes ( Tymochko et al. 2020 ), identify local climate zones ( Sena et al. 2021 ), detect and visualize Rossby waves ( Merritt 2021 ), and forecast COVID-19 spread using atmospheric data ( Segovia-Dominguez et al. 2021 ). The purpose of this article is to provide an intuitive introduction to TDA for the environmental science community—using a meteorological application as a guiding example—and an understanding of where TDA might be applied. This article is accompanied by easy-to-follow sample code provided as a GitHub repository ( https://github.com/zyjux/sffg_tda ) that we hope will be used by the community in new applications.

To provide a gentle introduction to TDA for the ES community, we illustrate its use for a practical example. We chose the application of classifying the mesoscale organization of clouds, specifically distinguishing four types of organization—sugar, gravel, fish, and flowers—identified by Stevens et al. (2020) . This task provides an ideal case study for our exploration of topological data analysis for several reasons: 1) these four organization patterns are well known from the seminal paper by Stevens et al. (2020) , and meteorological experts were able to reliably identify these patterns from satellite-visible imagery. 2) The task can be formulated as a classification of patches of single-image monochromatic imagery, for which a common TDA algorithm (persistent homology) is well suited. 3) TDA has never been applied for this application, so it is novel. 4) A well-developed benchmark dataset with reliable crowdsourced labels is publicly available for this task ( Rasp et al. 2020 ).

Several ML approaches have already been developed with good success for this benchmark dataset to classify the four different types ( Rasp et al. 2020 ). We emphasize that we are not seeking to match or exceed the performance of those ML approaches. Rather, we use this application to demonstrate TDA as an approach that can increase transparency, decrease computational effort, and be feasible even if few labeled data samples are available (see section 1c ).

In this paper, we focus on the TDA concept that is most appropriate for image analysis: persistent homology. We will provide a detailed introduction in section 3 but in this subsection give a short preview of key concepts to be discussed.

Homology is the classical study of connectivity and the presence of holes of various dimensions, giving large-scale geometric information. Persistent homology provides a descriptor with information on the texture of an image (how rough or smooth it is), which can be vectorized into a format useful for machine learning. It does this by scaling through all the intensity values in an image and recording at what intensities connected components and holes appear and disappear. Particularly on images, persistent homology and its vectorizations can be efficiently computed, so for image analysis (from models or satellites) the computational effort of implementing persistent homology is small.

The results of persistent homology computations can be displayed as either persistence diagrams or persistence barcodes. We focus here on barcodes in which each feature (connected component or hole) appears as a bar that starts at the intensity value at which the feature appears and ends at the intensity at which it disappears. The lengths of these bars indicate the persistence of each feature. The raw output of persistent homology is not suitable for most machine learning tasks as the output vector varies in length from sample to sample. While there are many proposed solutions to this, in this paper we use persistence landscapes, which translate a barcode into a mountain range, with the height of each mountain representing the persistence of the corresponding feature. The landscape is obtained from this mountain range by taking the n highest profiles as piecewise-linear functions, where n is a hyperparameter.

Persistent homology is a deterministic mathematical transformation (just as, say, the well-known Fourier transform). In the following, we first explore the advantages that persistent homology inherits from being a deterministic algorithm:

Transparency: All the internal steps of the algorithm are known and well-understood, and the method has a high degree of theoretical interpretation, giving it far more transparency than most ML methods. In section 4d , we use this theoretical background to understand what image features are driving differences in the output of persistent homology.

Known failure modes: No technique is perfect, and there will always be situations that cause errors and incorrect results. To use a method in practice, it is important to understand in what situations it struggles and what sorts of errors can result. Because persistent homology is a deterministic method, we can both theoretically predict these failure modes and interpret experimental results in terms of the original feature space.

No need for large, labeled datasets: As a deterministic algorithm, persistent homology does not require large, reliably labeled datasets. Instead, a small set of representative examples can be used to explore the different patterns that emerge in the transformed data. TDA is often used in combination with a simple ML model, and the number of labeled samples to obtain good performance is smaller than would be required to train a CNN or similar tool without TDA. This is a huge advantage for environmental science datasets, which are frequently large and detailed but almost entirely unlabeled.

Environmentally friendly: Many CNNs for image analysis tasks are known to have a surprisingly high carbon footprint due to the extensive computational resources required for model training ( Schwartz et al. 2020 ; Xu et al. 2021 ). TDA is more in line with the Green AI movement ( Schwartz et al. 2020 ; Xu et al. 2021 ), as it enables context-driven numerical results without the environmental impact inherent in training a deep neural network.

Next, we discuss the key abilities that persistent homology brings to image analysis tasks. These fall into three general categories: the incorporation of spatial context into a deterministic algorithm, the detection of texture and contrast, and invariance under certain transformations. The categories are discussed further below:

Incorporating spatial context: Many deterministic algorithms, as well as fully connected neural networks, struggle to incorporate the spatial context inherent in satellite data. Integrating this spatial context is precisely what motivated the development of CNNs, but CNNs are costly to train and challenging to make explainable. Persistent homology naturally incorporates spatial context, so patterns that are evident in this spatial context can be incorporated without resorting to CNNs or other spatially informed neural network architectures.

Detection of texture and contrast: Persistent homology excels at detecting contrast differences—regions (small or large) that differ from the surrounding average, which gives a representation of the texture present in an image. This focus on texture is useful in analyzing satellite weather imagery, as texture is frequently a key distinguishing factor, even more than a cloud being a particular shape or size.

Invariance to homeomorphisms: The notion of not wanting to be constrained by a particular geometry brings us to the final advantage: invariance under a common class of transformations called homeomorphisms (see section 3e ).

For some image analysis tasks, TDA methods can be used as a stand-alone tool, but for the majority of tasks, one would first use TDA to extract topological features and then afterward add a simple machine learning algorithm, as shown in Fig. 1b . For example, the sample application in section 4 uses TDA followed by a support vector machine (SVM). TDA can thus be viewed as a transparent means to construct new, physically meaningful, and interpretable features that may reduce the need for black-box machine learning algorithms. Using TDA in this way can support the goals of creating ethical, responsible, and trustworthy artificial intelligence approaches for environmental science outlined in McGovern et al. (2022) , since transparency is a key requirement for ML approaches to be used in tasks that affect life-and-death decision-making ( Rudin 2019 ), such as severe weather forecasting.

Fig. 1.

Citation: Artificial Intelligence for the Earth Systems 2, 1; 10.1175/AIES-D-22-0039.1

  • Download Figure
  • Download figure as PowerPoint slide

As mentioned before, we are not attempting to set a new benchmark for accuracy in classification nor are we declaring that this method renders existing techniques obsolete. Instead, we seek to raise awareness of a promising technique with significant potential for ES applications and provide the reader with a high-level understanding of how TDA works, what sorts of questions can be asked using TDA, and how the answers obtained can be interpreted and understood. The case study in section 4 provides examples of the sorts of questions TDA can help to address, including reports of negative examples, that is, situations in which persistent homology is not able to distinguish between classes, which are as informative as positive examples in order to understand the best use of TDA.

The remainder of this article is organized as follows: section 2 discusses in detail the sample application of classifying the mesoscale organization of clouds. Section 3 provides an introduction to the key concepts of topological data analysis. Section 4 illustrates the use of these TDA concepts for the sample application from section 2 in combination with a simple support vector machine. In particular, in section 4d , we provide a detailed and novel discussion of the characteristic image-level features that our combined TDA–SVM algorithm uses to classify. This highlights the ability to identify which learned patterns can be exposed and to discuss these in the original feature space, which is one of the greatest strengths of persistent homology and TDA. Section 5 provides an overview of advanced TDA concepts that are beyond the scope of this paper. Section 6 provides conclusions and suggests future work.

2. Guiding application: Classifying the mesoscale organization of clouds from satellite data

To illustrate the use of TDA we consider the task of identifying patterns of mesoscale (20–1000 km) organization of shallow clouds from satellite imagery, which has recently attracted much attention ( Stevens et al. 2020 ; Rasp et al. 2020 ; Denby 2020 ). Climate models, because of their low spatial resolution, cannot model clouds at their natural scale ( Gentine et al. 2018 ; Rasp et al. 2018 ). Since clouds play a major role in the radiation budget of Earth ( L’Ecuyer et al. 2015 ), the limited representation of clouds in climate models causes significant uncertainty for climate prediction ( Gentine et al. 2018 ). There has been progress in addressing this limitation from the climate modeling side, for example, using ML to better represent subgrid processes ( Krasnopolsky et al. 2005 ; Rasp et al. 2018 ; Yuval and O’Gorman 2020 ; Brenowitz et al. 2020 ).

A different approach is to build a better understanding of cloud organization in satellite imagery ( Stevens et al. 2020 ; Rasp et al. 2020 ; Denby 2020 ). One goal is to track the frequency of occurrence of certain cloud patterns across the globe, reaching back in time as far as satellite imagery allows, to better understand changes to the underlying meteorological conditions. To this end, in 2020 a group of scientists from an International Space Science Institute (ISSI) international team identified the primary types of mesoscale cloud patterns seen in Moderate Resolution Imaging Spectroradiometer (MODIS; Gumley et al. 2010 ) true color satellite imagery, focusing on boreal winter (December–February) over a trade wind region east of Barbados ( Stevens et al. 2020 ). Using visual inspection they identified four primary mesoscale cloud patterns, namely, sugar, gravel, fish, and flowers (shown in Fig. 2 ).Subsequent study of these four cloud types using radar imagery ( Stevens et al. 2020 ) and median vertical profiles of temperature, relative humidity, and vertical velocity ( Rasp et al. 2020 ) indicate that the four cloud types occur in climatologically distinct environments and are thus a good indication of those environments.

Fig. 2.

While humans are fairly consistent at recognizing these four patterns after some training, it is difficult to describe them objectively so that a machine can be programmed to do the same. Deep learning offers a potential solution; however, most deep learning approaches require a large number of labeled images to learn from.

Rasp et al. (2020) solve the lack of labeled data for this application by a crowdsourcing campaign using a two-step process. First, they developed a crowdsourcing environment and recruited experts to label 10 000 images. Experts used a simple interface to mark rectangular boxes in the imagery and label them with one of the four patterns. The labeled dataset enabled the use of supervised learning algorithms as a second step, that is, the algorithms were supplied with pairs of input images and output labels and then trained to estimate output labels from given imagery. Two types of supervised deep learning algorithms were developed: one for object recognition and one for segmentation. Both algorithms performed well ( Rasp et al. 2020 ).

In contrast, unsupervised learning approaches seek to develop models from unlabeled data samples. Clustering—which divides unlabeled input samples into groups that are similar in some way—is a classic unsupervised learning algorithm. For example, Denby (2020) trained an unsupervised deep learning algorithm, in combination with a hierarchical clustering algorithm, for a closely related application, namely, grouping image patches from Geostationary Operational Environmental Satellite ( Schmit et al. 2017 ) imagery into clusters of similar cloud patterns. Their algorithm identified a hierarchy of clustered mesoscale cloud patterns, but since the classes of cloud patterns were generated by a black-box algorithm rather than by domain scientists, their meaning is less understood than the four patterns from Rasp et al. (2020) . Indeed, a necessary step that comes after the unsupervised learning is to test whether the patterns identified by an algorithm correspond to climatologically distinct environments and if so which ones.

TDA is an alternative approach to address the lack of labels. With TDA we seek to match imagery to the original four classes identified by Stevens et al. (2020) , yet only require a small number of labeled samples. We map patches of the MODIS imagery into topological space and then investigate whether there are significant differences in the topological properties that we can leverage to distinguish the patterns. TDA can thus be viewed as a means of sophisticated feature engineering, giving new, physically meaningful topological features. Our motivation is that this approach would allow us to identify the well-established patterns from Stevens et al. (2020) but with two key differences: 1) this approach does not require a large number of labels (less crowdsourcing required) and 2) this approach is more transparent than the supervised [such as Rasp et al. (2020) ] and unsupervised [such as Denby (2020) ] deep learning approaches, since topological properties can be understood intuitively.

We note that TDA can also be used in an unsupervised fashion similar to the approach of Denby (2020) , only with more transparency and computational efficiency. On its own, TDA provides an embedding of the image data. However, rather than the embeddings being a learned property of a neural network whose properties can only be inferred after it is trained, and then only with difficulty, the TDA embedding is deterministically based on topological properties of the image. For this primer, however, we focus on the supervised task of identifying the previously established patterns of Stevens et al. (2020) .

The dataset from Rasp et al. (2020) provides approximately 50 000 individual cloud-type “annotations” (where each annotation is a rectangle placed on an image surrounding a particular cloud type) on around 10 000 base images. To evaluate the quality of these crowdsourced annotations, Rasp et al. (2020) used a comparison of intersection-over-union (IOU) scores (also known as the Jaccard index; Jaccard 1901 ; Fletcher and Islam 2018 ) between annotators analyzing the same image, and their analysis indicated that these annotations were generally of high quality. See Fig. 2 for examples of these cloud types. In general, sugar-type clouds are small, relatively uniformly distributed clouds; gravel-type clouds are somewhat larger than sugar clouds and tend to show more organization; flowers-type clouds are yet larger clouds that clump together with areas of clear sky between; and fish-type clouds form distinctive mesoscale skeletal patterns. Each image in the dataset is a 14° × 21° (latitude–longitude) visible color MODIS image from the Terra or Aqua satellite. On these images, annotators could draw rectangular annotations encompassing a single cloud type and could apply as many annotations to each image as they desired, so long as each annotation encompassed at least 10% of the image.

3. Introduction to topological data analysis

In this section, we provide a brief introduction to relevant mathematical topics. For more details, we refer readers to Carlsson (2009) , Ghrist (2008) , and Edelsbrunner and Harer (2010) .

In the broadest sense, topology is the study of the fundamental shapes of abstract mathematical objects. When we speak of the “topology” of an object, we speak of properties that do not change under a smooth reshaping of the object, as if it is made of a soft rubber. Some example properties include how many connected components the object contains, how many holes or voids it contains, and in what ways the object loops back on itself. In this paper, we focus on the first two properties: connectivity and holes.

Homology is one of the tools from topology that focuses on connectivity and holes. The d -dimensional homology H d (for d ∈ Ζ ≥0 ) counts the number of d -dimensional holes (or voids) in that object. For d = 0, the 0-dimensional homology H 0 captures the number of connected components present in an object. For d ≥ 1, the homology H d captures holes: a one-dimensional hole is one that can be traced around with a one-dimensional loop (like a loop of string), while a two-dimensional hole is a void. As shown in Fig. 3 , these holes and the surrounding surface need not be circular. Because homology is only interested in counting the presence of these features, it is invariant under any transformation of the space that does not create or destroy any holes or components. In our application of grayscale images, no holes of dimension two or larger can exist, as that would require a dataset that is at least three-dimensional.

Fig. 3.

Whereas homology focuses on global features of the space, there is an extension—known as “sublevelset/superlevelset persistent homology”—that captures more small-scale geometry ( Edelsbrunner and Harer 2010 ). Superlevelset persistent homology is the primary tool from TDA that we use in this paper, because it gives the best descriptor of image texture.

For an example of superlevelset persistent homology being computed on a simple surface, see Fig. 4 . The input to superlevelset persistent homology is a d -dimensional space plus an intensity value at every point. In our example, this is a grayscale image with two spatial dimensions ( d = 2) with the pixel values as intensities. This input is then converted into superlevelsets: each superlevelset is a binary mask of the original space, in which only points that have an intensity value greater than a particular cutoff value have been included. As this cutoff value sweeps down from the maximum intensity, the homology of each superlevelset is computed and the cutoff values at which homological features (connected components, holes) appear and disappear are tracked.

Fig. 4.

For our example, this means that as the cutoff value decreases, more and more pixels with gradually decreasing intensities are included in the superlevelsets, and we track the connected components and holes that appear and disappear. Because we are using superlevelsets in which we start by including the highest intensities, we can view connected components appearing at high-intensity value as being analogous to cloud tops, which are typically brighter, and holes as darker regions within these bright clouds.

This added interpretability motivates our focus here on superlevelset persistent homology, which is a simple variation (reflection) of the more commonly used sublevelset persistent homology. Sublevelset persistent homology is computed the same way but instead of each set including all the pixels with intensities above the cutoff value, pixels with intensities below the cutoff value are included, and the cutoff value is viewed as sweeping from low intensities up to high intensities. Throughout this paper, we will frequently omit the prefix “superlevelset” in “superlevelset persistent homology” and simply use “persistent homology” to refer to this technique—this should not be confused with the persistent homology technique, which takes as its input a cloud of data points ( Carlsson 2009 ).

In practice, it is not necessary to compute the homology for infinitely many superlevelsets; there are algorithms that discretize the data and then use linear algebra to implement this computation efficiently. These implementations are fast for low-dimensional data (e.g., the two-dimensional grayscale images used in our guiding example) but become more resource intensive when the input space consists of higher-order tensors. In this work, all homological and other TDA computations were performed using the GUDHI software package in Python ( Maria et al. 2014 ).

There are two main ways to display the output of persistent homology: persistence barcodes ( Fig. 4f ) and persistence diagrams ( Fig. 4g ).

In a barcode, each homological feature that appears is represented by a horizontal bar, which stretches from the cutoff value at which the corresponding feature first appears (is born) to the value at which it disappears (dies). Because we are using superlevelset persistent homology, our cutoff values are decreasing; thus, the intensity values on the x axis are decreasing from left to right. The persistence of each feature is the length of its bar. To distinguish between different homological classes, we color the bars depending on what dimension the homological feature is. We use red bars for zero-dimensional features (connected components) and blue bars for one-dimensional features (holes). The y axis of a persistence barcode counts the number of bars, typically ordered by birth value.

A persistence diagram contains the same information as a persistence barcode but represents each feature as a point rather than as a bar. In persistence diagrams, both the x axis and y axis represent intensity. The x coordinate of this point is given by the birth cutoff value, while the y coordinate is the death cutoff value. Because features always die at a higher cutoff value than they are born, all points lie above the diagonal line y = x . The persistence of a feature is represented by how far a persistence diagram point lies above the diagonal. We present persistence diagrams here to familiarize the reader with their use in, for example, Kim and Vogel (2019) , Tymochko et al. (2020) , and Sena et al. (2021) , but for our case study we focus on barcodes and landscapes.

In persistent homology, there are features that have infinite persistence—features that are born at a particular intensity but never die. The most common example of this is that the first connected component to appear will eventually become the only remaining connected component, as all other components eventually merge into it at high enough cutoff values. These infinite persistence points are represented as infinite bars (rays) in persistence barcodes, stretching out of the frame to the right and as infinite points appearing on a special “+∞” line in persistence diagrams.

To demonstrate how to read a persistence barcode we return to Fig. 4 . The four vertical lines in the barcode in Fig. 4f correspond to the four superlevelsets in the middle row, where white pixels show regions included in the superlevelset. In Fig. 4b , we see the first connected component appear, corresponding to the top, infinite-length red bar in the barcode. In Fig. 4c , two small components in the lower corners appear, corresponding to the two short red bars in the barcode crossed by the second vertical line. The short length of these bars indicates that these components are short lived, and soon merge into the larger component. In Fig. 4d , we see both the central one-dimensional hole, which corresponds to the blue bar, and the connected component within that hole, which corresponds to the red bar that is about to end near the third vertical line in the barcode. In Fig. 4e , we see almost the entire image is in the superlevelset, because the cutoff value is very small. However, the upper two corners have just been included as two new components, which are even shorter lived than the lower-corner components, as indicated by their extremely short red bars.

The first thing to notice about this barcode is that there are relatively few red bars, apart from the infinite-length bar, and those bars are very short. This indicates that few connected components appear and disappear as we scale through intensity values and thus the base image is very smooth. There is one red bar of reasonable length, so we would expect there to be one somewhat significant “bump,” a bright region surrounded by darker regions, which is precisely what we see in the middle of Fig. 4a . We also notice that there is one relatively long blue bar, which tells us that there is a hole (dark region surrounded by brighter regions), which persists for a relatively wide range of intensities.

Persistent homology is invariant under homeomorphisms of the input space, which are continuous deformations with continuous inverses (see Fig. 5 ). Examples include all the rigid motions of the plane (rotation and translation), affine transformations (scaling, skewing, etc.), as well as more radical reshapings, so long as no “ripping” occurs. Superlevelset-persistent homology is invariant over all such transformations. So, a cloud that has been reshaped, expanded, and moved, but which retains the same overall texture as in its original incarnation would have the same superlevelset persistence barcodes. See section 5 for some brief comments on versions of persistent homology that can distinguish between such different deformations of an image.

Fig. 5.

Persistence barcodes and diagrams have a drawback: they are not always convenient inputs for use in machine learning tasks, as described by Bubenik (2015) , Adams et al. (2017) , and Mileyko et al. (2011) , since they do not naturally live in a vector space. To deal with this, we use persistence landscapes to summarize and vectorize the persistence diagram ( Bubenik 2015 ). A persistence landscape is a collection of piecewise-linear functions that capture the essence of the persistence diagram.

An example of a persistence landscape computed from a small persistence barcode is shown in Fig. 6 . We separate out a particular homological dimension (e.g., H 0 or H 1 ) and remove any infinite bars and then create a new figure containing a collection of isosceles right triangles with hypotenuses along the x axis, one for each bar in the barcode. These triangles are scaled so that the triangle corresponding to a bar is the same width as that bar. We view this collection of triangles as a “mountain range” and begin to decompose it into landscape functions. The first landscape function is the piecewise-linear function that follows the uppermost edge of the union of these triangles, that is, it is the top silhouette of the mountain range. To compute the next landscape function, we delete the first landscape function from the mountain range and then find the piecewise-linear function that follows the uppermost edge of this new figure at every point, and so forth for the further landscape functions. This collection of piecewise-linear functions is the persistence landscape. The x axis still represents intensity, and the height of each peak is proportional to the length of the bar from which it came and is thus a measure of persistence.

Fig. 6.

This representation is stable—small changes to the input will only result in small changes in the persistence landscape ( Bubenik 2015 ). While the entire persistence landscape determines a persistence barcode exactly, in our work we retain only the top several persistence landscape functions, which means that we obtain a descriptive summary capturing the information of high-persistence points but ignore some information about low-persistence points.

To compute landscapes, we once again use the GUDHI software package running in Python ( Maria et al. 2014 ).

We now look at a realistic example in Fig. 7 . Reviewing the barcode in Fig. 7b , we first notice two red bars with high persistence: the infinite bar as well as another that stretches nearly all the way across the barcode. This indicates that there are two high-intensity regions that are separated by a dark region. Over middling intensities, there are few bars, indicating that outside the two bright regions we already identified, there is little going on. Finally, when we get to lower intensities (darker regions), there are many short bars, representing subtle variations in the dark regions. This information, however, is somewhat hard to read in a barcode with as many bars as this. Thus, we turn to landscapes as a way to summarize this information in a more readable format.

Fig. 7.

In the landscape in Fig. 7d , the single, tall, red mountain indicates that the original image ( Fig. 7a ) contains two connected components that persist over a large range of intensity levels (as the infinite-persistence component is implicit in the landscape). The only blue mountains in the landscape are much smaller than the red mountains and are mainly to the right of them. This indicates that while the original image contains numerous holes within the connected components, they only appear near the bottom end of the intensity range, that is, the holes do not appear until we have started including relatively dark regions. As an example, consider the single extremely dark pixel in the upper left-hand corner adjacent to the bright clouds and surrounded by a moderately dark region. This hole contributes a moderately tall blue peak far to the right in the landscape, as it does not appear until the relatively dark region surrounding it is included into the bright adjacent component, but it will not fill in until the nearly black pixel in the middle of the hole is included.

In general, high-persistence features (long bars, tall mountains) give information about large-scale features—the presence of two bright clouds in Fig. 7a , for example. On the other hand, low-persistence features (short bars, small mountains) give information about texture. In Fig. 7 , the short bars and small mountains appearing at lower intensity values indicate that the background darkness in the image is relatively noisy rather than being uniformly black or smoothly graded. The few small blue mountains in Fig. 7d indicate that the bright clouds also contain some textural elements—regions of slightly darker cloud within brighter regions.

4. Environmental science satellite case study

Now that we have established the basic theory of TDA, we return to its application to classifying mesoscale clouds.

We want to use persistent homology and landscapes to compare the clouds present in the rectangular annotations on images in the dataset of Rasp et al. (2020) , so we need to find a consistent vector representation for these annotations. While the overall images in the dataset are of consistent size (1400 × 2100 pixels), the annotations are not. An annotated region covering more area has inherently more complexity, which would yield a barcode with more bars (and thus a different vector representation) than a smaller annotation. To account for this, we implemented a subsampling routine, which is illustrated in Fig. 8 . For each annotation, we randomly chose six 96 × 96 pixel regions (the subsamples) and computed their persistence landscapes individually. This is shown in Figs. 8a–c . Each landscape consisted of 10 piecewise-linear functions: 5 giving information on connected components (plotted in red) and 5 giving information on one-dimensional holes (plotted in blue). The height of each function was recorded at 200 evenly spaced locations along the intensity axis, giving a vector of length 200 representing that function. These 10 vectors of length 200 were concatenated to yield a vector of length 2000 (i.e., a point in Ρ 2000 ), representing the persistence landscape of that particular subsample. This vectorization is shown in Fig. 9 . At that point, the persistent homology of each annotation was represented by a small-point cloud of six points in Ρ 2000 , one for each of the subsamples. To obtain a single point to represent the annotation, we took the geometric vector average of the six points in the point cloud, giving us the single vector that we can use to compare and analyze our annotations. Additionally, we can display and interpret this average vector as a landscape, in that it can be displayed as 10 functions, 5 representing the persistence and intensity ranges of connected components and 5 representing the same for one-dimensional holes. This averaged landscape is visualized in Fig. 8d . In particular, we can view (for instance) the first 200 values of the average landscape vector as the heights of the first average landscape function at the 200 evenly spaced intensity values where we sampled the piecewise landscape functions. We can view these values as coming from two equivalent formulations: first, as described above, as coming from a pointwise vector average, or second, by taking the average height of the first piecewise landscape function at each of the 200 evenly spaced intensity values across the six subsamples.

Fig. 8.

Once we obtained vectorized representations of each annotation, we sought to visualize the dataset. Because Ρ 2000 is not visualizable, we applied a dimensionality reduction algorithm to yield a representation that we can plot. We used principal component analysis (PCA), as it is a widely used and relatively simple technique, which in our case produced quite good results. We found that patterns in the data were visible upon projecting down onto the first three principal components, which captured over 90% of the variation in the high-dimensional data. We also note that the principal component vectors from repeated random samplings were extremely consistent, indicating that our projections were quite stable.

Once the data were projected down to three dimensions, we could visualize the data as a point cloud, with points colored according to which cloud pattern they represent. As a note, the PCA algorithm was entirely unsupervised with regard to these cloud pattern labels; it used only the vectorized representation of the persistence diagram.

We analyzed each of the six pairs of classes separately by training an SVM ( Boser et al. 1992 ) to find the plane that best separates the two classes of projected data in three dimensions. We considered running the SVM on the high-dimensional data, but initial testing indicated that this tended to overfit the data and actually resulted in reduced classification performance. The SVM was trained on a random sample of 350 annotations of each class, then performance metrics were computed for a test set of 200 random annotations of each class. For visualizations of the test data and SVM separating plane as well as performance metrics for each of the six pairwise comparisons, see Fig. 10 .

Fig. 10.

Overall, the projected points separated well, with one notable exception. We began our investigation by comparing the sugar versus flowers patterns, because these are visually the most dissimilar ( Fig. 2 ). As expected, these classes had the most distinct separation out of the six possible pairs, as seen in Fig. 10a . While the separation was not perfect, most of the error comes in the form of some intermingling near the separating hyperplane. The performance of the algorithm in separating these classes was striking, given that only a small, random subset of each annotation was included and that the PCA projection algorithm was entirely blind to the data labels. This example is a clear indicator of the potential that persistent homology has to usefully extract textural and shape differences in satellite imagery.

While not quite as exceptional, the flowers and gravel patterns also separate well, as seen in Fig. 10c . There is again a degree of intermingling near the separating plane and in this case that intermingling extends a bit farther to either side. This is what we would expect, based on the visual presentation of the cloud regimes; the gravel class falls somewhere between sugar and flowers in terms of cloud size and organization.

We begin to see the algorithm struggle a bit more when we attempt to separate the sugar regime from the gravel regime. As we can see in Fig. 10e , the intermingling of data points stretches throughout the point cloud, although there is still a difference in densities between the classes on either side of the separating hyperplane.

However, there are two classes that are effectively indistinguishable by this algorithm: the flowers and fish patterns. The plot of these points can be seen in Fig. 10f , and it is apparent that there is no effective linear separator between these classes. While there is a “separating” hyperplane plotted, it is much less relevant in this case than in the others; the data points are remarkably evenly mixed. This proved to be the case even when more principal components were included. A potential explanation for why the algorithm struggles so much with this task is that the distinguishing features of fish versus flowers are simply too large scale for the subsampling technique to pick up on. The fish pattern is characterized by its mesoscale skeletal structure, particularly in its difference from the flowers regime, which is more randomly distributed. This mesoscale organization is simply not visible to the subsamples, as the 96 × 96 patches are too small to detect that skeletal structure. Future analyses could include using larger patches to better capture these features and perhaps distinguish between these classes more effectively. We also note that in Rasp et al. (2020) , the fish pattern was the most controversial among the expert labelers, so it is perhaps not surprising that our algorithm also struggles.

When we look at fish versus sugar and fish versus gravel in Figs. 10b and 10d , respectively, we can see how similar these plots appear to those in Figs. 10a and 10c , in which the flowers pattern was compared with sugar and gravel. This similarity is made even more remarkable by the fact that the sugar samples in these plots were drawn separately rather than being reused for the pairwise comparisons (and similarly for the gravel samples). While the algorithm is not doing well at distinguishing between fish and flowers, we can at least see that its behavior is consistent: fish and flowers are projected similarly into the 3D embedding space, so they compare similarly to the other classes.

Overall, this case study suggests that it is possible to use persistent homology to quantify and understand the shape and texture of satellite cloud data. While there are cases where the algorithm struggles, these are understandable in terms of the visual task being requested and are internally consistent from sample to sample. Moreover, in the cases where the algorithm does well, it does so consistently across repeated samples and suggests that when these tools are appropriately applied, excellent results can be obtained from very limited sample sizes.

As an example of how this separation can be interpreted, we examine the case of sugar versus flowers. Recall that in Fig. 10a , we saw that this pair of classes had the strongest separation in the dataset.

To begin, we explore what can be learned just from the summarized data, without looking at examples. To discover what the separating plane between sugar points and flowers points represents, we create “virtual” landscapes. We first lift the separating plane in Ρ 3 to the hyperplane in Ρ 2000 consisting of all the points that project (under PCA) into the separating plane in Ρ 3 . Next, we find the line normal to this hyperplane that passes through geometric center of the data. Finally, we choose points on this line that fall at the outer extent of the data point cloud. These points live in the landscape embedding space Ρ 2000 but are not sampled data points. However, by applying the inverse landscape embedding, we can visualize the landscape-like set of curves that would give this embedded point. Virtual landscapes for sugar and flowers can be seen in Figs. 11c and 11f , respectively.

Fig. 11.

An advantage of this approach is that it synthesizes trends from the real data into a readable, controllable format that demonstrates how SVM is separating these classes. When we compare these virtual landscapes with the actual landscapes farthest from the separating hyperplane (seen in Figs. 11b,e ), we can see that the virtual landscapes are smoother but that the overall shapes are remarkably similar.

We can also interpret the shapes of these landscapes in terms of the features present in the images. Let us examine the images and corresponding landscapes in Fig. 11 . The most prominent feature in the two landscapes is the tall red peak in the sugar landscape, shown in Fig. 11b . Recall that the red lines denote zero-dimensional homology (connected components), while the blue lines denote one-dimensional homology (holes). This red peak represents the presence of strongly persistent connected components, that is, separated regions of bright cloud strongly contrasting against a much darker background. The sharpness of this peak also indicates that these features are similar in both the intensity of the cloud top and the intensity of the surrounding background. The comparatively low blue curves with only a small peak at the end indicate a lack of one-dimensional homological features (holes), and thus the texture within connected components is relatively uniform. Looking at the image in Fig. 11a , we see these observations borne out: there are numerous small clouds of similar brightnesses, which stand in stark contrast to the overall uniformly dark background, matching the tall H 0 peak. Because these clouds are relatively small, there is little discernible texture within each cloud, which corresponds to the relative absence of H 1 features.

On the other hand, the flowers landscape in Fig. 11e displays a lower red peak, with more separated curves. This lack of concentration indicates that there is more variation in the intensity values at which connected components appear (the brightest part of the component) and at which they merge together (the intensity of the bridge connecting that component to another), while the lower height indicates that these features are overall less persistent—they merge into one another more rapidly. Additionally, there is a much stronger H 1 signal in this case than in the sugar landscape, meaning that the connected components have more internal texture, with numerous holes appearing and disappearing over a wide range of intensity values. These observations match with what we see in Fig. 11d . The clouds in this image are much larger and cover more of the frame, with varying intensities within and between the clouds, leading to the more varied H 0 landscape. This image also shows much more internal texture to the clouds, with far more of a dimpling effect than in the sugar example.

In summary, this example shows how the patterns learned by the TDA–SVM algorithm can be translated back to homological features which in turn correspond to weather-relevant features in the original image. This is made possible by the fact that the SVM model can be represented by a single separating plane, which can be translated back into the space of persistence landscapes and then interpreted, yielding a highly interpretable approach to the pairwise classification problem.

e. Comparison of this classifier with those in Rasp et al. ( 2020 )

1) accuracy.

The accuracy of our approach cannot be directly compared with the deep learning algorithms in Rasp et al. (2020) because they address different tasks. The task considered here is to choose a single class (out of two) for an annotation assumed to consist of a single cloud type based on several small patches (96 × 96 pixels). In contrast, the task considered in Rasp et al. (2020) is much more complex, namely, to assign one or more labels for an annotation based on a very large image. We choose the simpler task for our TDA approach in order to expose the properties of a TDA algorithm, and trying to implement a multilabel assignment (e.g., by using a sliding window approach) likely would have made this exploration more complicated without providing new insights. However, even without a direct comparison, it is obvious from the results that this first TDA–SVM approach cannot nearly achieve the accuracy of the deep learning approaches.

2) Required data samples

Our approach only requires a few hundred labeled data samples to develop a classifier. This reduces the required labeling effort by two orders of magnitude relative to the tens of thousands of labeled samples in Rasp et al. (2020) .

3) Computational effort

Computations were performed on a Surface Pro 6 computer with an Intel Core i5-8250U CPU. The computational bottleneck in this case was computing the persistent homology: for 800 samples (and therefore 4800 subsamples to compute persistent homology for), approximately 45 min of wall-clock time was required. This is already much less computational time than is generally required to train a deep network, and it is likely that this could be significantly improved by parallelizing, because each sample can be processed entirely separately.

4) Interpretability and failure modes

Our approach yields a highly interpretable model that provides an intuitive explanation of how the algorithm distinguishes different classes, while the deep learning methods do not. Furthermore, the interpretation of the separation plane in our model makes it easy to provide insights into failure modes, that is, which types of mesoscale patterns can be easily or not so easily be distinguished by their topological features, and thus by this approach.

In this section, we briefly discuss and provide references for some advanced TDA concepts that are beyond the scope of this article, along with motivations for when and why readers might find them useful.

Figure 5 shows that while persistent homology measures some spatial aspects of the intensity function, it is also invariant under nice deformations (“homeomorphisms”) of the domain. However, there is another (very popular) type of persistent homology, constructed using growing offsets of a shape, or unions of growing balls, that distinguishes between different deformations of the domain ( Carlsson 2009 ; Ghrist 2008 ). We expect that this variant of persistent homology will also find applications in atmospheric science, and we refer the reader to Tymochko et al. (2020) for such an example.

We are particularly interested in exploring the use of TDA to analyze cloud properties from satellite imagery, for example, to detect convection. While the example here looked at large-scale organization of clouds, to analyze properties like convection we would zoom far into a single cloud and analyze its texture, for example, seek to identify whether there is a “bubbling” texture apparent in a considered area of the cloud. Preliminary analysis leads us to believe that it might be necessary to use more sophisticated TDA tools for this purpose than discussed here, such as vineyards ( Cohen-Steiner et al. 2006 ) or “contour realization of computed k -dimensional hole evolution in the Rips complex (CROCKER)” plots ( Topaz et al. 2015 ), which incorporate temporal context by analyzing the topological properties of sequences of images rather than individual images.

We have considered persistent homology that varies over a single parameter—the intensity of the satellite image. However, one frequently encounters situations in which two or more parameters naturally arise. For example, one can perform superlevelset persistent homology on a two-channel image, containing the intensities with respect to two frequencies, with respect to the parameter from either the first channel or the second. In these contexts, multiparameter persistence ( Carlsson et al. 2009 ; Carlsson and Zomorodian 2009 ; Cerri et al. 2013 ; RIVET Developers 2020 ) allows one to consider both parameters at once, even though the underlying mathematics is more subtle, and computations are more difficult. A version of multiparameter persistence was applied recently to the atmospheric domain in Strommen et al. (2023) .

Persistence barcodes and diagrams are not ideal as inputs into machine learning algorithms because they are not vectors residing in a linear space. This is evidenced by the fact that averages of persistence diagrams need not be unique ( Mileyko et al. 2011 ). There is a wide array of options for transforming persistence diagrams for use in machine learning, including not only persistence landscapes ( Bubenik 2015 ) but also persistence images ( Adams et al. 2017 ) and stable kernels ( Reininghaus et al. 2015 ), for example. TDA has been gaining traction in machine learning tasks as more tools become available to integrate it into existing workflows in both neural network layers ( Moroni and Pascali 2021 ) and loss functions ( Clough et al. 2020 ). As an example application, TDA has recently been used to compare models with differing grids and resolutions ( Ofori-Boateng et al. 2021 ).

There is a variant of superlevelset persistent homology called extended persistent homology ( Edelsbrunner and Harer 2010 , their section VII.3), which performs two sweeps (instead of just one) over the range of intensity values. Extended superlevelset persistent homology detects all of the features measured by superlevelset persistent homology, plus more. It may be the case that one can extract more discriminative information from a satellite image by instead computing the extended persistence diagram.

The primary contributions of this paper are as follows: 1) It presents, to the best of our knowledge, the first attempt to provide a comprehensive, easy-to-understand introduction to popular TDA concepts customized for the environmental science community. In particular, we seek to provide readers with an intuitive understanding of the topological properties that can be extracted using TDA by translating cloud imagery into persistence landscapes, interpreting the landscapes, then highlighting the topological properties in the original images. 2) In a case study, we demonstrate step by step the process of applying TDA, combined with a simple machine learning model, to extract information from real-world meteorological imagery. The case study focuses on how to use TDA to classify mesoscale organization of clouds from satellite imagery, which has never been addressed by TDA before. 3) The most novel contribution is the interpretation procedure we developed that projects the class separation planes identified by the SVM algorithm back into topological space. This, in turn, allows us to fully understand the strategy used by the classifier in meteorological image space, thus providing a fully interpretable classifier.

In future work we seek to explore several of the advanced methods outlined in section 5 . We believe that there are many applications to be explored with TDA, including the applications suggested by Rasp et al. (2020) for their methods, namely, “detecting atmospheric rivers and tropical cyclones in satellite and model output, classifying ice and snow particles images obtained from cloud probe imagery, or even large-scale weather regimes” ( Rasp et al. 2020 ). Furthermore, as discussed in section 1 , TDA has already been shown to be useful to identify certain properties of atmospheric rivers, wildfires, and hurricanes, and we expect TDA to find additional use in those areas as well. Our group is particularly interested in using TDA to detect convection in clouds and to distinguish blowing dust from, say, blowing snow in satellite imagery.

We have only scratched the surface here of exploring how TDA can support image analysis tasks in environmental science, but we hope that this primer will accelerate the use of TDA for this purpose.

The work by authors Ver Hoef and Ebert-Uphoff was supported in part by the National Science Foundation under Grant OAC-1934668 and under Grant ICER-2019758, which funds the NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography. Author Adams thanks the Institute of Science and Technology Austria for hosting him during this research. We thank Charles H. White at CIRA (CSU) for constructive feedback and excellent suggestions for this paper. The authors also thank Rasp et al. (2020) for making their dataset publicly available.

The underlying crowdsourced data from Rasp et al. (2020) are available online ( https://github.com/raspstephan/sugar-flower-fish-or-gravel ). The code used in our analysis is available as a GitHub repository ( https://github.com/zyjux/sffg_tda ).

AMS Publications

Chorus

Get Involved with AMS

Affiliate sites.

Email : [email protected]

Phone : 617-227-2425

Fax : 617-742-8718

Headquarters:

45 Beacon Street

Boston, MA 02108-3693

1200 New York Ave NW

Washington, DC 200005-3928

  • Privacy Policy & Disclaimer
  • Get Adobe Acrobat Reader
  • © 2024 American Meteorological Society
  • [66.249.64.20|195.190.12.77]
  • 195.190.12.77

Character limit 500 /500

Book cover

  • Conference proceedings
  • © 2020

Topological Data Analysis

The Abel Symposium 2018

  • Nils A. Baas 0 ,
  • Gunnar E. Carlsson 1 ,
  • Gereon Quick 2 ,
  • Markus Szymik 3 ,
  • Marius Thaule 4

Department of Mathematical Sciences, NTNU, Trondheim, Norway

You can also search for this editor in PubMed   Google Scholar

Department of Mathematics, Stanford University, Stanford, USA

Part of the book series: Abel Symposia (ABEL, volume 15)

43k Accesses

92 Citations

4 Altmetric

  • Table of contents

About this book

Editors and affiliations, bibliographic information.

  • Publish with us

Buying options

  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Other ways to access

This is a preview of subscription content, log in via an institution to check for access.

Table of contents (19 papers)

Front matter, a fractal dimension for measures via persistent homology.

  • Henry Adams, Manuchehr Aminian, Elin Farnell, Michael Kirby, Joshua Mirth, Rachel Neville et al.

DTM-Based Filtrations

  • Hirokazu Anai, Frédéric Chazal, Marc Glisse, Yuichi Ike, Hiroya Inakoshi, Raphaël Tinarrage et al.

Persistence Diagrams as Diagrams: A Categorification of the Stability Theorem

  • Ulrich Bauer, Michael Lesnick

The Persistence Landscape and Some of Its Properties

  • Peter Bubenik

Topological Approaches to Deep Learning

  • Gunnar Carlsson, Rickard Brüel Gabrielsson

Topological Data Analysis of Single-Cell Hi-C Contact Maps

  • Mathieu Carrière, Raúl Rabadán

Neural Ring Homomorphisms and Maps Between Neural Codes

  • Carina Pamela Curto, Nora Youngs

Radius Functions on Poisson–Delaunay Mosaics and Related Complexes Experimentally

  • Herbert Edelsbrunner, Anton Nikitenko, Katharina Ölsböck, Peter Synak

Iterated Integrals and Population Time Series Analysis

  • Chad Giusti, Darrick Lee

Prediction in Cancer Genomics Using Topological Signatures and Machine Learning

  • Georgina Gonzalez, Arina Ushakova, Radmila Sazdanovic, Javier Arsuaga

Topological Adventures in Neuroscience

  • Kathryn Hess

Percolation on Homology Generators in Codimension One

  • Yasuaki Hiraoka, Tatsuya Mikami

Hyperplane Neural Codes and the Polar Complex

  • Vladimir Itskov, Alexander Kunin, Zvi Rosen

Analysis of Dynamic Graphs and Dynamic Metric Spaces via Zigzag Persistence

  • Woojin Kim, Facundo Mémoli, Zane Smith

Canonical Stratifications Along Bisheaves

  • Vidit Nanda, Amit Patel

Inverse Problems in Topological Persistence

  • Steve Oudot, Elchanan Solomon

Sparse Circular Coordinates via Principal \(\mathbb {Z}\) -Bundles

  • Jose A. Perea

Same But Different: Distance Correlations Between Topological Summaries

  • Katharine Turner, Gard Spreemann

Certified Mapper: Repeated Testing for Acyclicity and Obstructions to the Nerve Lemma

  • Mikael Vejdemo-Johansson, Alisa Leshchenko

This book gathers the proceedings of the 2018 Abel Symposium, which was held in Geiranger, Norway, on June 4-8, 2018. The symposium offered an overview of the emerging field of "Topological Data Analysis". This volume presents papers on various research directions, notably including applications in neuroscience, materials science, cancer biology, and immune response. Providing an essential snapshot of the status quo, it represents a valuable asset for practitioners and those considering entering the field.

  • topological data analysis
  • data analysis
  • persistent homology
  • persistence diagrams
  • stability theorem
  • deep learning
  • single-cell Hi-C contact maps
  • neural codes
  • Poisson–Delaunay mosaics
  • prediction in cancer genomics
  • dynamic graphs
  • inverse problems
  • nerve lemma
  • neural science

Nils A. Baas, Gereon Quick, Markus Szymik, Marius Thaule

Gunnar E. Carlsson

Book Title : Topological Data Analysis

Book Subtitle : The Abel Symposium 2018

Editors : Nils A. Baas, Gunnar E. Carlsson, Gereon Quick, Markus Szymik, Marius Thaule

Series Title : Abel Symposia

DOI : https://doi.org/10.1007/978-3-030-43408-3

Publisher : Springer Cham

eBook Packages : Mathematics and Statistics , Mathematics and Statistics (R0)

Copyright Information : Springer Nature Switzerland AG 2020

Hardcover ISBN : 978-3-030-43407-6 Published: 26 June 2020

Softcover ISBN : 978-3-030-43410-6 Published: 26 June 2021

eBook ISBN : 978-3-030-43408-3 Published: 25 June 2020

Series ISSN : 2193-2808

Series E-ISSN : 2197-8549

Edition Number : 1

Number of Pages : XIV, 515

Number of Illustrations : 53 b/w illustrations, 146 illustrations in colour

Topics : Algebraic Topology , Biomedicine, general , Mathematics of Computing

Policies and ethics

  • Find a journal
  • Track your research

Computational Topology and Topological Data Analysis

Project description:, connectedness, alpha shapes, persistent betti numbers, some applications, papers and talks:.

IMAGES

  1. Topological Data Analysis with Applications

    thesis on topological data analysis

  2. Topological Data Analysis

    thesis on topological data analysis

  3. Topological data analysis

    thesis on topological data analysis

  4. Topological Data Analysis

    thesis on topological data analysis

  5. Topological Data Analysis, Data Visualization and Machine Learning

    thesis on topological data analysis

  6. Revealing the complexities of the magnetization reversal mechanism with

    thesis on topological data analysis

VIDEO

  1. Chain Complex

  2. Topological Data Analysis. Persistent Homology (GORBUNOV V.) 30.10.2023

  3. Topological Data Analysis H1

  4. Topological Data Analysis. Persistent Homology" (GORBUNOV V.) 16.10.2023

  5. Audun Myers (3/5/2024): Data Analysis Using Zigzag Persistence

  6. Topological Data Analysis. Persistent Homology (GORBUNOV V.) 13.11.2023

COMMENTS

  1. An Introduction To Practical Topological Data Analysis

    New mathematically well-founded theories have given birth to the field of. Topological Data Analysis (TDA). TDA is a field of mathematics that analyzes. data from a fundamentally di ̇erent perspective. TDA is mainly motivated by the. idea of utilizing powerful methods and approaches in geometry and topology to.

  2. Topological Data Analysis Using the Mapper Algorithm

    Topological data analysis is an expanding field that attempts to obtain qualitative information from. a data set using topological ideas. There are two common methods of topological data analysis: persistent homology and the Mapper algorithm; the focus of this thesis is on the latter.

  3. PDF Applications of Topology to Data Analysis

    This thesis aims to serve as an introduction to Topological Data Analysis (TDA), a collec-tion of methods that seek to quantify the topological and geometric features of data using algebraic topology. The theory behind persistent homology, a stable multi-scale approach for characterizing the structure of data, is presented here.

  4. An Introduction to Topological Data Analysis: Fundamental and Practical

    1 Introduction and Motivation. Topological data analysis (tda) is a recent field that emerged from various works in applied (algebraic) topology and computational geometry during the first decade of the century.Although one can trace back geometric approaches to data analysis quite far into the past, tda really started as a field with the pioneering works of Edelsbrunner et al. (2002) and ...

  5. PDF PERSISTENT HOMOLOGY AND MACHINE LEARNING

    1.1 Thesis Overview Topological data analysis refers to the analysis of data using topology. The goal of the methodology is to unravel the shape, or structure, of data [Ghr08], [Car09], [Was18]. In this thesis, we examine feature engineering using persistent homology to determine its e ectiveness in training machine learning algorithms [KJ19].

  6. PDF Abstract arXiv:1710.04019v2 [math.ST] 25 Feb 2021

    An introduction to Topological Data Analysis: fundamental and practical aspects for data scientists Frédéric Chazal and Bertrand Michel February 26, 2021 Abstract Topological Data Analysis (tda) is a recent and fast growing field providing a set of new topological and geometric tools to infer relevant features for possibly complex data.

  7. PDF Topological Data Analysis with Applications

    topology, and he has spent the last 20 years on the development of topological data analysis. He is also passionate about the transfer of scientiÞc Þndings to real-world applications, leading him to found the topological data analysis-based company Ayasdi in 2008. Mikael Vejdemo-Johansson is Assistant Professor in the Department of Mathematics

  8. PDF INTRODUCTION TO TOPOLOGICAL DATA ANALYSIS

    1. Introduction to Topological Data Analysis As the amount and variety of data continues to grow, it often becomes too difficult to manage and extract meaningful information from data, especially in high dimensions. Thus, Topological Data Analysis (TDA), which provides a framework for analyzing data, has been a rapidly growing field.

  9. PDF The Nerve Theorem and its Applications in Topological Data Analysis

    dimensional and complex data context it is challenging to differentiate between significant information and noise, let alone to visualize the data. Mathematical methods enable us to strategically analyze the data and look for structures, rela-tionshipsandhiddenfeatures. Topological Data Analysis (TDA) presents one approach to tackle this chal ...

  10. Explaining the Power of Topological Data Analysis in Graph Machine Learning

    OPOLOGICAL Data Analysis is gaining much traction in the fields of statistics, computer science, and math-ematics. The underlying concept of TDA is understanding the shape of data which focuses on detecting and encoding topological features such as connected components, loops, voids, etc., that are present in datasets in order to improve

  11. Topological data analysis and machine learning

    Topological data analysis refers to approaches for system-atically and reliably computing abstract 'shapes' of complex data sets. There are various applications of topological data analysis in life and data sciences, with growing interest among physicists. We present a concise review of applica-tions of topological data analysis to physics ...

  12. [PDF] An Introduction to Topological Data Analysis: Fundamental and

    Topological data analysis (tda) is a recent and fast-growing field providing a set of new topological and geometric tools to infer relevant features for possibly complex data. It proposes new well-founded mathematical theories and computational tools that can be used independently or in combination with other data analysis and statistical ...

  13. Thesis Applications of Topological Data Analysis to Natural Language

    Topological Data Analysis (TDA) uses ideas from topology to study the "shape" of data. It provides a set of tools to extract features, such as holes, voids, and connected components, from complex high-dimensional data. This thesis presents an introductory exposition of the mathematics underlying the two main

  14. BDCC

    Topological data analysis is a collection of algorithms aimed at extracting structural information from a dataset [1,2,3].Topological data analysis is concerned with a very crude structure, e.g., it can distinguish between a set of data points in the plane uniformly spread over a certain area, such as a disk, from a set of points in the plane scattered around a circle, a figure-eight shape ...

  15. Topological Machine Learning Data Analysis for the ...

    Topological data analysis (TDA) and geometric manifold learning are combined in this study, which add a significant amount of originality. However, topology gives both global and local descriptors, whereas topology is a different approach. We provide a method for combining these two features to create an understandable visual representation of ...

  16. A Primer on Topological Data Analysis to Support Image ...

    Abstract Topological data analysis (TDA) is a tool from data science and mathematics that is beginning to make waves in environmental science. In this work, we seek to provide an intuitive and understandable introduction to a tool from TDA that is particularly useful for the analysis of imagery, namely, persistent homology. We briefly discuss the theoretical background but focus primarily on ...

  17. PDF Topology of Deep Neural Networks

    Keywords: neural networks, topology change, Betti numbers, topological complexity, persistent homology 1. Overview A key insight of topological data analysis is that \data has shape" (Carlsson, 2013, 2014). That data sets often have nontrivial topologies, which may be exploited in their analysis, is c 2020 Gregory Naitzat, Andrey Zhitnikov, Lek ...

  18. Topological Data Analysis: The Abel Symposium 2018

    This book gathers the proceedings of the 2018 Abel Symposium, which was held in Geiranger, Norway, on June 4-8, 2018. The symposium offered an overview of the emerging field of "Topological Data Analysis". This volume presents papers on various research directions, notably including applications in neuroscience, materials science, cancer ...

  19. PDF Topological Data Analysis Helps to Improve Accuracy of Deep Learning

    Our research question is comparing topological data analysis to state-of-the-art deep learning models for text classification. The particular scenario we are looking at is fake news detection in a situation when very few training data are available. We show that features generated by topological data analysis improve performance of deep learning

  20. PDF Topological Data Analysis for Genomics and Evolution

    3.8 Euler Characteristics in Topological Data Analysis 228 3.9 Exploratory Data Analysis with Mapper 231 3.10 Summary 233 3.11 Suggestions for Further Reading 234 4 Dimensionality Reduction, Manifold Learning, and Metric Geometry 235 4.1 A Quick Refresher on Eigenvectors and Eigenvalues 238 4.2 Background on PCA and MDS 239 4.3 Manifold ...

  21. PDF Topological data analysis arXiv:2103.14131v1 [cs.LG] 25 Mar 2021

    This prior work suggests that topological features could be useful for various NLP tasks. However,Michel. arXiv:2103.14131v1 [cs.LG] 25 Mar 2021. et al.(2017) found a negative result that text clas- sification accuracy doesn't improve much if one extract the topological features from the word em- bedding.

  22. Computational Topology and Topological Data Analysis

    The presentations at the Topological Data Analysis workshop that we organized at the 2017 SIAM Snowbird meeting: part I and part II; Z. Alexander, A Topology-Based Approach for Nonlinear Time-Series Analysis with Applications in Computer Performance Analysis, Ph.D. thesis, University of Colorado, 2012.

  23. Topological Data Analysis

    This thesis is a mathematical exposition of the theory behind Topological Data Analysis (TDA) complemented by two applications in medicine and financial realm. We start by establishing the foundation of homology theory, then study the reconstruction of the underlying manifold from point cloud data. Followed by the theory of persistent homology ...