library(sbm)
library(ggplot2)
library(network)
library(GGally)
library(RColorBrewer)
library(knitr)
library(igraph)
library(mclust)
library(patchwork)
colSet <- RColorBrewer::brewer.pal(n = 8, name = "Set2") 📊 Stochastic Block Models Tutorial
Principle
0. Requirements
This tutorial illustrates the use of block models for the analysis of (ecological) network. It is mainly based on the vignette of the package sbm. The package is on the CRAN. Its development version is on Github.
The packages required for the analysis are sbm plus some others for data manipulation and representation:
1. First simulation from SBM
In this lab, we explore the network model Stochastic Block Models (SBM).
A Stochastic Block Model assumes that nodes belong to latent groups, and connection probabilities depend only on group memberships.
1.1 Simulating a network with the sbm package
First define the parameters.
K = 3; # nb of blocks
blockProp <- c(.25, 0.5 ,.25) # group proportions
alpha = diag(.4, 3) + 0.05
connectParam <- list(mean = alpha) # connectivity matrix: affiliation networkShow the connectivity matrix
print(connectParam$mean)
#> [,1] [,2] [,3]
#> [1,] 0.45 0.05 0.05
#> [2,] 0.05 0.45 0.05
#> [3,] 0.05 0.05 0.45\(\alpha\) is a \(3 \times 3\) matrix. We set all diagonal entries to \(0.45\), representing the intra-block connectance. All off-diagonal entries are small (\(0.05\)), indicating that block members primarily interact within their own block. This defines a clear community structure.
We can now simulate with the sampleSimpleSBM function.
nbNodes <- 100 # nb of vertices
mySimNet <- sampleSimpleSBM(nbNodes,
model = 'bernoulli',
directed = FALSE,
blockProp = blockProp,
connectParam = connectParam ,
dimLabels = 'Species')Explore the simulated object
mySimNet
#> Simple Stochastic Block Model -- bernoulli variant
#> =====================================================================
#> Dimension = ( 100 ) - ( 3 ) blocks and no covariate(s).
#> =====================================================================
#> * Useful fields
#> $nbNodes, $modelName, $dimLabels, $nbBlocks, $nbCovariates, $nbDyads
#> $blockProp, $connectParam, $covarParam, $covarList, $covarEffect
#> $expectation, $indMemberships, $memberships
#> * R6 and S3 methods
#> $rNetwork, $rMemberships, $rEdges, plot, print, coef
str(mySimNet$networkData)
#> int [1:100, 1:100] NA 0 0 1 0 0 0 0 0 0 ...
table(mySimNet$memberships)/nbNodes
#>
#> 1 2 3
#> 0.35 0.46 0.191.2 Displaying the simulated network with sbm and ggnet
We represent the data using either the adjacency matrix or the graph representation.
1.2.a Adjacency matrix representation
We use the plotMyMatrix function of the sbmpackage to display the adjacency matrix.
plotOptions <- list(colNames = TRUE, rowNames = TRUE)
sbm::plotMyMatrix(mySimNet$networkData,
dimLabels = list(row = "species", col = "species"),
plotOptions = plotOptions)Check the help for plotMyMatrix to see all the options that are already available. In addition, the output is a ggplot object, so you can modify any features you need afterwards (legend, text size, etc.).
Each node is randomly assigned to one of the three blocks. The structure we intended to simulate is not immediately visible. We need to reorder the nodes by block to reveal it.
U <- order(mySimNet$memberships)
myMat <- mySimNet$networkData[U,U]
plotMyMatrix(myMat,
dimLabels = list(row = "species", col = "species"),
plotOptions = plotOptions)1.2.b Network representation
GGally is a package based on ggplot2 to plot networks.
ggnet2(mySimNet$networkData,label = TRUE,color=colSet[1])From this first figure, we can guess that the network has the three communities we simulated. We can highlight this structure by coloring nodes according to their block.
Coloring the nodes
net = network(mySimNet$networkData, directed = FALSE)
ggnet2(net,
node.color = mySimNet$memberships,
palette = "Set2",
label = TRUE,
label.size = 3)You can also simulate weighted networks using the argument “model=poisson”.
Simulation of a weighted network
K = 2; # nb of blocks
blockProp <- c(.35, 0.65 ) # group proportions
alpha = matrix(0,K,K)
alpha[1,1] = 10
alpha[2,1] = 3
alpha[1,2] = 15
alpha[2,2] = 5
connectParam <- list(mean = alpha)
nbNodes <- 40 # nb of vertices
mySimPoisson <- sampleSimpleSBM(nbNodes,
model = 'poisson',
directed = TRUE,
blockProp = blockProp,
connectParam = connectParam ,
dimLabels = 'Species')
plot(mySimPoisson)2. Exercices 🧪
2.1 Ex 1: Erdos Renyi 🔥
- Adapting the previous code, generate a random network with \(n=50\) nodes, \(K=1\) block and with density \(\alpha = 0.5\).
This particular model with one block is called the Erdos Reyni model. It is the same as the one you simulated with igraph( igraph::sample_gnm(n=50,m=12)). With sample_gnm you specified the number of edges \(m\). Here, we specify a probability of edge, the realized number of edges is random. Note that \[\mathbb{E}[N] = \sum_{i<j}P(X_{ij}=1) = \frac{n(n-1)}{2} \alpha\]
Correction
nbNodes = 50
sim_ER = sampleSimpleSBM(nbNodes,blockProp=c(1/2),connectParam = list(mean=matrix(0.5,1,1)),model="bernoulli",directed = FALSE)- Transform the adjacency matrix into a graph using
igraph
Be careful: sbm sets NA on the diagonal of the adjacency matrix to indicate that self-interactions are ignored. In igraph, this diagonal must be set to \(0\).
sim_ER_adj <- sim_ER$networkData
diag(sim_ER_adj) <- 0 Correction
G1 <- graph_from_adjacency_matrix(sim_ER_adj, mode = "undirected")- Count the number of edges. Compare to the previous formula.
Correction
ecount(G1)
#> [1] 600
nbNodes*(nbNodes-1)/2*0.5
#> [1] 612.5- Display the degree distribution
Correction
hist(degree(G1),main="Degree dist.",xlab="Degree") The degree distribution has a low variability
- Compute the other statistics you learned on Day 1: edge density, diameter, closeness etc….
Quick reminder: try using edge_density(), diameter(), and closeness() from igraph.
Correction
components(G1)
#> $membership
#> [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
#> [39] 1 1 1 1 1 1 1 1 1 1 1 1
#>
#> $csize
#> [1] 50
#>
#> $no
#> [1] 1
edge_density(G1)
#> [1] 0.4897959
diameter(G1,unconnected=FALSE)
#> [1] 2
diameter(G1, unconnected=TRUE)
#> [1] 2
get_diameter(G1, directed=TRUE)
#> + 3/50 vertices, from 43bb643:
#> [1] 1 2 5
count_triangles(G1)
#> [1] 121 222 79 153 108 115 86 152 112 191 75 166 201 169 111 109 149 206 100
#> [20] 110 164 108 69 240 117 126 210 139 127 56 195 285 73 153 153 143 220 185
#> [39] 184 114 165 108 189 90 150 104 207 91 172 131
transitivity(G1)
#> [1] 0.50625532.2 Ex 2: Community structure 🔥🔥
We now consider the community network we simulated before.
mySimNet_adj <- mySimNet$networkData
diag(mySimNet_adj) <- 0
G2 <- graph_from_adjacency_matrix(mySimNet_adj, mode="undirected")
Z_sim <- mySimNet$memberships- Try to recover the “modules” with the cluster technics seen in Day 2 (cluster_louvain, cluster_edge_betweenness…)
Correction
G2_EB.mod<-cluster_edge_betweenness(G2)
G2_LE.mod<-cluster_leading_eigen(G2)
G2_ML.mod<-cluster_louvain(G2)- Compare the clusters obtained with the true (simulated ones)?
When willing to compare two clusterings, we can resort to several tools. You saw the alluvial plot in Day 2. You can also compute the Adjusted Rand Index (ARI). ARI is a measure of similarity between two clusterings that corrects for chance, giving a score where 1 means perfect agreement, 0 corresponds to random labeling, and negative values indicate worse-than-random agreement.
Correction
table(G2_EB.mod$membership,Z_sim)
#> Z_sim
#> 1 2 3
#> 1 35 0 0
#> 2 0 0 18
#> 3 0 46 0
#> 4 0 0 1
adjustedRandIndex(G2_EB.mod$membership,Z_sim)
#> [1] 0.9921279
adjustedRandIndex(G2_LE.mod$membership,Z_sim)
#> [1] 0.8844131
adjustedRandIndex(G2_ML.mod$membership,Z_sim)
#> [1] 1In this particular case, we simulated a really clear structure of communities. As a consequence, these groups are well recovered by any method. If you have time, do the same exercice but simulating a less clear structure: \[\alpha = \left(\begin{array}{ccc} 0.4 & 0.2 & 0.2 \\ 0.2 & 0.3 & 0.2\\ 0.2 & 0.2 & 0.35 \end{array} \right) \]
2.3 Ex 3: A more complexe structure 🔥🔥🔥
We now propose to simulate a network with a more complex structure.
Simulation Using what you have learned before, simulate a non-oriented network of \(n=100\) nodes clustered in \(K= 4\) blocks respecting the following organization:
The first block is made of around \(10\%\) of the nodes. The other blocks are of approximatively equal size.
Members of Block 1 are highly connected to everybody except Block 4 but including themselves
Members of Block 2 are only highly connected inside their group and to Blocks 1 and 4 but poorly connected to Block 3.
Members of Block 3 are not connected inside their group and only connected to block 1.
Members of Block 4 have very few connections except with block 2.
Proposition of connection matrix
K = 4
blockProp=c(1/10,3/10,3/10,3/10)
alpha3 = matrix(0.05,K,K)
alpha3[1,1:2] =alpha3[1:2,1] = 0.3
alpha3[1,3] = alpha3[3,1] = 0.3
alpha3[2,4] = alpha3[4,2] = 0.3
alpha3[2,2] = 0.4
alpha3[3,3] = 0.05
connectParam=list(mean=alpha3)
nbNodes = 100
mySimNet3 <- sampleSimpleSBM(nbNodes,
model = 'bernoulli',
directed = FALSE,
blockProp = blockProp,
connectParam =connectParam ,
dimLabels = 'Species')
plot(mySimNet3)
#> Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
#> ℹ Please use tidy evaluation idioms with `aes()`.
#> ℹ See also `vignette("ggplot2-in-packages")` for more information.
#> ℹ The deprecated feature was likely used in the sbm package.
#> Please report the issue at <https://github.com/GrossSBM/sbm/issues>.
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
#> ℹ The deprecated feature was likely used in the sbm package.
#> Please report the issue at <https://github.com/GrossSBM/sbm/issues>.- Explore the structure of the network quickly as before. Are the algorithm of community research able to recover the blocks?
Correction
mySimNet3_adj <- mySimNet3$networkData
diag(mySimNet3_adj) <- 0
G3 <- graph_from_adjacency_matrix(mySimNet3_adj, mode="undirected")
Z_sim3 <- mySimNet3$memberships
G3_EB.mod<-cluster_edge_betweenness(G3)
G3_LE.mod<-cluster_leading_eigen(G3)
G3_ML.mod<-cluster_louvain(G3)
adjustedRandIndex(G3_EB.mod$membership,Z_sim3)
#> [1] 0.2689459
adjustedRandIndex(G3_LE.mod$membership,Z_sim3)
#> [1] 0.2161923
adjustedRandIndex(G3_ML.mod$membership,Z_sim3)
#> [1] 0.09911642The community detection algorithms are unable to recover the blocks because the blocks are not communities. They encode a more complex structure.
2.4 Ex 4: Foodweb 🔥🔥🔥
Simulate a foodweb (binary directed network) with \(5\) trophic levels.
3. Simulation of bipartite 🕸️
3.1 Code 💻️
sbm can also be used to simulate bipartite networks. Here is an example.
K = 3; # nb of blocks
L = 4
nbNodes <- c(50,80) # nb of vertices
blockProp <- list(row = c(.25, 0.5 ,.25),col=c(0.1,0.3,0.4,0.2)) # group proportions
alpha <- matrix(0.2, K, L) # connectivity matrix
alpha[1,1] = 0.4
alpha[1,2] = 0.1
alpha[1,3] = 0.5
alpha[1,4] = 0.01
alpha[2,1] = 0.3
alpha[2,3] = alpha[2,4] = 0.6
alpha[3,1] = alpha[3,4] = 0.05
connectParam <- list(mean = alpha)
myBipartiteNet <- sampleBipartiteSBM(nbNodes, blockProp = blockProp, connectParam = connectParam , model = 'bernoulli', dimLabels = c('pollinators','plants'))
plot(myBipartiteNet)3.2 Ex: simulation of a 🌼-🐝 network 🧪🔥🔥
Use the function sampleBipartiteSBM to to simulate a bipartite network featuring specialists, generalists, and strong embeddedness.
4. Fitting a SBM by VEM
After having simulated more or less complexe networks, we now propose to fit the SBM.
As explained in the course: 1/ The parameters are estimated by maximizing a lower bound of the likelihood. 2/ The clustering of the nodes is recovered by selecting the most probable clustering a posteriori. 3/ The number of blocks is determined using a penalized likelihood criterion.
4.1 Codes for inference
4.1.a Estimation and model selection
The inference task is performed with the function estimateSampleSBM of the packages sbm and blockmodels.
res_estim <- estimateSimpleSBM(netMat = mySimNet$networkData,model = "bernoulli",
dimLabels ='species',
estimOptions = list(verbosity = 0))The algorithm aims to identify the optimal number of blocks \(K\). It explores the space of candidate models through a forward–backward procedure and generates multiple initializations for the VEM algorithm.
The progression of this search can be visualized in the plots produced during the estimation process by setting the argument estimOptions = list(plot = 1).
By default, the best model in terms of Integrated Classification Likelihood (ICL) is sent back. In fact, all the model are stored internally. The user can have a quick glance at them via the $storedModels field:
res_estim$storedModels %>%
ggplot() +
aes(x = nbBlocks, y = ICL) + geom_line() + geom_point(alpha = 0.5) + geom_vline(xintercept = res_estim$storedModels$nbBlocks[which.max(res_estim$storedModels$ICL)],color = "red")4.1.b Output of the estimation
The estimated objects can be obtained as follows:
res_estim$blockProp
#> [1] 0.4596204 0.3499500 0.1904296
res_estim$connectParam
#> $mean
#> [,1] [,2] [,3]
#> [1,] 0.43002865 0.05791270 0.06620821
#> [2,] 0.05791270 0.44723747 0.05058541
#> [3,] 0.06620821 0.05058541 0.45838160
res_estim$memberships
#> [1] 2 3 2 2 1 2 1 1 2 1 1 1 2 1 1 3 1 3 3 2 1 1 1 1 1 1 2 3 2 2 3 1 1 1 2 2 1
#> [38] 2 3 1 3 1 3 1 3 3 3 3 2 1 1 1 2 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 1 2 1 2
#> [75] 3 3 1 1 1 1 1 1 3 3 1 3 3 2 1 2 1 2 2 2 1 2 2 2 1 2Note that in this particular case, where we estimate the structure of a simulated network, we can compare the true clustering with the estimated one. In this case, the Adjusted Rand Index (ARI) is equal to \(1\) (or approximately \(1\)), although the block labels may be permuted. This is due to the label switching phenomenon: the labels of the blocks are not identifiable and can be arbitrarily reordered without affecting the clustering structure.
table(res_estim$memberships, mySimNet$memberships)
#>
#> 1 2 3
#> 1 0 46 0
#> 2 35 0 0
#> 3 0 0 19However the ARI is perfect:
adjustedRandIndex(res_estim$memberships, mySimNet$memberships)
#> [1] 14.1.c Displaying results
The default plot function represents the network data reordered according to the memberships estimated in the SBM. One can also plot the expected network which, in case of the Bernoulli model, corresponds to the probability of connection between any pair of nodes in the network.
plot(res_estim, type = "data", dimLabels = list(row = 'species', col = 'species')) |
plot(res_estim, type = "expected",estimOptions=list(legend=TRUE))If one wants to follow the analysis with a different number of blocks (for instance \(K=4\)), the user can update the current simpleSBMfit thanks to the setModel method:
res_estim$setModel(4)
res_estim$nbBlocks
#> [1] 4
res_estim$setModel(3) # Going back to the best model4.2 Exercices 🧪
- When considering the so-called complex network, is the algorithm able to recover the blocks?
Correction
res3 <- estimateSimpleSBM(mySimNet3$networkData)
adjustedRandIndex(res3$memberships, mySimNet3$memberships)Unlike community detection methods, SBM inference recovers clusters not by specifically searching for communities, but by identifying blocks of nodes that ‘behave’ similarly within the graph. In this sense, SBM inference is agnostic.
- Adapt the code to infer the bipartite networks you simulated
Correction
res_Bi <- estimateBipartiteSBM(myBipartiteNet$networkData,dimLabels = c(row = "pollinators", col = "plants"))
adjustedRandIndex(myBipartiteNet$memberships$pollinators, res_Bi$memberships$pollinators)
adjustedRandIndex(myBipartiteNet$memberships$plants, res_Bi$memberships$plants)Check the quality of the results: display the reordered matrix, compare the clusterings.