Network theory is a subset of graph theory that studies the relations between discrete objects or actors. It is used across many disciplines, including physics, engineering, biology, social sciences, and finance.
Networks are made up of nodes and edges.
Nodes or vertices are the discrete entities of the graph or dataset. These can represent Twitter followers, Facebook friends, participants in a study, items in a questionnaire, words in a text or conversation, or any other discrete concept.
Edges or links are the relations among the nodes. These can be either binary (e.g., Twitter follower or not) or weighted (e.g., correlation coefficient).
Networks may be directed or undirected. In a directed network, the order of the relationship matters. For the Twitter example, the relationship among followers is directed (whether you follow someone else is coded separately than if they follow you). By comparison, a Facebook network is undirected (all friends are friends of each other).
Networks take either both an edge list and a node list or an adjacency matrix as inputs. An adjacency matrix is a square matrix in which both the column row names are nodes.
edgeList <- cbind(a = 1:5, b = c(5,2,4,3,1))
edgeList
## a b
## [1,] 1 5
## [2,] 2 2
## [3,] 3 4
## [4,] 4 3
## [5,] 5 1
nodeList <- cbind("id" = 1:5)
nodeList
## id
## [1,] 1
## [2,] 2
## [3,] 3
## [4,] 4
## [5,] 5
adjMat <- matrix(0, 5, 5)
adjMat[edgeList] <- 1
adjMat
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0 0 0 0 1
## [2,] 0 1 0 0 0
## [3,] 0 0 0 1 0
## [4,] 0 0 1 0 0
## [5,] 1 0 0 0 0
Cat in the Hat! Nodes are individual words, and edges are how far apart those words are.
# Read in a text file as text
catInTheHat <- readLines("CatintheHat.txt")
head(catInTheHat)
## [1] "the sun did not shine." "it was too wet to play."
## [3] "so we sat in the house" "all that cold, cold, wet day."
## [5] "" "i sat there with sally."
# Remove white space
catInTheHat <- str_squish(catInTheHat) %>%
# convert to data frame
as.data.frame(stringsAsFactors = FALSE)
# column name is "." - needs to be something easy to call
colnames(catInTheHat) <- "Text"
tidyCat <- catInTheHat %>%
unnest_tokens(output = word, input = Text)
# To maintain order of words in poem
tidyCat$wordOrder <- 1:nrow(tidyCat)
head(tidyCat)
## word wordOrder
## 1 the 1
## 1.1 sun 2
## 1.2 did 3
## 1.3 not 4
## 1.4 shine 5
## 2 it 6
To maintain the order of the poem, we are doing this part before removing stop words.
# Function for matching words that are close by
# Essentially offsets the "word" column by X# in each direction
# next = the word after, prior = the word before
closeWords <- function(dataframe, distance) {
dataframe[, paste0("prior.", distance)] <- dataframe[(
# NA if words are at the beginning of poem
ifelse(dataframe$wordOrder - distance >= 1,
dataframe$wordOrder - distance, NA)), "word"]
dataframe[, paste0("next.", distance)] <- dataframe[(
# NA if words are at the end of poem
ifelse(dataframe$wordOrder + distance <= nrow(dataframe),
dataframe$wordOrder + distance, NA)), "word"]
return(dataframe)
}
# select words 1-3 away
# Any distance is fine! What matters to you will change based on the text.
for (i in 1:3) {
tidyCat <- closeWords(dataframe = tidyCat, distance = i)
}
rm(i)
# Melt to create a DF that represents the relation between words
# rows = 6 instances of each word
# wordPair = the word X# before or after that word in the poem
buildingMatrix <- tidyCat %>%
melt(id.vars = c("word", "wordOrder"),
variable.name = "relation",
value.name = "wordPair")
# Maximum distance between nodes
maxWeight <- 3
# So relations 1 word apart are weighted = 3, 3 words apart = 1, farther = 0
buildingMatrix$weight <- ifelse(is.na(buildingMatrix$wordPair), 0,
# reverse code
maxWeight + 1 -
# distance from word
as.numeric(str_sub(buildingMatrix$relation, start = -1)))
head(buildingMatrix)
## word wordOrder relation wordPair weight
## 1 the 1 prior.1 <NA> 0
## 2 sun 2 prior.1 the 3
## 3 did 3 prior.1 sun 3
## 4 not 4 prior.1 did 3
## 5 shine 5 prior.1 not 3
## 6 it 6 prior.1 shine 3
# Turn it into a weighted matrix
# Full = all 236 words in the poem
adjacencyMatrix <- dcast(
data = buildingMatrix,
formula = word ~ wordPair,
value.var = "weight")
# Make row names words
rownames(adjacencyMatrix) <- adjacencyMatrix$word
# Remove columns "word" and "NA"
adjacencyMatrix <-
adjacencyMatrix[, names(adjacencyMatrix) != c("word", "NA")] %>%
# convert to matrix
as.matrix
# Insert NAs for duplicates
adjacencyMatrix[lower.tri(adjacencyMatrix,diag = FALSE)] <- NA
# head(adjacencyMatrix)
# nodeList with no-stop-words
nodeList <- tidyCat %>%
anti_join(stop_words) %>%
# Repetitions of each word - for node size later
# sort = FALSE: keep in same order as matrix
count(word, sort = FALSE) %>%
# Classifications of words
left_join(get_sentiments("bing"))
head(nodeList)
## # A tibble: 6 x 3
## word n sentiment
## <chr> <int> <chr>
## 1 bad 2 negative
## 2 ball 5 <NA>
## 3 bed 1 <NA>
## 4 bent 1 negative
## 5 bet 2 <NA>
## 6 bit 3 <NA>
# Select only those rows and columns where the words appear in the nodeList
# AKA - limit to Cat in the Hat without stop words
adjacencyMatrix <- adjacencyMatrix[
which(rownames(adjacencyMatrix) %in% nodeList$word),
colnames(adjacencyMatrix) %in% nodeList$word]
# Check that the matrix and nodelist are looking at the same things
all.equal(rownames(adjacencyMatrix), colnames(adjacencyMatrix), nodeList$word)
## [1] TRUE
# head(adjacencyMatrix)
igraph is a collection of network analysis tools with the emphasis on efficiency, portability and ease of use.
# install.packages(igraph)
library(igraph)
catInTheHatGraph <- graph.adjacency(
adjmatrix = adjacencyMatrix,
mode = "undirected",
weighted = TRUE)
plot.igraph(catInTheHatGraph)
# Cool, but not super informative
# Make the vertex size proportionate to the number of repetitions
# 1) Change nodeList column names to match what igraph expects for vertex attributes
colnames(nodeList) <- c("name", "size", "sentiment")
# 2) Attach nodeList DF as the vertex attributes
vertex.attributes(catInTheHatGraph) <- nodeList
plot.igraph(catInTheHatGraph)
# Make the line widths porportionate to the number of times the words co-occur (the weight)
edge.attributes(catInTheHatGraph)$width <- E(catInTheHatGraph)$weight
# OR - change the name of the "weight" attribute to "width"
# names(edge.attributes(catInTheHatGraph)) <- "width"
plot.igraph(catInTheHatGraph)
# Those are super-close together. Aspect ratio default = 1. Change to 0 to make farther apart.
plot.igraph(catInTheHatGraph,
asp = 0)
# color the nodes based on sentiment
# There is no easy way to do this like in ggplot
nodeList$color = ifelse(is.na(nodeList$sentiment), "mediumpurple1",
ifelse(nodeList$sentiment == "negative", "red", "green"))
vertex.attributes(catInTheHatGraph) <- nodeList
# Check to make sure it worked
vertex.attributes(catInTheHatGraph)
## # A tibble: 112 x 4
## name size sentiment color
## <chr> <int> <chr> <chr>
## 1 bad 2 negative red
## 2 ball 5 <NA> mediumpurple1
## 3 bed 1 <NA> mediumpurple1
## 4 bent 1 negative red
## 5 bet 2 <NA> mediumpurple1
## 6 bit 3 <NA> mediumpurple1
## 7 bite 1 <NA> mediumpurple1
## 8 book 1 <NA> mediumpurple1
## 9 books 3 <NA> mediumpurple1
## 10 bow 1 <NA> mediumpurple1
## # ... with 102 more rows
# Add some color
plot.igraph(catInTheHatGraph,
asp = 0,
layout = layout.fruchterman.reingold, #this is default
vertex.label.color = "black",
vertex.color = adjustcolor(col = V(catInTheHatGraph)$color,
# so overlapping vertices are visible. alpha range 0 (transparent):1 (opaque)
alpha.f = 0.6),
# vertex.frame.color = "darkgreen", #if you wanted to change the color around the vertices
vertex.label.cex = 1, #size of the label text. 1 is default
edge.color = "lightsteelblue4",
main = "Cat in the Hat")
# Plot above threshold
min <- 1 #Doesn't work for 2 in this example - probably too few
plot(catInTheHatGraph,
asp = 0,
edge.weight = ifelse(E(catInTheHatGraph)$weight > min,
E(catInTheHatGraph)$weight,
NA),
edge.width = ifelse(E(catInTheHatGraph)$weight > min,
E(catInTheHatGraph)$weight,
NA))
# converting this adjacency matrix to an edge list
# All possible combinations between words, whether they existed, and how close
edgeList <- melt(adjacencyMatrix,
varnames = c("word", "wordpair"),
value.name = "weight")
# Only want relations that actually existed - otherwise plot is bizarre
edgeList <- edgeList[which(edgeList$weight > 0 & !is.na(edgeList$weight)), ]
# so edgeList has the column "width"
edgeList$width <- edgeList$weight
edgeGraph <- graph_from_data_frame(d = edgeList, directed = FALSE, vertices = nodeList)
plot.igraph(edgeGraph,
asp = 0,
layout = layout.fruchterman.reingold, #this is default
vertex.label.color = "black",
vertex.color =
# so overlapping vertices are visible. alpha range 0 (transparent):1 (opaque)
adjustcolor(col = V(edgeGraph)$color, alpha.f = 0.6),
# vertex.frame.color = "darkgreen", #if you wanted to change the color around the vertices
vertex.label.cex = 1, #size of the label text. 1 is default
edge.color = "lightsteelblue4",
main = "Cat in the Hat")
# Make sure it doesn't interfere with other network graphing packages
detach(package:igraph)
Now, we’ll go over some different ways that we can analyze networks. These techniques can be used to learn something about a particular network, or they can used to obtain measures for other aims (e.g., for using as predictors or outcomes in a regression).
One of the most common and obvious things to measure in a network is centrality. Centrality is linked up with ideas of prominence, status, social capital (in social networks), or importance more generally in other networks (e.g., a critical symptom for intervention in a symptom network).
There are several ways to measure centrality, and each takes a slightly different definition of centrality or imporance. The four that seem to be most common are: degree, closeness, Betweenness, and Power.
One thing we’ often wan’re often interested in is the extent to which there is some underlying structure to a particular network. In social network terms, these are called communities, and we can look for them using community detection algrorithms. The image below shows a hypothetical network, with community depicted with color.
Some interesting examples:
There are several different appreoaches to community detection, but virtually all of them follow a similar logic: communities are groups of nodes that are more densely connected to one another than the rest of the network. The algorithm we’ll focus on is called the walktrap algorithm, which looks for communities using random walks. A random walk is just what it sounds like: imagine each of the nodes are a place, and each edge is a path. A random walk starts at a place (node), then selects a random path (edge) and Walks along that path to a new place (node), and then keeps going. Clusters or communities are defined as sets of nodes with short random walks (relative to the larger network).
Three related measures in network analysis are clustering, transitivity, and small worlds. The first two are measures of local clustering, and the third is a bit different but relies on those measures.
Clustering is a measure of local clustering in a network, or how densely nodes are inter-related. Like other measures, there are several approaches to clustering coefficients, but the basic idea is to take the ratio of the number of connections between neighbors over the number of possible connections between neighbors.
Transitivity is another measure of local clustering, and is based on the idea of triads (groups of three nodes). It is the number of closed triads over the number of possible triads. It thus ranges from zero (no triads are closed) to 1 (every triad is closed). In a sense, transitivity is asking: to what extent should we expect friends of friends to be friends themselves.
The smallworldness of a network is best captured by the common notion of six degrees of seperation or six degree of Kevin Bacon. The basic premise is that many networks are structured such that the number of steps between two randomly selected nodes is less than would be expected by chance. This is why its called smallworldness, and it can be measured in a couple of ways.
One smallworldness index is defined as the average clustering coefficient over the average shortest path, and the other is average transitivity over the average shortest path.
So, they both conceptualize smallworldness as local clustering over network-level distance. This captures that intuition of a smallworld, as being more inter-connected than one would expect by chance. A network is said to be a smallworld if this index is greater than 1 (some researchers use a stricter threshold of 3).
Our next example will involve a network of a correlation matrix. The correlation matrix we’ll work with is built-in to the qgraph
package, and consists of inter-item correlations for the Dutch version of the NEO-PI-R. Participants were 500 psych students.
As alluded to earlier, this is becoming an increasinly common method for conceptualizing measurement in personality and psychopathology. See:
In addition to loading the data itself, we’ll also load a list called big5groups. This basically provides a key linking each item to one of the Big Five domains
# Load big5 dataset:
data(big5)
data(big5groups)
First, let’s take a look at the network. For this, we’ll use the the qgraph()
function again. For a correlation network, we’ll first need to get the correlation matrix. We’ll do this by passing in cor(data)
, or cor(big5)
in this case, as the first argument (remember before we just passed in the edge list).
There is one additional complication for correlation networks: what is an edge? In this example, we’ll use .1 as the threshold, which is arbitrarry. You could instead use significance as a threshold, or some other procedure (qgraph
has an adaptive lasso method and a partial correlation method built-in, for example). But, those take forever to run, so for the sake of time, we’re going to be arbitrary (code for significance as a threshold is there, and commented out) Lastly, we give it the sample size.
big5_net <- qgraph(cor(big5),
groups = big5groups,
threshold = .1)
#big5_net <- qgraph(cor(big5),
# groups = big5groups,
# threshold = "sig",
# sampleSize = nrow(big5))
Isn’t it beautiful! Each of the big 5 in its own little cluster, with plenty of connections across the Big 5 of course. However, it is a sort of a lie - when you pass it a groups
variable like we did, it uses the groups
layout. This puts anything within a group into a circle together. Let’s change this to the “spring” layout which is a commonly used layout in networks (it’s what the twitter graph used; it puts things closer together based on how related they are).
qgraph(cor(big5),
groups = big5groups,
threshold = .1,
layout = "spring")
Okay, that looks about right - each of the Big 5 is sort of in its own neighborhood, but there is plenty of mixing. Notice that Extraversion is very spread out, and Openness is sort of out there on its own..
Next, let’s see look at centrality for this network. We can look at many of the same ones that we did before, but for the sake of time, I’ll limit this example to Betweenness.
big5_cent <- centrality_auto(big5_net)
big5_node_cent <- as_tibble(big5_cent$node.centrality, rownames = "item")
Note: If anyone is feeling ambitious, take a look at the other centrality measures. Note, however, that centrality_auto()
does not provide degree for correlation networks (though one could get it).
Let’s look at betweenness, which is the number of shortest paths a node sits on. This gets at whether some items are acting as bridges more often.
psych::describe(big5_node_cent$Betweenness)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 240 103.62 119.05 65.5 82.68 77.84 0 918 918 2.42 9.6
## se
## X1 7.68
The average betweenness is pretty high here (102), and the SD is also very high (118), Let’s take a look at its distribution:
# turn it into a tibble
as_tibble(big5_node_cent$Betweenness) %>%
# it's calling the variable value by default, so we'll rename it
rename(betweenness = value) %>%
# pass that to ggplot
ggplot(aes(x = betweenness))+
# let's see a density plot
geom_density()
Looks like it’s pretty right-skewed, though not nearly as bad as degree in the twitter graph.
Next, we might want to represent centrality in the graph itself. Let’s make different nodes different sizes based on their betweenness; that way, we can get a quick view of which nodes are most central (from a betweenness perspective), and if that seems to be concentrated in any of our a priori groups.
qgraph(big5_net,
vsize = (big5_node_cent$Betweenness/100),
layout = "spring")
AS you can see, item 177 is the highest in betweenness. Additionally, it looks like Extraversion has a lot of high betweeenness nodes, but there is a fair amount of spread in that domain. Neuroticism appears to have some very central nodes too, and Conscientiousenss and Agreeableness are potentially next in line. Openness is much less central.
Next, we can look at the local clustering metrics (cluster coefficient and transitivity), and see if the NEO-PI-R has a small world structure.
qgraph has a few options for cluster coefficients. For unweighted graphs (i.e., binary undirected connections, like facebook friends), the Watts & Strogatz (1998) algorithm (clustWS()
) is commonly used. For weighted graphs, we have a few options. We will use the Zhang & Hovrath (2005) clustering coefficient, which is intended for weighted graphs. The corresponding qgraph function is clustZhang()
.
For those curious, try the others: * clustOnella()
* clustcoef_auto()
qgraph()
implements the regular version of each, and also a signed version. The latter is a generalization of the former for signed correlations (which we have), so we’ll want to look at that index (see Costantini & Perugini, 2014).
big5_net_clust <- clustZhang(big5_net)
# Notice that we're taking the signed clustZhang
psych::describe(big5_net_clust$signed_clustZhang) %>%
knitr::kable(caption = "Power Centrality Descriptives for NEO Big 5 correlation network")
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X1 | 1 | 240 | 0.1085488 | 0.0269858 | 0.1051004 | 0.1072887 | 0.0253883 | 0.0531074 | 0.185443 | 0.1323356 | 0.4291718 | -0.3714882 | 0.0017419 |
Looks like the mean is .11, suggesting a realtively small amount of clustering (it ranges from -1, +1).
Next, we can look at transitivity, which we can get with the transitivity()
function. That function is in igraph
, so we’ll have turn our network into an igraph
object first. Additionally, there are several options for calculating transitivity. We’ll look at it as calculated for unweighted graphs (i.e., it will ignore the weights, or strength of association between nodes). The main reason for this is that I’m not sure the algorithm they implement (the Barrett algorithm) works with signed data (i.e., + & - weights).
big5_net %>%
# turn it into an igraph object
as.igraph() %>%
# run transitivity
transitivity("local") %>%
# check out descriptives
psych::describe()
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 240 0.49 0.06 0.49 0.49 0.06 0.34 0.65 0.31 0.1 -0.41 0
Okay, now let’s look at the smallworldness of our Big 5 correlation network. Remember, this is a measure of whether one can get across the graph more quickly than would be expected by chance. Values > 1 are considered by some indication of smallworldness; some use a more stringent threshold of 3.
qgraph has two implementations of the smallworld index: 1. smallworldIndex()
2. smallworldness()
We’ll use the former; the latter takes a real long time to run.
smallworldIndex(big5_net)
## $transitivity
## [1] 0.505082
##
## $transitivity_random
## [1] 0.3583681
##
## $APL
## [1] 1.640202
##
## $APL_random
## [1] 1.600794
##
## $index
## [1] 1.375531
Two things to point out: 1. Transitivity is a tiny bit different between igraph’s transitivity()
and qgraph’s smallworldIndex()
. This is due to slight differences in how it’s calculated. 2. The small world index is 1.38, which is above the lower threshold. This suggests that the Big 5, as measured by the NEO-PI-R, has a smallworld structure.
So, that brings us to the end of this example. Some take aways:
Costantini & Perugini (2014) Watts & Strogatz (1998) Wetzel et al. (2017) R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering.