Introduction

Network theory is a subset of graph theory that studies the relations between discrete objects or actors. It is used across many disciplines, including physics, engineering, biology, social sciences, and finance.

Twitter conversations in time

Network Components

Networks are made up of nodes and edges.

Nodes or vertices are the discrete entities of the graph or dataset. These can represent Twitter followers, Facebook friends, participants in a study, items in a questionnaire, words in a text or conversation, or any other discrete concept.

Edges or links are the relations among the nodes. These can be either binary (e.g., Twitter follower or not) or weighted (e.g., correlation coefficient).

Networks may be directed or undirected. In a directed network, the order of the relationship matters. For the Twitter example, the relationship among followers is directed (whether you follow someone else is coded separately than if they follow you). By comparison, a Facebook network is undirected (all friends are friends of each other).

Networks take either both an edge list and a node list or an adjacency matrix as inputs. An adjacency matrix is a square matrix in which both the column row names are nodes.

Simple Input Examples

edgeList <- cbind(a = 1:5, b = c(5,2,4,3,1))
edgeList
##      a b
## [1,] 1 5
## [2,] 2 2
## [3,] 3 4
## [4,] 4 3
## [5,] 5 1
nodeList <- cbind("id" = 1:5)
nodeList
##      id
## [1,]  1
## [2,]  2
## [3,]  3
## [4,]  4
## [5,]  5
adjMat <- matrix(0, 5, 5)
adjMat[edgeList] <- 1
adjMat
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    0    0    0    0    1
## [2,]    0    1    0    0    0
## [3,]    0    0    0    1    0
## [4,]    0    0    1    0    0
## [5,]    1    0    0    0    0

Text Analysis

Cat in the Hat! Nodes are individual words, and edges are how far apart those words are.

Creating an Adjacency Matrix from Text

# Read in a text file as text
catInTheHat <- readLines("CatintheHat.txt")

head(catInTheHat)
## [1] "the sun did not shine."        "it was too wet to play."      
## [3] "so we sat in the house"        "all that cold, cold, wet day."
## [5] ""                              "i sat there with sally."
# Remove white space
catInTheHat <- str_squish(catInTheHat) %>% 
  # convert to data frame
  as.data.frame(stringsAsFactors = FALSE) 

# column name is "." - needs to be something easy to call
colnames(catInTheHat) <- "Text"

tidyCat <- catInTheHat %>% 
  unnest_tokens(output = word, input = Text)

# To maintain order of words in poem
tidyCat$wordOrder <- 1:nrow(tidyCat)

head(tidyCat)
##      word wordOrder
## 1     the         1
## 1.1   sun         2
## 1.2   did         3
## 1.3   not         4
## 1.4 shine         5
## 2      it         6

Weighting by Distance

To maintain the order of the poem, we are doing this part before removing stop words.

# Function for matching words that are close by
# Essentially offsets the "word" column by X# in each direction
# next = the word after, prior = the word before
     
closeWords <- function(dataframe, distance) {
  dataframe[, paste0("prior.", distance)] <- dataframe[(
    # NA if words are at the beginning of poem
    ifelse(dataframe$wordOrder - distance >= 1,
           dataframe$wordOrder - distance, NA)), "word"]

dataframe[, paste0("next.", distance)] <- dataframe[(
  # NA if words are at the end of poem
  ifelse(dataframe$wordOrder + distance <= nrow(dataframe),
         dataframe$wordOrder + distance, NA)), "word"]
return(dataframe)
}


# select words 1-3 away
# Any distance is fine! What matters to you will change based on the text.
  for (i in 1:3) {
   tidyCat <-  closeWords(dataframe = tidyCat, distance = i)
  }
rm(i)

# Melt to create a DF that represents the relation between words
# rows = 6 instances of each word
# wordPair = the word X# before or after that word in the poem
buildingMatrix <- tidyCat %>% 
  melt(id.vars = c("word", "wordOrder"),
       variable.name = "relation",  
       value.name = "wordPair")

# Maximum distance between nodes
maxWeight <- 3

# So relations 1 word apart are weighted = 3, 3 words apart = 1, farther = 0
buildingMatrix$weight <- ifelse(is.na(buildingMatrix$wordPair), 0,
  # reverse code
  maxWeight + 1 - 
  # distance from word
  as.numeric(str_sub(buildingMatrix$relation, start = -1)))

head(buildingMatrix)
##    word wordOrder relation wordPair weight
## 1   the         1  prior.1     <NA>      0
## 2   sun         2  prior.1      the      3
## 3   did         3  prior.1      sun      3
## 4   not         4  prior.1      did      3
## 5 shine         5  prior.1      not      3
## 6    it         6  prior.1    shine      3

Create Adjacency Matrix

# Turn it into a weighted matrix
# Full = all 236 words in the poem
adjacencyMatrix <- dcast(
  data = buildingMatrix, 
  formula = word ~ wordPair, 
  value.var = "weight")

# Make row names words 
rownames(adjacencyMatrix) <- adjacencyMatrix$word
# Remove columns "word" and "NA"
adjacencyMatrix <- 
  adjacencyMatrix[, names(adjacencyMatrix) != c("word", "NA")] %>% 
  # convert to matrix
  as.matrix

# Insert NAs for duplicates
adjacencyMatrix[lower.tri(adjacencyMatrix,diag = FALSE)] <- NA

# head(adjacencyMatrix)

Create Node List

# nodeList with no-stop-words
nodeList <- tidyCat %>% 
  anti_join(stop_words) %>% 
  # Repetitions of each word - for node size later
  # sort = FALSE: keep in same order as matrix
  count(word, sort = FALSE) %>% 
  # Classifications of words
  left_join(get_sentiments("bing"))

head(nodeList)
## # A tibble: 6 x 3
##   word      n sentiment
##   <chr> <int> <chr>    
## 1 bad       2 negative 
## 2 ball      5 <NA>     
## 3 bed       1 <NA>     
## 4 bent      1 negative 
## 5 bet       2 <NA>     
## 6 bit       3 <NA>

Limit Adjacency Matrix

# Select only those rows and columns where the words appear in the nodeList 
# AKA - limit to Cat in the Hat without stop words
adjacencyMatrix <- adjacencyMatrix[
  which(rownames(adjacencyMatrix) %in% nodeList$word),  
  colnames(adjacencyMatrix) %in% nodeList$word]

# Check that the matrix and nodelist are looking at the same things
all.equal(rownames(adjacencyMatrix), colnames(adjacencyMatrix), nodeList$word)
## [1] TRUE
# head(adjacencyMatrix)

Text Plotting

igraph is a collection of network analysis tools with the emphasis on efficiency, portability and ease of use.

Adjacency Matrix

# install.packages(igraph)
library(igraph)

catInTheHatGraph <- graph.adjacency(
  adjmatrix = adjacencyMatrix, 
  mode = "undirected", 
  weighted = TRUE)

plot.igraph(catInTheHatGraph)

# Cool, but not super informative

# Make the vertex size proportionate to the number of repetitions
# 1) Change nodeList column names to match what igraph expects for vertex attributes
colnames(nodeList) <- c("name", "size", "sentiment")
# 2) Attach nodeList DF as the vertex attributes
vertex.attributes(catInTheHatGraph) <- nodeList
plot.igraph(catInTheHatGraph)

# Make the line widths porportionate to the number of times the words co-occur (the weight)
edge.attributes(catInTheHatGraph)$width <- E(catInTheHatGraph)$weight
# OR - change the name of the "weight" attribute to "width"
# names(edge.attributes(catInTheHatGraph)) <- "width"
plot.igraph(catInTheHatGraph)

# Those are super-close together. Aspect ratio default = 1. Change to 0 to make farther apart.
plot.igraph(catInTheHatGraph,
            asp = 0)

# color the nodes based on sentiment
# There is no easy way to do this like in ggplot
nodeList$color = ifelse(is.na(nodeList$sentiment), "mediumpurple1",
                           ifelse(nodeList$sentiment == "negative", "red", "green"))
vertex.attributes(catInTheHatGraph) <- nodeList

# Check to make sure it worked
vertex.attributes(catInTheHatGraph)
## # A tibble: 112 x 4
##    name   size sentiment color        
##    <chr> <int> <chr>     <chr>        
##  1 bad       2 negative  red          
##  2 ball      5 <NA>      mediumpurple1
##  3 bed       1 <NA>      mediumpurple1
##  4 bent      1 negative  red          
##  5 bet       2 <NA>      mediumpurple1
##  6 bit       3 <NA>      mediumpurple1
##  7 bite      1 <NA>      mediumpurple1
##  8 book      1 <NA>      mediumpurple1
##  9 books     3 <NA>      mediumpurple1
## 10 bow       1 <NA>      mediumpurple1
## # ... with 102 more rows
# Add some color
plot.igraph(catInTheHatGraph,
            asp = 0,
            layout = layout.fruchterman.reingold, #this is default
            vertex.label.color = "black", 
            vertex.color = adjustcolor(col = V(catInTheHatGraph)$color, 
                                     # so overlapping vertices are visible. alpha range 0 (transparent):1 (opaque)
                                       alpha.f = 0.6),
            # vertex.frame.color = "darkgreen", #if you wanted to change the color around the vertices
            vertex.label.cex = 1, #size of the label text. 1 is default
            edge.color = "lightsteelblue4",
            main = "Cat in the Hat")

# Plot above threshold
min <- 1 #Doesn't work for 2 in this example - probably too few
plot(catInTheHatGraph,
     asp = 0,
     edge.weight = ifelse(E(catInTheHatGraph)$weight > min,
                          E(catInTheHatGraph)$weight,
                          NA),
     edge.width = ifelse(E(catInTheHatGraph)$weight > min,
                         E(catInTheHatGraph)$weight, 
                         NA))

Creating an Edge List

# converting this adjacency matrix to an edge list
# All possible combinations between words, whether they existed, and how close
edgeList <- melt(adjacencyMatrix, 
                 varnames = c("word", "wordpair"), 
                 value.name = "weight")

# Only want relations that actually existed - otherwise plot is bizarre
edgeList <- edgeList[which(edgeList$weight > 0 & !is.na(edgeList$weight)), ]

Plotting with Edge & Node Lists

# so edgeList has the column "width"
edgeList$width <- edgeList$weight

edgeGraph <-  graph_from_data_frame(d = edgeList, directed = FALSE, vertices = nodeList)

plot.igraph(edgeGraph,
            asp = 0,
            layout = layout.fruchterman.reingold, #this is default
            vertex.label.color = "black", 
            vertex.color =
              # so overlapping vertices are visible. alpha range 0 (transparent):1 (opaque)
              adjustcolor(col = V(edgeGraph)$color, alpha.f = 0.6),
            # vertex.frame.color = "darkgreen", #if you wanted to change the color around the vertices
            vertex.label.cex = 1, #size of the label text. 1 is default
            edge.color = "lightsteelblue4",
            main = "Cat in the Hat")

# Make sure it doesn't interfere with other network graphing packages
detach(package:igraph)

Analyzing Networks

Now, we’ll go over some different ways that we can analyze networks. These techniques can be used to learn something about a particular network, or they can used to obtain measures for other aims (e.g., for using as predictors or outcomes in a regression).

Centrality

One of the most common and obvious things to measure in a network is centrality. Centrality is linked up with ideas of prominence, status, social capital (in social networks), or importance more generally in other networks (e.g., a critical symptom for intervention in a symptom network).

There are several ways to measure centrality, and each takes a slightly different definition of centrality or imporance. The four that seem to be most common are: degree, closeness, Betweenness, and Power.

  1. Degree: connections matter
    • The number of ties/connections the actor or ego has to others.
    • In directed graphs:
      • in-degree is the number of incoming ties
      • out-degree is the number of outgoing ties
  2. Closeness: Being close (or really not being far away) matters
    • Measures the extent to which a node is close to all other actors.
    • The inverse of distance from that node to every other node in the network.
  3. Betweenness: Bridging gaps is what matters
    • The number of shortest paths or geodesics the actor sits on.
    • Requires the network to be connected (there is some path between every possible pair of nodes) Betweenness
  4. Power: It’s not just how many connections, but who you’re connected to
    • In broad strokes, its’ a weighted degree measure, with more weight given to connections to more central nodes.
    • Eigenvector centrality (which google uses for page rank) is a similar measure.
different Centrality Measures

different Centrality Measures

Communities

One thing we’ often wan’re often interested in is the extent to which there is some underlying structure to a particular network. In social network terms, these are called communities, and we can look for them using community detection algrorithms. The image below shows a hypothetical network, with community depicted with color.

hypothetical communities

hypothetical communities

Some interesting examples:

There are several different appreoaches to community detection, but virtually all of them follow a similar logic: communities are groups of nodes that are more densely connected to one another than the rest of the network. The algorithm we’ll focus on is called the walktrap algorithm, which looks for communities using random walks. A random walk is just what it sounds like: imagine each of the nodes are a place, and each edge is a path. A random walk starts at a place (node), then selects a random path (edge) and Walks along that path to a new place (node), and then keeps going. Clusters or communities are defined as sets of nodes with short random walks (relative to the larger network).

Clustering, Transitivity, and small worlds

Three related measures in network analysis are clustering, transitivity, and small worlds. The first two are measures of local clustering, and the third is a bit different but relies on those measures.

Clustering

Clustering is a measure of local clustering in a network, or how densely nodes are inter-related. Like other measures, there are several approaches to clustering coefficients, but the basic idea is to take the ratio of the number of connections between neighbors over the number of possible connections between neighbors.

Transitivity

Transitivity is another measure of local clustering, and is based on the idea of triads (groups of three nodes). It is the number of closed triads over the number of possible triads. It thus ranges from zero (no triads are closed) to 1 (every triad is closed). In a sense, transitivity is asking: to what extent should we expect friends of friends to be friends themselves.

Smallworldness

The smallworldness of a network is best captured by the common notion of six degrees of seperation or six degree of Kevin Bacon. The basic premise is that many networks are structured such that the number of steps between two randomly selected nodes is less than would be expected by chance. This is why its called smallworldness, and it can be measured in a couple of ways.

One smallworldness index is defined as the average clustering coefficient over the average shortest path, and the other is average transitivity over the average shortest path.

So, they both conceptualize smallworldness as local clustering over network-level distance. This captures that intuition of a smallworld, as being more inter-connected than one would expect by chance. A network is said to be a smallworld if this index is greater than 1 (some researchers use a stricter threshold of 3).

Example 1: Social Network (Twitter followers)

Now let’s put all of this to use. For this first example, I’ll be using a dataset of anonymized twitter users follower networks. This data comes from Arizona State University’s Social COmputing Program. The data we’ll be working with is a smaller subset of the original dataset, to be a more manageable size.

Twitter follower networks are a directed network. A can follow B without B following A.

library(tidyverse)
library(rio)
library(qgraph)
library(igraph)

twitter_sample_df <- import("twitter_sample.csv")
head(twitter_sample_df)
##    node_id alter_id
## 1 10127550       NA
## 2  6464887  5994113
## 3  5247112  1349110
## 4  5247112    53468
## 5  5247112  1202400
## 6  5290409  1349110

As you can, the above data is an example of an edgelist; it lists every edge in the network (i.e., each row is two nodes that are connected).

We’ll be using both the qgraph and igraph libraries for this example. qgraph is pretty straightforward to use, and has the benefit of being made by psychologists (so it has a lot of what we, as psychologists, want). igraph has some functionality that qgraph doesn’t, so we will use it for a few measures.

Visualizing the Network

First, you may want to see what the network looks like. We can vizualize the network with the qgraph() function from the qgraph library.

twitter_net <- 
  twitter_sample_df %>%
  # remove missing rows.
  # these can cause some issues when turning an
  # edgelist into a social network graph
  na.omit() %>%
  # Next, make sure the node_id and alter_id variables
  # are character variables. 
  mutate_if(is.numeric, as.character) %>% 
  qgraph()

A couple of things might jump out. First, it looks like some nodes have a lot of connections and some nodes have very few, so there are probably differences in the centrality of the nodes. Second, it looks like there are some pretty dense clusters of nodes in various places. This suggests there may be some communities in the network. First let’s look at centrality and then we’ll look for communities.

Centrality

It is very straightforward to get the centrality measures we care about in qgraph. You just use centrality() or centrality_auto. The main difference is that centrality_auto has some useful defaults for common types of graphs. In our case, we have a disconnected graph (not every node is connected), which presents a problem for certain measures (closeness, for example). centrality_auto deals with this by calculating closeness for just the largest component (i.e., just largest part of the network that is connected). So we’ll use centrality_auto() for our network.

twitter_cent <- centrality_auto(twitter_net)

The Centrality object has three different dataframes. We’ll work with the node.centrality dataframe, since we’re looking for centrality indices about nodes right now.

twitter_node_cent <- as_tibble(twitter_cent$node.centrality, rownames = "node_id")

twitter_node_cent
## # A tibble: 2,147 x 7
##    node_id Betweenness Closeness InDegree OutDegree OutExpectedInfluence
##    <chr>         <dbl>     <dbl>    <dbl>     <dbl>                <dbl>
##  1 6464887           0         0        0         1                    1
##  2 5247112           0         0        0         3                    3
##  3 5290409           0         0        0         1                    1
##  4 8512052           0         0        0         1                    1
##  5 5778622           0         0        0         1                    1
##  6 2838239           0         0        0         1                    1
##  7 412216            0         0        0         2                    2
##  8 6440079           0         0        0         1                    1
##  9 7181350           0         0        0         1                    1
## 10 4232907           0         0        0         2                    2
## # ... with 2,137 more rows, and 1 more variable: InExpectedInfluence <dbl>

Degree

First, we might want to get some descriptives on in-degree and out-degree. These have a straightforward interpretation in this twitter data:

  • in-degree = # of people that follow you
  • out-degree = # of people that you follow.
twitter_node_cent %>% 
  select(contains("Degree")) %>% 
  psych::describe() %>% 
  knitr::kable(caption = "Degree Centrality for Twitter Follower Network")
Degree Centrality for Twitter Follower Network
vars n mean sd median trimmed mad min max range skew kurtosis se
InDegree 1 2147 1.005589 0.718525 1 0.9860384 0 0 16 16 8.941473 154.0393 0.0155069
OutDegree 2 2147 1.005589 17.290618 0 0.0139616 0 0 608 608 27.877039 857.0494 0.3731595

It looks like the average in and out degree are quite low, but variability in out degree is very high. Also you can see that out_degree has much more variability than in_degree, which means there is more variability in number of followers than number of followees.

twitter_node_cent %>% 
  select(contains("Degree")) %>% 
  gather(type, degree) %>% 
ggplot(aes(x = degree, fill = type))+
  geom_density(alpha = .4)

As you can see, degree is highly skewed. This is pretty typical, and is why you’ll here people say that centrality in networks follows a power law. It is also something to consider when planning studies looking at degree centrality.

Closeness

To get closeness from the centrality list we obtained above, we grab out that vector. Let’s take a look at the mean and standard deviation of that measure.

twitter_node_cent %>% 
  select(Closeness) %>% 
  summarize(mean = mean(Closeness, na.rm = TRUE), 
            sd = sd(Closeness, na.rm = TRUE))
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1     0     0

This network is so sparse that everyone has a closeness score of 0, even when looking just at the largest component.

Betweenness

To get betweennes, we just pull that vector from the centrality results list.

twitter_node_cent %>% 
  select(Betweenness) %>% 
  summarize(mean = mean(Betweenness, na.rm = TRUE), 
            sd = sd(Betweenness, na.rm = TRUE))
## # A tibble: 1 x 2
##    mean    sd
##   <dbl> <dbl>
## 1     0     0

This one again has a mean and SD of zero, which speaks to the sparsity of this network.

Power

This one is not in the centrality results object from qgraph, but igraph has a function for it. So we’ll turn our qgraph into an igraph and pass that to igraph’s power_centrality() function. We’re going to save the igraph object though, because we will need it again later.

twitter_igraph <-  
  # Take our qgraph network
  twitter_net %>% 
  # turn it into an igraph object with
  # as.igraph
  as.igraph()

twitter_power_cent <-
  # take igraph object made above
  twitter_igraph %>% 
  # calculate power centrality
  power_centrality() %>% 
  # turn it into a tibble so
  # we can easily plot it
  as_tibble() %>% 
  # rename the value column power
  rename(power = value)

psych::describe(twitter_power_cent$power) %>% 
  knitr::kable(caption = "Power Centrality Descriptives for Twitter Follower Network")
Power Centrality Descriptives for Twitter Follower Network
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 2147 0.0580735 0.9985449 0 0.0008063 0 0 35.11241 35.11241 27.87704 857.0494 0.0215502
ggplot(twitter_power_cent, aes(x = power)) +
  geom_density()

You can see that this, like degree, is highly skewed.

Couple of take-aways here:

  1. Centrality can be measured in a couple different ways.
  2. Each measure considers a different aspect of centrality.
  3. Some may be more or less apropriate in particular contexts.
    • E.g., closeness and betweenness not very useful in a sparse network like this one.

Communities

The igraph library has some more functionality for clustering algroithms. It looks like qgraph does not have a clustering algorithm for directed graphs, but igraph does. It’s a function called cluster_walktrap(), which uses the walktrap algoritm.

# run the cluster_walktrap function
twitter_net_communities <- cluster_walktrap(twitter_igraph)

# look at how membership is recorded
head(twitter_net_communities$membership)
## [1]  4  2  2  6 75 61

Membership is coded as an integer; nodes with the same integer belong to the same community.

Next, let’s see how many communities it found:

# how many communities?
length(unique(twitter_net_communities$membership))
## [1] 126

Okay, it looks like there are 126 communities. I suppose that is unsurprising, given how large that network is.

Now let’s vizualize these communities. In qgraph() you can color nodes based on community membership using the groups argument. We’ll pass the membership vector to qgraph in the groups argument, which will color the nodes based on community membership.

# make sure you use the twitter_net object
# not the igraph object
qgraph(twitter_net,
       groups = twitter_net_communities$membership)

You can see some of the groups that looked like communities when we first graphed it are indeed communities. But, some of the groups of nodes are actually not in the same communities, which is interesting. We don’t know much about this particular network, so we can’t say much about what these communities mean.

We’ll save the cluster coefficient, transitivity, and smallworldness for the next example. Calculating these in directed graphs gets a little tricky, so we we’ll leave our twitter data here.

Example 2: Correlation Network

Our next example will involve a network of a correlation matrix. The correlation matrix we’ll work with is built-in to the qgraph package, and consists of inter-item correlations for the Dutch version of the NEO-PI-R. Participants were 500 psych students.

As alluded to earlier, this is becoming an increasinly common method for conceptualizing measurement in personality and psychopathology. See:

In addition to loading the data itself, we’ll also load a list called big5groups. This basically provides a key linking each item to one of the Big Five domains

# Load big5 dataset:
data(big5)
data(big5groups)

Visualize the network

First, let’s take a look at the network. For this, we’ll use the the qgraph() function again. For a correlation network, we’ll first need to get the correlation matrix. We’ll do this by passing in cor(data), or cor(big5) in this case, as the first argument (remember before we just passed in the edge list).

There is one additional complication for correlation networks: what is an edge? In this example, we’ll use .1 as the threshold, which is arbitrarry. You could instead use significance as a threshold, or some other procedure (qgraph has an adaptive lasso method and a partial correlation method built-in, for example). But, those take forever to run, so for the sake of time, we’re going to be arbitrary (code for significance as a threshold is there, and commented out) Lastly, we give it the sample size.

big5_net <- qgraph(cor(big5),
       groups = big5groups,
       threshold = .1)

#big5_net <- qgraph(cor(big5),
#       groups = big5groups,
#       threshold = "sig",
#       sampleSize = nrow(big5))

Isn’t it beautiful! Each of the big 5 in its own little cluster, with plenty of connections across the Big 5 of course. However, it is a sort of a lie - when you pass it a groups variable like we did, it uses the groups layout. This puts anything within a group into a circle together. Let’s change this to the “spring” layout which is a commonly used layout in networks (it’s what the twitter graph used; it puts things closer together based on how related they are).

qgraph(cor(big5),
       groups = big5groups,
       threshold = .1,
       layout = "spring")

Okay, that looks about right - each of the Big 5 is sort of in its own neighborhood, but there is plenty of mixing. Notice that Extraversion is very spread out, and Openness is sort of out there on its own..

Centrality

Next, let’s see look at centrality for this network. We can look at many of the same ones that we did before, but for the sake of time, I’ll limit this example to Betweenness.

big5_cent <- centrality_auto(big5_net)
big5_node_cent <- as_tibble(big5_cent$node.centrality, rownames = "item")

Note: If anyone is feeling ambitious, take a look at the other centrality measures. Note, however, that centrality_auto() does not provide degree for correlation networks (though one could get it).

Betweenness

Let’s look at betweenness, which is the number of shortest paths a node sits on. This gets at whether some items are acting as bridges more often.

psych::describe(big5_node_cent$Betweenness)
##    vars   n   mean     sd median trimmed   mad min max range skew kurtosis
## X1    1 240 103.62 119.05   65.5   82.68 77.84   0 918   918 2.42      9.6
##      se
## X1 7.68

The average betweenness is pretty high here (102), and the SD is also very high (118), Let’s take a look at its distribution:

# turn it into a tibble
as_tibble(big5_node_cent$Betweenness) %>% 
  # it's calling the variable value by default, so we'll rename it
  rename(betweenness = value) %>%
  # pass that to ggplot
  ggplot(aes(x = betweenness))+
  # let's see a density plot
  geom_density()

Looks like it’s pretty right-skewed, though not nearly as bad as degree in the twitter graph.

Incorporating Centrality in Graph

Next, we might want to represent centrality in the graph itself. Let’s make different nodes different sizes based on their betweenness; that way, we can get a quick view of which nodes are most central (from a betweenness perspective), and if that seems to be concentrated in any of our a priori groups.

qgraph(big5_net, 
       vsize = (big5_node_cent$Betweenness/100),
       layout = "spring")

AS you can see, item 177 is the highest in betweenness. Additionally, it looks like Extraversion has a lot of high betweeenness nodes, but there is a fair amount of spread in that domain. Neuroticism appears to have some very central nodes too, and Conscientiousenss and Agreeableness are potentially next in line. Openness is much less central.

Clustering, Transitivity, and smallworldness

Next, we can look at the local clustering metrics (cluster coefficient and transitivity), and see if the NEO-PI-R has a small world structure.

Cluster Coefficient

qgraph has a few options for cluster coefficients. For unweighted graphs (i.e., binary undirected connections, like facebook friends), the Watts & Strogatz (1998) algorithm (clustWS()) is commonly used. For weighted graphs, we have a few options. We will use the Zhang & Hovrath (2005) clustering coefficient, which is intended for weighted graphs. The corresponding qgraph function is clustZhang().

For those curious, try the others: * clustOnella() * clustcoef_auto()

qgraph() implements the regular version of each, and also a signed version. The latter is a generalization of the former for signed correlations (which we have), so we’ll want to look at that index (see Costantini & Perugini, 2014).

big5_net_clust <-  clustZhang(big5_net)

# Notice that we're taking the signed clustZhang
psych::describe(big5_net_clust$signed_clustZhang) %>% 
  knitr::kable(caption = "Power Centrality Descriptives for NEO Big 5 correlation network")
Power Centrality Descriptives for NEO Big 5 correlation network
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 240 0.1085488 0.0269858 0.1051004 0.1072887 0.0253883 0.0531074 0.185443 0.1323356 0.4291718 -0.3714882 0.0017419

Looks like the mean is .11, suggesting a realtively small amount of clustering (it ranges from -1, +1).

Transitivity

Next, we can look at transitivity, which we can get with the transitivity() function. That function is in igraph, so we’ll have turn our network into an igraph object first. Additionally, there are several options for calculating transitivity. We’ll look at it as calculated for unweighted graphs (i.e., it will ignore the weights, or strength of association between nodes). The main reason for this is that I’m not sure the algorithm they implement (the Barrett algorithm) works with signed data (i.e., + & - weights).

big5_net %>% 
  # turn it into an igraph object
  as.igraph() %>% 
  # run transitivity
  transitivity("local") %>% 
  # check out descriptives
  psych::describe()
##    vars   n mean   sd median trimmed  mad  min  max range skew kurtosis se
## X1    1 240 0.49 0.06   0.49    0.49 0.06 0.34 0.65  0.31  0.1    -0.41  0

Smallworldness

Okay, now let’s look at the smallworldness of our Big 5 correlation network. Remember, this is a measure of whether one can get across the graph more quickly than would be expected by chance. Values > 1 are considered by some indication of smallworldness; some use a more stringent threshold of 3.

qgraph has two implementations of the smallworld index: 1. smallworldIndex() 2. smallworldness()

We’ll use the former; the latter takes a real long time to run.

smallworldIndex(big5_net)
## $transitivity
## [1] 0.505082
## 
## $transitivity_random
## [1] 0.3583681
## 
## $APL
## [1] 1.640202
## 
## $APL_random
## [1] 1.600794
## 
## $index
## [1] 1.375531

Two things to point out: 1. Transitivity is a tiny bit different between igraph’s transitivity() and qgraph’s smallworldIndex(). This is due to slight differences in how it’s calculated. 2. The small world index is 1.38, which is above the lower threshold. This suggests that the Big 5, as measured by the NEO-PI-R, has a smallworld structure.

So, that brings us to the end of this example. Some take aways:

  1. Within the NEO-PI-R, Extraversion items appear to be the highest in betweenness - they bridge the gap between more distant personality items.
  2. The Big Five, as measured by the NEO-PI-R, is a small-ish world.

Minihacks

References

Costantini & Perugini (2014) Watts & Strogatz (1998) Wetzel et al. (2017) R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering.