(This is a chapter from Complex networks, complex processes.)

We can improve both the performance and statistical properties of simulations by changing the simulation approach we use. We *won't* try to optimise or improve the performance of synchronous dynamics, although there's certainly scope to do so: instead, we'll *replace* the synchronous approach with another technique that (it turns out) is better-suited to the accurate simulating large systems.

The technique we use is sometimes called *Gillespie's stochastic siumulation algorithm* or simply *Gillespie simulation*, It was developed initially [Gil76][Gil77] to perform *ab initio* chemical simulations, where a lot of molecules react according to a set of simple chemical rules – a situation that's very similar to a process over a network. Cao *et alia* [CGP06, section II] provides a very accessible description to the basic mathematics of the technique, which we'll develop in a network context below.

The essence of Gillespie simulation is the observation that we can manipulate the probabilities governing events. Instead of testing in every discrete timestep which of the available events can occur (for example from susceptible to infected in SIR), we predict the instant of time at which the next event will occur – skipping the intermediate time when nothing happens. To put this another way, we convert the probabilities of individual events in *space* into aggregate probability distributions of events over *time*. If the simulation is such that a lot of "empty" timesteps occur, then this approach will avoid the costs of simulating them. It has the additional advantage of operating in continuous time with only a single event happening at each instant, which solves the problem of events affecting each other within a timestep.

Unfortunately these benefits come at the cost of some fairly subtle mathematics needed to manipulate the probability distributions into the required form. We'll deal with this first, and then encode the result as a new simulation dynamics that we can use to simulate epidemics using the *same* compartmented process models as we used for the synchronous case.

In the synchronous simulation in the previous chapter we took all the places at which an event could occur and probabilistically chose some of them for firing. Infection happens along SI edges. (We can also identify SS, SR, II, and RR edges, and these play important rôles in some epidemic models, although not in SIR.) SIR assumes that the dynamics occurs at these loci independently. If we denote the probability of an SI edge transmitting an infection as $\beta$ as usual, then the rate at which edges in the network transmit infection is given by $\beta [SI]$ where $[SI]$ denotes the number of SI edges in the network (the size of the locus, in other words). $[SI]$ is of course a function of time, since the population of SI edges is changed by the infection event. Similarly if infected nodes are removed with probability $\alpha$ the rate of recovery is given by $\alpha [I]$. In a sense the values of $[SI]$ and $[I]$ constitute the "state" of the dynamical system. Each infection event will decrease $[SI]$ by one and increase $[I]$ by a value that depends on the degree of the newly-infected node and how many of those adjacent nodes are susceptible. This indicates that the dynamics entwines three distinct features:

- the probabilities of different events;
- the number of places at which these events can occur; and
- the topology of the network that controls how the populations of different loci evolve.

It is this third feature that distinguishes the network formulation from the differential equation formulation, since it allows heterogeneity of evolution in both space and time.

Let us re-formulate the above in a way that's more explcitly continuous in nature. The probability that some SI edge will transmit infection in a small time $dt$ is given by $a_I \, dt = \alpha [SI] \, dt$, and recovery similarly by $a_R \, dt = \beta [I] \, dt$. We can now ask two questions: given the state of the network,

- when will the next event occur?, and
- what event will it be?

Clearly these are probabilistic questions, so the answers will be formulated as probability distributions. Let's define a probability distribution $P(\tau, e) \, d\tau$ as the probability that an event will happen in the interval $(t + \tau, t + \tau + d\tau)$ *and* that that event will be of type $e$, which for SIR will be either an infection ($I$) or a recovery ($R$) event. So at time $t$ we're looking at the distribution of the times $\tau$ between $t$ and the next event, and the identity of that event. This is a joint probability density function on the space of $\tau$ and $e$, where $\tau$ is a continuous random variable and $e$ is a discrete random variable. We an then draw values a pair of values $(\tau, e)$ from this distribution to give us the time to the next event and its identity.

Note also that the value of $\tau$ answers the first question above, while the value of $e$ answers the second.

What do we expect from this distribution? Intuitively, a system where there are lots of places where events can occur should give rise to a high likelihood of drawing a small value of $\tau$ from the distribution: the events happen close together in time. Conversely, as the number of places available decreases, it becomes more likely that we'll draw a larger value of $\tau$.

We now need a way to specify $P(\tau, e)$ and to draw values from it.

Let's think about $P(\tau, e) \, d\tau$ a little more. We're looking for a value of $\tau$ at which the next event happens, and the identity of that event. Equivalently, we could say that we want the probability that *no* event happens in the interval $[t, t + \tau]$, *and* that an $e$ event happens in the interval $[t + \tau, t + \tau + d\tau]$. The use of the word "and" here suggests that we'll be multiplying together the probabilities of the two components. We defined to probability of a particular event happening above, so we can then re-phrase $P(\tau, e) \, d\tau$ a little differently:

where $P_0(\tau)$ is the probability of no event happening in $(t + \tau)$ and $a_e$ is the probability of an event $e$ happening in an interval $d\tau$. Since we already know the values of $a_e$ from the model parameters $\alpha$ and $\beta$ and the size of the appropriate loci $[SI]$ and $[I]$, we just need an expression for $P_0(\tau)$. Let $a \, d\tau' = \sum_e a_e \, d\tau'$ be the probability that *some* event happens in an interval $d\tau'$, simply by summing-up the component probabilities of the different events. We then have:

which is the probability that no event occurs in in the interval $(t, t + \tau)$ *and then* that one occured in the following interval $d\tau'$. This is a differential equation, the solution of which is:

Substituting back into the above we therefore have:

\begin{align*} P(\tau, e) &= P_0(\tau) \, a_e \\ &= a_e \, e^{-a \tau} \end{align*}This is our joint probability distribution for the events defined by the various values of $a_e$. These values are *rates*, not probabilities: they are defined in terms of the number of places at which each event $e$ can occur.

To conduct simulation, we need to be able to draw a pair $(\tau, e)$ from our distribution. However, we can't simply choose $\tau$ and $e$ independently of each other, because the value of $P(\tau, e)$ depends on *all* the possible events $e$ because of the presence of $a$, the sum of all event rates, in its definition. That means that the time to the next event depends on the number of events that could occur.

In other words, $P(\tau, e)$ is a **joint probability distribution** from which we need to draw a pair. Any joint probability distribution $P(a, b)$ can be re-written as $P(a, b) = P(a) \, P(b | a)$: the prior (independent) probability of $a$ occuring multiplied by the probability of $b$ occurring *given that* $a$ has occurred. In our case,

where $P(\tau)$ is the probability that *some* event will occur on the interval $(t, t + \tau)$ and $P(e | \tau)$ is the probability that this event will be of type $e$ *given that* it occurs on this interval. Clearly $P(\tau)$ is simply the sum of the probabilities for all the events that may occur,

and therefore:

$$ P(e | \tau) = \frac{P(\tau, e)}{\sum_{e'} P(\tau, e')} $$These two equations are both single-variable probability distributions (over $\tau$ and $e$ respectively) expressed in terms of the joint probability distribution $P(\tau, e)$, and if we substitute for $P(\tau, e)$ from above we get:

\begin{align*} P(\tau) &= \sum_e a_e e^{-a \tau} \\ &= a \, e^{-a \tau} \\ \\ P(e | \tau) &= \frac{P(\tau, e)}{\sum_{e'} P(\tau, e')} \\ &= \frac{a_e e^{-a \tau}}{a \, e^{-a \tau} } \\ &= \frac{a_e}{a} \end{align*}Note that $P(e | \tau)$ is in this case independent of $\tau$, since the event probabilities are constants.

Let's briefly return to the network scenario we're interested in. The value $\tau$ is the interval of time until the next event occurs in the network, whether that is the infection of the S node attached to an SI edge of the recovery of an I node. Which of these events happens is determined by $e$. The pair $(\tau, e)$ therefore fully defines the time and identity of the next event in the simulation. It remains to see how we choose these two values, and how the network evolves in response to the selected event.

In order to make use of $P(\tau, e)$ we have to be able to draw $\tau$ and $e$ from the joint distribution. We saw above that we can dop this by drawing values from $P(\tau)$ and $P(e | \tau)$ individually, with the latter distribution actually being independent of time in our current case.

It may not be obvious how to draw from such distributions, but we can manipulate the probabilities to make it possible using only a source of uniformly-distributed random numbers on the range $(0, 1)$, which Python certainly has: `numpy.random.random()`

. The trick is to observe that, for any probability density function $P(a)$, the value $P(a) \, da$ represents the probability that a value drawn from the distribution will lie between $a$ and $(a + da)$. From this we can construct a cumulative distribution function,

where $F(x_0)$ represents the probability that a value drawn from $P(a)$ is less than or equal to $x_0$, also denoted $P(a \le x_0)$. If we now draw a value $r$ from a uniform distribution on $(0, 1)$ we can compute $x = F^{-1}(r)$ where $F^{-1}$ is the inverse of the cumulative distribution function and $x$ will be distributed according to $P(a)$. This means we can convert a uniformly-distributed value into a value drawn from any probability distribution for which we can construct (and invert) a cumulative distribution function.

In our case we have that $P(\tau) = a \, e^{-a \tau}$. Remember that $a$ is a constant, and that intervals can't be negative. This means that

\begin{align*} F(\tau) &= \int_{-\infty}^{\tau} a \, e^{-a \tau'} \, d\tau' \\ &= \int_0^{\tau} a \, e^{-a \tau'} \, d\tau' \\ &= -e^{-a \tau'} \, \bigg|_0^\tau \\ &= -e^{-a \tau} -(-e^0) \\ &= 1 - e^{-a \tau} \end{align*}This is an awkward expression to manipulate, but we can observe that, if a number $r_1$ is uniformly distributed, then so by definition is $1 - r_1$, so if we set $F(\tau) = 1 - r_1$ we can cancel-out the constant ones and get a simpler expression overall. We then have:

\begin{align*} 1 - r_1 &= F(\tau) \\ &= 1 - e^{-a \tau} \\ r_1 &= e^{-a \tau} \\ &= \frac{1}{e^{a \tau}} \\ e^{a \tau} &= \frac{1}{r_1} \\ a \tau &= \ln \frac{1}{r_1} \\ \tau &= \frac{1}{a} \, \ln \frac{1}{r_1} \end{align*}The discrete case works similarly. If we draw a value $r_2$ on $(0, 1)$, then the value of $e$ we require is given by $\sum_{e' = 0}^{e - 1} a_{e'} \leq r_2 a \leq \sum_{e' = 0}^{e} a_{e'}$: the largest $e$ such that the sum of $a_{e'}$ for $e' \le e$ is less than $r_2 a$.

The upshot of all this probability theory is that we can choose a time to the next event $\tau$ and the identity of the next event $e$ from the distribution induced by the individual event probabilities and the size of the loci for the various events in the network, by drawing two uniformly-distributed numbers and performing two simple calculations [Gil76].

`epydemic.Dynamics`

, exactly as we previously did for discrete-time synchronous dynamics.

In [2]:

```
import cncp
import networkx
import math
import numpy
import pickle
import epyc
import epydemic
import pandas as pd
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import seaborn
```

`epydemic`

's simulation class hierarchy, as outlined in red below:

We'll start with the stochastic dynamics class itself:

In [3]:

```
class StochasticDynamics(epydemic.Dynamics):
'''A dynamics that runs stochastically in :term:`continuous time`. This is a
very efficient and statistically exact approach, but requires that the
statistical properties of the events making up the process are known.'''
def __init__( self, g = None ):
'''Create a dynamics, optionally initialised to run on the given network.
:param g: prototype network to run the dynamics over (optional)'''
super(StochasticDynamics, self).__init__(g)
def eventRateDistribution( self, t ):
'''Return the event distribution, a sequence of (l, r, f) triples
where l is the locus where the event occurs, r is the rate at
which an event occurs, and f is the event function called to
make it happen.
Note that it's a rate we want, not a probability:
the former can be obtained from the latter simply by
multiplying the event probability by the number of times it's
possible in the current network, which is the population
of nodes or edges in a given state.
It is perfectly fine for an event to have a zero rate. The process
is assumed to have reached equilibrium if all events have zero rates.
:param t: current time
:returns: the event rate distribution'''
dist = self.eventDistribution(t)
return map((lambda (l, p, f): (l, p * len(l), f)), dist)
def do( self, params ):
'''Run the simulation using Gillespie dynamics. The process terminates
when either there are no events with zero rates or when :meth:`at_equilibrium`
return True.
:param params: the experimental parameters
:returns: the experimental results dict'''
# run the dynamics
g = self.network()
t = 0
events = 0
while not self.at_equilibrium(t):
# pull the transition dynamics at this timestep
transitions = self.eventRateDistribution(t)
# compute the total rate of transitions for the entire network
a = 0.0
for (_, r, _) in transitions:
a = a + r
if a == 0:
break # no events with non-zero rates
# calculate the timestep delta
r1 = numpy.random.random()
dt = (1.0 / a) * math.log(1.0 / r1)
# calculate which event happens
if len(transitions) == 1:
# if there's only one, that's the one that happens
(l, _, ef) = transitions[0]
else:
# otherwise, choose one at random based on the rates
r2 = numpy.random.random()
xc = r2 * a
k = 0
(l, xs, ef) = transitions[k]
while xs < xc:
k = k + 1
(l, xsp, ef) = transitions[k]
xs = xs + xsp
# increment the time
t = t + dt
# draw a random element from the chosen locus
e = l.draw()
# perform the event by calling the event function,
# passing the dynamics, event time, network, and element
ef(self, t, g, e)
# increment the event counter
events = events + 1
# run any events posted for before the maximum simulation time
self.runPendingEvents(self._maxTime)
# add some more metadata
(self.metadata())[self.TIME] = t
(self.metadata())[self.EVENTS] = events
# report results
rc = self.experimentalResults()
return rc
```

We need an event rate distribution rather than an event probability distribution so we provide that as a method `eventRateDistribution()`

that takes the probability distribution returned by `eventDistribution()`

and, for each event, multiplies the probability of that event happening by the number of places the event can happen.

The important part of the class is the `do()`

method, which implements the mechanism for drawing the $(\tau, e)$ pair as described above. In the code, `dt`

is the interval to the next event ($\tau$), while `xc`

is used to choose the event that occurs.

Then we need to bridge between this general framework and compartmented models, just as before:

In [4]:

```
class CompartmentedStochasticDynamics(StochasticDynamics):
'''A :term:`stochastic dynamics` running a compartmented model. The
behaviour of the simulation is completely described within the model
rather than here.'''
def __init__( self, m, g = None ):
'''Create a dynamics over the given process model, optionally
initialised to run on the given network.
:param m: the compartmented model for the disease process
:param g: prototype network to run the dynamics over (optional)'''
super(CompartmentedStochasticDynamics, self).__init__(g)
self._model = m
def setUp( self, params ):
'''Set up the experiment for a run. This performs the default action
of copying the prototype network and then builds the model and
uses it to initialise the nodes into the various compartments
according to the parameters.
:params params: the experimental parameters'''
# perform the default setup
super(CompartmentedStochasticDynamics, self).setUp(params)
# build the model
self._model.reset()
self._model.build(params)
# initialise the network from the model
g = self.network()
self._model.setUp(self, g, params)
def eventDistribution( self, t ):
'''Return the model's event distribution.
:param t: current time
:returns: the event distribution'''
return self._model.eventDistribution(t)
def experimentalResults( self ):
'''Report the model's experimental results.
:returns: the results as seen by the model'''
return self._model.results(self.network())
```

We can now take the same parameters as we used in the synchronous case:

In [5]:

```
# ER network parameters
N = 5000
kmean = 5
pEdge = (kmean + 0.0) / N
# SIR parameters
pInfected = 0.01
pInfect = 0.2
pRemove = 0.1
# create a parameters dict containing the disease parameters we want
params = dict()
params[epydemic.SIR.P_INFECTED] = pInfected
params[epydemic.SIR.P_INFECT] = pInfect
params[epydemic.SIR.P_REMOVE] = pRemove
```

Plugging these parameters into our new simulation class, we get:

In [6]:

```
g = networkx.erdos_renyi_graph(N, pEdge)
m = epydemic.SIR()
sim = CompartmentedStochasticDynamics(m, g)
sto = sim.set(params).run()
with open('sto.pickle', 'wb') as handle:
pickle.dump(sto, handle)
```

In [7]:

```
print "Epidemic covered {percent:.2f}% of the network".format(percent = ((sto['results']['compartments']['R'] + 0.0)/ N) * 100)
```

*supposed* to behave the same, for a suitably stochastic definition of "the same".

We can of course dig-into the results in more detail. There are a lot of potentially interesting things to explore, and we'll just pick two of the most important: is one method faster than the other?, and, do they look like they generate a similar train of events?

First we load both datasets:

In [8]:

```
with open('sync.pickle', 'rb') as handle:
syn = pickle.load(handle)
with open('sto.pickle', 'rb') as handle:
sto = pickle.load(handle)
```

In [9]:

```
print "Elapsed simulation times:"
print "Synchronous {elapsed:.2f}s".format(elapsed = syn[epyc.Experiment.METADATA]['elapsed_time'])
print "Stochastic {elapsed:.2f}s".format(elapsed = sto[epyc.Experiment.METADATA]['elapsed_time'])
```

But a performance benefit is only useful if the results are correct: there's no point in doing the wrong things faster, after all. So we need to convince ourselves that, at the very least, the two simulations conducted for the same parameters produce plausibly comparable results – even while we accept that statistical variations might occur.

We can start by looking at the populations of the different compartments at equilibrium:

In [10]:

```
print "Node type sub-populations:"
print "Synchronous:", syn[epyc.Experiment.RESULTS]['compartments']
print "Stochastic:", sto[epyc.Experiment.RESULTS]['compartments']
```

Although we are using two different simulation techniques, we claim that they are "the same" in the sense of simulating the same process dynamics. One way to test this is to look at the distance between successive events. If the events are happening with similar distributions, we would expect the inter-event time distributins to be similar too.

To do this we need to capture when (in simulation time) each event occurs. We can do this quite simply, either by extending the simulation dynamics classes, or – more straightforwardly – by defining a new compartment5ed model whose results include the simulation times for events:

In [11]:

```
class SIR_EventDistribution(epydemic.SIR):
'''An SIR model that also captures the times of all events.'''
def __init__( self ):
super(SIR_EventDistribution, self).__init__()
# create a place to store the sequence of event times
self._eventDistribution = []
def reset( self ):
super(SIR_EventDistribution, self).reset()
self._eventDistribution = []
def results( self, g ):
rc = super(SIR_EventDistribution, self).results(g)
# add the event times to the results
rc['event_times'] = self._eventDistribution
return rc
def infect( self, dyn, t, g, (n, m) ):
# perform the base event
super(SIR_EventDistribution, self).infect(dyn, t, g, (n, m))
# record the event time
self._eventDistribution.append(t)
def remove( self, dyn, t, g, n ):
# perform the base event
super(SIR_EventDistribution, self).remove(dyn, t, g, n)
# record the event time
self._eventDistribution.append(t)
```

In [31]:

```
# epidemic parameters
params = dict()
params[epydemic.SIR.P_INFECTED] = pInfected
params[epydemic.SIR.P_INFECT] = 0.05
params[epydemic.SIR.P_REMOVE] = 0.01
m = SIR_EventDistribution()
# run process over a larger ER network
g = networkx.erdos_renyi_graph(30000, 5.0 / 30000)
# synchronous dynamics
sim = epydemic.CompartmentedSynchronousDynamics(m, g)
syn_res = sim.set(params).run()
syn_events = syn_res[epyc.Experiment.RESULTS]['event_times']
# stochastic dynamics
sim = CompartmentedStochasticDynamics(m, g)
sto_res = sim.set(params).run()
sto_events = sto_res[epyc.Experiment.RESULTS]['event_times']
```

In [40]:

```
fig = plt.figure(figsize = (8, 5))
plt.title('Distribution of inter-event times')
plt.xlabel('Inter-event time')
plt.ylabel('$log(\mathrm{events})$')
# work out inter-event times
l = 0
syn_inter = []
for i in xrange(1, len(syn_events) - 1):
syn_inter.append(syn_events[i] - l)
l = syn_events[i]
sto_inter = []
l = 0
for i in xrange(1, len(sto_events) - 1):
sto_inter.append(sto_events[i] - l)
l = sto_events[i]
# plot the histogram of the distribution
plt.hist([sto_inter, syn_inter],
bins = range(10),
log = True,
label = ['stochastic', 'synchronous'])
plt.legend()
_ = plt.show()
```

*similar*, both dropping off exponentially as we'd expect. They don't follow exactly the same distribution, but that could just be the result of the stochastic nature of the process: we ran the two dynamics over the same network, but from different initial (random) seedings of nodes. Or it could be because the synchronous approach is less exact because of interactions between events. If we wanted a closer look, we'd have to perform some repetitions to see whether we got different results repeatedly or whether things evened out – but that's something for another time.

[CGP06] Ying Cao, Daniel Gillespie and Linda Petzold. Efficient step size selection for the tau-leaping simulation method. Journal of Chemical Physics **124**. 2006.

[Gil76] Daniel Gillespie. A general method for numerically simulating the stochastic time evolution of coupled chemical reactions. Journal of Computational Physics **22**, pages 403-â€“434. 1976.

[Gil77] Daniel Gillespie. Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry **81**(25), pages 2340-â€“2361. 1977.

(This is a chapter from Complex networks, complex processes.)

`epydemic`

library, making use of the compartmented model we coded earlier. And we'll discuss some of the advantages of this approach – but also its limitations, whcih lead us into continuous-time suiimulation of the same model

Recall from our earlier discussion that discrete-event simulators have to make three key decisions:

*when*(in simulation time) does the next event occur?,*where*in the network does it occur?, and*which*event is it that occurs?

A discrete-time simulation performs these decisions in a simulation loop that looks roughly as follows. At each timestep, the simulation collects all the places in which an event *might* occur (the "where" question). It then, for each of these places, decides *whether* the event occurs or not ("when") and, if it decides that it does, executes the event ("which"). It then moves to the next moment and repeats. Executing an event will typically change the places where future events can occur.

A discrete-time simulation is sometimes referred to as a **synchronous** simulation, because all the events in a given moment are performed in a batch.

Let's now build the code we need to create a synchronous simulation of an epidemic. We'll be making use of the `epydemic`

library, and specifically its descriptions of compartmented disease models. Before we do that, however, we need to construct a general simulation framework that we can then specialise to perform the functions we need.

`epydemic`

represents synchronous simulation using a small class hierarchy, and in this chapter we'll fill-out the part outlined in red in the following UML diagram:

(Actually what we'll describe is a slightly simpler version of `epydemic`

for ease of explanation. But it captures all the main points, and we'll come back to the code when we need the more advanced features.)

The decomposition of the three classes is as follows. `epydemic.Dynamics`

defines the basic functionality of a discrete-event simulation, mainly concerning the way we get events to execute. `epydemic.SynchronousDynamics`

specialises this framework to run in synchronous time, collecting together all the events for a given timestep, but without specifying exactly where the events come from. `epydemic.CompartmentedSynchronousDynamics`

then binds the source of events to a compartmented model. (We describe *why* we do it this way below.)

In [1]:

```
import networkx
import epydemic
import epyc
import math
import numpy
import pickle
from copy import copy
import pandas as pd
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import seaborn
```

Let's begin with the basic discrete-event dynamics:

In [2]:

```
class Dynamics(epyc.Experiment, object):
'''A dynamical process over a network. This is the abstract base class
for implementing different kinds of dynamics as computational experiments
suitable for running under. Sub-classes provide synchronous and stochastic
(Gillespie) simulation dynamics.'''
# Additional metadata elements
TIME = 'simulation_time' #: Metadata element holding the logical simulation end-time.
EVENTS = 'simulation_events' #: Metadata element holding the number of events that happened.
# the default maximum simulation time
DEFAULT_MAX_TIME = 20000 #: Default maximum simulation time.
def __init__( self, g = None ):
'''Create a dynamics, optionally initialised to run on the given network.
The network (if provided) is treated as a prototype that is copied before
each individual simulation experiment.
:param g: prototype network (optional)'''
super(Dynamics, self).__init__()
self._graphPrototype = g # prototype copied for each run
self._graph = None # working copy of prototype
self._maxTime = self.DEFAULT_MAX_TIME # time allowed until equilibrium
def network( self ):
'''Return the network this dynamics is running over.
:returns: the network'''
return self._graph
def setNetworkPrototype( self, g ):
'''Set the network the dynamics will run over. This will be
copied for each run of an individual experiment.
:param g: the network'''
self._graphPrototype = g
def setMaximumTime( self, t ):
'''Set the maximum default simulation time. The default is given
by :attr:`DEFAULT_MAX_TIME`.
param: t: the maximum time'''
self._maxTime = t
def at_equilibrium( self, t ):
'''Test whether the model is an equilibrium. Override this method to provide
alternative and/or faster simulations.
:param t: the current simulation timestep
:returns: True if we're done'''
return (t >= self._maxTime)
def setUp( self, params ):
'''Before each experiment, create a working copy of the prototype network.
:param params: parameters of the experiment'''
# perform the default setup
super(Dynamics, self).setUp(params)
# make a copy of the network prototype
self._graph = self._graphPrototype.copy()
def tearDown( self ):
'''At the end of each experiment, throw away the copy.'''
# perform the default tear-down
super(Dynamics, self).tearDown()
# throw away the worked-on model
self._graph = None
def eventDistribution( self, t ):
'''Return the event distribution, a sequence of (l, p, f) triples
where l is the :term:`locus` of the event, p is the probability of an
event occurring, and f is the :term:`event function` called to make it
happen. This method must be overridden in sub-classes.
It is perfectly fine for an event to have a zero probability.
:param t: current time
:returns: the event distribution'''
raise NotYetImplemented('eventDistribution()')
```

We make the dynamics class a sub-class of `epyc.Experiment`

. We haven't discussed `epyc`

yet – and there's no need to right now – but it provides functions for running lots of repetitions of simulations with a single command. We'll make extensive use of this later when we scale-up simulations.

An epidemic simulation takes place over a network. We can provide a network either to the constructor or by calling `setNetworkPrototpye()`

. This network is referred to as the *prototype* network. Every time we run the simulation, the prototype is copied into a *working* netyork that we then run the epidemic process over. This means we can repeatedly use the *same* network for *different* instances of the *same* process. The `setUp()`

and `tearDown()`

methods create and destroy the working copy.

We need to know when we should stop the simulation, and the most general answer to this is to have a maximum simulation time: that way we know we'll stop at some point. `setMaximumTime()`

can be used to change this from the default value of 20000 timesteps; `at_equilibrium()`

returns true if we have exceeded that time. Clearly we will often be able to do a better job of decoding whether a simulation has ended, in which case we should override this method.

Finally, we need a source of events. We get these in terms of a probability distribution that consists of a list of triples consisting of a list of places for an event to occur in the network, the probability of that event happening at any given element place, and the event function that we call when the event occurs. The `eventDistribution()`

method returns the distribution for the given time, and for the moment is left undefined.

We should note what else this class *doesn't* provide: any way of actually selecting and executing events drawn from the distribution. For that we need to define a specific dynamics.

Any simulation dynamics has to answer the three questions we posed earlier: *when* does an event happen?, *where* in the network?, and *which* action is taken? Synchronous dynamics has simple answers to these questions. At each discrete timestep (*when*) it looks for all the places in the network where an event *could* occur (*where*), and choose whether or not an event occurs at each place according to the probabilities given to the events by the probability distribution (*which*).

Providing this dynamics is simply a matter of turning this into code:

In [3]:

```
class SynchronousDynamics(Dynamics):
'''A dynamics that runs synchronously in discrete time, applying local
rules to each node in the network. These are simple to understand and
simple to code for many cases, but can be statistically inexact and slow
for large systems.'''
# additional metadata
TIMESTEPS_WITH_EVENTS = 'timesteps_with_events' #: Metadata element holding the number timesteps that actually had events occur within them
def __init__( self, g = None ):
'''Create a dynamics, optionally initialised to run on the given prototype
network.
:param g: prototype network to run over (optional)'''
super(SynchronousDynamics, self).__init__(g)
def do( self, params ):
'''Synchronous dynamics.
:param params: the parameters of the simulation
:returns: a dict of experimental results'''
# run the dynamics
g = self.network()
t = 0
events = 0
timestepEvents = 0
while not self.at_equilibrium(t):
# retrieve all the events, their loci, probabilities, and event functions
dist = self.eventDistribution(t)
# run through all the events in the distribution
nev = 0
for (l, p, ef) in dist:
if p > 0.0:
# run through every possible element on which this event may occur
for e in copy(l.elements()):
# test for occurrance of the event on this element
if numpy.random.random() <= p:
# yes, perform the event
ef(self, t, g, e)
# update the event count
nev = nev + 1
# add the events to the count
events = events + nev
if nev > 0:
# we had events happen in this timestep
timestepEvents = timestepEvents + 1
# advance to the next timestep
t = t + 1
# add some more metadata
(self.metadata())[self.TIME] = t
(self.metadata())[self.EVENTS] = events
(self.metadata())[self.TIMESTEPS_WITH_EVENTS] = timestepEvents
# report results
rc = self.experimentalResults()
return rc
```

That's it! – one method called `do()`

that codes-up the simulation loop. While the simulation is not at equilibrium (as defined by the `at_equilibrium()`

method inherited from `Dynamics`

) we retrieve the event distribution. For each entry we run-through all the possible places for an event and select randomly whether the event actually happens. We do this by using the `numpy.random.random()`

function, which returns a random number uniformyl distributed over the range $[0, 1]$. If this random number is less than the probability associated with the event, then we "fire" the event by calling the associated event function, passing it the dynamics, the current simulation time, the network over which the process is running, and the place where the event occurs (a node or an edge in the network). We keep track of the number of events we fire, and also keep track of the number of timesteps in which events are fired, which we'll use later when we think about the efficiency of this kind of simulation.

At the end of `do()`

we package-up a short summary of the experiment as **metadata**: data about the way the simulation occured. We store this in a dict that we inherit from `epyc.Experiment`

, accessed by the `metadata()`

method. Finally we return our `experimentalResults()`

, which is another method inherited from `epyc.Experiment`

that we'll come back to in a moment.

We're still missing some details, though: `SynchronousDynamics`

doesn't give us an event distribution, and doesn't give us any events.

In [4]:

```
class CompartmentedSynchronousDynamics(SynchronousDynamics):
'''A :term:`synchronous dynamics` running a compartmented model. The
behaviour of the simulation is completely described within the model
rather than here.'''
def __init__( self, m, g = None ):
'''Create a dynamics over the given disease model, optionally
initialised to run on the given prototype network.
:param m: the model
:param g: prototype network to run over (optional)'''
super(CompartmentedSynchronousDynamics, self).__init__(g)
self._model = m
def setUp( self, params ):
'''Set up the experiment for a run. This performs the default action
of copying the prototype network and then builds the model and
uses it to initialise the nodes into the various compartments
according to the parameters.
:params params: the experimental parameters'''
# perform the default setup
super(CompartmentedSynchronousDynamics, self).setUp(params)
# build the model
self._model.reset()
self._model.build(params)
# initialise the network from the model
g = self.network()
self._model.setUp(self, g, params)
def eventDistribution( self, t ):
'''Return the model's event distribution.
:param t: current time
:returns: the event distribution'''
return self._model.eventDistribution(t)
def experimentalResults( self ):
'''Report the model's experimental results.
:returns: the results as seen by the model'''
return self._model.results(self.network())
```

`setUp()`

method does the standard behaviour of building a copy of the network prototype, and then resets and builds the model and passes the working network to the model's `setUp()`

method. `eventDistribution()`

returns what the model says is the event distribution, which will also include implementations of the events. Finally, `experimentalResults()`

returns a dict of the model's definition of what constitutes the important features of running that particular model.

That's quite a lot of code, so let's pause and assess what we've built.

First of all we defined the basic structures of an epidemic process on a network: basically the ability to generate a working copy of a network several times, some definition of termination, and an abstract method for getting the event distribution. We then specialised this to provide continuous-time simulation dynamics which takk the distribution and applied it to all possible places where events could occur according totheir probabilities. Rather than then specifying the event distributions and events by sub-classing, we instead bound the missing elements to an object defining a compartmented model of disease, allowing that to provide the details.

Why this way? – why not just sub-class `SynchronousDynamics`

to provide, for example, the events of SIR and their distribution? The answer is that SIR is a process that can run on several *different* simulation regimes as well as this one, notably the stochastic dynamics we'll look at later. If we defined SIR by sub-classing `SynchronousDynamics`

, we'd then need to re-define it if we introduced another simulation dynamics: two definitions of the same process, which is an invitation to mistakes.

It's far better to define a single process in a single class and then re-use it, wnd this is what we've done in defining the `CompartmentedModel`

class and sub-classing it to define SIR. This makes the simulation framework easier to use, but trickier to implement: the astute reader will have noticed that we didn't explain how `CompartmentedModel`

works inside, and that's because it's a bit complicated. But it's also largely irrelevant in practice: you don't need to know how this particular piece of code works in order to use it for network science experiments. (If you're interested, you can look at the code in `epydemic`

's github repo. But don't say you weren't warned.)

The message here is not that some simulation code is complicated, but rather that it's possible to *localise* that complexity where it can't do any harm. This keeps the user interface simpler and also means that we can now concentrate on the epidemics, not the code we use to simulate them.

Finally, at long last, let's run some code.

We have a compartmented model of SIR, and a synchronous discrete-time simulation framework, so let's run the former in the latter. We first need to define the parameters of our simulation, and for this experiment we'll use a small-ish ER network and some fairly nondescript SIR parameters:

In [5]:

```
# ER network parameters
N = 5000
kmean = 5
pEdge = (kmean + 0.0) / N
# SIR parameters
pInfected = 0.01
pInfect = 0.2
pRemove = 0.1
```

We can then create the network and the model, and bind them together with the simulation dynamics:

In [6]:

```
g = networkx.erdos_renyi_graph(N, pEdge)
m = epydemic.SIR()
sim = CompartmentedSynchronousDynamics(m, g)
```

In [7]:

```
# create a parameters dict containing the disease parameters we want
params = dict()
params[epydemic.SIR.P_INFECTED] = pInfected
params[epydemic.SIR.P_INFECT] = pInfect
params[epydemic.SIR.P_REMOVE] = pRemove
# run the simulation
sync = sim.set(params).run()
# save the results for later
with open('sync.pickle', 'wb') as handle:
pickle.dump(sync, handle)
```

**results dict**. It's structured in a very particular way, with three top-level keys:

In [8]:

```
sync.keys()
```

Out[8]:

`results`

key contains a dict of the experimental results that the simulation returned: it's "real" results, if you like:

In [9]:

```
sync['results']
```

Out[9]:

In this case the results are a dict of compartments and their sizes, and a dict of loci and their sizes. We can see that in this case there are no infected nodes left, and therefore no SI edges – and therefore no way the simulation can infect any more nodes.

The `parameters`

key contains a dict of the parameters we passed to the simulation:

In [10]:

```
sync['parameters']
```

Out[10]:

So we have the experimental results and the simulation parameters that gave rise to the immediately to hand. Note that this isn't *quite* all the information we might need, as it doesn't include the size or link probability of the underlying network prototype we passed to the simulation.

Finally, the `metadata`

key contains a dist of useful information about how the simulation progressed:

In [11]:

```
sync['metadata']
```

Out[11]:

These values might be important in assessing how the simulation worked. For the time being, let's just draw attention to the difference between two values: the overall simulation time (20000 timesteps, the default), and the number of timesteps iun whichevents actually occurred. The former is *way* larger than the latter, suggesting that the simulation did an awful lot of ... well, nothing.

We can easily check whether we had an epidemic by checking the size of the largest outbreak, which in the case of an epidemic should scale linearly with `N`

, the size of the network:

In [12]:

```
print "Epidemic covered {percent:.2f}% of the network".format(percent = ((sync['results']['compartments']['R'] + 0.0)/ N) * 100)
```

The synchronous synamics we encoded above works by evaluating the process dynamics at each discrete timestep. This is an obvious approach, but one that begs two questions: how expensive is it to evaluate the dynamics at each step?; and, what proportion of timesteps do we evaluate the dynamics with no effect, because nothing changes?

To answer the first question we can look at the `do()`

method on `SynchronousDynamics`

. At each timestep it retrieves all the places where an event might occur, which we know from our definition of SIR is any SI edge (for infection events) and any infected node (for removal events). For each place, it draws a random number and then possibly calls an event function. The amount of work therefore depends on the sizes of the two loci for events, which will presumably swell as the epidemic progresses: we might assume that in an average timestep about half the nodes are infected, and some smaller proportion of the edges are SI: we can't say much more without a lot more information about the structure of the network. The loci change as events occur, which means that `CompartmentedModel`

will have to ensure that it can efficiently track these changes (and indeed a lot of the code complexity addresses exactly this).

We alluded earlier to the answer to the second question. The result dict includes metadata that defines the number of timesteps and the number in which at least one event actually occurred. We can use these to determine the percentage of timesteps in which anything actually happened – and therefore calculate the "wasted" timesteps:

In [13]:

```
print "Of {n} cycles simulated, {no} ({percent:.2f}%) had no events ".format(n = sync['metadata']['simulation_time'],
no = sync['metadata']['simulation_time'] - sync['metadata']['timesteps_with_events'],
percent = (sync['metadata']['simulation_time'] - sync['metadata']['timesteps_with_events']) / (0.0 + sync['metadata']['simulation_time']) * 100)
```

A slightly more significant problem is one of statistical exactness: the extent to which the simulation actually performs according to the probabilities. We won;t dig into this in too much detail, but the basic problem is simple to explain. In the `do()`

method, for each possible event, we collect the possible places the event can happen and then decide whether the event actually happens there. There's a hidden assumption here that all these choices are independent of one another, but that's not quite the case. For example, if two infected nodes are connected to the same susceptible node – so there are two SI edges in the locus for infection events – then we have two chances to infect the susceptible node in the same timestep. If the first time happens to result in infection then the second one can't (by definition), making the actual rate of infections in a timestep varies just slightly from the expected value. Similarly, we may happen to run the removal events before the infection events, and so nodes infected in the timestep don'ty have any possibility of recovering in that same timestep – even if the probability of recovery were set very high.

If these sound like trivial issues, well they may well be. But they may *not* be, depending on the exact combinations of parameters and network structures we encounter. That's a risk it'd be better not to take, as it introduces patterns into the simulation results that aren't there the model descriptions, or indeed in any mathematical analysis we might make of them.

It might be that we're willing to accept these issues in the interests of simplicity: synchronous simulation is very easy to program and understand. But both the performance and the statistical exactness issues are caused by the same basic decision to use discrete time, and it turns out that we can address both by using a different simulation dynamics, one that works directly from the probability distributions in continuous time.

(This is a chapter from Complex networks, complex processes.)

Simulation is an enormous topic in computer science, with a long and distiinguished history. It's easy to see why it's so important: whenever we use computers to study natural processes (or indeed man-made or engineering processes) we're taking a physical system, abstracting it into a computer model, and then building software that runs the model *as if* it were the real system running in the real world.

The process of model abstraction – of which compartmented models of disease are a prime example – is a process of simplification, of leaving out details in order to get to the essentials of the process we're interested in. This reduction of detail is sometimes criticised by those outside the scientific community: if you leave out the details, how do you know that your model is really saying anything about the real-world phenomenon? And that's a fair point. But simplification is essential if we're to understand the core behaviour of processes and not be distracted by all the details.

How do we know is a model says anything meaningful? We need to **verify** and **validate** it – two software engineering concepts that are sometimes summarised as "did we build it right?" and, "did we build the right thing?".

By **verification** we mean examining the model to ensure that, to the best of our ability, the mathematics and code are faithful to the way we think the process operates. This examination might take any number of forms, from inspection of the code and maths by others, through the development of test suites to exercise the code and check it against known situations, to the use of the mathematically-based techniques of computer science formal methods. It's easy to get code wrong, and incorrect code tells us less than nothing about the phenomena we're interested in: never skimp on debugging, and never assume things are finally working completely correctly.

By **validation** we mean deciding whether the model does actually reflect the real world. This might take the form of creating a simple real-world experiment and performing it physically as well as in simulation, to see if the results match. Of course they never will match exactly, because in the process of simplification we'll have removed some of the details that affect the physical process. In simulation, a pendulum on a friction-free mount will swing forever; in reality, it never will, because the mount will never actually *be* friction-free.

Since simulation has so much history, it's unsurprising that there are a myriad of approaches to conducting them. Each choice has subtly different implications on the experiments and results obtained – often only really understood by those who've spent a lifetime with the given techniques.

In network science the simulations we typically use fall under the broad rubric of **discrete-event simulation**. What this means is that, to a simulator, the world is treated as a sequence of individually-identifiable events that happen in a sequence through time. In the case of disease models, the events are individual nodes being infected or recovering: individual, discrete "happenings" described individually and executed independently. Of course one event affects subsequent ones – you can't recover if you've never been infected in the first place – but that's about the *possible sequences* of events that can occur, not a relationship between one event and enother at the coding level.

You can see the sequencing of events at work in the compartmented model. An infection event happens at SI edges and has a local effect: change the susceptible node's compartment, which in turn might generate more SI edges at which further infection events can occur. A recovery event happens at infected nodes, meaning it can *only* happen *if* an infection event previously happened there (or of the node was initially infected). The sequencing is implicit in the definition of the event loci, and of the events' effects – even though there's no explicit encoding within the events themselves of how they'll be sequenced.

A simulation occurs in **simulated time**, which is to say the time in the simulated world. This is typically different to real-world or **wallclock time**, which is how long the simulation takes to run on a computer. These two notions of time differ substantially. It's easy to see why: most biological and physical processes take an eternity from the perspective of a modern computer. The progression of a disease in an individual might take days, and we seldom want to wait that long for results. Simulation time often therefore passes more quickly than wallclock time.

We might need some way to relate simulation time to "real world" time, for example to see how many days an epidemic will last. In that case we'll need to develop ways to translate between simulated time and the "real" time of the phenomenon being studied. But often we don't care about this level of realism, and are happy to work in a more abstract world.

There's still another thing to consider, which is the issue of temporal resolution. Time, at least at the macro scale, is a continuous quantity, represented as a real number. A **continuous time** simulation represents time in this way, and also typically assuems that only one event happens at each (simulated) moment. This may sound restrictive, but the idea is that the events that happen happen instantaneously, so two events never need happen at *exactly* the same time: we can always put some infinmitesimal gap between them. In SIR, this means people go from being susceptible to being infected instantaneously; if two people are infected, one of them is always infected before the other.

Another way to view time is to think of it as divided up into discrete chunks: seconds, for example. Instead of modelling a continuous stream of events, each occurring at a different instant, we think about blocks of time in which a set of events occur. This is a **discrete time** simulation.

Which of these approaches is "right"? Neither – and that's anyway the wrong question. They are both approximations of reality that we use to perform computational experiments. There are sometimes reasons to prefer one over the other, but often the choice is a matter of intellectual preference or coding convenience. At the risk of massive stereotyping, people with computer science backgrounds are often (at least initially) more comfortable with a discrete-time view, while people with a classical science background often find it easier to think about continuous time. (One reason for this may be that the mathematics taught in computer science programmes is typically overwhelmingly discrete and tends not to emphasise modelling with differential equations, which is where the continuous ideas come from.) There are good mathematical reasons to prefer continuous-time over discrete-time simulation, but both are available to you.

When working with random networks and stochastic processes there are additional complications due to the use of randomness. It is entirely possible that, just by a chance interaction, a disease on a network will die out. Run the *same* experiment on the *same* network with the *same* parameters – and you might get a disease that *doesn't* die out, because the chance interaction didn't happen this time.

Does this mean that such experiments aren't repeatable? No! – but it *does* mean that we need to be careful, perform repetitions, be sure that we understand the implications of the various random factors that affect the outcome of each experiment. We'll have a lot more to say on this topic later.

What this discussion is getting at is that we need to be careful in going from the models we develop, their realisation in code, and their execution in simulation, to conclusions about the real world. We need to be sure that the conclusions we draw are supported by the simulations we've done, and that they match, to an appropriate degree, observations we can make about the real-world process we're simulating.

Let's look in overview at the process of discrete-event simulation, before we get into the coding details.

The basic process of simulation involves repeatedly deciding three things:

*when*(in simulation time) does the next event occur?,*where*in the network does it occur?, and*which*event is it that occurs?

The event is then executed, and the process repeats – forever in principle, and in practice until some **termination condition** occurs. In network science we often use a termination condition of **equilibrium**, where the network has in some sense "stabilised" so we can look at its overall state. In SIR this might be when there are no infected nodes left in the network, since no further events are then possible.

How are these three decisions made? The details are what differentiates between the different methods of simulation. For our purposes, `epydemic`

provides a small framework for simulating epidemics on networks, with the decision-making either being coded directly or – more conveniently – being offloaded to a software encoding of a compartmented model. It's this framework we'll turn to next.

(This is a chapter from Complex networks, complex processes.)

Having developed a discrete compartmented model of disease, we now have to turn it into code. Most epidemic processes share a common form and can be simulated using a small set of common techniques. It therefore makes sense to capture the form of an epidemic process in code, and then use that code to drive a simulator. In this way we can focus on the epidemic process rather than on the process of simulation.

We make use of a Python library, `epydemic`

, written to provide a framework within which to conduct simulations of epidemic processes. `epydemic`

provides three main elements:

- A base class for describing epidemic processes quickly and cleanly;
- A small library of common epidemic processes that can be used as a starting point for defining additional processes; and
- Implementations of the two most common simulation regimes.

As well as providing the small-scale features we introduce in this chapter, `epydemic`

has features for performing large-scale simulations on paralle compute clusters, integrating cleanly with the `epyc`

simulation library. We'll discuss this intregration in more detail later. You can also read the API documentation for a full description of `epydemic`

and its capabilities.

As we saw earlier, an epidemic simulation consists of two main components:

- A
**model**of the disease process that describes how nodes in the network are infected, recover, and so forth, typically using either probabilities or fixed elapsed times; and - A
**dynamics**that applies the model to a network over the timespan of the simulation.

The former describes the way nodes evolve as the disease progresses; the latter describes how this evolution occurs in time. For the moment we'll focus on the model, which `epydemic`

represents by the class `epydemic.CompartmentedModel`

. We sub-class this class to create different compartmented disease models.

An instance of a sub-class of `epydemic.CompartmentedModel`

basically encodes exactly the kind of discrete model we developed earlier. Each node in the network resides in a **compartment**, a box representing the disease state of the node. We are typically interested in how the sizes of the compartments change over time. A **locus** is a place in the network where an **event** can occur, where an event typically changes the compartment of one or more nodes around the locus. An example event in SIR would be an infection event, whose locus is the set of SI edges and which causes the S end to become I and any edges to adjacent S nodes to be classified as SI (i.e., be added to the locus for possible future infection).

The significance of loci is that `epydemic`

keeps track of the nodes and edges in each locus at each stage of the simulation. In our SIR example, after every simulation event `epydemic`

checks whether any nodes should be removed from the infected locus and whether any edges should be added to the SI locus – and does so automatically in a way that is optimised to only check as little of the network as necessary. This both makes simulation more efficient and simplifies the epidemic process description.

An `epydemic`

event is simply a Python function. As such it can do anything Python can do – but typically will perform only some simple transitions of the compartments of nodes. `epydemic.CompartmentedModel`

provides two methods that perform these operations. `changeCompartment()`

changes the compartment of a node, making sure that this change is reflected in the process' loci. `markOccupied()`

marks an edge as having been used to spread the disease, whcih can be useful when exploring how the epidemic spread.

Events might want to do other things, for example keeping track of the simulation time at which the epidemic crossed a particular edge, which might be useful for doing animations. About the only restriction on event code is that it should use `changeCompartment()`

to change nodes' compartments, as this ensures that the loci are updated.

`epydemic`

. This isn't actually necessary, as `epydemic`

already *has* an implementation of SIR (and indeed other compartmented models). But SIR is conceptually the simplest compartmented model, and demonstrates the approaches we'll use later.

In [1]:

```
import epydemic
import networkx
```

Let's first define a model for our disease. We know that SIR consists of three compartments: Susceptible, Infected, and Removed. There are two loci for disease and two corresponding events: infected nodes (which can be subject to recovery events), and SI edges (which can undergo infection events). We also know that it requires two dynamical parameters: the probability of infection along an edge, and the probability of recovery. We also require an initial seeding of the network in which nodes become infected with a given probability.

Let's see how this is coded in `epydemic`

:

In [2]:

```
class SIR(epydemic.CompartmentedModel):
'''The Susceptible-Infected-Removed compartmented model of disease.
Susceptible nodes are infected by infected neighbours, and are removed
when they are no longer infectious.'''
# the model parameters
P_INFECTED = 'pInfected' #: Parameter for probability of initially being infected.
P_INFECT = 'pInfect' #: Parameter for probability of infection on contact.
P_REMOVE = 'pRemove' #: Parameter for probability of removal.
# the possible dynamics states of a node for SIR dynamics
SUSCEPTIBLE = 'S' #: Compartment for nodes susceptible to infection.
INFECTED = 'I' #: Compartment for nodes infected.
REMOVED = 'R' #: Compartment for nodes recovered/removed.
# the locus for infection events
SI = 'SI' #: Edge able to transmit infection.
def __init__( self ):
super(SIR, self).__init__()
def build( self, params ):
'''Build the SIR model.
:param params: the model parameters'''
pInfected = params[self.P_INFECTED] # probability of a node bveing initially infected
pInfect = params[self.P_INFECT] # probability of infection
pRemove = params[self.P_REMOVE] # probability of recovery
self.addCompartment(self.SUSCEPTIBLE, 1 - pInfected)
self.addCompartment(self.INFECTED, pInfected)
self.addCompartment(self.REMOVED, 0.0)
self.addLocus(self.INFECTED)
self.addLocus(self.SUSCEPTIBLE, self.INFECTED, name = self.SI)
self.addEvent(self.INFECTED, pRemove, lambda d, t, g, e: self.remove(d, t, g, e))
self.addEvent(self.SI, pInfect, lambda d, t, g, e: self.infect(d, t, g, e))
def remove( self, dyn, t, g, n ):
'''Perform a removal event. This changes the compartment of
the node to :attr:`REMOVED`.
:param dyn: the dynamics
:param t: the simulation time (unused)
:param g: the network
:param n: the node'''
self.changeCompartment(g, n, self.REMOVED)
def infect( self, dyn, t, g, (n, m) ):
'''Perform an infection event. This changes the compartment of
the susceptible-end node to :attr:`INFECTED`. It also marks the edge
traversed as occupied.
:param dyn: the dynamics
:param t: the simulation time (unused)
:param g: the network
:param e: the edge transmitting the infection, susceptible-infected'''
self.changeCompartment(g, n, self.INFECTED)
self.markOccupied(g, (n, m))
```

Let's look at the `build()`

method first. This is called to construct the epidemic model. It first extracts the three parameters for the simulation from the hash of parameters. It then declares the three compartments of SIR using the `addCompartment()`

method. The second parameter is the probability of a ndoe being initially assigned to this compartment. (There are no initially-removed nodes.)

We then add the two loci using `addLocus()`

. Loci come in two flavours in `epydemic`

. **Node loci** capture nodes in a given compartment, while **edge loci** are edges linking nodes in two particular compartments. In this case, we have a node locus for infected nodes and an edge locus for SI edges (which we name for later).

Finally we bind events to each locus using `addEvent()`

. Events happen at a given locus with a given probability. An event is a function that takes four parameter: the simulation dynamics, the current simulation time, the `networkx`

network, and an element from the locus to which the event is bound (either a node or an edge). Since we represent events by methods on the model object, we need to wrap them in lambda expressions (Python closures) so that, when the event is triggered, it calls the correct method on the right model. We then bind these events to the correct loci. A locus may have several events associatd with it if desired, and conversely the same event might occur at several loci.

The above code completely specifies the structure of the epidemic. We now need to specify what happens at each event. For a `remove()`

event, we are passed a node and change its compartment using `changeCompartment()`

. For an `infect()`

event we are passed an SI edge, with the edge being aligned so that the compartments of its endpoints match the way we specified in defining the corresponding locus. We change the susceptible end's compartment to be infected, and mark the edge itself as "occupied", since the infection spread along it.

So far so good, but we still don't have anything to actually *run*. What we *do* have is the static description of a disease model thaty describes the probabilities of a node moving between different disease stages – together with code for the events that will occur as we progress through each stage.

What we stil need is a way of deciding when the different progressions happen for the different nodes. This is the issue of simulation dynamics. There are many ways in which we can perform simulations, but the important point is that the model we described can be applied under *any* of these different models – and that's generally true for most models developed using `epydemic`

. We next need to explore the simulation under different dynamics to see how they differ.

(This is a chapter from Complex networks, complex processes.)

**continuous** model where the population sizes are assumed to be real numbers. This makes a certain amount of sense if we think of compartments as fractions of an overall population. However, from another perspective, it's clear that there's another perspective in which only whole numbers of people become sick, leading to a **discrete** model that places an integer number of individuals into each compartment. How do we reconcile these two views?

The continuous model is best thought of as modelling the large-scale, **macroscopic** behaviour of the epidemic in which the don't really care about the exact numbers of individuals concerned. Also, for a large population, considering the relative sizes of compartments to a few decimal palces of accuracy will still yield something close to a whole number of individuals per compartment when the compartment fractions are scaled-up to the size of the overall population.

But we can also ask what happens at the **macroscopic** scale, for individuals. In that case we want to know how the disease might evolve in a *single person*. Another way to think of this is that a comparttmented model allows each individual person to traverse the compartments according to the probabilities associated with each transition.

Clearly the macroscopic and microscopic descriptions are related: we assume that, if we let a disease run through a population, then the ways in which individuals' infections evolve will integrate to reflect the macroscopic description in terms of fractions of the entire population.

As well as being continuous, however, there's another assumption implicit in the contionuous description. Let's re-visit the equations describing SIR:

$$ \frac{ds}{dt} = -\beta s(t) i(t) \hspace{1in} \frac{di}{dt} = \beta s(t) i(t) - \alpha i(t) \hspace{1in} \frac{dr}{dt} = \alpha i(t) $$Here $i(t)$ denotes the fraction of the population who are infected as time $t$. The rate of change in this population, $\frac{di}{dt}$, has two terms: a growth term $\beta s(t) i(t)$, and a reducing term $\alpha i(t)$. The growth term says that the infected population grows at a rate that is proportional to the total number of (susceptible, infected) pairs in the population, which is simply the product of the two population sizes: in each unit of time, all these people meet each other and a fraction $\beta$ of the susceptibles become infected.

The assumption, clearly, is that all these pairs of people *do actually meet*, and this is a strong assumption. It's called the assumption of **well-mixing**, or alternatively of a **homogeneous** population. We discussed this earlier when we talked about attack rates and reproduction numbers. In "small" populations, well-mixing isn't a totally unreasonable assumption – although it *is* still an approximation of reality (even the people in my small village don't all meet each other every day). If we were to consider a population the size of Scotland, it's clearly implausible.

That doesn't mean we should throw the model away. The statistician George Box is quoted as saying, "*There is no need to ask the question 'Is the model true?'. If 'truth' is to be the 'whole truth' the answer must be 'No'. The only question of interest is 'Is the model illuminating and useful?'*" But the simplification of SIR to three differential equations does smear-out some structure that might be important – and, it turns out, *is* important in the sense that there are disease phenomena that occur in nature that don't occur in this system. Putting SIR onto a network is one way of addressing this.

So in moving to diseases on networks we're trying to address two issues:

- that populations exhibit structure and so are not well-mixed; and
- that diseases occur in individuals, not simply in populations.

To address the first issue, we use a network to represent individuals and their interactions, with the connection structure of the network providing the opportunity for different kinds of inhomogeneity.For the second issue, we develop a discrete description of SIR, consistent with the continuous version, that we can apply to the individual noides of the network. We can then study how different network structures affect the properties of an epidemic.

The first step is conceptually the easier, but has some subtleties. The natural way to treat a population as a network is to have one node per individual in the population. Edges between nodes represent social interactions that are opportunities for infection. If a susceptible person is connected by an edge to an infected person, then there is an opportunity for the latter person to infect the former. Conversely, if there is no such edge, then the susceptible person cannot be infected that the infected individual, since there exists no social contact between them.

How might we construct this network? The simplest approach is undoubtedly to create a random network of some kind: perhaps an ER network, in which case we will obtain a "social network" for $N$ individuals who interact in a random way with a well-defined mean number of others. Simulating an epidemic will then involve running our toi-be-designed discrete disease process over this network, and examining the results.

A moment's thought will show several problems with this approach. Firstly, not all contacts are created equal, as we saw when we discussed secondary attack rates: people in close contact (such as children in a nursery, or people in a care home) are more likely to infect one another than people in weaker contact (such as workers in a factory). We could address this issue, perhaps, by **weighting** the edges between people to capture that fact that "some edges are more infectious than others". Alternatively, we might argue that these factors will even-out over a suiitably heterogeneous population, and so if we focus on the probability of infection for an "average social contact" we can still extract meaningfuil information from any simulation.

Secondly, how are individuals to be connected in the infection network? Are their connections random? Do they exhibit a more clustered structure? Are there dense packets of highly connected individuals, separated by sparse connections? These are questions of network degree, connectivity, and so forth – of network topology in general – and intuitively it seems clear that the choice may make a difference. We might, for example, expect a disease in a well-connected, high-mean-degree network to spread differently to the same disease on a network with lower connectivity.

Thirdly, we have described a **static** network whose connections don't change over time. Relating this back to the context we're considering, that doesn't seem appropriate. People might be expected to avoid individuals who are sick, or the sick individuals might be quarantined to preclude social contact. Either of these behaviours would be expected to remove social contacts – edges – from the part of the network around an infected individual.

(When I was growing up in England in the 1970s, parents actually demonstrated exactly the opposite behaviour. If a child got measles, for example, mothers all brought their children round for a play date with the explicit intention of getting them infected too – the logic being that exposing a child to the disease early was good for their immune systems, got the one-off infection "out of the way", and generally improved herd immunity. None of those arguments are at all wrong, but this approach to parenting seems to have gone out of fashion.)

In either case, we might think that it is more appropriate to adopt an approach that changes the structure of the network in response to infection, perhaps reducing the number of edges when a node is infected. In this case we have a **dynamic** or **adaptive** network structure, where the network responds to the progress of the process running over it. Again, we might decide that these effects will even-out and can be ignored to give an "average" result.

The upshot of this discussion is that we can take a simple representation – a static, random network with unweighted links – and then add more features if we think they might be relevant. As we do so we make the model more realistic – but also more complicated, and and we add to the number of possible degrees of freedom.

Adding more factors in pursuit of realism may sound attractive, but we have to bear in mind that it also gives us a freedom we may not be able to use effectively. Consider the case where we reduce the number of edges to an infected individual. How many edges do we remove?, and how do we select them? – and will these choices make a critical difference?, and how do they interact with the existying parameters of the model? In adding a new freedom we also add a considerable burden of analysis and simulation to check what effects our new freedom has. Might it be better to stick with the simplest case?

This argument might sound bogus to you: a cop-out just to reduce the amount of work we have to do. And if your primary interest is in the dynamics of a *particular* disease, about which you want to make accurate predictions – as would be the case for planning a clinical response to an outbreak – then of course you may strive to build *the most realistic model possible* and accept the associated extra work. On the other hand, if your primary interest is epidemic processes in general, you might be happy to stick to simpler models to see whether they *always* exhibit certain features which can then be generalised (with care) to *all* diseases. We'll see an example of this later in the case of epiudemic thresholds, where certain combinations of infectiousness and recovery *necessarily* lead to epidemics pretty much regardless of everything else.

Now let's return to the second issue we identified above: moving from a continuous to a discrete description of the disease process.

Compartmented models of disease represent diseases as a collection of compartments. We notionally consider each individual in the population to be "in" a particular compartment at a given time. As their disease progresses, they move "from" one compartment "to" another, typically according to some stochastic process where their re-location happens with some probability. In addition, this probability may be affected by other factors, for example the presence of individuals in other compartments as neighbours. When looking at the overall disease behavuour (the macroscopic view) we are typically interested in how the relative sizes of the compartments changes. When looking at the disease's progress (the microscopic view) we additionally need to know about the compartments of neighbouring individuals. It is precisely this microscopic behaviour that is missing from the continuous-process description of compartmented models.

How then do we describe interactions at the scale of individual nodes?

Let's look again (not for the last time) at the differential equations for SIR:

$$ \frac{ds}{dt} = -\beta s(t) i(t) \hspace{1in} \frac{di}{dt} = \beta s(t) i(t) - \alpha i(t) \hspace{1in} \frac{dr}{dt} = \alpha i(t) $$There are three compartments, and the three equations (one per compartment) tell us how their population changes. Looking at the last equation, we see that $r(t)$ increases at a rate proportional to the $i(t)$, the size of the infected compartment. Similarly, looking at the first equation, $s(t)$ decreases at a rate proportional to the number of susceptibl;e-infected pairs. In the second equation, these two effects both appear inverted – understandably, since individuals pass through infection to recovery, and rates have to balance out if we are to keep the population constant.

So much for the compartments: what does this mean for an individual?

We know that we are representing the interactions between individuals as network edges. Suppose that at some time we have a given susceptible individual. That individual cannot become infected spontaneously, but only through interaction with an individual who is infected at the same time and with whom she has some social contact, represented by an edge. So to determine whether the susceptible individual is infected, we need to know whether she has any edges that lead to individuals who are infected. We refer to such edges as **SI edges**: they connect a susceptible node to an infected node.

Suppose we have found an SI edge linking our susceptible node to an infected neighbour. The infection "passes along" this edge with a probability $\beta$, turning our susceptible node into an infected node, decreasing the population of the susceptible compartment by one and increasing the population in the infected compartment by one.

But there is also another effect. The edge down which the infection travelled is no longer an SI edge, since it now connects an infected node to *another* infected node. Furthermore any other SI edges that connected our formerly-susceptible node to infected nodes are also no longer SI edges. And finally, the fact that our formerly-susceptible node is now an infected node means that there may be new SI edges created, where there are edges between our node and a neighbnouring susceptible node.

This is quite a bit more complicated than the equations suggest at first glance. It is perhaps simpler to think of it slightly differently. It is the population of SI edges, *not* the population of susceptible or infected individuals alone, which determines the rate of infection: that much is clear from the infection term. The infection dynamics happens, not at individual nodes, but at SI edges. If can think of the SI edges as a **locus** for the infection dynamics: a place at which infection possibly occurs. The edges in that locus are potentially changed by every infection **event**: every time an SI edge actually results in an infection.

*once it has happened* it has an impact on the SI edges – and therefore, indirectly, in future infection events. The locus for removal events is therefore the population of nodes in the infected compartment, any of which may spontaneously be removed.

Summing-up the above, we can now formulate a discrete description of SIR.

The model consists of three compartments: susceptible (S), infected (I), and removed (R). Each node resides in exactly one compartment at any time. There are two loci for the dynamics: SI edges, and infected nodes. There are two events: infection happens at the SI locus with probability $\beta$, while removal happens at the I locus with probability $\alpha$. The infection event moves the S node into the I compartment; the removal event moves the I node into the R compartment. Removal therefore affects the contents of the I locus, and both events may affect the contents of the SI locus. If we compare this description to the three equations above it is hopefully easy to see the derivation.

What we've done is quite significant, though. We've moved from a description consisting of three continuous rates of change (the three differential equations) to a description consisting of two discrete events, each happening at a different locus. The events can be applied to individual nodes or edges in our network model, in which we would need to track exactly which nodes are in which compartments, and which edges are in the SI locus we're interested in. It's worth noting that we really don't care about removed nodes: they don't appear in either locus, and therefore can't affect the dynamics, other than by the fact that nodes that are removed are by definition *not* susceptible or infected.

The process description is an essential step along the way to simulation, but we're not quite there yet. We need to be able to express the above model in a computational form suitable to be executed. We need to be able to keep track of the populations in the different loci of the dynamics. And we need to choose where, and at what times, the different events occur.

(This is a chapter from Complex networks, complex processes.)

When we created ER networks earlier, we started with an empty network of $N$ and then added edges between pairs of nodes with a given probability $\phi$. We know that this will eventually lead to a network with mean degree $N\phi$. But let's look at the process from a slightly different perspective: what happens *as we add the nodes*? Specifically, how do the nodes become connected as we add edges?

Intuitively we can argue as follows. We start with an empty network. Adding an edge necessarily build a 2-node component. Adding another edge is (for a larrge network, anyway) overwhelmingly like to pick two other nodes not in the first component, forming a second. We can continue like this for some time, but gradually it will become more likely that one of the nodes we choose to connect is not isolated by rather part of a larger cluster: indeed, *both* nodes may be part *different* clusters, which thereby become joined into a a single one. As we continue to add edges, it starts to become increasingly likely that the edges will placed be between increasingly large components, thereby connecting them. And as a component becomes larger, there are more ways to connect to it (since there are more nodes to choose as endpoints), so we might expect that large components grow at the expense of small components. Eventually the network may become one large component, but even before this we might expect that there will be one or more components that are large relative to the others and to the size of the network as a whole.

This is indeed what happens. As we add edges to the initially-empty network according to the ER process, we create a large number of small components that over time connect to each other. Because large components are easier to connect to they grow faster, which leads to the formation of a component that contains a large fraction of the nodes: the **giant component**.

Does the giant component necessarily form? A moment's thought will suggest not: if we only add a small number of edges, then clearly there won't be enough for a giant component to form.

Let's denote the size of the largest component in a network by $N_G$. How does $N_G$ vary as we add edges?

Starting from an empty network, we have $N_G = 1$ since every node is its own small cluster. The ratio of the size of the "giant" component to the size of the network, $\frac{N_G}{N} \rightarrow 0$ as $N \rightarrow \infty$: the giant component is an insignificant fraction of the nodes. As we add nodes, we expect $N_G$ to increase. If we were to set $\phi = 1$ and add *all* possible edges, then at the end of the process we would have $\frac{N_G}{N} = 1$, the giant component containing all the nodes. We can think of $\frac{N_G}{N}$ as the probability that a node chosen at random will be in the giant component. Let's refer to this probability as $S$.

How does a node $i$ end up outside the giant component? It means that, for every other node $j$ in the network,

- either $i$ is not connected to $j$; or
- $i$ i is connected to $j$ but $j$ is itself not in the giant component.

For a particular node $j$, the probability of the first case is $(1 - \phi)$ (since the probability of their being an edge added is $\phi$); the probability of the second case is $\phi (1 - S)$, there being an edge between $i$ and $j$ (which is $\phi$) *and* $j$ not being in the giant component (which is $(1 - S)$). If we sum-up this probability for every $j$, then the probability we are looking for is given by the recurrence equation $1 - S = ((1 - \phi) + \phi (1 - S))^{N - 1}$. If we re-arrange this slightly,

where we used $\phi = \frac{\langle k \rangle}{N}$. Taking logs on both sides,

\begin{align*} \ln (1 - S) &= N \, \ln (1 - \frac{\langle k \rangle}{N} S) \\ &= -N \frac{\langle k \rangle}{N} S) \\ &= - \langle k \rangle S \\ \end{align*}Then we can take exponentials on each side, leading to:

\begin{align*} 1 - S &= e^{- \langle k \rangle S} \\ S &= 1 - e^{- \langle k \rangle S} \end{align*}This is still an awkward recurrence equation: $S$ appears on both sides. Situations like this often have no closed-form solution, but there's a trick to make progress, which is to make use of a graphical method.

In [1]:

```
import networkx
import math
import numpy
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import seaborn
```

In [2]:

```
fig = plt.figure(figsize = (5, 5))
# create a set of points for S, evenly spaced over the interval [0.0, 1.0]
ss = numpy.linspace(0.0, 1.0)
# different kmeans and their associated line types
kmeans = [ 0.5, 1, 1.5, 2 ]
lines = [ 'r-', 'g-', 'b-', 'y-' ]
# Build a function parameterised by kmean ro tun over S
def make_S( kmean ):
return (lambda S: 1.0 - math.exp(-kmean * S))
# plot S against S
plt.plot(ss, ss, 'k--')
# plot the exponential curves for the different selected kmeans
for i in range(len(kmeans)):
kmean = kmeans[i]
line = lines[i]
# map the appropriate function across S
ys = map(make_S(kmean), ss)
# plot the curve
plt.plot(ss, ys, line, label = '$\\langle k \\rangle = {k}$'.format(k = kmean))
plt.xlabel('$S$')
plt.title('Solutions for $S = 1 - e^{-\\langle k \\rangle S}$ for different values of $\\langle k \\rangle$')
plt.legend(loc = 'upper left')
_ = plt.show()
```

So by inspection for $\langle k \rangle = 1.5$ there is a solution at approximately $S = 0.58$, while for $\langle k \rangle = 2$ there is a solution at approximately $S = 0.8$ – 80% of the nodes in the network are in the giant component.

Looking at the lines for the different values of $\langle k \rangle$, notice that as $\langle k \rangle$ increases the corresponding curve starts out steeper. Shallow curver never intersect $y = S$, meaning no giant component emerges; as the curves get steeper, a solution emerges starting at low values of $S$ and gradually moving towards $S = 1$. The separator between these two regimes occurs when the initial gradient of the curve matches that of $y = S$, when the curve and the line are tangent to each other at $S = 0$. This separator is referred to as a **critical transition** or a **critical threshold**, because it's the critical value at which behaviour abruptly changes. It happens when:

and so:

$$ \langle k \rangle e^{-\langle k \rangle S} = 1 $$At $S = 0$ we discover that the critical threshold $\langle k \rangle_c = 1$.

We can of course also relate $\langle k \rangle_c$ back to $\phi$, the probability of adding an edge, and discover that the critical threshold probability $\phi_c$ below which the giant component doesn't form, but above which it does (a point we explore a little more below). For $\langle k \rangle_c = 1$ we have that $\phi_c = \frac{1}{N}$.

Let these two results sink in for a minute. Firstly, a mean degree of 1 – every node attached to on average one neighbnour – is enough to start forming a giant component and therefore, by implication, to take the network towards being connected. Secondly, for a large ER network even a vanishingly small number of edges will result in the formation of a giant component – and that number gets smaller as the network gets bigger! This all suggests that giant components will be common, so a lot of the networks we encounter in applications will have one.

Alternatively we can observe that, while it's hard to find $S$ in terms of $\langle k \rangle$, it is easy to find $\langle k \rangle$ in terms of $S$:

\begin{align*} S &= 1 - e^{-\langle k \rangle S} \\ 1 - S &= e^{-\langle k \rangle S} \\ \ln (1 - S) &= -\langle k \rangle S \\ \langle k \rangle &= - \frac{\ln (1 - S)}{S} \end{align*}Since we're actually interested in $S$ we can plot the curve rotated by ninety degrees for clarity, which yields:

In [3]:

```
fig = plt.figure(figsize = (5, 5))
ss = numpy.linspace(0.0, 1.0, endpoint = False) # omit 0.0 to avoid a divide-by-zero error later
plt.xlim([0, 4])
plt.xlabel("$\\langle k \\rangle$")
plt.ylabel("$S$")
plt.plot(map((lambda S: - math.log(1.0 - S) / S), ss), ss, 'r-')
plt.title('Expected size of giant component')
_ = plt.show()
```

This makes the critical nature of $\langle k \rangle_c = 1$ even more clear. As $\langle k \rangle$ grows beyond $\langle k \rangle_c$, the expected size of the giant componentÂ rapidly approaches the size of the network itself.

The existence and value of the critical threshold was first proven by Erdős and Rényi [ER59] in a paper that really marks the very start of network science. It shows that, even for small mean degrees, an ER network will have a giant component, and as the mean degree gets larger, that component will span the entire network. Looking at the graph above, you can see that the curve asymptotically approaches $S = 1$ as $\langle k \rangle \rightarrow \infty$. It is never *certain* that the process will connect the network – it's stochastic, after all – but it rapidly becomes overwhelmingly likely.

So much for the mathematics: let's look at the emergence of the giant component computationally.

The `networkx`

function `number_connected_components()`

computes the number of components in a network. To look at the giant component forming, we therefore need to count the number of components over the region around the critical threshold. We expect to see the number of components rapidly drop towards 1, and the fraction of nodes in the largest component rapidly increase towards 1.

We could therefore create an empty network and progressively add edges to it, counting the number of components as we go. We already have the code for this in our earlier from-scratch ER network generator: however, looking at the code, while the *result* is a random network, the *process* by which edges are added is actually very regular, and we should probably avoid such unnecessary regularity in case it makes a difference. One could easily imagine that adding nodes in a regular fashion might generate components faster (or slower?) than truly random addition.

What we could do instead is to build a random network and then re-construct it by emptying it and then adding the same edges edges in a random order. This destroys any artefacts coming from the way in which we added the edges in the first place.

We first define an iterator that will randomise a list:

In [4]:

```
from copy import copy
class permuted:
"""An iterator for the elements of an array in a random order."""
def __init__( self, es ):
"""Return an iterator for the elements of an array in a random order.
:param es: the original elements"""
self.elements = copy(es) # copy the data to be permuted
def __iter__( self ):
"""Return the iterator.
:returns: a random iterator over the elements"""
return self
def next( self ):
"""Return a random element
:returns: a random elkement of the original collection"""
n = len(self.elements)
if n == 0:
raise StopIteration
else:
i = int(numpy.random.random() * n)
v = self.elements[i]
del self.elements[i]
return v
```

In [9]:

```
def growing_component_numbers( n, es ):
"""Build the graph with n nodes and add edges randomly from es, returning
a list of the number of components in the graph as we add edges in a
random order taken from a list of possible edges.
:param n: the number of nodes
:param es: the edges
:returns: the number of components as each edge is added"""
# create an empty graph
g = networkx.empty_graph(n)
# add edges to g taken at random from the edge set,
# and compute components after each edge
cs = []
for e in permuted(es):
g.add_edge(*e)
nc = networkx.number_connected_components(g)
cs.append(nc)
return cs
```

In [10]:

```
# create an ER networks and grab its edges
er = networkx.erdos_renyi_graph(2000, 0.01)
es = er.edges()
# replay these edges
component_number = growing_component_numbers(2000, es)
# plot components against edges
fig = plt.figure(figsize = (5, 5))
plt.title("Consolidation of components as edges are added")
plt.xlabel("$|E|$")
plt.ylabel("Components")
plt.plot(range(len(component_number)), component_number, 'b-')
# edge at which the giant component forms
i = component_number.index(1)
# highlight the formation of the giant component
ax = fig.gca()
ax.annotate("$|E| = {e} ({p}\\%)$".format(e = i, p = int(((i + 0.0) / len(es)) * 100)),
xy = (i, 1),
xytext = (len(component_number) / 2, component_number[0] / 2),
arrowprops = dict(facecolor = 'black', width = 1, shrink = 0.05))
_ = plt.show()
```

The giant component forms well before we've added all the edges.

(Remember that thisd is a stochastic process. It's *possible* that a giant component would *never* form for a network, just by chance. However, for an ER network with 2000 nodes $\phi_c = \frac{1}{N} = 0.0005$, so $\phi = 0.01$ is well above the critical threshold.)

But *how* does the giant component form? Does it steadily accrete, or does it form suddenly as previously disconnected components connect? We can explore this by plotting the size of the largest component as we add edges, using the function `connected_components()`

that returns a list of components, largest first:

In [11]:

```
def growing_component_sizes( n, es ):
"""Build the graph with n nodes and edges taken from es, returning
a list of the size of the largest component as we add edges in a
random order taken from a list of possible edges.
:param n: number of edges
:param es: the edges
:returns: liost of largest component as each edge is added"""
g = networkx.empty_graph(n)
cs = []
for e in permuted(es):
g.add_edge(*e)
# pick the largest component (the one with the longest list of node members)
gc = len(max(networkx.connected_components(g), key = len))
cs.append(gc)
return cs
```

*number* of components on the same axes:

In [12]:

```
# compute list of component sizes as we add edges, re-using the
# ER edges we computed earlier
component_size = growing_component_sizes(2000, es)
fig = plt.figure(figsize = (5, 5))
plt.title("Emergence of the giant component as edges are added")
# plot the number of components
ax1 = fig.gca()
ax1.set_xlabel("Edges")
ax1.set_ylabel("Components", color = 'b')
ax1.plot(range(i), component_number[:i], 'b-', label = 'Components')
for t in ax1.get_yticklabels():
t.set_color('b')
# plot component sizes against edges
ax2 = ax1.twinx()
ax2.set_ylabel("Component size", color = 'r')
ax2.plot(range(i), component_size[:i], 'r-', label = "Component size")
for t in ax2.get_yticklabels():
t.set_color('r')
_ = plt.show()
```

Now isn't *that* interesting... Let's try to interpret what's happening. Quite early-on in the process of adding edges, there's a sudden jump in the size of the largest component in the network. Well before we get to the giant component, we start getting a component of hundreds, and then thousands, of nodes. The process by which we're adding edges is random and smooth, but nonetheless results in a sudden change in the connectivity of the network. The network consists of lots of small components that suddenly – over the course of adding a relatively small number of edges – join up and create an enormously larger component consisting of most of the nodes, which then itself gradually grows until it contains *all* the nodes. Below this threshold the network is composed of small, isolated collections of nodes; above it, it rapidly becomes one big component.

This is the first example we've seen of a critical transition, also known as a **phase change**: during a steady, incremental, process, the network changes from one state into another, very different state – and does so almost instantaneously.

We should examine the area around the critical point in more detail. First we need to locate it. Since the characteristic of the critical point is that the slope of the graph suddenly increases, we can look for it by looking at the slope of the data series:

In [13]:

```
def critical_point( cs, slope = 1 ):
"""Find the critical point in a sequence. We define the critical point
as the index where the derivative of the sequence becomes greater than
the desired slope. We ignore the direction of the slope.
:param cs: the sequence of component sizes
:param slope: the desired slope of the graph (defaults to 1)
:returns: the point at which the slope of the time series exceeds the desired slope"""
for i in xrange(1, len(cs)):
if abs(cs[i] - cs[i - 1]) > slope:
return i
return None
```

In [14]:

```
# find the critical point
cp = critical_point(component_size, slope = 50)
# some space either side of the critical point, with the
# right-hand side being more interesting and so getting more
bcp = int(cp * 0.8)
ucp = int(cp * 3)
fig = plt.figure(figsize = (5, 5))
plt.title("Details of the phase transition")
# plot the number of components
ax1 = fig.gca()
ax1.set_xlabel("Edges")
ax1.set_ylabel("Components", color = 'b')
ax1.plot(range(bcp, ucp), component_number[bcp:ucp], 'b-', label = 'Components')
for t in ax1.get_yticklabels():
t.set_color('b')
# plot component sizes against edges
ax2 = ax1.twinx()
ax2.set_ylabel("Component size", color = 'r')
ax2.plot(range(bcp, ucp), component_size[bcp:ucp], 'r-', label = "Component size")
for t in ax2.get_yticklabels():
t.set_color('r')
# add a line to show where we decided the critical point was
ax1.plot([cp, cp], # x's: vertical line at the critical point
ax1.get_ylim(), # y's: the y axis' extent
'k:')
_ = plt.show()
```

*number of components* comes down fairly smoothly, the *size of the largest component* jumps quickly as smaller components amalgamate.

In [16]:

```
def make_er_giant_component_size_by_kmean( n ):
"""Return a model function for a network with the given number
of nodes, computing the fractional size of the giant component
for different mean degrees.
:param n: the number of nodes"""
def model( kmean ):
phi = kmean / n
er = networkx.erdos_renyi_graph(n, phi)
gc = len(max(networkx.connected_components(er), key = len))
S = (gc + 0.0) / n
return S
return model
fig = plt.figure(figsize = (5, 5))
# plot the observed behaviour
kmeans = numpy.linspace(0.0, 5.0, num = 20)
sz = map(make_er_giant_component_size_by_kmean(2000), kmeans)
plt.scatter(kmeans, sz, color = 'r', marker = 'D', label = 'experimental')
# plt the theoretical behaviour
ss = numpy.linspace(0.0, 1.0, endpoint = False)
plt.plot(map((lambda S: - math.log(1.0 - S) / S), ss), ss, 'k,', label = 'predicted')
plt.xlim([0, 5])
plt.ylim([0.0, 1.0])
plt.title('Expected vs observed sizes of giant component')
plt.xlabel('$\\langle k \\rangle$')
plt.ylabel('$S$')
plt.legend(loc = 'lower right')
_ = plt.show()
```

*one specific* ER network that *might* happen to have properties that cause a giant component to form, or not form, or form with a slightly different size than predicted, just because of some fluke of way the edges are added. The mathematical expression gives us the expected behaviour that's overwhemingly probable in the case of large ($N \rightarrow \infty$) networks – but it can be misleading in any single clase, and in smaller networks.

There are many more properties of components we could explore, but we'll stop here: Newman [New10] presents many more calculations, for example about how the distribution of component sizes changes as edges are added.

There's an important point to make about all we've said above. You'll have noticed that a lot of the arguments relied on averaging, for example in identifying the *average* (mean) degree as greater than 1, or finding the *expected* size of the giant component. You might have wondered whether these sorts of calculations would be possible if for whatever reason we weren't able to do averaging.

Averaging works well for large networks: indeed, for really large networks we *have* to rely on statistical techniques, as all the details will generally be unavailable. And it's certainly the case that lot of phenomena of interest for complex networks (and complex processes) depend strongly on these statistical properties, with only very weak dependence on the details. This means we can often ignore the fine structure, the **micro-scale structure** of a network, and treat them as instances of classes defined by their **macro-scale structure**, the high-level summary statistics. Indeed, this is the basis for the techniques for managing variance by repetition that we'll see later when we scale-out our simulations.

*But*. (There was obviously a *but* coming.) There are also examples in which fine structure *does* matter – and even more cases where variations or irregularities in the structure make a huge difference. We'll see examples of these later, but an easily-understood example is the way an epidemic spreads on a network with communities of more-than-averagely-connected nodes: easily within communities, but with more difficulty between them because of the lesser connectivity. This is true even for networks with the same mean degree: the modular structure changes the process' behaviour.

The ER networks are special not because they're random – lots of networks have randomness – but because they're *so perfectly* random. They have, on average (that word again...), no fine structure to worry about, and so arguments based on averaging work, both for properties like the degrees of nodes and also for repeating experiments over different networks with the same parameters.

What about for more complex situations? It turns out that the other main class of networks, the powerlaw networks, have similar (but different) regularities that can similarly be exploited. There are other cases that don't have such nice features, and – while we can sometimes fall back on more powerful mathematical techniques, such as those associated with generating functions – we'll often be placed in situations where only extensive and careful simulation will get us anywhere. And simulation often requires an understanding of how the network is put together at a macro level as well as some understanding at least of the micro level, so the mathematical and computational views remain entwined.

(This is a chapter from Complex networks, complex processes.)

Networks consist of nodes connected by edges. We've already looked at the notion of a path in terms of providing a "rouyute to follow" to get from one node to another. We can look at paths between pairs of nodes to see whether they exist – is it possible to navigate from one node to the other? – and find paths of different lengths, including a possibly unique shortest path. We also considered one way of raising this local property to the global network level in order to find the network's diameter: in the network as a whole, what is the longest shortest path between *any* pair of nodes?

There's another such global question related to paths: is it always possible to find a path between any pair of nodes in the network? Clearly there's a major difference between networks for which the answer is yes, and other networks: in the former case, while it may be *hard* to find a path between tweo nodes, it will always be *possible*; in the latter case, some attempts at navigation are doomed to failure.

A network for which there is always a path between any pair of nodes is called **connected**. Connectivity is the property that says that navigation is always possible.

How do we determine if a network is connected? At some level we need to check that paths exist between all pairs of nodes, but that's going to be extremely expensive for large networks. Fortunately there's a simpler way, and even more fortunately `networkx`

provides it built-in.

In [1]:

```
import networkx
import numpy
import itertools
import cncp
import matplotlib as mpl
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import matplotlib.cm as cmap
import seaborn
```

`networkx`

's `is_connected()`

function to test the network's connectivity:

In [2]:

```
l = cncp.lattice_graph(10, 10)
print 'Lattice connected? {c}'.format(c = networkx.is_connected(l))
```

In [3]:

```
l.add_node(9999)
print 'Lattice with extra node connected? {c}'.format(c = networkx.is_connected(l))
l.add_edge(9999, 1)
print 'Did the new edge re-connect things? {c}'.format(c = networkx.is_connected(l))
```

**disconnect** it by removing nodes, for example by "snipping off the corner":

In [4]:

```
l.remove_edges_from([ (0, 1), (0, 10) ])
print 'Still connected? {c}'.format(c = networkx.is_connected(l))
```

This works because we happen to know the way the nodes are labelled by `lattice_graph()`

, so we know which edges we need to remove. We could also have removed a band of edges across the centre of the lattice, or on a diagonal: as longf as we interrupt the path between *any one pair* of nodes, the network will no longer be connected.

These ideas work with larger groups of nodes as well. For example, suppose we place two networks "side by side", having edges internally but none between them:

In [5]:

```
# create two lattices
l1 = cncp.lattice_graph(5, 5)
l2 = cncp.lattice_graph(5, 5)
# re-label the second lattice so that the node labels will be unique
l2p = networkx.relabel_nodes(l2, lambda n: n + 1000)
# combine the two lattices together to form a single network
l = networkx.compose(l1, l2p)
print 'Two-lattice network connected? {c}'.format(c = networkx.is_connected(l))
```

Notice what we did to make this work:

- we built the two networks independently ;
- then we re-labelled one of them to make the node labels unique; and
- then composed them together.

Our lattice-creation function always labels nodes in the same way in the networks it creates, so after the first step we have two networks each with a common set of node labels. If we'd simply composed these networks together as-is, `networkx`

would have assumed that two nodes with the same label were *the same node* and would have combined them – and then combined all the edges too. We'd have ended up with a single lattice! By re-labelling the second network's nodes we ensure they're recognised as distinct, and therefore when we combine the two networks we get a a network with two lattices "side by side" and no edges between them.

Adding a single edge between nodes in the two lattices is of course enough to connect the network:

In [6]:

```
l.add_edge(0, 1000)
print 'Two-lattice network connected with extra edge? {c}'.format(c = networkx.is_connected(l))
```

In [7]:

```
l.remove_node(1000)
print 'Is the network still connected after removing a critical node? {c}'.format(c = networkx.is_connected(l))
```

And of course one of the "lattices" is now missing a node.

Let's return to the lattices we used to create the network above. Each of the lattices was itself a network, which we then joined together to firm the overall lattices-side-by-side network. But we can also observe that – in this case, although not necessarily – the two lattices were themselves connected. It was possible to go from any node in one lattice to any node *in the same lattice*; when we put them side-by side, this stopped being the case; and then when we added an edge it because possible to go from any node in one lattice to any node *in either lattice*.

So after we placed the lattices side-by-side we had a network with two **sub-networks**, each of which was connected, but the network taken together was disconnected. This property of being a connected sub-network of a larger structure is called being a **component** (or sometimes a **connected component**, although that's a bit tautologous). When we connected the two components together we created a single connected network, a single component.

Each component is a "island" of connectivity. Navigation is possible "on the island", but impossible "off the island". The number of components in a network is a measure of how many "islands" there are. We can use `networkxz`

to count both their number and their size:

In [8]:

```
print "Newly-split network as {c} components".format(c = networkx.number_connected_components(l))
# compute the sizes of the components
cs = list(networkx.connected_components(l))
for i in range(len(cs)):
print 'Component {i} contains {n} nodes'.format(i = i, n = len(cs[i]))
```

`max()`

or `sorted()`

to explicitly put them into the right order:

In [9]:

```
print 'Largest component has {n} nodes'.format(n = len(max(networkx.connected_components(l), key = len)))
```

The significance of components really becomes clear when we consider different ways of generating networks, especially using random processes. Many such processes don't actually guarantee to generate a connected network: they add edges between nodes randomly, so it's entirely possible that some nodes may be isolated or that two or more components may form. If this is important for an application, we need to be careful to make sure the network is connected *before* we start work on it. There are two basic ways to do this:

- we can check that the network is connected using
`is_connected()`

, and throw it away and start again; or - we can take the largest component from the network as-is.

Neither method is necessarily better. For the first, it might be that we *never* get a connected network because of some combination of parameters to the generator (for example the network has three nodes and we only ever add one edge: extreme, but you get the idea). For the second, we'll necessarily end up with a network that has fewer nodes than we thought: possibly less than half, depending on exactly how many components the generator gives rise to. So which method we adopt depends on the application, and we'll have to think carefully about the constraints of each scenario we explore.

*quite* components?

In [10]:

```
# build left network
l = networkx.Graph()
l.add_edges_from([ (1, 2), (2, 3), (3, 4), (4, 5) ])
# build left network
r = networkx.Graph()
r.add_edges_from([ (1, 2), (1, 3), (1, 4), (1, 5) ])
# create the figure
fig = plt.figure(figsize = (10, 5))
# draw left network
ax1 = fig.add_subplot(1, 2, 1) # one row of two columns, first box
ax1.grid(False) # no grid
ax1.get_xaxis().set_ticks([]) # no ticks on the axes
ax1.get_yaxis().set_ticks([])
networkx.draw_networkx(l, ax = ax1, node_size = 100)
# draw right network
ax2 = fig.add_subplot(1, 2, 2) # one row of two columns, second box
ax2.grid(False) # no grid
ax2.get_xaxis().set_ticks([]) # no ticks on the axes
ax2.get_yaxis().set_ticks([])
networkx.draw_networkx(r, ax = ax2, node_size = 100)
```

In [11]:

```
print 'Left network diameter {ld}.'.format(ld = networkx.diameter(l))
print 'Right network diameter {ld}.'.format(ld = networkx.diameter(r))
```

Clearly it's "quicker to get around" the right-hand network. So what would be the "quickest" network we could imagine? The minimum case is when the diameter of the network is 1. Remembering the definition of diameter as the longest shortest path, this would mean that the shortest path between any pair of nodes was 1 – or, to put it another way, every node was adjacent to every other. Such a network is called a **clique** (which rhymes with "speak", *not* with "click"). In the graph theory literature, the clique of $n$ nodes is referred to as $K_n$.

We can create cliques algorithmically:

In [12]:

```
# create a clique of five nodes
k5 = networkx.Graph()
for (n, m) in itertools.combinations(range(5), 2):
k5.add_edge(n, m)
# draw the clique
fig = plt.figure(figsize = (5, 5))
ax = fig.gca()
ax.grid(False) # no grid
ax.get_xaxis().set_ticks([]) # no ticks on the axes
ax.get_yaxis().set_ticks([])
networkx.draw_networkx(k5, node_size = 100)
plt.title('$K_5$')
_ = plt.show()
```

If you're not familiar with Python's `itertools`

package, it provides a whole suite of useful ways to combine sets of data. `itertools.combinations()`

takes a collection `l`

and a number `i`

and produces all combinations of `i`

objects taken from `l`

– in this case all pairs of nodes, with each pair appearing exactly once.

`networkx`

will, unsurprisingly, create cliques directly:

In [13]:

```
fig = plt.figure(figsize = (5, 5))
ax = fig.gca()
ax.grid(False) # no grid
ax.get_xaxis().set_ticks([]) # no ticks on the axes
ax.get_yaxis().set_ticks([])
networkx.draw_networkx(networkx.complete_graph(10), node_size = 100)
plt.title('$K_{10}$')
_ = plt.show()
```

*could* have is sometimes referred to as its **density**. It isn't a measure of connectivity *per se*, but can provide a useful metric for deciding whether a network is well-connected or sparse – concepts we'll come back to later.

In the lattices-side-by-side example above we had two components that we connected with a single edge. Suppose we scale things up a bit, to a large network with several large components. Suppose we then add a small number of edges between the components, thereby connecting the network. We now have a connected network and a single component: is there anything else to say about the matter?

Well clearly there is. The sub-networks are no longer components, it's true, but they're still recognisibly more connected *within* themselves than *between* themselves. We refer to these almost-components as **communities** or **modules**.

While the idea of being a component is very clear-cut, being a community is a lot more delicate. When is a collection of nodes "connected enough" internally and "not connected enouygh" externally to be termed a community? Can we always identify the communities of a network? As the number of edges increases, and the number of paths between pairs of nodes in two communities increases, at what point do they cease to be two communities and become one?

These are all interesting questions, which we'll return to later: the notion of community-finding is a very active research topic For the time being, it's sufficient to observe that the component (or community) structure of a network might have an influence on its properties, and in particular on how processes operate over it.

(This is a chapter from Complex networks, complex processes.)

So far we've looked at ER networks from a practical perspective, through simulation. This **numerical** approach is typical for computer scientists, and is very powerful. It has the enormous advantage of working for *any* network using the *same* set of techniques (and code). It has the enormous disadvantage, however, of often providing very little insight as to *why* the answer is as it is: why, for example, does an ER network have the bell-shaped degree distribution that it has, and what does this imply?

Often the numerical approach is the best we can hope for, especially in the face of irregular or otherwise "awkward" networks. But the ER network has a very regular construction process: surely we might expect to be able to do better?

An alternative to simulation in such cases is to take an **analytical** approach, to try to find closed-form mathematical expressions that answer the key questions we want to pose. This approach omly works in some cases – although these cases are vitally important and interesting, and it turns out that there are other analytic techniques that work for a still broader class of networks – but it has the advantage of not requiring simulation that may be time-consuming and subject to various statistical constraints: analysis provides precise, uniform answers.

In this chapter we'll look at some properties of ER networks from this perspective and derive mathematical expressions for them. We'll focus only on those properties that are most important from a practical perspective: the dergree distribution and the mean degree. (The Wikipedia page for ER networks describes – but doesn't derive – lots of other properties of largely theoretical interest.) We'll do this from first principles and at some length, to demonstrate the sorts of mathematical arguments that'll be common in what's to come.

We'll start by returning to the degree distribution, the numbers of nodes with given numbers of immediate neighbours in the network. We observed earlier that we can interpret the degree distribution in terms of probability: what is the probability of a node $v$ chosen at random having a given degree $k$? In normal probability notation this would be written $P(deg(v) = k)$, the probability that $deg(v)$, the degree of $v$, is equal to $k$. For brevity we will usually write this as $p_k$. Taken over the whole network, this will yield a degree distribution, where the probability of all possible degrees in the network sum to one: $\sum_k p_k = 1$.

So what is the degree distribution for an ER network? At first acquaintance, many non-mathematicians would argue something like this: the generating process adds an edge between any pair of nodes with a fixed probability $\phi$, with every edge (and every node) treated equally. Therefore, we'd expect every node to have roughly the same degree as every other – a degree distribution that's *uniform* – consistent with the uniformity of the generating process.

Does that sound reasonable? – it did to me when I first made this argument. But we know from the simulation we did earlier that this *isn't* what happens: we actually get a *normal* distribution of degrees, not a uniform one. (If you need more convincing about this, read the rest of this section and then skip to the epilogue at the end of the chapter.) Clearly there must be another way of thinking about the process.

Let's re-phrase the question: in an ER network, how does a node end up having degree $k$? We can answer this by looking back at the construction process, where we iterated through all the pairs of nodes and added an edge between them with a given, fixed, probability $\phi$ (which we denoted `pEdge`

in the code). So each node *could in principle* have been connected to $N - 1$ other nodes: that's the maximum degree it could have, since we've excluded the possibility of self-loops or parallel edges. For each of these potential edges, we essentially tossed a coin to decide whether the edge was included or not – except that the "coin" came down "heads" with a probability $\phi$, and therefore came down "tails" with a probability $(1 - \phi)$ (since there are only two alternatives, and their probabilities have to sum to 1). Let's refer to each such decision – add an edge or don't – as an *action*. For each node we perform $N - 1$ actions, one per potential edge, and for a node to have a degree $k$ we have to perform $k$ "add" actions and $(N - 1 - k)$ "don't-add" actions. We can perform these actions in any order.

How many ways are there to perform this sequence of actions? Suppose we have a bag of $a$ actions: how many ways are there to select $b$ actions from the bag? The answer to this is given by the formula $\frac{a!}{b! (a - b)!}$, a result known as the **binomial theorem**. This value is often denoted $\binom{a}{b}$, so:

So, returning to our original question, we have $\binom{N - 1}{k}$ ways to perform $k$ "add" actions from a possible $N - 1$ actions, with the remainder being "don't-add" actions. This is the number of possible sequences that, for a given node, can result in that node having degree $k$. From elementary probability theory, to work out the probability of a sequence of actions happening we multiply-out the probabilities of the individual actions: "this *and* this *and* this" and so forth. So for each sequence of $k$ add actions and $(N - 1 - k)$ don't-add actions we multiply the probailities of each action together to get the probability of them *all* happening, and then multiply this compound probability by number of ways these actions can happen so as to still give us the $k$ edges we want.

Putting all this together, what is the probability that a node $v$ taken at random from an ER model consisting of $N$ nodes and edge probability $\phi$ will have degree $k$? For a node to have degree $k$ we need to perform a sequence of actions consisting of $k$ add actions (each occurring with probability $\phi$ ); *and* we need $(N - 1 - k)$ don't-add actions (occurring with probability $1 - \phi$); *and* there are $\binom{N - 1}{k}$ ways in which these actions can be arranged. Expressing this as maths, we get:

This is a distribution well known in statistics as the **binomial distribution**. It's important to note that $\phi$ is a constant, and that each add action is independent of each other add action: it doesn't get any easier to add edges over time. (If this seems like an obvious thing to say, we only say it because this turns out to be different to the approach we'll take to BA networks later.)

Given that we are dealing with large graphs, we will simplify the $N - 1$ term to $N$, since it makes very little difference as $N \rightarrow \infty$, yielding:

$$p_k = \binom{N}{k} \, \phi^k \, (1 - \phi)^{N - k}$$What happens as $N$ gets larger and larger? Clearly $\binom{N}{k}$ also gets larger and larger (there are more and more ways to choose the $k$ edges), and $(1 - \phi)^{N - k}$ gets smaller and smaller (since $1 - \phi$ is by definition less than 1), while $\phi^k$ stays the same size. What happens therefore depends on whether the rise term or the falling term dominates in the limit, which isn't blindingly obvious but fortunately the answer *is* known: the binomial distribution converges to another distribution, the **Poisson distribution**, as $N \rightarrow \infty$. The Poisson distribution is basically the normal distribution for systems built from discrete events, and is given by:

While this form is easier to work with, it's a lot less suggestive. The binomial form is probably to be preferred as a way of thinking about the distribution simply because each of the factors within it relates to a real, concrete phenomenon: add actions, don't-add actions, their probabilities (summing to 1), and the number of ways of combining them.

It's also worth noting that, in using an analytical approach, we were able to appeal to lots of known results in mathematics about the number of possible combinations of actios, or the ways functions behave in the limit – and with no need to write any code or burn any computer time.

In [1]:

```
import math
import numpy
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import seaborn
```

In [3]:

```
def poisson( n, pEdge ):
'''Return a model function for the Poisson distribution with n nodes and
edge probability pEdge.
:param n: number of nodes
:param pEdge: probabilty of an edge being added betweed a pair of nodes'''
def model( k ):
return (pow(n * pEdge, k) * math.exp(-n * pEdge)) / math.factorial(k)
return model
fig = plt.figure()
plt.xlabel("$k$")
plt.ylabel("$p_k$")
plt.title('Poisson degree distribution, $N = {n}, \phi = {phi}$'.format(n = 1000, phi = 0.05) )
plt.plot(xrange(100), map(poisson(1000, 0.05), xrange(100)))
_ = plt.show()
```

The graph is symmetric around the point $x = 50$, suggesting that this is the mean. Looking at the parameters of the distribution, however, we plotted 1000 nodes with an edge probability of 0.05, which multiplied-out also give 50. That's suggestive, but we need to *prove* that its the case *always*.

First let's re-visit the idea of a mean. The mean of any random variable can be written as the sum of each value the variable can take m,ultiplied by the probability of it taking that value. For the mean degree, we therefore have:

\begin{align} \langle k \rangle &= 1 \times p_1 + 2 \times p_2 + \cdots \\ &= \sum_{k = 1}^N k \, p_k \end{align}(The maximum node degree is actually $N - 1$ since we're looking at simple networks, so we only really need to sum $k$ up to $N-1$ rather than $N$ – but that just means that $p_N = 0$, so the sum works out anyway.) For the Poisson distribution underlying an ER network, we can code-up this definition using the formula above to work out the probability for each $k$. If $N = 1000$ and $\phi = 0.05$ as above, then:

In [4]:

```
sum = 0
p = poisson(1000, 0.05)
for k in xrange(1, 100):
sum = sum + k * p(k)
print 'Computed mean degree = {kmean}'.format(kmean = sum)
```

Close enough. But we can do better: we can obtain an analytic result and compute the formula for the mean degree given $N$ and $\phi$. We can identify the two definitions above to get that:

$$ \langle k \rangle = \sum_{k = 1}^N k \, \binom{N}{k} \, \phi^k \, (1 - \phi)^{N - k} $$So we need to find out the value of the sum on the right-hand side. To do this we need to know another property of the binomial distribution, which is that:

$$ (p + q)^n = \sum_{d = 1}^{n} d \binom{n}{d} \, p^d \, q^{n - d} $$Now, if we differentiate both sides with respect to $p$, we get:

\begin{align*} n(p + q)^{n - 1} &= \sum_{d = 1}^{n} \binom{n}{d} \, d \, p^{d - 1} \, q^{n - d} \\ &= \frac{1}{p} \sum_{d = 0}^{n} d \binom{n}{d} \, p^d \, q^{n - d} \\ np(p + q)^{n - 1} &= \sum_{d = 1}^{n} d \binom{n}{d} \, p^d \, q^{n - d} \end{align*}and the right-hand side starts to look very like the form we're looking for from above. If we now express it in terms of $N$, $\phi$, and $k$ to get the notation straight, and let $q = 1 - p$, then:

\begin{align*} N\phi(\phi + (1 - \phi))^{N - 1} &= \sum_{k = 1}^{N} \binom{N}{k} \, \phi^k \, (1 - \phi)^{N - k} \\ N\phi &= \sum_{k = 1}^{N} \binom{N}{k} \, \phi^k \, (1 - \phi)^{N - k} \\ &= \langle k \rangle \end{align*}So the mean of the binomial degree distribution is given by $N \phi$. Looking at the equations, we can see that $N$ and $\phi$ are the only parameters: we need to know them, *and only them*, to compute the distribution for any value of $k$. We can therefore say that $N$ and $\phi$ *completely characterise* the distribution.

There is another implication of this. Since $\langle k \rangle = N\phi$, for large $N$ we can make use of the fact that the binomial distribution converges to the Poisson distribution and re-write the probability distribution for at ER network in terms of the network's mean degree:

$$ p_k = \frac{\langle k \rangle^k e^{-\langle k \rangle}}{k!} $$This means that given two of $N$, $\phi$, and $\langle k \rangle$, we can compute the other, and we have all we need to completely characterise the degree distribution of an ER network. Put still another way, if we want an ER network with a specific number of nodes and a mean degree, we can compute the link probability $\phi = \frac{\langle k \rangle}{N}$ we need to construct it.

Earlier we asserted that many people, on first seeing the generating process for the ER model, assume that it will result in a uniform degree distribution. I certainly did. Since it's such a common reaction, it's perhaps worth exploring a little why it's also wrong.

The argument for a uniform degree distribution goes roughly as follows: since the edge probability is independent for every edge, we'd expect that, at each node, we select roughly the same number of edges to add, and therefore there's no reason for one node to be preferred over another, so they should all have roughly the same degree.

The problem here is that it takes a statement about *edges* and subtly converts it into a statement about *nodes*. Just because we select edges with a constant probability doesn't imply that we do so uniformly at the node level – so uniformly, in fact, that every node ends up having *exactly* the same number of edges. Put that way, a uniform degree distribution actually sounds rather unlikely! The process only says that, *over the graph as a whole*, edges are added with constant probability: it does not say anything about the *local* behaviour of edge addition around an individual node. It is this that allows for the possibility of non-uniform distrbution.

This observation – that global behaviour, and typically global regularity, doesn't lead to local regularity – is perhaps the single most important thing to bear in mind about complex networks. It's tempting to think that large-scale regularity emerges from lots of small-scale regularity, but that isn't necessarily the case: the small scale could be irregular, but the irregularities could even out. Conversely, it's tempting to think that something that looks regular and well-behaved on the outside has component pieces that are regular and well-behaved – and again that isn't necessarily the case. The lesson here is that things can be more complex than they seem. On the other hand, it also means that we can often ignore local noise and make use of global properties, as long as we're careful.

The description we used for the ER generator is an example of a process that in mathematics is called a Bernoulli process, where we look at the sequence of actions needed to generate a given outcome and compute how many ways there are for those actions to occur at random. Bernoulli processes occur whenever we encounter actions being performed one after the other according to some random driver, and the argument above is completely typical of how one deals with them.

(This is a chapter from Complex networks, complex processes.)

Let's now look at the best-understood complex network. If there's a poster child for network science, it's the "random graph", or more properly, the *Erdős-Rényi* or *ER network*. We mentioned Erdős and Rényi in the introduction as the mathematicians who first gave shape to the idea that large networks with essentially random structure might still show some usefulÂ statistical properties that made them more comprehensible. In this chapter we'll see what these regularities are. The ER networks are complex enough to allow us to demonstrate techniques that will apply in other circumstances, but are simple and well-behaved enough to make this analysis fairly straightforward.

We'll explore the ER network in some detail both through simulation and through mathematical analysis. We'll do it this way for a good reason: in the real world, networks often cannot be guaranteed to have exactly the properties that the mathematical techniques require, but computer simulation really needs to be driven by an understanding of what's going on in network at a fundamental level and how the mathematical features contribute to this behaviour. For these reasons, it's not safe to only understand how to simulate networks: you need to be able at least to follow the mathematical analysis as well. Conversely, understanding real networks and applications requires the techniques of simulation as well as analysis.

We'll start by building ER networks using `networkx`

and explore some of the properties that we developed earlier. We'll then look at the same properties (and more) from a more mathematical perspective, and relate the code to the maths to show how the two views interrelate.

To build an Erdős-Rényi (or ER) network with $N$ vertices, we proceed as follows:

- Build a graph $G = (V, E)$ with $N$ vertices and no edges, so $|V| = N$ and $E = \emptyset$
- For each pair of vertices $v_1, v_2 \in V$ with $v_1 \neq v_2$, add an edge $(v_1, v_2)$ to $E$ with probability $\phi$

That's it! – a very simple process for constructing what turns out ot be a very interesting class of networks. There are a four things to notice here, all of whch turn out to be very important for what follows.

Firstly, the ER model has two parameters: the number of nodes in the network $N$, and the probability $\phi$ of an edge occurring between any given pair of nodes. The combination of these two parameters defines a **class** of networks, depending on exactly which pairs of nodes are connected at random at the connection stage.

Secondly, the probability of an edge appearing between any pair of nodes is an independent event: it doesn't matter whether a node is already heavily connected or not, the chances of its being linked to any other node is just $\phi$ – and this probability doesn't change over time.

Thirdly, we disallow both self-loops and parallel edges, thereby creating a simple network.

Fourthly, we build the network "all at once", with all its nodes and all its edges in place before we do any further analysis.

To build such a network, we need to turn the description into code. We can do this in two ways using `networkx`

:

- by implementing the construction process ourselves; or
- by using the built-in generator function

The latter is clearly entirely adequate in practice, but for demonstration purposes, we'll do both.

In [1]:

```
import networkx
import math
import numpy
import matplotlib as mpl
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
import matplotlib.colors as colors
import matplotlib.cm as cm
import seaborn
from JSAnimation import IPython_display
from matplotlib import animation
```

In [2]:

```
def erdos_renyi_graph_from_scratch( n, pEdge ):
"""Build the graph with n nodes and a probability pEdge of there
being an edge between any pair of nodes.
:param n: number of nodes in the network
:param pEdge: probability that there is an edge between any pair of nodes
:returns: a network"""
g = networkx.empty_graph(n)
# run through all the possible edges
ne = 0
for i in xrange(n):
for j in xrange(i + 1, n):
if numpy.random.random() <= pEdge:
ne = ne + 1
g.add_edge(i, j, { 'added': ne })
return g
```

(We use `n`

for $N$ and `pEdge`

for $\phi$.) Notice the way we run through the pairs of nodes so that we only try to generate an edge once between each pair. This works because the graph we're building is undirected and we want at most one edge between each pair of nodes *in either order*. (There are also directed ER networks: to build on of those we'd want to try each pair *in each order* to allow for directionality.)

The key `networkx`

method here is `add_edge`

, which adds an edge between a pair of nodes. It's optional third parameter is a dictionary of attribute/value pairs that are associated with the edge, and we use this to record the order in which the edge was added so we can visualise the growth of the network below.

We can then use this function to build an ER network, for example with 5000 nodes and a 5% probability of there being an edge between any pair of nodes:

In [3]:

```
g_from_scratch = erdos_renyi_graph_from_scratch(5000, 0.05)
```

`networkn`

's "generator" function for ER networks built-in that we can use to build a graph with the same properties as above:

In [4]:

```
g_from_generator = networkx.erdos_renyi_graph(5000, 0.05)
```

`g_from_scratch`

and `g_from_generator`

are instances of the class of ER networks. They aren't *the same network*, though, even though they have the same parameters, because they've been created by stochastic processes and so will have different connections between their nodes. However, they will both share certain statistical characteristics that we'll come back to after we look at the growth processes ion more detail.

It can sometimes be useful to see how these graphs grow, by means of animation. We can use `matplotlib`

to draw a graph progressively, one node at a time, and show how the edge set grows too. We can then use the `JSAnimation`

plug-in to generate an in-line animation, or save the animation to a file and link to it.

`matplotlib`

's animation functions are quite involved. The core is a function that creates a figure for each frame of the animation, which `matplotlib`

then links together like the pages of a flick-book. There's quite a lot of set-up involved too, though: the following code is heavily commented to (hopefully) show what's going on.

In [26]:

```
def animate_growing_graph( g, edges, fig, ax = None, pos = None, cmap = None, **kwords ):
"""Animate the growth of a network, showing how edges are added and
how node degrees evolve. Slow if done for a large graph. Returns a
matplotlib animation object that can be saved to a file for later
or shown in-line in a notebook.
:param g: the network
:param edges: the edges, in the order they were added
:param fig: the figure to draw into
:param ax: (optional) the axes to draw into (defaults to main figure axes)
:param pos: (optional) layout for the network (default is to use the spring layout)
:returns: an animation object"""
# fill in the defaults
if ax is None:
# figure main axes
ax = fig.gca()
if pos is None:
# layout the network using the spring layout
pos = networkx.spring_layout(g, iterations = 100, k = 2/math.sqrt(g.order()))
if cmap is None:
cmap = cm.hot
if ('frames' not in kwords.keys()) or (kwords['frames'] is None):
# animate at one second per edge
kwords['frames'] = int(len(edges) * (1.0 / kwords['interval']))
# manipulate the axes, since this isn't a data plot
ax.set_xlim([-0.2, 1.2]) # axes bounded around 1
ax.set_ylim([-0.2, 1.2])
ax.grid(False) # no grid
ax.get_xaxis().set_ticks([]) # no ticks on the axes
ax.get_yaxis().set_ticks([])
# work out the colour map for the degrees of the network, picking
# colours linearly from the length of the colour map
ds = g.degree().values()
max_degree = max(ds)
min_degree = min(ds)
norm = colors.Normalize(vmin = min_degree, vmax = max_degree)
mappable = cm.ScalarMappable(norm, cmap)
# We now create all the graphical elements we need for the animation as matplotlib
# lines and patches. Essentially this defines what's in the final frame of the animation.
# We'll then make everything invisible and, as the animation progresses, make the elements
# appear in the right order. It's a lot faster to do it this way rather than re-building
# each frame from nothing as we go -- although that works too.
# generate node markers based on positions
nodeMarkers = dict()
nodeDegrees = dict()
for v in g.nodes_iter():
circ = plt.Circle(pos[v], radius = 0.02, zorder = 2) # place node markers at the top of the z-order
ax.add_patch(circ)
nodeMarkers[v] = circ
nodeDegrees[v] = 0
# build the list of edges as they were added
edgeMarkers = []
edgeEndpoints = []
for (i, j) in edges:
xs = [ pos[i][0], pos[j][0] ]
ys = [ pos[i][1], pos[j][1] ]
line = plt.Line2D(xs, ys, zorder = 1) # place edge markers down the z-order
ax.add_line(line)
edgeMarkers.append(line)
edgeEndpoints.append((i, j))
# work out the "time shape" of the animation
nFrames = kwords['frames'] # frames in the animation
framesPerEdge = max(int(nFrames / len(edges)), 1) # frames per edge
# add colourbar for node degree
kmax = max(g.degree().values())
cax = fig.add_axes([ 0.9, 0.125, 0.05, 0.775 ])
norm = mpl.colors.Normalize(0, kmax)
cb = mpl.colorbar.ColorbarBase(cax, cmap = cmap,
norm = norm,
orientation = 'vertical',
ticks = range(kmax + 1))
# initialisation function hides all the edges, colours all nodes
# as having degree zero
def init():
x = 1
for em in edgeMarkers:
em.set(alpha = 0)
for vm in nodeMarkers.values():
vm.set(color = mappable.to_rgba(0))
# per-frame drawing for animation
def frame( f ):
# frame number boundaries for various transitions in the animation "shape"
atEdge = int((f + 0.0) / framesPerEdge) # the edge we've reached with this frame
if framesPerEdge == 1:
a = 1
else:
a = ((f + 0.0) % framesPerEdge) / framesPerEdge
if atEdge < len(edgeMarkers):
edgeMarkers[atEdge].set(alpha = a)
if(a == 1):
(i, j) = edgeEndpoints[atEdge]
nodeDegrees[i] = nodeDegrees[i] + 1
nodeMarkers[i].set(color = mappable.to_rgba(nodeDegrees[i]))
nodeDegrees[j] = nodeDegrees[j] + 1
nodeMarkers[j].set(color = mappable.to_rgba(nodeDegrees[j]))
# return the animation with the functions etc set up
return animation.FuncAnimation(fig, frame, init_func = init, **kwords)
```

In [6]:

```
# build the network, which annotates the edges with their order of addition
er = erdos_renyi_graph_from_scratch(100, 0.03)
# pull the edges as a dict from edge to order
er_edges_dict = networkx.get_edge_attributes(er, 'added')
# return a list of edges in order of addition
er_edges = sorted(er_edges_dict.keys(),
key = (lambda e: er_edges_dict[e]))
```

We can then generate and show the animation:

In [29]:

```
fig = plt.figure(figsize = (8, 6))
anim = animate_growing_graph(er, er_edges, fig, frames = 100)
IPython_display.display_animation(anim, default_mode = 'once')
```

Out[29]: