Workshop 3: Inference

Overview

Aim of the session

  • We’ve seen several definitions of Poisson processes.

  • We can now generate our own point patterns, given the definition of a process.

  • Can we go the other way? Given a point pattern, can we estimate the intensity function that generated it?

  • We will cover parametric estimation of \(\lambda(t;\theta)\) in 1 dimension.

Outline

  • What is inference?

  • Deriving the Poisson process likelihood

  • Maximum likelihood estimation

  • Examples and representing uncertainty

Disclaimers, again.

  • I will be using R, feel free to reuse and adapt this code.

  • No requirement to use R, pick whatever language you are most comfortable in.

  • Focus on temporal point patterns here - core ideas extend to spatial processes. (I am deliberately leaving extensions for you!)

Inference for Poisson Processes

Inference is the process of using data to draw conclusions about a process beyond the realisation that you have available to you.

  • We have one (or more) point patterns, which are realisations from some point process.

  • We want to construct an intensity model for that point process.

  • We will consider the method of maximum likelihood in this lecture

Suggested Poster Topic
  • Inference for self-exciting point processes.

  • Bayesian inference for point process models.

Review of Maximum likelihood inferenece

We have observations \(\mathbf{x} = (x_1, \ldots, x_n)\) and an intensity function \(\lambda(t;\beta)\) which is described by some parameters \(\theta\), e.g.:

\[\lambda(t ; \theta) = \theta,\]

\[\lambda(t ; \theta) = \exp\{\theta_0 + \theta_1 t\}.\]

We want to pick the parameters \(\theta = \hat \theta\) which make the observed point pattern most likely.

To do that we find an expression for the probability of our observed point pattern and find the combination of parameter values that maximise that expression.

\[\Pr( X | \theta) = \mathcal{L}(\theta ; X)\]

\[ \hat \theta = \underset{\theta \in \Theta}{\operatorname{arg\,max}} \ \ \mathcal{L}(\theta; X)\]

Likelihood of a Poisson Process

The likelihood of a dataset is the probability of observing that data set, considered as a function of the model parameters.

We want to find the parameters \(\hat \theta\) that make the observed data most probable - this will be our point estimate of the parameter value(s).

In a Poisson process there are two random components:

  • How many events we see within the process, \(\Pr(N(A) = n)\);

  • Where those events are located on our observation window \(A \subseteq S\).

We are interested in the joint probability of seeing the particular number of points that we did and observing those points in the locations they happened to appear:


\[\mathcal{L}(\theta;X) = \Pr(N(A) = n \ \&\ X_1 = x_1 \ \&\ \ldots \ \&\ X_n = x_n).\]


Since \(\Pr(Y \cap Z) = \Pr(Y | Z)\Pr(Z)\), it follows that


\[ \mathcal{L}(\theta;X) = \Pr(N(A) = n)\Pr(X_1 = x_1 \ \&\ \ldots \ \&\ X_n = x_n \mid N = n).\]


Because event locations in a Poisson process are independent of one another, this can be further simplified:

\[ \mathcal{L}(\theta;X) = \Pr(N = n)\prod_{i=1}^{n}\Pr(X_i = x_i | N = n).\]

Recall that our integrated intensity or compensator is defined as follows:

\[\Lambda(A; \theta) = \int_{s \in A} \lambda(s;\theta) \mathrm{d}s. \]

Question
  • How can \(\Lambda(A)\) be interpreted in terms of:
      1. an area or volume;
      1. an event count?

We can therefore replace \(\Pr(N = n)\) with the Poisson pmf:

\[ \frac{(\Lambda(A;\theta))^n \exp\{-\Lambda(A;\theta)\}}{n!} \prod_{i=1}^{n}\Pr(X_i = x_i | N = n).\]

Question

How can we get an expression for the probability of observing an event at any given location?

Since event locations are independent of the total number of events and proportional to the intensity on A, we can calculate the probability density function of an individual observation by scaling \(\lambda(s ; \theta)\) so that it integrates to 1 over \(A\).

\[ f_{X_i}(x_i) = \frac{\lambda(x_i; \theta)}{\Lambda(A; \theta)}.\]

Substituting this into the previous expression, we get that

\[ \mathcal{L}(\theta; x) = \frac{(\Lambda(A;\theta))^n \exp\{-\Lambda(A; \theta)\}}{n!} \prod_{i=1}^{n} \frac{\lambda(x_i; \theta)}{\Lambda(A; \theta)} \]

Cancelling redundant terms gets us to the Poisson process likelihood function:

\[ \mathcal{L}(\theta; x) = (n!)^{-1} \exp\{-\Lambda(A;\theta)\} \prod_{i=1}^{n}\lambda(x_i; \theta).\]

We can see that this function has two components: one concerned with the number of events that we see, and the other is concerned with the location of those events on \(A\).

\[ \mathcal{L}(\theta; x) = \underbrace{(n!)^{-1} \exp\{-\Lambda(A;\theta)\}}_{\text{Number of Events}}\ \ \underbrace{\prod_{i=1}^{n}\lambda(x_i; \theta)}_{\text{Locations}}.\]

Question

What would happen if the term containing \(\Lambda(A;\theta)\) was not included in the likelihood?

Does this suggest why \(\Lambda(A;\theta)\) is sometimes called the compensator?

Maximising this likelihood function over \(\theta\) is equivalent to maximising the log-likelihood \(\ell\):

\[ \ell(\theta; x) = \log \mathcal{L}(\theta; x) = - \log(n!) - \Lambda(A; \theta) + \sum_{i =1}^{n} \log(\lambda(x_i;\theta)).\]

Question

What are two reasons that we might prefer to maximise the log-likelihood, rather than the likelihood?

Hint: one reason is analytical the other is computational.

Special Case: Homogeneous Poisson Process

In the special case that \(\lambda(s) = \lambda\), the likelihood becomes particularly simple:

\[ \mathcal{L}(\lambda; x) = \frac{\lambda^n \exp\{-\lambda|A|\}}{n!}.\]

The corresponding log-likelihood is:

\[\ell(\lambda; x) = n\log \lambda -\lambda|A| - \log(n!).\]

In this simple case, we can show that the MLE is equivalent to the method of moments estimate \(\hat \lambda = \frac{n}{|A|}\).

# function to calculate log-liklihood
hpp_llh <- function(lambda, n, t_min, t_max){
  
  if (lambda < 0) return(-Inf)
  
  mod_A <- t_max - t_min
  
  n * log(lambda) - (lambda * mod_A) - sum(log(n:1))
}

# Calculate llh at a sequence of potential lambda values
lambda_values <- seq(from = 1, to = 8, length.out = 1001)
llh_values <- rep(NA, 1001)

for (i in 1:1001) {
  llh_values[i] <- hpp_llh(
    lambda = lambda_values[i],
    n = 125,
    t_min = 0,
    t_max = 25)
}

# Plot log-liklihood function 
plot(
  x = lambda_values,
  y = llh_values,
  ylab = "log-liklihood",
  xlab = expression("lambda(t)"), 
  type = "l",
  lwd = 2
)
abline(v = 5, col = "darkorange", lwd = 2)

Task
  • Derive this result for yourself, along with a corresponding expression for an approximate 95% confidence interval for \(\lambda\).

  • If you are working in R, can you modify hpp_llh() to take as input a vector of lambda values and avoid this for loop?

  • Can you modify hpp_llh() to take the point pattern as an input, rather than the number of points?

Inhomogeneous Poisson Processes

For almost all inhomogeneous Poisson process models numerical methods will be required to evaluate \(\Lambda(A;\theta) = \int_{s\in A} \lambda(s;\theta) \mathrm{d}s\) and to optimise \(\ell(\theta; X)\).

General approach:

  1. Write or use a function to evaluate \(\Lambda(A;\theta) = \int_{s\in A} \lambda(s;\theta) \mathrm{d}s\).

  2. Write a function to evaluate \(\ell(\theta; X)\).

  3. Optimise using gradient descent or some other method.

Tips:

  • Make use of built in optimisers unless this will be the focus of your project - optim() or optimise() in R
    • scipy.optimize() in Python.
  • {NHPoisson} {spatstat} {splancs}
Suggested Poster Topic

Review of packages for point process analysis in R / Python.

Worked Example

For our example case we will take


\[\lambda(t;\theta) = \theta_1 \sin(t) + \theta_2.\]

In this case we can evaluate \(\Lambda(A,\theta)\) analytically for \(A\) of the form \((0, t_\text{max})\):


\[ \Lambda(A;\theta) = \theta_1 [1 - \cos(t_{max})] + \theta_2 t_{\text{max}}.\]

First we define functions to evaluate \(\lambda\) and \(\Lambda\):

lambda <- function(theta, time){
 theta[1] * sin(time) + theta[2]
}

Lambda <- function(theta, time){
  theta[1] * (1 - cos(time)) + theta[2] * time 
}

This allows us to plot the true intensity function, which is useful for checking our code / method.

# set simulation parameters
t_min <- pi
t_max <- 5 * pi 
theta_true = c(40, 50)

# Evaluate lambda on regular grid for plotting
t <- seq(t_min, t_max, length.out = 1001)
lambda_vals <- lambda(theta = theta_true, time = t)

plot(t, lambda_vals, type = "l", lwd = 2, ylim = c(0,100), bty = "n")

We can then simulate a point pattern from this intensity.

# Simulate a point pattern by thinning / rejection sampling

# Simulate from bounding HPP
set.seed(4321)
lambda_max <- 90
n_homo <- rpois(n = 1, lambda = lambda_max * (t_max - t_min))
x_homo <- runif(n = n_homo, min = t_min, max = t_max)

# Thin point pattern
prob_keep <- lambda(theta = theta_true, time = x_homo) / lambda_max
is_kept <- rbinom(n = n_homo, size = 1, prob = prob_keep)

x_inhomo <- x_homo[is_kept == 1]
n_inhomo <- length(x_inhomo)

plot(
  x = t, 
  y = lambda_vals,
  type = "l", 
  lwd = 2,
  ylim = c(0,100),
  bty = "n") 
points(
  x = x_inhomo,
  y = rep(0, n_inhomo),
  pch = 16,
  col = rgb(0,0,0,0.2))

Now that we have our simulated point pattern, we can try to recover the simulation parameters from it.

Note: in applications there is never a “real” parametric intensity function.

To find the MLE, we will need to write a function to evaluate the log-likelihood.

  • We will use R’s built-in optimiser to find the MLE. This does minimisation not maximisation.

  • So we need a function to evaluate the negative log-likelihood for a given set of parameters theta, point pattern x and observation interval A.

# Negative log-likelihood 

nllh_ihpp <- function(theta, x, A){
  # !! Need intensity non-negative everywhere
  if (theta[1] > theta[2]) return(1e8)
  
  # Integral term 
  Lambda_term <- Lambda(theta, A[2]) - Lambda(theta, A[1])
  
  # Intensity term
  lambda_term <- -sum(log(lambda(theta, time = x))) 

  return(lambda_term + Lambda_term)
}

# Test: evaluate for lambda = 0 * sin(t) + 5
nllh_ihpp( theta = c(0, 5), x = x_inhomo, A = c(pi, 5*pi))
[1] -933.4102

We can then optimise this two-parameter function using the following code, which returns a list, which gives us all the information we need.

opt <- optim(
  par = c(0,5),             # starting parameter values
  fn = nllh_ihpp,           # function we want to optimise
  method = "L-BFGS-B",      # A smarter version of gradient descent
  hessian = TRUE,           # Return numerical estimate of Hessian at MLE
  A = c(t_min, t_max),      # Remaining parameters of lambda(t;theta)
  x = x_inhomo
  )

opt
$par
[1] 41.40042 49.25828

$value
[1] -1910.736

$counts
function gradient 
      14       14 

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

$hessian
           [,1]       [,2]
[1,]  0.2734041 -0.2297891
[2,] -0.2297891  0.4482445

We can use the point estimates from above to plot our point estimate for the intensity function. Not bad!

fitted_intensity <- lambda(theta = opt$par, time = t)

plot(
 x = t,
 y = lambda_vals,
 type = "l",
 lwd = 2,
 ylim = c(0, 100)) 
points(
  x = x_inhomo,
  y = rep(0, n_inhomo),
  pch = 16, 
  col = rgb(0, 0, 0, 0.2)) 
lines(
  x = t, 
  y = fitted_intensity,
  col = "#E87800",
  type = "l",
  lwd = 2,
  ylim = c(0,10))

Adding uncertainty

A point estimate wouldn’t be sufficient if we were estimating a scalar, and nor is it sufficient when we are estimating a function.

To visualise the uncertainty in our function we can draw lots of values for \(\hat \theta\) from it’s sampling distribution and plot the intensity function associated with each set of sampled values.

Review of Likelihood inference - part 2

The Maximum likelihood estimator has a sampling distribution that is (asymptotically) Gaussian:

\[\hat \theta \sim MVN(\theta, I^{-1}(\theta)). \]

If we have a large enough data set (and some mild regularity conditions hold), we can use this as result to approximate the error we make when estimating \(\theta\):

\[\hat \theta - \theta \sim MVN(0, I^{-1}(\theta)). \]

Unfortunately we do not know the true value of theta (otherwise we wouldn’t need to estimate it!) and so don’t know the information matrix either.

Instead, we replace it by the observed information at the MLE - a matrix that we do know.

\[\hat \theta - \theta \sim MVN(0, I_O^{-1}(\hat\theta)) \]

In fact, this observed information matrix is equivalent to the negative of the Hessian matrix and this is exactly what R has numerically approximated for us.

\[\hat \theta - \theta \sim MVN(0, - H^{-1}(\hat \theta))\]

(Recall that this Hessian matrix is just a square matrix filled with second ordered partial derivatives of our likelihood function. We evaluate these partial derivatives at the maximum likelihood estimate to describe the local curvature of the likelihood function.)

We can use this result to simulate 1000 values from the approximate sampling distribution of the MLE \(\hat\theta\). To do this, I will use a function from the {mgcv} package to draw samples from the corresponding multivariate Gaussian distribution.

mle_samples <- mgcv::rmvn(n = 1000, mu = opt$par, V = solve(opt$hessian))

plot(
  mle_samples,
  col = rgb(232/255, 120/255, 0, 0.4),
  xlab = "theta_1",
  ylab = "theta_2",
  asp = 1,
  pch = 16,
  bty = "n")

points(x = theta_true[1], y = theta_true[2], pch = "+", cex = 2)

Finally, we can visualise the uncertainty in the estimated intensity by overlaying each of these curves

plot(
  x = t,
  y = lambda_vals, 
  type = "l", 
  ylab = expression("lambda(t)"),
  lwd = 2, 
  ylim = c(0,100), 
  bty = "n")  

for (i in 1:1000) {
  lines(
    x = t, 
    y = lambda(mle_samples[i,], t),
    type = "l",
    lwd = 2, 
    col = rgb(232/255, 120/255, 0, 0.04)
  )
}

lines(x = t, y = lambda_vals,  lwd = 2) 

legend(
"topleft", 
legend = c("true", "estimated"), 
col = c("black", rgb(232/255, 120/255, 0, 0.4)), 
lwd = c(2,1), 
bty = "n"
)

Suggested Activity
  • Formally test whether a sinusoidal intensity improves the model fit over a HPP.
  • How does the length of the observation interval impact the power of this test?
Suggested Activity

Extend the code given to include a phase shift and phase compression

\[ \lambda(t;\theta) = a \sin(b t + c) + d.\]

Poster Ideas

Suggested Poster Topic
  • Pick your own \(\lambda(s;\theta)\) or \(\lambda(x;\theta)\) for simulation and inference.

  • How to compare different models \(\lambda_1(s;\theta)\) vs \(\lambda_2(s;\theta)\)?

  • Estimation when \(\Lambda(A;\theta)\) is an intractable integral.

  • Nonparametric (Kernel Density) estimation of \(\lambda(s;\theta)\).

  • How can we assess the goodness of fit of a point process model?

  • How can the idea of “residuals” be extended to point processes?

  • Simulation and Inference of Poisson cluster / simple inhibition processes.

  • Estimating the intensity of history-dependent processes using Expectation-Maximisation Algorithms.

Wrapping Up

That is all for today, make sure to have a go at the suggested activities in the second hour while I am around to help.

That is all of the technical content covered. Next session will be shorter, highlighting some applications and extensions of Poisson processes and giving general poster advice.

If you haven’t already, talk to me about your poster idea.