# function to calculate log-liklihood
<- function(lambda, n, t_min, t_max){
hpp_llh
if (lambda < 0) return(-Inf)
<- t_max - t_min
mod_A
* log(lambda) - (lambda * mod_A) - sum(log(n:1))
n
}
# Calculate llh at a sequence of potential lambda values
<- seq(from = 1, to = 8, length.out = 1001)
lambda_values <- rep(NA, 1001)
llh_values
for (i in 1:1001) {
<- hpp_llh(
llh_values[i] lambda = lambda_values[i],
n = 125,
t_min = 0,
t_max = 25)
}
# Plot log-liklihood function
plot(
x = lambda_values,
y = llh_values,
ylab = "log-liklihood",
xlab = expression("lambda(t)"),
type = "l",
lwd = 2
)abline(v = 5, col = "darkorange", lwd = 2)
Workshop 3: Inference
Overview
Aim of the session
We’ve seen several definitions of Poisson processes.
We can now generate our own point patterns, given the definition of a process.
Can we go the other way? Given a point pattern, can we estimate the intensity function that generated it?
We will cover parametric estimation of \(\lambda(t;\theta)\) in 1 dimension.
Outline
What is inference?
Deriving the Poisson process likelihood
Maximum likelihood estimation
Examples and representing uncertainty
Disclaimers, again.
I will be using R, feel free to reuse and adapt this code.
No requirement to use R, pick whatever language you are most comfortable in.
Focus on temporal point patterns here - core ideas extend to spatial processes. (I am deliberately leaving extensions for you!)
Inference for Poisson Processes
Inference is the process of using data to draw conclusions about a process beyond the realisation that you have available to you.
We have one (or more) point patterns, which are realisations from some point process.
We want to construct an intensity model for that point process.
We will consider the method of maximum likelihood in this lecture
Inference for self-exciting point processes.
Bayesian inference for point process models.
Review of Maximum likelihood inferenece
We have observations \(\mathbf{x} = (x_1, \ldots, x_n)\) and an intensity function \(\lambda(t;\beta)\) which is described by some parameters \(\theta\), e.g.:
\[\lambda(t ; \theta) = \theta,\]
\[\lambda(t ; \theta) = \exp\{\theta_0 + \theta_1 t\}.\]
We want to pick the parameters \(\theta = \hat \theta\) which make the observed point pattern most likely.
To do that we find an expression for the probability of our observed point pattern and find the combination of parameter values that maximise that expression.
\[\Pr( X | \theta) = \mathcal{L}(\theta ; X)\]
\[ \hat \theta = \underset{\theta \in \Theta}{\operatorname{arg\,max}} \ \ \mathcal{L}(\theta; X)\]
Likelihood of a Poisson Process
The likelihood of a dataset is the probability of observing that data set, considered as a function of the model parameters.
We want to find the parameters \(\hat \theta\) that make the observed data most probable - this will be our point estimate of the parameter value(s).
In a Poisson process there are two random components:
How many events we see within the process, \(\Pr(N(A) = n)\);
Where those events are located on our observation window \(A \subseteq S\).
We are interested in the joint probability of seeing the particular number of points that we did and observing those points in the locations they happened to appear:
\[\mathcal{L}(\theta;X) = \Pr(N(A) = n \ \&\ X_1 = x_1 \ \&\ \ldots \ \&\ X_n = x_n).\]
Since \(\Pr(Y \cap Z) = \Pr(Y | Z)\Pr(Z)\), it follows that
\[ \mathcal{L}(\theta;X) = \Pr(N(A) = n)\Pr(X_1 = x_1 \ \&\ \ldots \ \&\ X_n = x_n \mid N = n).\]
Because event locations in a Poisson process are independent of one another, this can be further simplified:
\[ \mathcal{L}(\theta;X) = \Pr(N = n)\prod_{i=1}^{n}\Pr(X_i = x_i | N = n).\]
Recall that our integrated intensity or compensator is defined as follows:
\[\Lambda(A; \theta) = \int_{s \in A} \lambda(s;\theta) \mathrm{d}s. \]
- How can \(\Lambda(A)\) be interpreted in terms of:
- an area or volume;
- an event count?
We can therefore replace \(\Pr(N = n)\) with the Poisson pmf:
\[ \frac{(\Lambda(A;\theta))^n \exp\{-\Lambda(A;\theta)\}}{n!} \prod_{i=1}^{n}\Pr(X_i = x_i | N = n).\]
How can we get an expression for the probability of observing an event at any given location?
Since event locations are independent of the total number of events and proportional to the intensity on A, we can calculate the probability density function of an individual observation by scaling \(\lambda(s ; \theta)\) so that it integrates to 1 over \(A\).
\[ f_{X_i}(x_i) = \frac{\lambda(x_i; \theta)}{\Lambda(A; \theta)}.\]
Substituting this into the previous expression, we get that
\[ \mathcal{L}(\theta; x) = \frac{(\Lambda(A;\theta))^n \exp\{-\Lambda(A; \theta)\}}{n!} \prod_{i=1}^{n} \frac{\lambda(x_i; \theta)}{\Lambda(A; \theta)} \]
Cancelling redundant terms gets us to the Poisson process likelihood function:
\[ \mathcal{L}(\theta; x) = (n!)^{-1} \exp\{-\Lambda(A;\theta)\} \prod_{i=1}^{n}\lambda(x_i; \theta).\]
We can see that this function has two components: one concerned with the number of events that we see, and the other is concerned with the location of those events on \(A\).
\[ \mathcal{L}(\theta; x) = \underbrace{(n!)^{-1} \exp\{-\Lambda(A;\theta)\}}_{\text{Number of Events}}\ \ \underbrace{\prod_{i=1}^{n}\lambda(x_i; \theta)}_{\text{Locations}}.\]
What would happen if the term containing \(\Lambda(A;\theta)\) was not included in the likelihood?
Does this suggest why \(\Lambda(A;\theta)\) is sometimes called the compensator?
Maximising this likelihood function over \(\theta\) is equivalent to maximising the log-likelihood \(\ell\):
\[ \ell(\theta; x) = \log \mathcal{L}(\theta; x) = - \log(n!) - \Lambda(A; \theta) + \sum_{i =1}^{n} \log(\lambda(x_i;\theta)).\]
What are two reasons that we might prefer to maximise the log-likelihood, rather than the likelihood?
Hint: one reason is analytical the other is computational.
Special Case: Homogeneous Poisson Process
In the special case that \(\lambda(s) = \lambda\), the likelihood becomes particularly simple:
\[ \mathcal{L}(\lambda; x) = \frac{\lambda^n \exp\{-\lambda|A|\}}{n!}.\]
The corresponding log-likelihood is:
\[\ell(\lambda; x) = n\log \lambda -\lambda|A| - \log(n!).\]
In this simple case, we can show that the MLE is equivalent to the method of moments estimate \(\hat \lambda = \frac{n}{|A|}\).
Derive this result for yourself, along with a corresponding expression for an approximate 95% confidence interval for \(\lambda\).
If you are working in R, can you modify
hpp_llh()
to take as input a vector of lambda values and avoid thisfor
loop?Can you modify
hpp_llh()
to take the point pattern as an input, rather than the number of points?
Inhomogeneous Poisson Processes
For almost all inhomogeneous Poisson process models numerical methods will be required to evaluate \(\Lambda(A;\theta) = \int_{s\in A} \lambda(s;\theta) \mathrm{d}s\) and to optimise \(\ell(\theta; X)\).
General approach:
Write or use a function to evaluate \(\Lambda(A;\theta) = \int_{s\in A} \lambda(s;\theta) \mathrm{d}s\).
Write a function to evaluate \(\ell(\theta; X)\).
Optimise using gradient descent or some other method.
Tips:
- Make use of built in optimisers unless this will be the focus of your project -
optim()
oroptimise()
in Rscipy.optimize()
in Python.
{NHPoisson}
{spatstat}
{splancs}
…
Review of packages for point process analysis in R / Python.
Worked Example
For our example case we will take
\[\lambda(t;\theta) = \theta_1 \sin(t) + \theta_2.\]
In this case we can evaluate \(\Lambda(A,\theta)\) analytically for \(A\) of the form \((0, t_\text{max})\):
\[ \Lambda(A;\theta) = \theta_1 [1 - \cos(t_{max})] + \theta_2 t_{\text{max}}.\]
First we define functions to evaluate \(\lambda\) and \(\Lambda\):
<- function(theta, time){
lambda 1] * sin(time) + theta[2]
theta[
}
<- function(theta, time){
Lambda 1] * (1 - cos(time)) + theta[2] * time
theta[ }
This allows us to plot the true intensity function, which is useful for checking our code / method.
# set simulation parameters
<- pi
t_min <- 5 * pi
t_max = c(40, 50)
theta_true
# Evaluate lambda on regular grid for plotting
<- seq(t_min, t_max, length.out = 1001)
t <- lambda(theta = theta_true, time = t)
lambda_vals
plot(t, lambda_vals, type = "l", lwd = 2, ylim = c(0,100), bty = "n")
We can then simulate a point pattern from this intensity.
# Simulate a point pattern by thinning / rejection sampling
# Simulate from bounding HPP
set.seed(4321)
<- 90
lambda_max <- rpois(n = 1, lambda = lambda_max * (t_max - t_min))
n_homo <- runif(n = n_homo, min = t_min, max = t_max)
x_homo
# Thin point pattern
<- lambda(theta = theta_true, time = x_homo) / lambda_max
prob_keep <- rbinom(n = n_homo, size = 1, prob = prob_keep)
is_kept
<- x_homo[is_kept == 1]
x_inhomo <- length(x_inhomo)
n_inhomo
plot(
x = t,
y = lambda_vals,
type = "l",
lwd = 2,
ylim = c(0,100),
bty = "n")
points(
x = x_inhomo,
y = rep(0, n_inhomo),
pch = 16,
col = rgb(0,0,0,0.2))
Now that we have our simulated point pattern, we can try to recover the simulation parameters from it.
Note: in applications there is never a “real” parametric intensity function.
To find the MLE, we will need to write a function to evaluate the log-likelihood.
We will use R’s built-in optimiser to find the MLE. This does minimisation not maximisation.
So we need a function to evaluate the negative log-likelihood for a given set of parameters
theta
, point patternx
and observation intervalA
.
# Negative log-likelihood
<- function(theta, x, A){
nllh_ihpp # !! Need intensity non-negative everywhere
if (theta[1] > theta[2]) return(1e8)
# Integral term
<- Lambda(theta, A[2]) - Lambda(theta, A[1])
Lambda_term
# Intensity term
<- -sum(log(lambda(theta, time = x)))
lambda_term
return(lambda_term + Lambda_term)
}
# Test: evaluate for lambda = 0 * sin(t) + 5
nllh_ihpp( theta = c(0, 5), x = x_inhomo, A = c(pi, 5*pi))
[1] -933.4102
We can then optimise this two-parameter function using the following code, which returns a list
, which gives us all the information we need.
<- optim(
opt par = c(0,5), # starting parameter values
fn = nllh_ihpp, # function we want to optimise
method = "L-BFGS-B", # A smarter version of gradient descent
hessian = TRUE, # Return numerical estimate of Hessian at MLE
A = c(t_min, t_max), # Remaining parameters of lambda(t;theta)
x = x_inhomo
)
opt
$par
[1] 41.40042 49.25828
$value
[1] -1910.736
$counts
function gradient
14 14
$convergence
[1] 0
$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
$hessian
[,1] [,2]
[1,] 0.2734041 -0.2297891
[2,] -0.2297891 0.4482445
We can use the point estimates from above to plot our point estimate for the intensity function. Not bad!
<- lambda(theta = opt$par, time = t)
fitted_intensity
plot(
x = t,
y = lambda_vals,
type = "l",
lwd = 2,
ylim = c(0, 100))
points(
x = x_inhomo,
y = rep(0, n_inhomo),
pch = 16,
col = rgb(0, 0, 0, 0.2))
lines(
x = t,
y = fitted_intensity,
col = "#E87800",
type = "l",
lwd = 2,
ylim = c(0,10))
Adding uncertainty
A point estimate wouldn’t be sufficient if we were estimating a scalar, and nor is it sufficient when we are estimating a function.
To visualise the uncertainty in our function we can draw lots of values for \(\hat \theta\) from it’s sampling distribution and plot the intensity function associated with each set of sampled values.
Review of Likelihood inference - part 2
The Maximum likelihood estimator has a sampling distribution that is (asymptotically) Gaussian:
\[\hat \theta \sim MVN(\theta, I^{-1}(\theta)). \]
If we have a large enough data set (and some mild regularity conditions hold), we can use this as result to approximate the error we make when estimating \(\theta\):
\[\hat \theta - \theta \sim MVN(0, I^{-1}(\theta)). \]
Unfortunately we do not know the true value of theta (otherwise we wouldn’t need to estimate it!) and so don’t know the information matrix either.
Instead, we replace it by the observed information at the MLE - a matrix that we do know.
\[\hat \theta - \theta \sim MVN(0, I_O^{-1}(\hat\theta)) \]
In fact, this observed information matrix is equivalent to the negative of the Hessian matrix and this is exactly what R has numerically approximated for us.
\[\hat \theta - \theta \sim MVN(0, - H^{-1}(\hat \theta))\]
(Recall that this Hessian matrix is just a square matrix filled with second ordered partial derivatives of our likelihood function. We evaluate these partial derivatives at the maximum likelihood estimate to describe the local curvature of the likelihood function.)
We can use this result to simulate 1000 values from the approximate sampling distribution of the MLE \(\hat\theta\). To do this, I will use a function from the {mgcv}
package to draw samples from the corresponding multivariate Gaussian distribution.
<- mgcv::rmvn(n = 1000, mu = opt$par, V = solve(opt$hessian))
mle_samples
plot(
mle_samples,col = rgb(232/255, 120/255, 0, 0.4),
xlab = "theta_1",
ylab = "theta_2",
asp = 1,
pch = 16,
bty = "n")
points(x = theta_true[1], y = theta_true[2], pch = "+", cex = 2)
Finally, we can visualise the uncertainty in the estimated intensity by overlaying each of these curves
plot(
x = t,
y = lambda_vals,
type = "l",
ylab = expression("lambda(t)"),
lwd = 2,
ylim = c(0,100),
bty = "n")
for (i in 1:1000) {
lines(
x = t,
y = lambda(mle_samples[i,], t),
type = "l",
lwd = 2,
col = rgb(232/255, 120/255, 0, 0.04)
)
}
lines(x = t, y = lambda_vals, lwd = 2)
legend(
"topleft",
legend = c("true", "estimated"),
col = c("black", rgb(232/255, 120/255, 0, 0.4)),
lwd = c(2,1),
bty = "n"
)
- Formally test whether a sinusoidal intensity improves the model fit over a HPP.
- How does the length of the observation interval impact the power of this test?
Extend the code given to include a phase shift and phase compression
\[ \lambda(t;\theta) = a \sin(b t + c) + d.\]
Poster Ideas
Pick your own \(\lambda(s;\theta)\) or \(\lambda(x;\theta)\) for simulation and inference.
How to compare different models \(\lambda_1(s;\theta)\) vs \(\lambda_2(s;\theta)\)?
Estimation when \(\Lambda(A;\theta)\) is an intractable integral.
Nonparametric (Kernel Density) estimation of \(\lambda(s;\theta)\).
How can we assess the goodness of fit of a point process model?
How can the idea of “residuals” be extended to point processes?
Simulation and Inference of Poisson cluster / simple inhibition processes.
Estimating the intensity of history-dependent processes using Expectation-Maximisation Algorithms.
Wrapping Up
That is all for today, make sure to have a go at the suggested activities in the second hour while I am around to help.
That is all of the technical content covered. Next session will be shorter, highlighting some applications and extensions of Poisson processes and giving general poster advice.
If you haven’t already, talk to me about your poster idea.