Background

$$\newcommand{\kon}{k_{\mathrm{on}}} \newcommand{\koff}{k_{\mathrm{off}}} \newcommand{\kcat}{k_{\mathrm{cat}}} \newcommand{\kuncat}{k_{\mathrm{uncat}}} \newcommand{\kms}{k_{m,\mathrm{S}}} \newcommand{\kmp}{k_{m,\mathrm{P}}} \newcommand{\dSdt}{\frac{d[\mathrm{S}]}{dt}} \newcommand{\dEdt}{\frac{d[\mathrm{E}]}{dt}} \newcommand{\dESdt}{\frac{d[\mathrm{ES}]}{dt}} \newcommand{\dPdt}{\frac{d[\mathrm{P}]}{dt}}$$

Enzyme Kinetics

Enzymes catalyze many critical chemical reactions in cells.

Describing a cell with a mathematical model (a long-standing goal of computational biologists) would entail modelling each enzyme-catalyzed chemical reaction.

However, although we may know the scheme for many enzymatic reactions (the responsible enzyme, the associated substrates, and resultant products) we are often missing many of the details needed to construct a faithful mathematical model of the reaction.

Let's begin by introducing the mathematical model used to describe enzymatic reaction schemes. Consider the following enzymatically-catalyzed (uni uni) chemical reaction scheme:

$$ E+S \underset{\koff}{\overset{\kon}{\rightleftarrows}} ES \underset{\kuncat}{\overset{\kcat}{\rightleftarrows}}E+P $$

In this scheme E is an enzyme, S is its substrate, ES is the enzyme-substrate complex, which is an intermediate, and P is the product of the reaction. Each of those chemical species has a concentration in a fixed volume, which we denote with brackets (e.g. $[\mathrm{E}]$ = enzyme concentration).

If we make the simplifying assumption that the 4 molecular species are 'well-mixed' in solution, we can invoke the 'Law of Mass Action' under which the rate of each of the four included reactions is linear in the concentrations of the reactants (with an associated coefficient called the rate constant). The reactions in the above scheme are: enzyme-substrate association ($\kon$), dissociation ($\koff$), enzyme catalysis of substrate into product ($\kcat$), and enzyme-product re-association ("uncatalysis", $\kuncat$). The designation of 'substrate' and 'product' is our choice -- the model is entirely symmetric, which is reflected in the associated ODEs:

$$\begin{aligned} \frac{d[\mathrm{S}]}{dt} &= k_{\mathrm{off}}[\mathrm{ES}] - k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] \\ \frac{d[\mathrm{E}]}{dt} &= k_{\mathrm{off}}[\mathrm{ES}] - k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] + k_{\mathrm{cat}}[\mathrm{ES}] - k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ \frac{d[\mathrm{ES}]}{dt} &= - k_{\mathrm{off}}[\mathrm{ES}] + k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] - k_{\mathrm{cat}}[\mathrm{ES}] + k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ \frac{d[\mathrm{P}]}{dt} &= k_{\mathrm{cat}}[\mathrm{ES}] - k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \end{aligned}$$

This differential equation model describing the (deterministic) chemical kinetics for an enzymatically-catalyzed reaction in well-mixed conditions contains 4 kinetic parameters, i.e. 4 degrees of freedom, which we do not know a priori. These will be the subject of inference.

Note: the intracellular environment is not best described as well-mixed, and models of ’Macromolecular Crowding’ have led to more accurate rate laws for these reactions in vivo. However, we will retain the well-mixed assumption for now.

Parameter Inference

There are 3 typical problems associated with ODE models:

  • Supplied with a complete specification of the system, the forward problem is to integrate the differential equations from some initial conditions forwards in time and predict the trajectory of the system. This is what is typically meant by "solving" the ODE system, but exact analytical solutions are rare, and numerical methods are often brought to bear to approximate system trajectories.
  • Supplied with one or more trajectories (data) but incomplete specification of the system, the inverse problem is to estimate parameters of the system (coefficients in the ODE expressions).
  • Finally, given some manipulable inputs, the control problem is to drive the system towards some desired state.

This post will explore a range of approaches for the inverse problem. Our goal will be to estimate the kinetic parameters of enzymatically-catalyzed chemical reactions from timeseries of concentrations of the molecular species.

Note: enzyme kinetic parameters are typically not inferred from metabolite timeseries data using the methods we will describe, but instead from specific enzyme assays. However, at the moment, these assays are limited to studying one enzyme at a time. The inference approaches described in this post can leverage data from emerging high-throughput assays, which measure many molecular concentrations at once.

The determination of the kinetic parameters for the enzymatic reactions of life is a major project, and reported values have been tabulated in databases such as BRENDA. However, my experience with these databases has been that the reported kinetic parameters are not internally consistent.

The Michaelis-Menten/Briggs-Haldane Approximation

Two assumptions commonly made at this point are:

  1. to assume the initial substrate concentration is much larger than the enzyme concentration ($[\mathrm{S_0}] \gg [\mathrm{E_0}]$).
  2. to suppose that the rates of enzyme-substrate association ($\kon$) and dissociation ($\koff$) are greater than the rates of catalysis and uncatalysis (i.e. $\kon$, $\koff$ $\gg$ $\kcat$, $\kuncat$).

These assumptions permit a timescale separation argument called the "Quasi-Steady-State Approximation" (QSSA) in which we set $\dESdt = 0$, which enables the derivation of the traditional Reversible Michaelis-Menten/Briggs-Haldane expression:

$$\begin{aligned} \frac{d[\mathrm{P}]}{dt} &= \frac{ \frac{\kcat \, [\mathrm{E_T}] [\mathrm{S}]}{K_{m,\mathrm{S}}} - \frac{\koff \, [\mathrm{E_T}] [\mathrm{P}]}{K_{m,\mathrm{P}}}} {1+\frac{[\mathrm{S}]}{K_{m,\mathrm{S}}} + \frac{[\mathrm{P}]}{K_{m,\mathrm{P}}}} \\ \\ \frac{d[\mathrm{S}]}{dt} &= -\frac{d[\mathrm{P}]}{dt} \end{aligned}$$

in which we have introduced the "Michaelis Constants": $K_{m,\mathrm{S}} = \frac{\koff + \kcat}{\kon}$ and $K_{m,\mathrm{P}} = \frac{\koff + \kcat}{\kuncat}$.

The QSSA reduces the system from 4 variables to 2. However, there are still 4 kinetic parameters to estimate in this reduced model.

Note: another assumption typically made at this point is to assume that catalysis is irreversible ($\kuncat = 0$), leading to a further simplified expression for the rate of product formation $\frac{d[\mathrm{P}]}{dt}$. However, this assumption is quite often inaccurate, so we will not make it.

Exploring the Forward Model

A Standard Example

Before we explore techniques to estimate enzyme kinetic parameters from timeseries data, we need to generate timeseries data to begin with. We can accomplish that by fixing kinetic parameters, then solving the forward problem. It will turn out that integrating the differential equations forwards is a subroutine of both approaches to the inverse problem we'll see in this post, so developing a method for the forward problem is hardly wasted effort.

In order to produce a trajectory, we need to set initial conditions. We'll integrate the reaction kinetics of a hypothetical in vitro experiment, in which a fixed quantity of enzyme and substrate are added to the reaction at the outset, then left to react.

Note: in vivo we would expect the concetration of enzyme to vary over time, and the substrate to be replenished. We will generalize this approach to a more biologically-relevant setting in a future post.

Our initial conditions are:

  • $[E]_0$, the initial enzyme concentration, is set to 1mM (miliMolar, i.e. 1000μM).
  • $[S]_0$, the initial substrate concentration is set to 10mM.
default_initial_conditions = {
    'S_0': 10e3,
    'E_0': 1e3,
    'ES_0': 0.0,
    'P_0': 0.0
}

Next, let's fix some generic rate constants:

  • $\kon \,$ of $10^6$ events per Mol per second, or 1 per μM per second, is a typical rate for enzyme-substrate binding.
  • $\koff \,$ of 500/s results in a $\koff$/$\kon$ = $k_d$ of 500 μM, which is a typical $k_d$.
  • $\kcat \,$ is 30/s, a fairly slow but respectable $\kcat$.
  • $\kuncat \,$ of $\frac{\kon}{10}$ is often considered as the boundary for the QSSA to hold (so 0.1 per μM per second). Let's use $\kuncat = \frac{\kon}{100} = $ 0.01/μM for good measure.

Our units are μM and seconds.

default_kinetic_params = {
    'k_on': 1,
    'k_off': 500,
    'k_cat': 30,
    'k_uncat': 0.01
}

def k_ms(p): return (p['k_off'] + p['k_cat']) / p['k_on']
def k_mp(p): return (p['k_off'] + p['k_cat']) / p['k_uncat']

default_kinetic_params['k_ms'] = k_ms(default_kinetic_params)
default_kinetic_params['k_mp'] = k_mp(default_kinetic_params)

To simulate the kinetics with little derivative steps, we need a step size, and a number of total steps:

dt = 1e-6
steps = 5e5

There are a variety of numerical methods to integrate systems of differential equations. The most straightforward is Euler's method, which we've written down explicitly for this system below:

def integrate_euler_full(kinetic_params, dt=dt, steps=steps, initial_conditions=default_initial_conditions):

    S, E, ES, P = initial_conditions.values()
    k_on, k_off, k_cat, k_uncat, k_ms, k_mp = kinetic_params.values()
    traj = [[S, E, ES, P]]

    for _ in range(int(steps)):

        dS = k_off * ES - k_on * E * S
        dE = k_off * ES - k_on * E * S + k_cat * ES - k_uncat * E * P
        dES = k_on * E * S - k_off * ES - k_cat * ES + k_uncat * E * P
        dP = k_cat * ES - k_uncat * E * P

        S += dS * dt
        E += dE * dt
        ES += dES * dt
        P += dP * dt

        traj.append([S, E, ES, P])

    return pd.DataFrame(traj, columns=['S', 'E', 'ES', 'P'], index=np.around(np.linspace(0, dt*steps, int(steps)+1), 6))

We'll also write down Euler's method for the Michaelis-Menten/Briggs-Haldane kinetics

Now we can integrate the reaction kinetics, and plot the trajectory. We'll overlay the Michaelis-Menten/Briggs-Haldane kinetics with dotted lines on top of the full kinetics (solid).

traj_euler_full = integrate_euler_full(default_kinetic_params)
traj_euler_mm = integrate_euler_MM(default_kinetic_params)

We can plainly see the validity of the Quasi-Steady-State Approximation (QSSA) in action in the trajectory: Enzyme E and Substrate S rapidly form Enzyme-Substrate complex ES, the concentration of which remains relatively constant throughout the course of the reaction (recall the QSSA is the approximation that $\dESdt = 0$). Thus, the Michaelis-Menten/Briggs-Haldane product concentration trajectory P_MM well approximates the full kinetics trajectory for the concentration of product P, since the requisite assumptions are valid, namely, (1) $[\mathrm{S_0}] \gg [\mathrm{E_0}]$ and (2) $\kon$, $\koff$ $\gg$ $\kcat$, $\kuncat$.

In practice, Michaelis-Menten/Briggs-Haldane kinetics are often assumed by default, risking the possibility of their misapplication. Let's take this opportunity to explore how the MM/BH kinetics diverge from the full kinetics when we violate the requisite assumptions.

Breaking the Michaelis-Menten/Briggs-Haldane Assumptions: Initial Substrate:Enzyme Ratio

Suppose first the number of molecules of substrate is not much greater than the number of molecules of enzyme, which is a plausible regime for certain reactions in vivo.

initial_conditions = {
    'S_0': 2e3,
    'E_0': 1e3,
    'ES_0': 0.0,
    'P_0': 0.0
}

Then P_MM worsens significantly as an estimate of P.

Breaking the Michaelis-Menten/Briggs-Haldane Assumptions: Fast Enzyme-Substrate Complex Kinetics

Suppose further that the rates of association and dissociation of enzyme with subtstrate are not substantially faster than those of enzyme and product.

kinetic_params = {
    'k_on': 0.05,
    'k_off': 1,
    'k_cat': 50,
    'k_uncat': 0.5
}

kinetic_params['k_ms'] = k_ms(kinetic_params)
kinetic_params['k_mp'] = k_mp(kinetic_params)

Then the Michaelis-Menten/Briggs-Haldane kinetics diverge further.

In each of these latter trajectories, the criteria to make the Michaelis-Menten/Briggs-Haldane approximation are violated, leading to poor approximations to the full kinetics. We belabor this point here because in the following, we will seek to infer the parameters of the kinetics, and our inference will fit poorly if we fit to inappropriate kinetic expressions.

Comparing Integrators

All of the above trajectories are generated by Euler's Method, the most intuitive ODE integration technique. Unfortunately, Euler's Method's naïvete has drawbacks:

  • The order of the error is large with respect to the timestep size.
  • The method is slow, due to the uniformity of the timestep sizes.

A variety of faster and more accurate (albeit more complicated) integrators have been proposed, many of which have implementations in scipy's integrate package. Due to their superior speeds and accuracies, we'll use these methods during inference. As a sanity check, we compare our basic Euler Method solver to scipy's:

The lack of deviation gives us confidence both integration techniques are accurate. Meanwhile,

f'our naïve code takes {round(euler_time, 2)}s, whereas the optimized scipy code takes {round(scipy_time, 4)}s to generate the same trajectory.'
'our naïve code takes 1.16s, whereas the optimized scipy code takes 0.0064s to generate the same trajectory.'

Inference

We have seen how the trajectory of the chemical system is a function of the kinetic parameters. We would now like to invert that function to recover the kinetic parameters from an observed trajectory.

Suppose we know the initial concentrations of Enzyme E and Substrate S, and we measure the concentration of product P over the course of the reaction, which yields the following dataset:

There are two families of approaches to solving this inverse problem. We will explore the simplest variant of each type.

Note: If we had measurements for $\dPdt$ for various concentrations of $[\mathrm{S}]$, $[\mathrm{P}]$, and $[\mathrm{E}]$ we could estimate the kinetic parameters via nonlinear regression, a much simpler approach, which some readers may be familiar with. Concretely, supposing we had a set of measurements for the variables in brown, a nonlinear regression would permit us to fit the parameters in blue: $$\color{saddlebrown}{\dPdt}\color{black} = \frac{ \frac{\color{dodgerblue}{\kcat} \, \color{saddlebrown}{[\mathrm{E_T}]} \color{saddlebrown}{[\mathrm{S}]}} {\color{dodgerblue}{K_{m,\mathrm{S}}}} - \frac{\color{dodgerblue}{\koff} \, \color{saddlebrown}{[\mathrm{E_T}]} \color{saddlebrown}{[\mathrm{P}]}}{\color{dodgerblue}{K_{m,\mathrm{P}}}}} {1+\frac{\color{saddlebrown}{[\mathrm{S}]}}{\color{dodgerblue}{K_{m,\mathrm{S}}}} + \frac{\color{saddlebrown}{[\mathrm{P}]}}{\color{dodgerblue}{K_{m,\mathrm{P}}}}} $$ If we had assumed the reaction were irreversible ($k_{\mathrm{uncat}} = 0$), the Michaelis-Menten/Briggs-Haldane expression would have simplified further to $$\color{saddlebrown}{\dPdt}\color{black} = \frac{ \color{dodgerblue}{\kcat} \, \color{saddlebrown}{[\mathrm{E_T}]} \color{saddlebrown}{[\mathrm{S}]}} {\color{dodgerblue}{K_{m,\mathrm{S}}}\color{black} + \color{saddlebrown}{[\mathrm{S}]}} $$ Where $\color{dodgerblue}{\kcat} \, \color{saddlebrown}{[\mathrm{E_T}]}$ is often consolidated as $\color{dodgerblue}{V_{max}}$. It's difficult to imagine an assay for the activity of many enzymes in a cell at once. Careful Mass Spectrometry-based metabolomics and proteomics, or clever microscopy might measure $[\mathrm{S}]$, $[\mathrm{P}]$, and $[\mathrm{E}]$ but unlikely $\dPdt$ for many enzymes. We would presumably not be able to approximate $\dPdt$ via finite differences either, due to the likely sparsity of the measurement in time compared to the rates of the reactions. The computational approaches covered in this post are more elaborate (and expensive), but do not require measurement of $\dPdt$, making them better suited to existing assays.

Bayesian Approach: Inference by Sampling

[We assume the reader is familiar with Bayesian Inference in other settings.]

The goal of the Bayesian approach is to characterize a posterior distribution of kinetic parameters which could plausibly have generated the data. Bayes Theorem defines the posterior as the product of the prior and likelihood (up to a constant factor). Thus the Bayesian approach entails defining a prior and a likelihood for our problem.

Prior

If the kinetic parameters of our enzyme are not unlike the kinetic parameters of other enzymes, then the empirical distribution of kinetic parameters of other enzymes is a good prior for the parameters of our enzyme.

Since databases of observed enzyme kinetic parameters (e.g. BRENDA, SabioRK) appear to be unreliable, we'll use a previously curated set of kinetic parameters (from the supplement of Bar-Even et. al.).

This database lists $k_{\mathrm{m}}$ and $\kcat$ for both "forwards" and "reverse" reactions with respect to which direction biologists believe is "productive", from which we can parlay distributions for $\kms$ and $\kcat$ from reactions in the forwards direction, and $\kmp$ and $\koff$ from reverse reactions.

This plot is surprising: according to this database, enzymes appear to have roughly equal binding affinity for their substrates and products.

On the other hand, they have a fairly strong preference for catalyzing the reaction biologists think of as forwards (~10x).

Since these empirical distributions over $\kms$ and $\kcat$ in the forwards direction and $\kmp$ and $\koff$ in the reverse direction look sufficiently like normals in log space, so we'll treat them as lognormals. However, we would like our inference procedure to estimate $\kon$, $\koff$, $\kcat$, and $\kuncat$. We can rearrange the expressions for $\kms$ and $\kmp$ to get expressions for the two parameters we're missing:

$$ \kon = \frac{\koff + \kcat}{\kms} \quad \mathrm{and} \quad \kuncat = \frac{\koff + \kcat}{\kmp}$$

Conveniently, the ratio of lognormal variables $\frac{X_1}{X_2}$ is also lognormal with $\mu_{1/2} = \mu_1 - \mu_2$ and $ \sigma^2_{1/2} = \sigma^2_1 + \sigma^2_2 - \sigma_{x_1, x_2}$. In order to use that fact, we say the sum of the random variables $\koff + \kcat$ is also log-normally distributed. We compute its mean and variance empirically.

kcat_plus_koff = pd.Series(np.repeat(empirical_kcat.values, len(empirical_koff)) +
                           np.tile(empirical_koff.values, len(empirical_kcat)))

log_kcat_plus_koff_mean = np.log(kcat_plus_koff).mean()
log_kcat_plus_koff_var = np.log(kcat_plus_koff).var()

This permits us to produce empirical distributions for $\kon$ and $\kuncat$,

log_kon_normal = scipy.stats.norm(loc=log_kcat_plus_koff_mean-log_empirical_kms.mean(),
                                  scale=sqrt(log_kcat_plus_koff_var+log_empirical_kms.var()))

log_kuncat_normal = scipy.stats.norm(loc=log_kcat_plus_koff_mean-log_empirical_kmp.mean(),
                                     scale=sqrt(log_kcat_plus_koff_var+log_empirical_kmp.var()))

which, along with our empirical distributions for $\koff$ and $\kcat$, define a prior over the 4 kinetic parameters we wish to infer.

We might ask whether these are correlated lognormals...

...Not enough to include covariances in the prior. We set the prior covariance to be a diagonal matrix:

Now that we have a prior, let's examine where the default parameters introduced in §2.1 land in this distribution. I had claimed they were "typical".

Likelihood

We need to define a likelihood $p(D|\theta)$ which measures the probability of producing the observed data given settings of the kinetic parameters $\theta = \{\kon, \koff, \kcat, \kuncat\}$. Our data $D = \{\mathrm{obs}_{[\mathrm{P}]}(t) \, ; t \in 0...0.5\}$ are an observed trajectory of concentrations of reaction product P. Each setting of the kinetic parameters corresponds to a trajectory of concentrations of P (via a numerical integration). Intuitively, parameter sets which result in trajectories very near the observed trajectory are more likely. Therefore, our likelihood should measure the distance between the observed $\{\mathrm{obs}_{[\mathrm{P}]}(t) \, ; t \in 0...0.5 \}$ and predicted $\{ u_{[\mathrm{P}]}(t, \theta) \, ; t \in 0...0.5\}$.

How far should the predicted trajectory be allowed to stray from the measured $\{\mathrm{obs}_{[\mathrm{P}]}(t) \}$? The likelihood is really our statement about the presumed noise in our measurements. If we believe our measurements to be noiseless, then our likelihood should concentrate tightly around our measurements (a dirac $\delta$ in the limit), and we would only admit kinetic parameters that interpolate the observed $\{\mathrm{obs}_{[\mathrm{P}]}(t) \}$ almost exactly. In reality, no measurement is noiseless, so we propose the following noise model:

Supposing the detection of each molecule of P is an independent binary random variable with error rate $\sigma$ then the aggregate random variable $\mathrm{N}_{[\mathrm{P}]}(t)$ is gaussian-distributed $\sim \mathcal{N}( \mathrm{obs}_{[\mathrm{P}]}(t), \, \sigma \cdot \sqrt{ \mathrm{obs}_{[\mathrm{P}]}(t)} )$. The variance of the gaussian grows as the square root of the mean, via a Central Limit Theorem argument. We can represent this noise model (and consequently, likelihood) visually as:

σ = 5 # arbitrary magic number represents detection noise level

Concretely, the likelihood is the product distribution of each of the gaussian marginals centered around the measurements. These form a multivariate normal, diagonal since we don't model covariances.

$$p(D|\theta) = \displaystyle \prod_{t\in\mathrm{obs}} p_t \left(u_{[\mathrm{P}]}(t, \theta)\right) \textrm{ where } p_t \textrm{ is the probability density function of } \mathrm{N}_{[\mathrm{P}]}(t) \sim \mathcal{N} \left( \mathrm{obs}_{[\mathrm{P}]}(t), \, \sigma \cdot \sqrt{ \mathrm{obs}_{[\mathrm{P}]}(t) } \right)$$

Which leaves us with a "hyperparameter" $\sigma$, which we have set arbitrarily above.

likelihood_dist = multivariate_normal(mean=observations.values[1:], cov=σ * np.diag(sqrt(observations.values[1:])))

def likelihood_logpdf(ut): return likelihood_dist.logpdf(ut)

Note: We will supply noiseless measurements to our inference algorithms. However, our inference procedures will assume noise in the measurements.

Metropolis-Hastings

We can now evaluate the prior $p(\theta)$ and the likelihood $p(D|\theta)$ of kinetic parameters $\theta = \{\kon, \koff, \kcat, \kuncat\}$. Those two distributions permit us to elaborate an Markov Chain Monte Carlo (MCMC) routine to sample from the posterior $p(\theta|D) \propto p(D|\theta) \cdot p(\theta)$. The algorithm is as follows:

Repeat:

  1. Draw kinetic parameters from the proposal distribution.
  2. Integrate the system with the proposed kinetic parameters.
  3. Evaluate the likelihood of the trajectory generated in step 2.
  4. Accept/Reject the proposal by a Metropolis-Hastings criterion.
  5. Append the current kinetic parameters to the Markov Chain.
  6. Construct a proposal distribution around the current kinetic parameters.

Since the likelihood assigns most of the probability mass to a fairly narrow region of parameter space, most parameter sets have extremely low probability. In order to preserve some numerical stability, we log-transform the typical Metropolis-Hastings expressions. So typically $π_t = \mathrm{likelihood\_pdf}(u_t) \cdot \mathrm{prior\_pdf}(θ_t)$ and the acceptance criterion is $\frac{π_{t+1}}{π_t} > \mathrm{rand}([0,1])$. In log space, the acceptance criterion becomes: $\log(π_{t+1}) - \log(π_t) > \log(\mathrm{rand}([0,1]))$ with $\log(π_t) = \mathrm{likelihood\_logpdf}(u_t) + \mathrm{prior\_logpdf}(θ_t)$.

def MH_MCMC(chain_length=1e3):

    θt = sample_prior()
    ut = simulate_measurements(exp_params(θt))
    πt = likelihood_logpdf(ut) + prior_logpdf(**θt)
    if all(ut == 0): return MH_MCMC(chain_length)

    cov = np.eye(4) * 5e-4
    i = 0
    accept_ratio = 0
    chain = []
    samples = []

    while i < chain_length:

        θtp1 = proposal(θt, cov)
        utp1 = simulate_measurements(exp_params(θtp1))
        πtp1 = likelihood_logpdf(utp1) + prior_logpdf(**θtp1)

        if πtp1 - πt > np.log(np.random.rand()):

            θt, ut, πt = θtp1, utp1, πtp1
            accept_ratio += 1

        chain.append(θt)
        samples.append(ut)

        i += 1

        if i % 100 == 0 and i > 300:
            # cov = pd.DataFrame(chain[100:]).cov()
            print(i, end='\r')

    chain = pd.DataFrame(chain)
    samples = pd.DataFrame(np.hstack((np.zeros((len(chain), 1)), samples)), columns=observations.index)
    accept_ratio = accept_ratio/chain_length

    return chain, samples, accept_ratio

Our proposal density for the time being can be a simple isotropic gaussian around the current parameters. We could update our Gaussian's covariance over the course of sampling, but that proved to be unhelpful for this problem.

def proposal(θt, cov):

    μ = [θt['k_on'], θt['k_off'], θt['k_cat'], θt['k_uncat']]

    θtp1 = dict(zip(['k_on', 'k_off', 'k_cat', 'k_uncat'], np.random.multivariate_normal(μ, cov)))

    return θtp1

Now let's put it into practice:

chain_length = 1e3
chain, samples, accept_ratio = MH_MCMC(chain_length=chain_length)
print('accept_ratio:', accept_ratio)
accept_ratio: 0.209
plot_chain(chain)
plot_samples(samples)
chain_1, samples_1, accept_ratio_1 = MCMC_run()
accept_ratio: 0.288
chain_2, samples_2, accept_ratio_2 = MCMC_run()
accept_ratio: 0.285
chain_3, samples_3, accept_ratio_3 = MCMC_run()
accept_ratio: 0.302

The above chains tell us something interesting: it appears to be possible to closely fit the observed data with very different parameter sets than the ones used to generate the observed trajectory -- except for $k_{\mathrm{cat}}$, which must be accurate. There must exist some subspace of parameters which yield similar $[\mathrm{P}](t)$ curves. We'll explore the topic of parameter identifiability in a future blog post.

Frequentist Approach: Inference by Optimization

In the previous section, we conducted a random walk in parameter space, biasing our random walk towards regions of the parameter space where both the prior probability and likelihood were greater. After a certain number of samples (the "burn in" phase), the samples from our random walk could be said to be drawn from the (exact) posterior distribution over the kinetic parameters.

An alternative approach begins with another premise:

  • Suppose we want to incorporate no prior knowledge, and let the timeseries data alone govern our determination of our enzyme's kinetic parameters.
  • Suppose as well that instead of searching for a distribution of plausible parameters, we're only interested in finding the single most likely set of parameters.

These two choices recast the inference task as an optimization problem. We'll explore two approaches to optimize the kinetic parameters to fit the observed data.

Forward Sensitivities

Optimization problems require an objective, such as minimizing a loss (or cost) function. Let's use the conventional squared error between our observations $\mathrm{obs}(t)$ and our trajectory $u(t, \theta)$ at the observed timepoints, scaled in accordance with the basic noise model we described in §3.1.2:
$G(u(t, \theta)) = \displaystyle\sum_{t \in \mathrm{obs}} \left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}}\right)^2$.

def loss(u):
    if not all(observations.index.isin(u.index)): return np.nan
    return sum(((observations - u.loc[observations.index, 'P'])/(σ * np.sqrt(observations))).dropna()**2)

In order to optimize $\theta$ with respect to to our loss function $G$, we need a means to evaluate the gradient of the loss with respect to the parameters. Recall the gradient is the vector of first derivatives of $G$ with respect to each parameter $\theta_i$. The simplest way to compute each entry of the gradient ($\frac{dG(u(t, \theta))}{d\theta_i}$) would be to evaluate $G$ slightly above and below $\theta_i$ to numerically approximate the derivative. Concatenating those (for each $\theta_i$) would yield a numerical approximation of the gradient vector. This technique is called Finite Differences, and it's valid, though impractical. We'll implement it to check the validitiy of gradients we compute by other means.

def finite_differences(_, θ):

    grad = {}
    for k,v in θ.items():
        ϵ = np.sqrt(v * 1e-15)
        grad[k] = (loss(integrate_scipy_full({**θ, **{k:v+ϵ}})) - loss(integrate_scipy_full({**θ, **{k:v-ϵ}}))) / (2*ϵ)

    return grad

Beyond Finite Differences, we might try to analytically derive an expression for the gradient:

$$\frac{d}{d\theta_i}G(u(t, \theta)) = \frac{d}{d\theta_i} \sum_{t \in \mathrm{obs}} \left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}}\right)^2 = \sum_{t \in \mathrm{obs}} \left[ 2\left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{(\sigma \cdot \sqrt{\mathrm{obs}(t)})^2}\right) \cdot \frac{-du(t, \theta)}{d\theta_i} \right]$$

However, the quantity $\frac{du(t, \theta)}{d\theta_i}$ (called the sensitivity of the solution to a parameter) is not immediately available. We can derive it as follows:

Our original differential equation is $\frac{du(t, \theta)}{dt} = f(u(t, \theta), \theta)$. If we take $\frac{\partial}{\partial\theta_i} \left[ \frac{du(t, \theta)}{dt} \right] = \frac{\partial}{\partial\theta_i} \left[ f(u(t, \theta), \theta) \right]$, we can rearrange as $\frac{d}{dt} \left[ \frac{\partial u(t, \theta)}{\partial\theta_i} \right] = \frac{\partial}{\partial\theta_i} \left[ f(u(t, \theta), \theta) \right]$ and then integrate over $t$ for

$$\int_{t_0}^T\frac{d}{dt} \left[ \frac{\partial u(t, \theta)}{\partial\theta_i} \right]\mathrm{d}t = \int_{t_0}^T\frac{\partial}{\partial\theta_i} \left[ f(u(t, \theta), \theta) \right]\mathrm{d}t = \int_{t_0}^T \left[ \frac{\partial f}{\partial u} \Big|_{u(t, \theta), \theta} \cdot \frac{\partial u}{\partial \theta_i} + \frac{\partial f}{\partial \theta_i} \Big|_{u(t, \theta), \theta} \right] \mathrm{d}t$$

What we've done is define an ODE whose solution (integral) is that missing quantity, the sensitivity $\frac{du(t, \theta)}{d\theta_i}$. This ODE is aptly named the forward sensitivity ODE. We can integrate both the original ODE and the sensitivity ODE forwards in time together from $t_0$ to $T$.

But first, we need to understand the constituent expressions: $\frac{\partial f}{\partial u} \Big|_{u(t, \theta), \theta}$ , $\frac{\partial u}{\partial \theta_i}$ and $\frac{\partial f}{\partial \theta_i} \Big|_{u(t, \theta), \theta}$

Recall,

$$\frac{du}{dt} = \frac{d}{dt}\begin{bmatrix}[\mathrm{S}] \\ [\mathrm{E}] \\ [\mathrm{ES}] \\ [\mathrm{P}] \end{bmatrix} = \begin{bmatrix} k_{\mathrm{off}}[\mathrm{ES}] - k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] \\ k_{\mathrm{off}}[\mathrm{ES}] - k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] + k_{\mathrm{cat}}[\mathrm{ES}] - k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ - k_{\mathrm{off}}[\mathrm{ES}] + k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] - k_{\mathrm{cat}}[\mathrm{ES}] + k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ k_{\mathrm{cat}}[\mathrm{ES}] - k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \end{bmatrix} = f(u(t, \theta), \theta)$$

$\frac{\partial f}{\partial u} \Big|_{u(t, \theta), \theta}$ is the derivative of the derivative $f$ with respect to the state $u$. Since both the state $u$ and its derivative $f$ are 4D, this quantity is a 4x4 Jacobian:

$$\frac{df}{du} = \begin{bmatrix}\frac{df}{[\mathrm{S}]} & \frac{df}{[\mathrm{E}]} & \frac{df}{[\mathrm{ES}]} & \frac{df}{[\mathrm{P}]} \end{bmatrix} = \begin{bmatrix} -k_{\mathrm{on}}[\mathrm{E}] & -k_{\mathrm{on}}[\mathrm{S}] & k_{\mathrm{off}} & 0 \\ -k_{\mathrm{on}}[\mathrm{E}] & -k_{\mathrm{on}}[\mathrm{S}] - k_{\mathrm{uncat}}[\mathrm{P}] & k_{\mathrm{off}} + k_{\mathrm{cat}} & -k_{\mathrm{uncat}}[\mathrm{E}] \\ k_{\mathrm{on}}[\mathrm{E}] & k_{\mathrm{on}}[\mathrm{S}] + k_{\mathrm{uncat}}[\mathrm{P}] & -k_{\mathrm{off}} - k_{\mathrm{cat}} & k_{\mathrm{uncat}}[\mathrm{E}] \\ 0 & -k_{\mathrm{uncat}}[\mathrm{P}] & k_{\mathrm{cat}} & -k_{\mathrm{uncat}}[\mathrm{E}] \end{bmatrix}$$
def f_u_Jacobian(S, E, ES, P, k_on, k_off, k_cat, k_uncat):

    return np.array([
        [-k_on * E, -k_on * S, k_off, 0],
        [-k_on * E, -k_on * S - k_uncat * P, k_off + k_cat, -k_uncat * E],
        [k_on * E, k_on * S + k_uncat * P, -k_off - k_cat, k_uncat * E],
        [0, -k_uncat * P, k_cat, -k_uncat * E]
    ])

$\frac{\partial f}{\partial \theta_i} \Big|_{u(t, \theta), \theta}$ is the derivative of the derivative $f$ with respect to one of the parameters $\theta_i$.

$$\frac{\partial f}{\partial k_{\mathrm{on}}} = \begin{bmatrix} -[\mathrm{E}][\mathrm{S}] \\ -[\mathrm{E}][\mathrm{S}] \\ [\mathrm{E}][\mathrm{S}] \\ 0 \end{bmatrix}, \qquad \frac{\partial f}{\partial k_{\mathrm{off}}} = \begin{bmatrix} [\mathrm{ES}] \\ [\mathrm{ES}] \\ -[\mathrm{ES}] \\ 0 \end{bmatrix}, \qquad \frac{\partial f}{\partial k_{\mathrm{cat}}} = \begin{bmatrix} 0 \\ [\mathrm{ES}] \\ -[\mathrm{ES}] \\ [\mathrm{ES}] \end{bmatrix}, \qquad \frac{\partial f}{\partial k_{\mathrm{uncat}}} = \begin{bmatrix} 0 \\ -[\mathrm{E}][\mathrm{P}] \\ [\mathrm{E}][\mathrm{P}] \\ -[\mathrm{E}][\mathrm{P}] \end{bmatrix}, \qquad $$
def f_k_on(S, E, ES, P): return np.array([[-E*S, -E*S, E*S, 0]]).T
def f_k_off(S, E, ES, P): return np.array([[ES, ES, -ES, 0]]).T
def f_k_cat(S, E, ES, P): return np.array([[0, ES, -ES, ES]]).T
def f_k_uncat(S, E, ES, P): return np.array([[0, -E*P, E*P, -E*P]]).T

cols = [v+u for u in ['', '_k_on', '_k_off', '_k_cat', '_k_uncat'] for v in ['S', 'E', 'ES', 'P']]

$\frac{\partial u}{\partial \theta_i}$ is the variable of integration, which means we only need to define a boundary condition for it, in this case, an initial value:

$$ \frac{\partial u}{\partial \theta_i} \Big|_{t_0} = \frac{\partial}{\partial \theta_i} u(0, \theta) $$

But since in our case, our fixed initial condition $u(0, \theta)$ does not depend on $\theta$, $\frac{\partial u}{\partial \theta_i} \Big|_{t_0} = 0$.

Now we're ready to augment our original Euler method to compute both $\int_{t_0}^T\frac{du(t, \theta)}{dt} \mathrm{d}t$ as before and add $\int_{t_0}^T\frac{\partial}{\partial\theta_i} \left[ f(u(t, \theta), \theta) \right] \mathrm{d}t$.

Unfortunately, as before, our simple-minded python code, although conceptually helpful, is too slow to use repeatedly inside a loop. Let's once again re-structure this code for scipy.

Our naïve code takes 22.08s, whereas the optimized scipy code takes 0.1982s to generate the same trajectory.

Recall, computing the sensitivity of the solution with respect to the parameters $\frac{du(t, \theta)}{d\theta_i}$ was in service of computing the gradient of our loss function with respect to the parameters:

$$\frac{dG(u(t, \theta))}{d\theta_i} = \sum_{t \in \mathrm{obs}} \left[ 2\left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{(\sigma \cdot \sqrt{\mathrm{obs}(t)})^2}\right) \cdot \frac{-du(t, \theta)}{d\theta_i} \right]$$

Now, since we set up this problem such that we only observe $\mathrm{obs}_{[\mathrm{P}]}(t)$, we are only able to compare the integrated kinetics of $ u_{[\mathrm{P}]}(t, \theta)$ and so our gradient expression becomes:

$$\frac{dG(u(t, \theta))}{d\theta_i} = \sum_{t \in \mathrm{obs}} \left[ 2 \left( \frac{(\mathrm{obs}_{[\mathrm{P}]}(t) - u_{[\mathrm{P}]}(t, \theta))}{(\sigma \cdot \sqrt{\mathrm{obs}_{[\mathrm{P}]}(t)})^2} \right) \cdot \frac{ -d u_{[\mathrm{P}]}(t, \theta)}{d\theta_i} \right]$$

We notice that $\frac{ d u_{[\mathrm{P}]}(t, \theta)}{dk_{\mathrm{uncat}}}$ reaches $O(10^6)$. This difference of the scales of the sensitivities may be because the parameters span many orders of magnitude, so the response of the system to small perturbations in some parameters may be much greater than others, especially integrated over time.

We've set ourselves the task of optimizing the 4 ODE parameters to minimize the squared error with respect to an observed timeseries. We need somewhere to begin optimization from. Let's use the means of the prior distributions (from §3.1.1) for each parameter as a starting point.

θ_0 = {'k_on': exp(log_kon_normal.mean()),
       'k_off': exp(log_koff_normal.mean()),
       'k_cat': exp(log_kcat_normal.mean()),
       'k_uncat': exp(log_kuncat_normal.mean())}

Our gradient descent routine iterates a loop:

  1. Integrate the system ODE and sensitivity ODE with the current parameters.
  2. Compute the gradient of the loss with the current parameters.
  3. Update the parameters with a gradient step

Our optimization routine will use one extra trick: a momentum term. This amounts to updating the gradient step expression from

$$ \theta_{t+1} = \theta_t - \eta \cdot \frac{dG(u(t, \theta))}{d\theta} \qquad \mathrm{to} \qquad \begin{aligned} v_{t+1} &= \gamma \cdot v_t + \frac{dG(u(t, \theta))}{d\theta} \\ \theta_{t+1} &= \theta_t - \eta \cdot v_t \end{aligned}$$

This expression has two hyperparamters: $\eta$, the learning rate and $\gamma$, the momentum parameter. We'll set $\eta = 0.01$ at first and decrease it as we converge towards a minimum. We set $\gamma$ to the typical 0.9.

optimization_record = optimize_by_gradient_descent(θ_0)
{'k_on': 1.0489415890739007, 'k_off': 9.301874926576735, 'k_cat': 26.789395694394965, 'k_uncat': 0.8975846605968499} 
	Loss:  2.135936  ->  2.135936 
	|Gradient|:  1.756622982117719 
	|v_t|:  1.756622982117719 
	η:  3.1622776601683754e-12 
	 {'k_on': 3.6718623703555924, 'k_off': 9.116898206276698, 'k_cat': 27.228235334781886, 'k_uncat': 0.7150114110837106}
plot_optimization_trajectory(optimization_record)

We said initially we didn't want to rely on prior knowledge of typical enzyme parameters in the sensitivity-based approach, but we've used them above to define our starting point for optimization, $\theta_0$. In many settings, we may not have a good estimate of where to begin from, and the conventional approach is to try multiple seeds. Let's try outlandishly wrong starting points, and see whether our gradient descent routine is able to find parameters which fit the data.

optimization_records = get_or_compute_optimization_runs()

As we did not add noise to our measurements, a loss of 0 is possible. However, none of our optimizations reach 0 loss. We would anticipate the magnitude of our gradients to approach 0 as well, as optimization proceeds successfully. Let's dig into these optimization runs a little further...

Some of our optimizations succeed fit the data, whereas some fail to. Once again, even those that fit the data quite well (and therefore approach zero loss) do not recover the original parameters the trajectory was generated with. This result warrants a closer inspection of the loss surface.

Visualizing $G(u(t, \theta))$, which is a $\R^4 \mapsto \R^1$ function, is tricky, but we can get a feel for it by examining orthogonal 2D slices.

First, we need to evaluate the loss everywhere on a grid.

global_indices, global_loss_hypergrid = get_or_compute_global_loss_hypergrid()
global_loss_hypergrid.shape
(11, 11, 11, 11)
optimum_neighborhood_indices, optimum_neighborhood_loss_hypergrid = get_or_compute_optimum_neighborhood_loss_hypergrid(θ_opt)
optimum_neighborhood_loss_hypergrid.shape
(15, 15, 15, 15)

We'll first explore the loss function on a grid in $\log_{10}$ space. This will give us a feel for the global shape of the loss surface, so long as we're mindful of the $\log_{10}$-distortion. We can also project the optimization trajectories onto each 2D slice (as well as the global optimum), while remembering that the trajectories are 4D, so their projections may not look appear to be optimizing the loss in each 2D slice. We can also take the opportunity to visualize the MCMC trajectories as a comparison (in red).

Note: Run the notebook to interactively explore this landscape with sliders that translate the slices through the 4D space.

Note: The contour plots above suggest local optima, but those are artifacts of plotly’s contour function and the sparsity of our grid.
To understand why our optimizations are failing to reach the optimum, we can also explore the loss around the optimum, or around the termination points of each optimization. Let's see the neighborhood of the optimum.

The optimum falls in a valley, which, once attained, likely has a very small gradient. This corresponds to the valley we saw in the $\log_{10}$-scaled global loss landscape above. Small gradients can be difficult to compute accurately with finite numerical precision, and our optimizer will stop once it's unable to distinguish which direction is downhill, because the loss surface is locally flat, and tiny steps in the direction of the approximate gradient do not decrease the loss.

Our parameters span many orders of magnitude. In our MCMC approach to this problem, we dealt with our kinetic parameters in log space. Let's take a quick detour and try gradient-based optimization for the log-parameters.

Forward Sensitivities in Log Space

Many of the expressions derived in the previous section remain the same when we optimize the loss $G$ with respect to the log-parameters. The Jacobian $\frac{\partial f}{\partial u} \Big|_{u(t, \theta), \theta}$ remains the same, however, the $\frac{\partial f}{\partial \theta} \Big|_{u(t, \theta), \theta}$ terms are replaced by $\frac{\partial f}{\partial \log(\theta)} \Big|_{u(t, \theta), \theta}$ terms:

$$\frac{\partial f}{\partial \log(k_{\mathrm{on}})} = \begin{bmatrix} -k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] \\ -k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] \\ k_{\mathrm{on}}[\mathrm{E}][\mathrm{S}] \\ 0 \end{bmatrix}, \qquad \frac{\partial f}{\partial \log(k_{\mathrm{off}})} = \begin{bmatrix} k_{\mathrm{off}}[\mathrm{ES}] \\ k_{\mathrm{off}}[\mathrm{ES}] \\ -k_{\mathrm{off}}[\mathrm{ES}] \\ 0 \end{bmatrix}, \qquad \frac{\partial f}{\partial \log(k_{\mathrm{cat}})} = \begin{bmatrix} 0 \\ k_{\mathrm{cat}}[\mathrm{ES}] \\ -k_{\mathrm{cat}}[\mathrm{ES}] \\ k_{\mathrm{cat}}[\mathrm{ES}] \end{bmatrix}, \qquad \frac{\partial f}{\partial \log(k_{\mathrm{uncat}})} = \begin{bmatrix} 0 \\ -k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \\ -k_{\mathrm{uncat}}[\mathrm{E}][\mathrm{P}] \end{bmatrix}, \qquad $$
def f_log_k_on(S, E, ES, P, k_on): return np.array([[-k_on*E*S, -k_on*E*S, k_on*E*S, 0]]).T
def f_log_k_off(S, E, ES, P, k_off): return np.array([[k_off*ES, k_off*ES, -k_off*ES, 0]]).T
def f_log_k_cat(S, E, ES, P, k_cat): return np.array([[0, k_cat*ES, -k_cat*ES, k_cat*ES]]).T
def f_log_k_uncat(S, E, ES, P, k_uncat): return np.array([[0, -k_uncat*E*P, k_uncat*E*P, -k_uncat*E*P]]).T

We'll skip straight to the scipy implementation.

We'll once more start from the mean of the prior,

log_θ_0 = {'k_on': log_kon_normal.mean(),
           'k_off': log_koff_normal.mean(),
           'k_cat': log_kcat_normal.mean(),
           'k_uncat': log_kuncat_normal.mean()}
optimization_record = optimize_by_gradient_descent(log_θ_0, integrate=lambda log_θ: integrate_scipy_log_sensitivities(exp_params(log_θ)))
{'k_on': 0.04778164537946772, 'k_off': 2.2302159848479026, 'k_cat': 3.288006126168655, 'k_uncat': -0.1080478337126154} 
	Loss:  0.697513  ->  0.697513 
	|Gradient|:  0.5847699040405534 
	|v_t|:  0.5847699040405534 
	η:  9.999999999999989e-11 
	 {'k_on': 1.1154074189761212, 'k_off': 1.754124991901947, 'k_cat': 3.3975347971126704, 'k_uncat': 0}
plot_optimization_trajectory(optimization_record)

This seems promising. Let's once more start from many initial points.

log_optimization_records = get_or_compute_log_optimization_runs()
plot_optimization_records_losses_and_magnitudes(log_optimization_records)

Steps in log space can be quite large, often yielding much faster convergences. We see some trajectories still fail to approach optima.

Adjoint Method

There exists another way to evalate the gradient of the loss with respect to the parameters. Our intention is not to improve the accuracy of the gradients, but instead the asymptotic computational complexity of evaluating the gradient, and thus decrease the runtime for optimizing the parameters of systems with many parameters.

Our enzyme kinetics system has 4 parameters, which is not many, so we will not experience a notable speedup in our optimizations, but including the approach is perhaps easier to illustrate granularly for a small system.

In the Forward Sensitivities approach, a system of Ordinary Differential Equations in 4 variables with 4 parameters had us integrating 16 sensitivity variables: $\frac{\partial [\mathrm{X}_i]}{\partial \theta_j}$ for each combination of $[\mathrm{X}_i]$ and $\theta_j$. The number of sensitivity ODE expressions scales as $O(|\mathrm{X}| \cdot |\theta|)$. The Adjoint Method for ODEs will permit us to reduce that to $O(|\mathrm{X}| + |\theta|)$. Let's walk through the derivation. Fair warning: it's lengthy (at least by my standards).

Derivation:

We had chosen our loss to be

$$G(u(t, \theta)) = \sum_{t \in \mathrm{obs}} \left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}}\right)^2 $$

Re-write $G$ as an integral of the instantaneous loss over continuous time, for which we can use dirac deltas $\delta$

$$\begin{aligned}G(u(t, \theta)) &= \int_{t_0}^T \left[ \left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}}\right)^2 \cdot \sum_{t_i \in \mathrm{obs}}\delta(t_i-t) \right] \, \mathrm{d}t \\ &=\int_{t_{0}}^{T}g(u(t,\theta))\, \mathrm{d}t \end{aligned}$$

We want to specify the constraint that $\frac{du}{\, \mathrm{d}t} = f(u(t,\theta),\theta)$ throughout the domain which we can do via lagrange multipliers. Introduce the auxiliary vector variable $\lambda$

$$G(u(t, \theta)) =\int_{t_{0}}^{T}g(u(t,\theta))\, \mathrm{d}t+\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{du}{\, \mathrm{d}t}-f(u(t,\theta),\theta) \right) \, \mathrm{d}t$$

Now, take $\frac{\partial}{\partial \theta_i}$ to evaluate the iᵗʰ entry of the gradient. Apply the chain rule to find the derivative of the RHS.

$$\frac{\partial}{\partial\theta_i}G(u(t, \theta)) =\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}+\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}\right)\, \mathrm{d}t+\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{d}{d\theta_i}\frac{\partial u}{\partial t}-\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\right)\, \mathrm{d}t$$

extract the last integral from that expression, and split it into two.

$$\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{d}{d\theta_i}\frac{\partial u}{\partial t}-\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\right)\, \mathrm{d}t =\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{d}{d\theta_i}\frac{\partial u}{\partial t}\right)\, \mathrm{d}t-\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\, \mathrm{d}t$$

For the first of those two expressions, apply integration by parts

$$\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{d}{d\theta_i}\frac{\partial u}{\partial t}\right)\, \mathrm{d}t =\big[\lambda^{\intercal}(t)\frac{\partial u}{\partial \theta_i}\big]_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\frac{d\lambda^{\intercal}(t)}{\, \mathrm{d}t}\frac{\partial u}{\partial \theta_i}\right)\, \mathrm{d}t$$

Now form the resultant expression

$$\begin{aligned} \frac{d}{d\theta_i}G(u(t, \theta)) & =\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}+\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}\right)\, \mathrm{d}t+\left[\big[\lambda^{\intercal}(t)\frac{\partial u}{\partial \theta_i}\big]_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\frac{d\lambda^{\intercal}(t)}{\, \mathrm{d}t}\frac{\partial u}{\partial \theta_i}\right)\, \mathrm{d}t-\int_{t_{0}}^{T}\lambda^{\intercal}(t)\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\, \mathrm{d}t\right] \\ % & =\big[\lambda^{\intercal}(t)\frac{\partial u}{\partial \theta_i}\big]_{t_{0}}^{T}+\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}+\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}-\frac{d\lambda^{\intercal}(t)}{\, \mathrm{d}t}\frac{\partial u}{\partial \theta_i}-\lambda^{\intercal}(t)\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\right)\, \mathrm{d}t \end{aligned}$$

Now here's the magic step. If we require:

$$\begin{aligned} \frac{d\lambda^{\intercal}(t)}{\, \mathrm{d}t} &=-\frac{\partial f}{\partial u}\lambda^{\intercal}(t)+\frac{\partial g}{\partial u}^{\intercal}\\ % \lambda(T) &=0 \end{aligned}$$

Then our expression simplifies

$$\begin{aligned} \frac{d}{d\theta_i}G(u(t, \theta)) &=-\big[\lambda^{\intercal}(t)\frac{\partial u}{\partial \theta_i}\big]_{t_{0}}^{T}+\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}+\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}-\left(-\frac{\partial f}{\partial u}\lambda(t)+\frac{\partial g}{\partial u}^{\intercal}\right)\frac{\partial u}{\partial \theta_i}-\lambda^{\intercal}(t)\left(\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial \theta_i}\right)\right)\, \mathrm{d}t \\ % & =-\big[0\cdot\frac{\partial u}{\partial \theta_i}\Big|_T-\lambda^{\intercal}(t_{0})\cdot\frac{\partial u}{\partial \theta_i}\Big|_{t_0}\big]+\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}+\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}+\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}\lambda^{\intercal}(t)-\frac{\partial g}{\partial u}\frac{\partial u}{\partial \theta_i}-\lambda^{\intercal}(t)\frac{\partial f}{\partial u}\frac{\partial u}{\partial \theta_i}-\lambda^{\intercal}(t)\frac{\partial f}{\partial \theta_i}\right)\, \mathrm{d}t\\ % & =\int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}-\lambda^{\intercal}(t)\frac{\partial f}{\partial \theta_i}\right)\, \mathrm{d}t \end{aligned}$$

The Adjoint Method expression for the gradient of the loss $G$ with respect to the parameters $\theta$ is $\frac{dG(u(t, \theta))}{d\theta_i} = \int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}-\lambda^{\intercal}(t)\frac{\partial f}{\partial \theta_i}\right)\, \mathrm{d}t$. The savings over the forward sensitivities gradient expression results from the fact that $\lambda^{\intercal}(t)$ is the same for every $\theta_i$. The Adjoint Method expression contains three terms:

  • $\frac{\partial g}{\partial \theta_i} = 0$ in our case, because our loss function $G(u(t, \theta)) = \sum_t (\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}})^2$ contains no explicit $\theta$ term, as it might if our loss included a regularization term.
  • $\frac{\partial f}{\partial \theta_i}$ appeared in our forward sensitivity ODEs. It hasn't changed.
  • That leaves us with $\lambda^{\intercal}(t)$, which we need to compute. In order to cancel terms in our derivation, we had defined $\lambda^{\intercal}(t)$ via an ODE: $$\begin{aligned}\frac{d\lambda^{\intercal}(t)}{dt} & =-\frac{\partial f}{\partial u}\lambda^{\intercal}(t)+\frac{\partial g}{\partial u}^{\intercal}\\ % \lambda(T) & =0\end{aligned}$$ We can solve for $\lambda^{\intercal}(t)$ by integrating this "adjoint ODE", however, since the boundary condition (the "initial value") occurs at the last timepoint, we need to integrate this differential expression backwards in time. $$-\frac{d\lambda^{\intercal}(t)}{dt} =\frac{\partial f}{\partial u}\lambda^{\intercal}(t)-\frac{\partial g}{\partial u}^{\intercal}$$ This adjoint ODE includes yet more terms: $\frac{\partial f}{\partial u}$ and $\frac{\partial g}{\partial u}^{\intercal}$.
  • $\frac{\partial f}{\partial u}$ is the Jacobian, also unchanged from our forward sensitivity ODEs. Resolving the entries of this matrix requires we have $u(t, \theta)$. Thus in order to integrate the adjoint ODE we will need to have already integrated the system ODE.
  • Finally, $\frac{\partial g}{\partial u}^{\intercal}$ is the derivative of the instantaneous loss with respect to the state. Our instantaneous loss is zero everywhere except the timepoints we observe: $$\begin{aligned} g(u(t,\theta)) &= \left(\frac{ \mathrm{obs}(t) - u(t, \theta)}{\sigma \cdot \sqrt{\mathrm{obs}(t)}}\right)^2 \cdot \sum_{t_i \in \mathrm{obs}}\delta(t_i-i) \\ \frac{\partial}{\partial u}g(u(t,\theta)) &= \begin{cases} \frac{2( \mathrm{obs}(t) - u(t, \theta))}{\sigma \cdot \sqrt{\mathrm{obs}(t)}} & t \in \mathrm{obs}\\ 0 & t \notin \mathrm{obs} \end{cases} \end{aligned}$$

The Adjoint Method for ODEs is thus:

  1. Integrate the system ODE $\frac{du}{dt} = f(u(\theta, t), \theta)$, providing $u(t, \theta)$.
  2. Integrate the adjoint ODE $-\frac{d\lambda^{\intercal}(t)}{dt} =\frac{\partial f}{\partial u}\lambda^{\intercal}(t)-\frac{\partial g}{\partial u}^{\intercal}$ in reverse from $\lambda(T) = 0$ to $t_0$, providing $\lambda(t)$.
  3. Evaluate the integral $\frac{dG(u(t, \theta))}{d\theta_i} = \int_{t_{0}}^{T}\left(\frac{\partial g}{\partial \theta_i}-\lambda^{\intercal}(t)\frac{\partial f}{\partial \theta_i}\right)dt$

Inspecting this procedure, we notice the integral in step 3 can be computed concurrently with step 2 for some time-savings: $\frac{dG(u(t, \theta))}{d\theta_i} = \int_{T}^{t_{0}}\left(-\frac{\partial g}{\partial \theta_i}+\lambda^{\intercal}(t)\frac{\partial f}{\partial \theta_i}\right)dt$.

Let's implement it.

plot_adjoint(u_0, λ_0)

We can check that our gradient computation via the Adjoint Method is correct by comparing it to the Finite Differences approximation of the gradient:

def cosine_distance(a, b): return 1 - np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))

grad_0_fd = pd.Series(finite_differences(None, θ_0)).iloc[0:4]
grad_0_adjoint_euler = pd.Series(grad_0)
cosine_distance(grad_0_fd, grad_0_adjoint_euler)
1.133973482914108e-05

That's good news, however,

f'A single step of the Adjoint Method with the naïve python Euler method takes {round(euler_time, 2)}s'
'A single step of the Adjoint Method with the naïve python Euler method takes 57.91s'

As before, we would like to use scipy to evaluate the integrals needed to compute the gradient. This time we run into two new issues:

  • One reason the scipy integrate package's integrators are faster is because they use adaptive timestepping methods. In our Euler Methods, we set out timestep size dt at the outset, which remains constant throughout integration, meaning our resulting trajectory $u(t, \theta)$ is really a vector of values of $u$ defined at uniformly-spaced values of $t$. Adaptive timestepping integrators yield trajectories supported on non-uniformly-spaced values of $t$, which we won't know prior to integration. If we integrate $\frac{du}{dt}$ with adaptive timestepping, and subsequently $-\frac{d\lambda}{dt}$, the resultant trajectories will not be supported on the same values of $t$. We can work around this issue by interpolating the trajectory of $u$ for values of $t$ between the $t$ the solution is defined at.

  • Another issue arises from the fact that scipy's main integration method is not able to integrate differential equations with jump discontinuities of the sort we have in our Adjoint ODE with finite observations (as in the second plot above). Thankfully, Julia's feature-rich DifferentialEquations.jl library has bindings to python via diffeqpy, and does support this use case.

We recapitulate the results from our Euler method.

julia_adjoint_grad_0 = pd.Series(grad_0_julia)
cosine_distance(grad_0_fd, grad_0_adjoint_julia)
1.7326465229228205e-06
f'A single step of the Adjoint Method with julia takes {round(julia_time, 2)}s'
'A single step of the Adjoint Method with julia takes 1.07s'

Now we can conceive of optimizing kinetic parameters with the adjoint method.

optimization_record = optimize_by_gradient_descent(θ_0, integrate=integrate_julia_full, gradient=gradient_of_loss_via_julia_adjoint_method)
plot_optimization_trajectory(optimization_record)

In principle we could use the adjoint method to compute derivatives in log space as well, but we'll "leave that as an exercise for the reader".

Further Reading

We have implemented the elementary techniques from the two families of approaches for ODE parameter inference. More advanced approaches from each family bear mentioning.

Advanced Bayesian Approaches

Bayesian approaches estimate a posterior distribution over the parameters to infer.

  • There are more practical and scalable variants of the inference by sampling approach. We implemented Markov-Chain Monte Carlo with a Metropolis-Hastings update criterion, which is hardly ever used in practice due to its simplemindedness.

  • A newer approach to estimate the posterior distribution is not driven by sampling but instead on optimizing the parameters of a surrogate distribution. This approach is named Variational Inference. As an introduction I would recommend these lecture notes. Many VI variants have been proposed, but of particular note are SVI and ADVI.

Advanced Frequentist Approaches

The frequentist approach is to optimize our parameters against our Loss function. We implemented Gradient Descent with Momentum using gradients computed via Forward Sensitivities and the Adjoint Method. There are more practical and scalable variants of each of those parts.

  • There exist many algorithms which use gradients to descend on a loss surface. Beyond momentum we could have used any of the gradient-based opimtization procedures, often applied to optimize the parameters of Neural Networks (some of which are visualized here). Alternatively, for a few more evaluations of the ODE and the loss, we could have implemented a Line Search, which may have worked quite well on our relatively simple loss surface.
  • Beyond gradients, optimization can be dramatically accelerated by the evaluation of local second derivatives of the loss. Newton's method iteratively approximation the loss with a quadratic (a second-order taylor expansion) and steps directly to the bottom of the paraboloid at each iteration, yielding much greater convergence rates. When the number of parameters is large, and evaluation of the full Hessian (the 2ⁿᵈ derivative) at each iteration is intractable, pseudo-second-order optimization techniques such as BFGS (and the more practical L-BFGS) approximate the curvature of the loss surface from a sequence of gradients, and can work very well in practice.
  • Besides optimization techniques, the problem setting may admit alternate formulations of the optimization problem. Trajectory Optimization techniques may be brought to bear, particularly Multiple Shooting and Collocation methods.

Implementations

This blog post is pedagogical. Proper implementations of these techniques should be used in practice.

  • The SciML ecosystem in Julia offers the best tooling I've found for these problems. DiffEqFlux.jl offers implementations of many of the frequentist approaches, and Turing.jl offers implementations of the Bayesian approaches. Both rely on DifferentialEquations.jl. [Disclaimer: I have the privilege of working with the people who built this tooling.]
  • pyPESTO appears to be the best-maintained python toolchain for parameter estimation for biological systems models. There may be good python tooling I'm not aware of (reach out!) especially originating from the Neural ODE literature, which has introduced the adjoint method to the Machine Learning community.

Conclusions

Having learned to infer the parameters of ODE models, let's now take a step back and ask whether we have accomplished something meaningful.

Do we know the right model?

Let us not pretend to the world that we will eventually have predictive models of biology in the same way we have predictive models of airplanes.

— Jeremy Gunawardena

Were we to dispense a specific number of enzyme and substrate molecules into a fixed volume of continuously mixed solvent at a known temperature, pH, and ionic concentration, then the Law of Mass Action we assumed at the outset would hold, and the form of our differential equation model would apply. However, as Lucas Pelkmans writes "solution chemistry does not apply to cells, in which everything is inhomogeneous, crowded, out-of-equilibrium, and unmixed." In this post, we explored techniques to fit the parameters of a chosen differential equation model, but determining the right model to begin with may be the greater challenge. For example, in the case of enzymatic catalysis, the effective concentration of enzyme and substrate may vary substantially inside a single cell, facilitated by phase separation (see The role of liquid–liquid phase separation in regulating enzyme activity and A new phase for enzyme kinetics). This fact is incompatible with the mass-action model we started out with, as our kinetic parameters would appear to be functions of the concentrations of various intracellular metabolites rather than constants. Enzyme kinetic parameters are already functions of the pH, salt, and temperature conditions of the reaction environment, which is likely the cause for the inconsistent estimates of enzymes' kinetic parameters in databases such as BRENDA. Mathematical models from analytical chemistry may simply be inadequate to address the relevant phenomena in biology, which leads us to ask: what are the right models to describe these phenomena?

Further Reading:

Is a wrong model useful?

All models are wrong, but some are useful.

— George Box

Putting aside the former question, in this work, our synthetic data were explicitly generated from the Mass Action rate equations, with fixed kinetic parameters, not a biological system. Even still, our inference techniques did not recover the original kinetic parameters. Instead, they found alternate kinetic parameters which predicted the observations nearly as well. When we visualized the loss surface, we discovered a region of parameter space able to predict the observed data. This phenomenon has been called "Practical Identifiability" (or lack thereof) and "Model Sloppiness", and may be addressible by model order reduction and parameter space compression) techniques. However, we should ask ourselves why we are concerned that the parameters of a perfectly predictive model are not unique. This question gets at the heart of scientific modeling. Our goal is to arrive at a model in which the variables and parameters have a tight correspondence with the physical phenomena they represent. If we achieve that, then our model will be general, and accurately predict the outcomes of interventions not previously observed -- it will extrapolate. Furthermore, it will be interpretable, which makes it easier to design interventions. For the time being, highly predictive black-box models are missing the hallmarks of traditional scientific models.

A highly-predictive differential equation model with aphysical parameters is only marginally more useful than an overparameterized black-box model. In the future, we will aim for models composed of physically-salient variables and parameters.

Further Reading:

Is a large model useful?

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.

— John von Neumann

We alluded to the long-term aspiration of mathematical biologists to model entire cells. Cells contain millions of chemical species, so a Chemical Reaction Network model of the sort we used here would entail millions of variables, millions of reaction equations, and millions more kinetic parameters. To a physicist, a model should reveal the essence of a system, and parameters are best restricted or avoided altogether. Would a million-parameter ODE model be any more interpretable or insightful than a million parameter neural network?

I believe that it would. A large physical model is an unusual thing today:models are mostly either physical, elegant, and concise ($f = ma$), or statistical, highly predictive, and overparameterized ($f = \mathrm{NN}_{\theta}(\cdot)$). I contend that large physical models provide many of the advantages of each of the former categories, namely, they may be highly predictive of complex phenomena yet interpretably so. The best example of a very large, very detailed, very useful model is Google Maps. The overwhelming quantity of data in Google's model of the world is useful in large part thanks to the exceptional interactive data visualization tool they provide to browse the model. The cell not only needs modeling, but representations and visualizations to explore and interrogate that model.

Is (math) modeling biology useful?

The best material model for a cat is another, or preferably the same cat.

— Arturo Rosenblueth and Norbert Wiener

If you speak to a biologist about a "model", they may assume you mean a mouse, not a system of equations. The shared use of the same term is hardly coincidental:both approaches offer surrogates to the systems we seek to understand. We should ask ourselves why to invest in mathematical models of biological systems rather than biological models of those systems. In particular, experimentalists continue to develop techniques to precisely perturb and broadly measure biological systems. If we can perfectly control the inputs and measure the outputs of a biological model, why bother with a mathematical model?

This is a serious question in light of the relative advantages of biological models, which include:

  • Accuracy: we continue to discover new principles governing the behavior of biological matter, meaning that even a mathematical model consolidating the entirety our present understanding of cells would not be faithful to the real thing. Meanwhile, it is tautological that cells model themselves with perfect accuracy, even if the biochemical mechanisms mapping inputs to outputs are obscure.
  • Cost: cells "compute" their destinies with extremely little energy input, whereas large mathematical models require massive computing infrastructure and megawatts of power.
  • Parallelism: Experimentalists have developed technologies to interrogate biology at the level of the single cell, in multiplex. It is now possible to run millions of variants of an experiment simultaneously. That scale is not feasible in silico.

Despite those, mathematical models confer their own advantages, which justify their pursuit:

  • Timeliness: In silico experiments are quick to set up, and yield answers on a timescale of minutes rather than weeks.
  • Interpretability: The entirety of the mechanism mapping inputs to outputs is transparent to us, by default.
  • Reproducibility: Biological models often behave in accordance with covariates we do not measure or control. We expect mathematical models running on deterministic computers to be perfectly reproducible.
  • Theory: A rich body of scholarship in optimal design & control theory can be brought to bear against a mathematical model, as well as many other mathematical frameworks for the analysis of differential equations from the physical sciences.
  • Learning from mistakes: Insofar as mathematical models fail to predict the behavior of a system, they demonstrate the inaccuracies of their assumptions and ours.

Ultiumately, we pursue mathematical models because unlike a mouse model or a statistical model, the mathematical model alone represents understanding of the phenomenon. We apprehend the world as causes and effects, which we have formalized in the language of Differential Equations. True human understanding of the cell will be reached once we have a complete mathematical model.

Further Reading: