When does the mean and variance define an SDE?
November 1 2021 in Uncategorized | Tags: | Author: Christopher Rackauckas
I recently saw a paper that made the following statement:
“Innes et al. [22] trained neural SDEs by backpropagating through the operations of the solver, however their training objective simply matched the first two moments of the training data, implying that it could not consistently estimate diffusion functions.”
However, given that the statement “could not consistently estimate diffusion functions” had no reference to it and no proof in the appendix, I was interested to figure out the mathematical foundation behind the claim. Furthermore, I know from the DiffEqFlux documentation example that there was at least one case where second order method of moments seems to estimate the diffusion function. So a question arose, when does the mean and variance define an SDE?
Of course, in 2021 a Twitter thread captures the full discussion. But I want to take a step back and summarize a bit of the evidence to show why it’s still an unclear but interesting problem. Hopefully this will encourage someone to take up the challenge and prove whether it is consistent or inconsistent, but for now I’ll start with the evidence we have on hand.
Before I continue, I would like to thank David Duvenaud, Patrick Kidger, and Sam Power for being such good sports. The fields of math and science work best when people can have cordial disagreements which further improve everyone’s understanding.
Consistent Estimation of the Stochastic Differential Equations
Before we talk about evidence, we need to be clear about the problem that is being discussed. The question is, given information about the mean and variance, is it possible to estimate an SDE in a way that converges to the true SDE as the data becomes dense? In a more concrete form, if data is drawn from an autonomous SDE
$$dx = f(x,p)dt + g(x,p)dW_t$$
(we will stick to autonomous because the non-autonomous case is obviously not defined by mean and variance), and if you had as many points as possible measuring mean(x(t)) and variance(x(t)) as you would like, would minimizing the cost function
$$C(p) = \sum_{t_i} || E[x(t_i)]-mean(x(t_i)) || + || V[x(t_i)]-variance(x(t_i)) ||$$
give you that the parameters $$p$$ are the same as the parameters which generated the data? But note that is a bit too much of a simplification because that would assume we already know the structure of the SDE $$f$$ and $$g$$. But we can instead generalize to learning the structure by using neural SDEs, i.e. $$dX = NN_1(x,p)dt + NN_2(x,p)dW$$ where $$NN_1$$ and $$NN_2$$ are neural networks. The concrete question then is whether $$NN_1 \rightarrow f$$ and $$NN_2 \rightarrow g$$ as the amount of data about the means and variances goes to infinity.
Evidence that Mean and Variance May Not Be Complete
Most of the discussion focuses on evidence for why the means and variances may not be a complete description of an SDE, and thus not allow for such an estimation to occur. There are two major pieces of evidence that are at play. For one, Patrick Kidger mentioned that in a recent paper he proved uniqueness of Brownian paths using collections of iterated integrals. This is essentially the instantaneous moments of the process, so there is a clear relationship. In that paper his proof required using infinite moments to prove the uniqueness of the SDE. This of course is evidence that you may need to have more than two moments, but it’s not a proof that you need more than two moments. However, given the clarity of the proof it is a sign pointing towards incompleteness of mean+variance information.
The other major evidence being given is counter examples. This is really the meat of the discussion. An SDE has two forms of a solution: its strong solution and its weak solution. The strong solution is the solution to the SDE as a path, the weak solution is the SDE as a probability distribution at every point in time. Of course, it is possible for two stochastic processes to have different generating processes but having the same probability distribution at each time point. This is the big nagging idea behind why it seems mean and variance cannot be consistent, can’t we just generate two SDEs $$dx = f_1(x,p)dt + g_1(x,p)dW_t$$ and $$dy = f_2(y,p)dt + g_2(y,p)dW_t$$ such that $$f_1 \neq f_2$$ or $$g_1 \neq g_2$$ but both have the same mean and variance at every time point? Instead of just mean and variance, given the looseness of the weak solution, for any SDE can’t you find another SDE which has the same probability distribution?
Note that if you can come up with such a construction then you will have settled the discussion for good. If for any SDE there is another SDE with the same mean and variance at every single time point, then of course it’s impossible to fully reconstruct the SDE from only knowing the time series of the mean and variance. So this is where most of the construction went. Patrick had a nice example of one such case: $$y(0) \sim N(0, 1)$$, but with either $$dy = dW_t$$ and with $$dy = \frac{0.5}{t + 1} dt$$. In both cases the solution is always mean zero, but the deterministic solution was chosen to make the variance grow in a square root fashion just like the famous characterization of Brownian motion. Such a construction is clear for any case where mean(t) and variance(t) is linear. But what about if the means and variances are “sufficiently nonlinear”?
For that question, a few interesting pieces of evidence were brought up. Sam Power had a nice example of the overdamped Langevin: $$dx_i(t) = \nabla log p_i (x_i(t)) dt + \sqrt{2} dW_t$$ for $$x_i(0) \sim p_i$$ gives $$x_i(t) \sim p_i$$ for all $$t \geq 0$$. This is a well-known fact from statistical physics used all of the time in things like biological modeling. However, there are two things this does not answer. One, it requires $$p_i$$ is a stationary distribution, thus it’s the case where mean and variance are constant again, but two it does not establish that there is another process with the same probability distribution. It shows how for any (stationary) $$p_i$$ you can construct an SDE that has $$p_i$$ as its probability distribution at all times, but can you construct two?
Patrick was able to dig up a nice paper from Gyöngy which showed how for any probability distribution p(t), you can construct an SDE whose solution was p(t). For awhile I thought that may have settled it, but it doesn’t quite settle it because once again, the ability to construct an SDE whose probability distribution is p(t) does not necessarily mean that you can construct two different SDEs whose probability distribution is p(t). If you take an SDE whose distribution is p(t) and then use Gyöngy’s construction, what’s to say that it won’t give you back an SDE which is equivalent to what you had started with? To be a counter example for consistency, you’d want to find two stochastic differential equations (t,x(t)) and (t,y(t)) where both have a weak distribution p(t) but the measure of the set of Brownian paths for which x(t) != y(t) a.e. is non-zero. Thus this paper gets really close to the heart of the problem but does not end it.
So at the end of the day, there is fairly strong evidence that the mean and variance may not be sufficient to define an SDE. And if mean(t) and variance(t) are linear, indeed it is clearly insufficient. However, even for the simplest linear SDE $$dX = aX dt + bX dW_t$$ the mean(t) and variance(t) are nonlinear, and there does not seem to be a known construction for how to create a similar process with the same mean and variance. There are very clear ways to construct an SDE whose weak solution is the probability distribution p(t), but is there a way to show that there are always two such SDEs with the same probability distribution p(t)? At least from this discussion there does not seem to be an answer to that, which leaves it inconclusive but with lots of evidence showing that it might be possible to come up with such a construction.
But is there evidence to the contrary?
Evidence That Mean and Variance May Be Complete
I was hoping discussion would lead to some proof of incompleteness, but it hasn’t (so far). So let me take a second to describe why I think it might still not be so obvious that mean and variance is incomplete information.
The starting point is to just take the simplest SDE example: the linear SDE or geometric Brownian motion. This is the case of $$dx = aX dt + bX dW_t$$. The analytical solution is well-known, and from its analytical solution you can see that you have $$mean[X(t)] = X_0 \exp(at)$$ and $$variance[X(t)] = X_0^2 \exp(2at) (\exp(b^2 t)-1)$$. Thus it’s very clear from a time series that the mean would define $$a$$ and the variance would then uniquely define $$b$$, allowing you to recover the SDE. Of course, this is a very simple case, but can it be done in general?
I think one might be able to use this fact to show that the mean and variance is complete information. I don’t have a full proof, but here’s a sketch. In the land of ODEs it’s well-known that for any sufficiently regular ODE it can be approximated locally as a linear ODE. I.e., if $$x’ = f(x)$$, then $$x’ = f'(x)x$$ if $$f$$ is “nice enough” (I believe it simply needs to be twice differentiable?). This fact is exploited in some numerical ODE solvers, such as those from the Exponential Rosenbrock family to give a convergent scheme in terms of local matrix exponentials of the Jacobian. This means that for a large class of ODEs, if you knew the Jacobian at every point you could recover the ODE.
It turns out that local linearization is a rather common phenomena across types of differential equations, and a large class of SDEs are locally linear as well. In the land of numerical SDEs, this is used for stochastic exponential Rosenbrock methods (who would’ve thought?). The premise is very similar to the ODE case: since any SDE $$dx = f(x)dt + g(x)dW_t$$ can be written via the local linearization $$dx = f'(x)x dt + g'(x) x dW_t$$, you can use the analytical solution to the linear SDE to get accurate constructions to the nonlinear SDEs over small time periods which get pieced together. However, let’s go back to that first thought… isn’t the linear SDE fully defined by its mean and variance, and in higher dimensions by its mean and covariance matrix?
This leads to the following proof sketch. Take any “sufficiently nice” SDE which allows a local linearization. Dense information about mean(t) and variance(t) would fully define the local information, giving a unique representation in terms of $$dx = f'(x)x dt + g'(x) x dW_t$$. That would then state that mean(t) and variance(t) defines the SDE up to an equivalence class of SDEs which share the same local linearization. You would then just need to show that the local linearization uniquely defines the SDE in a strong sense. That would be almost equivalent to a proof that stochastic exponential Rosenbrock without the nonlinear correction converges in a strong sense. But if one could show these pieces, that would then complete the proof as then a the local linearization and its “unlinearization” are uniquely defined.
[Interestingly, all of the results I know for the convergence of stochastic exponential Rosenbrock methods are about their weak convergence, which is a bit of evidence that maybe it does not converge in a strong sense. Also interestingly, the odd case for linearizations would be when mean(t) and variance(t) are linear, since the constant linear case gives exponentials in mean and variance and thus the linear case would have to be handled separately. That would explain why this case might not have the same uniqueness?]
Of course, that sketch doesn’t mean that it’s proven, there are some big holes to fill there, but hopefully that gives a sense of how one could possibly do it. I would be interested in collaborating with anyone willing to dig deeper.
Some final thoughts
Let me end with a higher level statement that really drives the reason why I think it’s not so obvious that mean and variance information is incomplete. For normal distributions it’s known that mean and variance information is a sufficient statistic, i.e. every Gaussian random variable is uniquely defined if you know just the first two moments. Stochastic differential equations are in a sense a continuous Gaussian process over time, where at every point in time it has a deterministic movement and a Gaussian movement defined by its mean and variance. Is it really that crazy to think that solutions of a stochastic differential equation might thusly be well-defined by simply knowing the continuous evolution of its mean and variance? While SDEs have a lot of weird mathematical properties (look at stochastic Taylor series, non-differentiability almost everywhere, etc.), they do tend to have a lot of local structure that is preserved. If it’s locally Gaussian, and if that local Gaussian is fully defined, does the fully define the process? This seems like it could be a fundamental fact about SDEs, but its truth is still elusive to me. I don’t tend to think it’s so obvious that this structure is lost, but would be willing to believe a proof in either direction if it ever presents itself. Anyways, I am curious to see what others find, or if someone already knows the answer.
So for now, to me, it seems like this question doesn’t have an affirmative answer. I would be happy to hear what others know on this topic and put this to rest.
Edit
This makes sense now. The increment information is what defines the SDE uniquely. Thus it’s the mean and the variance of the increments, $$mean[X(t) – X(t-dt)]$$ and $$variance[X(t) – X(t-dt)]$$, which then directly define $f$ and $g$. This tells us the “right” way to fit and SDE is against such increment data. Nice!