Solving Differential Equations with Artificial Neural Networks

Dec, 2020

Artificial neural networks have been shown to be very flexible function approximators. Given sufficient width and depth of the network, the universal approximation theorem states that, in principle, any function can be represented by the network. As the network itself is differentiable, an efficient training is possible by evaluating the required derivatives for the optimization process analytically via the backpropagation algorithm. Furthermore, no prior knowledge about the function to be represented is required when implementing the architecture of the network.

A function approximator with the just mentioned properties is very versatile and therefore it is not surprising that artificial neural networks have been tried early on to be used in various disciplines. One idea is to use them as solvers for differential equations. Hereby, the neural network represents the solution function itself and the network parameters are tuned to fulfill the differential equation.

Here, I explore a possible architecture following ideas of Refs. [1-6] to solve differential equations. I start in a first section with a description of ordinary differential equations (ODEs) in general and traditional solving methods. If you are familiar with ODEs you may jump directly to the second section, where I describe the solution method with artificial neural networks and I derive all required equations in detail. I describe my implementation of the neural network with numpy from scratch and also a second implementation using tensorflow. I tested the architecture on several differential equations and I show and discuss results of my numerical study in the third section. The code of the implementation can be found on GitHub.

Index

1. Ordinary differential equations
- Traditional solving methods
- Solving by variational function approximation
2. Differential equation neural network (DeqNN)
3. Examples
- Bernoulli differential equation
- Harmonic oscillator
4. Conclusion
References

1. Ordinary differential equations

Assume a coupled set of first-order ordinary differential equations (ODEs) of a vector function $\vec x(t) \in \mathbb R^N$ $$ \begin{aligned} \vec G\big(\vec x(t), \vec x'(t), t\big) &= \vec 0 \end{aligned} $$ with initial condition $\vec x(t_0) = \vec x_0$. The term ordinary refers to the fact that the vector function $\vec x(t)$ is a function of a single variable $t$ compared to a partial differential equation (PDE), which depends on multiple variables. This differential equation system describes $N$ different functions $\vec x(t) = (x_1(t), x_2(t), \ldots, x_N(t))^\top$ with their derivatives $\vec x'(t) = (x_1'(t), x_2'(t), \ldots, x_N'(t))^\top$. The term first-order means that only first-order derivatives $\vec x'(t)$ occur. Given the function $\vec G(\vec x(t), \vec x'(t),t)$, the task is to find a function $\vec x(t)$ fulfilling the equation.

The consideration of only first-order equations is no restriction as any higher-order system can be reduced to first-order by defining derivatives of higher orders as new functions. A reduction of a higher-order system can be achieved with the following procedure. If one considers an arbitrary differential equation of order $n$ of a scalar function $x(t)$, i.e. $$ \begin{aligned} H(x(t), x'(t), x''(t), \ldots, x^{(n)}(t), t) = 0 \end{aligned} $$ one can define new functions $x_i(t) = x^{(i-1)}(t)$ or $\vec x(t) = (x(t), x'(t), ..., x^{(n)}(t))^\top$ which leads to $$ \begin{aligned} x_1'(t) &= x_2(t) \\ x_2'(t) &= x_3(t) \\ \vdots \\ x_n'(t) &= H^I(x_1(t), x_2(t), ..., x_n(t), t) \end{aligned} $$ where $H^I$ is obtained by solving $H$ for $x^{(n)}(t)$. Written vectorially as $\vec x'(t) = \vec K(\vec x(t), t) = (x_2, x_3, \ldots, H^I)^\top$, one can define $$ \begin{aligned} \vec G(\vec x(t), \vec x'(t), t) = \vec K(\vec x(t), t) - \vec x'(t) = 0 \end{aligned} $$ and we are back at the general expression in Eq. (1).

Please note that in the following, I often conveniently describe the variable $t$ as being a time, e.g. describing a physical system where $x(t)$ could be the trajectory of a particle and $\vec G$ Newton's equation of motion. Of course, the equations a completely general and $t$ and $\vec x(t)$ can be anything.

Traditional solving methods

Solving an ODE system is in general a difficult task. Only for particular systems and structures of the equations analytic solutions can be found. In all other cases, numerical techniques are required. The usual method is a step-by-step integration of the equation starting from the initial condition, where the simplest method is the Euler method.

The idea for the integration is the following. In general, a derivative can be approximated by the finite difference $$ \begin{aligned} \vec x'(t) \approx \frac{\vec x(t + \epsilon) - \vec x(t)}{\epsilon} \end{aligned} $$ which leads to $$ \begin{aligned} \vec x(t + \epsilon) = \vec x(t) + \epsilon \vec x'(t) \end{aligned} $$ Using $\vec x'(t) = \vec K(\vec x(t), t)$ from the differential equation (3), one obtains $$ \begin{aligned} \vec x(t + \epsilon) = \vec x(t) + \epsilon K(\vec x(t), t) \end{aligned} $$ Thus, starting from an initial value $\vec x(t_0) = \vec x_0$, the solution at a later time step $t_0 + \epsilon$ can be found by $$ \begin{aligned} \vec x(t_0+\epsilon) = \vec x(t_0) + \epsilon K(\vec x(t_0), t_0) \end{aligned} $$ This procedure can be repeated to obtain the approximate solution of $\vec x(t)$ at discrete time steps $t_i$. In practice, more sophisticated algorithms like Runge-Kutta methods are used, where more points within a time step are considered for more numerical stability and higher accuracy. Furthermore, with adaptive methods, the step width $\epsilon$ is adjusted dependent on the local error in each step.

While these methods work typically quite well in practice, they have inherent drawbacks concerning efficient numerical implementation. As the integration is an iterative method, a parallelization is possible only within each integration step but not over the integration range. Therefore, the calculation time scales linearly with the integration time range. Furthermore, as the method is a discrete integration, it is not possible to obtain a solution $\vec x(t)$ at arbitrary time $t$ after the integration. Only values at $\vec x(t_i)$ are available and a recalculate is required for a different value $\vec x(t_j)$ when $t_j$ is not within the set of $t_i$.

Solving by variational function approximation

Another method to solve an ODE is to make a variational ansatz, i.e. guess a solution $\vec x(t) = \hat{\vec x}(t, \vec \sigma)$, where $\vec \sigma$ are variational parameters which are chosen in such a way that the ODE $$ \begin{aligned} \vec G(\hat{\vec x}(t, \vec \sigma), \hat{\vec x}'(t, \vec \sigma), t) = \vec 0 \end{aligned} $$ is minimized. On large complex systems, a good intuition is required to be able to chose a suitable ansatz. Alternatively, a set of basis function $\vec a_i(t)$ might be chosen with a superposition ansatz $$ \begin{aligned} \hat{\vec x}(t, \vec \sigma) = \sum_i \sigma_i \vec a_i(t) \end{aligned} $$ Yet, in this case one needs sufficient intuition as well to chose the right basis function which are able to represent the correct solution approximately well.

This is where artificial neural networks (ANN) come into play to be used as a variational function ansatz using "arbitrary" basis functions. With this, no prior detailed knowledge about the solution is necessary. The advantage of a variational solution using ANNs is twofold. First, it has the advantage over iterative integration methods that the optimization process of the network can be fully parallelized. Second, once a suitable variational solution is found, a solution within the training range can be obtained at any time $t$ at constant evaluation time without having to recalculate anything again.

2. Differential equation neural network (DeqNN)

The idea is to use an ANN as a function approximator. Hereby, the network has a single one dimensional input representing the time $t$ and two $N$-dimensional outputs representing the function $\hat{\vec x}(t)$ and its derivative $\hat{\vec x}'(t)$. It is therefore an extension to a vanilla ANN as the derivative of the output with respect to the input is calculated simultaneously in the forward pass. To train the network, a batch of input times is passed to the network and the calculated function $\hat{\vec x}(t)$ and derivative $\hat{\vec x}'(t)$ are used to evaluate the differential equation $\vec G$. The evaluated differential equation and the initial condition are used to construct a loss function which is tried to be minimized in the training process. This architecture is inspired and similar to ideas proposed in Refs. [1-6]. The following figure gives a schematic overview of the network architecture. It will be described in the following in more detail.

Network architecture and forward pass

For the network, the following notation will be used

$N$: Number of coupled differential equations
$L$: Number of layers
$U_l$: Number of units in layer $l$
- Input layer $U_0 = 1$
- Output layer $U_L = N$
$A^{(l)}_{u}(t_s)$: Output of unit $u$ in layer $l$ for input time $t_s$
- Input layer $A^{(0)}_1(t_s) = t_s$
- Output layer $A^{(L)}_u(t_s) = \hat x_u(t_s)$
$\delta A^{(l)}_{u}(t_s)$: Derivative of network output $A^{(L)}_u(t_s)$ with respect to $A^{(l)}_{u}(t_s)$
$S$: Number of samples, sample $s$
$A^{(l)}_{su} = A_u^{(l)}(t_s)$: Output matrix of layer $l$ $$ \begin{aligned} A^{(l)} = \begin{pmatrix} A^{(l)}_1(t_1) & A^{(l)}_2(t_1) & \ldots & A^{(l)}_{U_l}(t_1)\\ A^{(l)}_1(t_2) & A^{(l)}_2(t_2) & \ldots & A^{(l)}_{U_l}(t_2)\\ \vdots\\ A^{(l)}_1(t_S) & A^{(l)}_2(t_S) & \ldots & A^{(l)}_{U_l}(t_S) \end{pmatrix} \in \mathbb R^{S\times U_l} \end{aligned} $$

The calculation of activations in each layer follows the standard procedure $$ \begin{aligned} Z^{(l)} &= A^{(l-1)} \cdot W^{(l)} + b^{(l)}\\ Z_{sk}^{(l)} &= \sum_p A^{(l-1)}_{sp} \cdot W^{(l)}_{pk} + b^{(l)}_{sk} \end{aligned} $$ $$ \begin{aligned} A^{(l)} &= a^{(l)}(Z^{(l)})\\ A^{(l)}_{sk} &= a^{(l)}(Z^{(l)}_{sk}) \end{aligned} $$ where $Z^{(l)} \in \mathbb R^{S\times U_l}$ is the result of the linear matrix operation and $a^{(l)}(z)$ the element-wise activation function. For the input and output layer there is no activation function, i.e. $a^{(0)}(z) = a^{(L)}(z) = z$. The weights and biases of each layer are $W^{(l)} \in \mathbb R^{U_{l-1} \times U_l}$ and $b^{(l)} \in \mathbb R^{S\times U_l}$. There is only one input feature, i.e. the time $t$. Thus, for a batch of size $S$, the input reads $$ \begin{aligned} Z^{(0)} = A^{(0)} = \vec t = \begin{pmatrix}t_1\\t_2\\\vdots\\t_S\end{pmatrix} \in \mathbb R^{S\times 1} \end{aligned} $$ The output of the network corresponds to the solution $\hat{\vec x}(t)$, i.e. $$ \begin{aligned} A^{(L)} = Z^{(L)} &= \begin{pmatrix} \hat x_1(t_1) & \hat x_2(t_1) & \ldots & \hat x_N(t_1)\\ \hat x_1(t_2) & \hat x_2(t_2) & \ldots & \hat x_N(t_2)\\ \vdots\\ \hat x_1(t_S) & \hat x_2(t_S) & \ldots & \hat x_N(t_S) \end{pmatrix}\\ &= \begin{pmatrix} \hat{\vec x}(t_1)^\top \\ \hat{\vec x}(t_2)^\top \\ \vdots \\ \hat{\vec x}(t_S)^\top \end{pmatrix} \in \mathbb R^{S\times N}\\ \end{aligned} $$

So far, the network is a vanilla ANN. Yet, in order to calculate the ODE we need not only $\vec x(t)$ but also the derivative $\vec x'(t)$. As the ANN is differentiable, one can obtain the derivative analytically. It corresponds to the derivative of the output with respect to the input. We define $$ \begin{aligned} \delta A^{(L)} &:= \frac{\mathrm dA^{(L)}}{\mathrm d A^{(0)}} = \frac{\mathrm d\hat{\vec x}}{\mathrm d \vec t} \\ &= \begin{pmatrix} \frac{\mathrm dA^{(L)}_{1}(t_1)}{\mathrm dt_1} & \frac{\mathrm dA^{(L)}_{2}(t_1)}{\mathrm dt_1} & \ldots & \frac{\mathrm dA^{(L)}_{N}(t_1)}{\mathrm dt_1}\\ \vdots\\ \frac{\mathrm dA^{(L)}_{1}(t_s)}{\mathrm dt_s} & \frac{\mathrm dA^{(L)}_{2}(t_s)}{\mathrm dt_s} & \ldots & \frac{\mathrm dA^{(L)}_{N}(t_s)}{\mathrm dt_s}\\ \vdots\\ \frac{\mathrm dA^{(L)}_{1}(t_S)}{\mathrm dt_S} & \frac{\mathrm dA^{(L)}_{2}(t_S)}{\mathrm dt_S} & \ldots & \frac{\mathrm dA^{(L)}_{N}(t_S)}{\mathrm dt_S} \end{pmatrix}\\ &= \begin{pmatrix} \hat{\vec x}'(t_1)^\top \\ \vdots \\ \hat{\vec x}'(t_s)^\top \\ \vdots \\ \hat{\vec x}'(t_S)^\top \end{pmatrix} \end{aligned} $$ Let us derive a recursive equation for the derivative $$ \begin{aligned} \delta A^{(l)} &= \underbrace{\frac{\mathrm dA^{(l)}}{\mathrm dA^{(0)}}}_{\in \mathbb R^{S\times U_l}} \\ &= \underbrace{\frac{\mathrm dZ^{(l)}}{\mathrm dA^{(0)}}}_{\in \mathbb R^{S\times U_l}} \odot \underbrace{\frac{\mathrm dA^{(l)}}{\mathrm dZ^{(l)}}}_{\in \mathbb R^{S\times U_l}} \\ &= \Big( \underbrace{\frac{\mathrm dA^{(l-1)}}{\mathrm dA^{(0)}}}_{\in \mathbb R^{S\times U_{l-1}}} \cdot \underbrace{\frac{\mathrm dZ^{(l)}}{\mathrm dA^{(l-1)}}}_{\in \mathbb R^{U_{l-1}\times U_l}} \Big) \odot \underbrace{\frac{\mathrm da^{(l)}}{\mathrm dZ^{(l)}}}_{\in \mathbb R^{S\times U_l}}\\ &= \left( \delta A^{(l-1)} \cdot W^{(l)} \right) \odot a^{'(l)}(Z^{(l)}) \end{aligned} $$ or element-wisely expressed as $$ \begin{aligned} \delta A^{(l)}_{sk} &= \sum_p \delta A^{(l-1)}_{sp} \cdot W^{(l)}_{pk} \cdot a^{'(l)}(Z_{sk}^{(l)}) \end{aligned} $$ The derivative of the initial layer is $$ \begin{aligned} \frac{\mathrm dA^{(0)}}{\mathrm dA^{(0)}} = \frac{\mathrm d\vec t}{\mathrm d\vec t} = \begin{pmatrix} 1 \\ \vdots \\ 1\end{pmatrix} \in \mathbb R^{S\times 1} \end{aligned} $$ As we can see, the derivative can be calculated using Eq. (6) in the forward pass together with equations (4) and (5). It can be represented by the following computational graph:

Loss function and training

To train the network, we generate a training input vector $A^{(0)} \in \mathbb R^{S\times 1}$ in each training epoch. Hereby, $A^{(0)}_{11} = t_0$ is always the time of the initial condition $\vec x(t_0) = \vec x_0$, while the rest of the values are randomly sampled from an interval $[t_{\mathrm{min}},t_{\mathrm{max}}]$. We will see that the reason to always add $t_0$ in the training data is that it will be used to enforce the initial condition to be fulfilled.

The loss function $L = L^D + L^I$ of the system consist of two parts. Hereby, the first part $L^D$ represents the error of the differential equation, while the second part $L^I$ represents the error of the initial condition.

The loss $L^D$ is the mean value over the batch of the sum of all elements of the squared differential equation (1)$$ \begin{aligned} L^D &= \frac{1}{S} \sum_{s=1}^S\sum_{n=1}^N G_n(\hat{\vec x}(t_s), \hat{\vec x}'(t_s))^2 \\ &= \frac{1}{S} \sum_{s=1}^S\sum_{n=1}^N \hat G_{sn}(A^{(L)}, \delta A^{(L)})^2 \end{aligned} $$ where the matrix $\hat G \in \mathbb R^{S\times N}$ with elements $\hat G_{sn}$ is defined by $$ \begin{aligned} \hat G &= \begin{pmatrix} \vec G(\hat{\vec x}(t_1), \hat{\vec x}'(t_1))^\top \\ \vdots \\ \vec G(\hat{\vec x}(t_s), \hat{\vec x}'(t_s))^\top \\ \vdots \\ \vec G(\hat{\vec x}(t_S), \hat{\vec x}'(t_S))^\top \\ \end{pmatrix} \end{aligned} $$ or $$ \begin{aligned} \hat G_{sn} &= G_n( (A^{(L)}_{s1}, \ldots, A^{(L)}_{sN})^\top, (\delta A^{(L)}_{s1}, \ldots, \delta A^{(L)}_{sN})^\top) \end{aligned} $$ One can see that each row of $\hat G$ only depends on the corresponding row of $A^{(L)}$ and $\delta A^{(L)}$. A minimization of this part of the loss function ensures that the differential equation is fulfilled.

The second part $L^I$ of the loss functions is the sum over all elements of the squared difference between the predicted solution of the initial condition time $t_0$, i.e. $\hat x_n(t_0) = A^{(L)}_{1n}$ and the real initial solution $x_n(t_0)$. $$ \begin{aligned} L^I &= \sum_{n=1}^N (A^{(L)}_{1n} - x_n(t_0))^2 \end{aligned} $$ This is the reason why we always add $t_0$ to the input vector as $A^{(0)}_{11}$ in order to be able to compute this loss. A minimization of this loss ensures that the initial condition is fulfilled.

To optimize the ANN, one can use any standard algorithm like batch gradient-descent, i.e. an iteration of the parameters via $$ \begin{aligned} W^{(l)}_{\mathrm{new}} &= W^{(l)} - \alpha \frac{\mathrm dL}{\mathrm dW^{(l)}} \\ b^{(l)}_{\mathrm{new}} &= b^{(l)} - \alpha \frac{\mathrm dL}{\mathrm db^{(l)}} \end{aligned} $$ or other advanced optimizers like Adam.

Derivation of backpropagation

To compute the required derivatives $\frac{\mathrm dL}{\mathrm dW^{(l)}}$ and $\frac{\mathrm dL}{\mathrm db^{(l)}}$, we must consider that now two quantities $Z^{(l)}$ and $\delta A^{(l)}$ depend on the parameters. This is different to a vanilla ANN where there is no quantity $\delta A^{(l)}$. In the following, all required equations to calculate the derivatives via backpropagation are derived in detail. The result is summarized in the next section.

Differential equation loss $L^D$ and weights $W^{(l)}$

Let us start with the first part of the loss $L^D$ representing the differential equation. The derivative with respect to the weights $W^{(l)}$ reads $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm dW^{(l)}_{ij}} &= \sum_{s,k} \frac{\mathrm dL^D}{\mathrm d Z_{sk}^{(l)}} \frac{\partial Z^{(l)}_{sk}}{\partial W^{(l)}_{ij}} + \sum_{s,k} \frac{\mathrm dL^D}{\mathrm d \delta A_{sk}^{(l)}} \frac{\partial \delta A^{(l)}_{sk}}{\partial W^{(l)}_{ij}} \end{aligned} $$ Using the element-wise definition of Eq. (4), we find $$ \begin{aligned} \frac{\partial Z^{(l)}_{sk}}{\partial W^{(l)}_{ij}} = \delta_{k,j} A^{(l-1)}_{si} \end{aligned} $$ and using Eq. (7) we find $$ \begin{aligned} \frac{\partial \delta A^{(l)}_{sk}}{\partial W^{(l)}_{ij}} &= \delta_{k,j} \cdot \delta A_{si}^{(l-1)} \cdot a^{'(l)}(Z_{sj}^{(l)}) \end{aligned} $$ Both expression contain $\delta_{k,j}$. Thus, the sum over $k$ in Eq. (8) vanishes and one obtains $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm dW^{(l)}_{ij}} &= \sum_{s} \left[ \delta^{(l)}_{sj} \cdot A_{si}^{(l-1)} + \gamma_{sj}^{(l)} \cdot \delta A_{si}^{(l-1)} \cdot a^{'(l)}(Z_{sj}^{(l)}) \right] \end{aligned} $$ or written in matrix notation $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm dW^{(l)}} &= (A^{(l-1)})^\top \cdot \delta^{(l)}\\ &\quad + (\delta A^{(l-1)})^\top \cdot (\gamma^{(l)} \odot a^{'(l)}(Z^{(l)})) \end{aligned} $$ In these expressions, we defined the two quantities $$ \begin{aligned} \delta^{(l)}_{sj} &= \frac{\mathrm dL_d}{\mathrm dZ_{sj}^{(l)}} \end{aligned} $$ $$ \begin{aligned} \gamma^{(l)}_{sj} &= \frac{\mathrm dL_d}{\mathrm d \delta A_{sj}^{(l)}} \end{aligned} $$

For $\delta^{(l)}_{sj}$ and $\gamma^{(l)}_{sj}$ one can find a recursive equation. We start with $\delta^{(l)}_{sj}$ $$ \begin{aligned} \delta_{sj}^{(l)} &= \frac{\mathrm dL^D}{\mathrm dZ_{sj}^{(l)}} \\ &= \sum_k \left[ \frac{\mathrm dL^D}{\mathrm dZ_{sk}^{(l+1)}} \sum_m \frac{\partial Z^{(l+1)}_{sk}}{\partial A^{(l)}_{sm}} \frac{\partial A^{(l)}_{sm}}{\partial Z^{(l)}_{sj}} + \frac{\mathrm dL^D}{\mathrm d\delta A^{(l)}_{sk}} \frac{\partial \delta A^{(l)}_{sk}}{\partial Z_{sj}^{(l)}} \right] \\ &= \sum_k \Big[ \delta_{sk}^{(l+1)} \cdot \sum_m W^{(l+1)}_{mk} \cdot a^{'(l)}(Z_{sj}^{(l)}) \cdot \delta_{m,j} \\&\qquad\qquad + \gamma_{sk}^{(l)} \cdot \sum_p \delta A^{(l-1)}_{sp} \cdot W^{(l)}_{pk} \cdot a^{''(l)}(Z_{sk}^{(l)}) \cdot \delta_{k,j} \Big]\\ &= \sum_k \delta_{sk}^{(l+1)} \cdot W^{(l+1)}_{jk} \cdot a^{'(l)}(Z_{sj}^{(l)}) \\&\qquad + \gamma_{sj}^{(l)} \cdot \sum_p \delta A^{(l-1)}_{sp} \cdot W^{(l)}_{pj} \cdot a^{''(l)}(Z_{sj}^{(l)}) \end{aligned} $$ or in matrix notation $$ \begin{aligned} \delta^{(l)} &= (\delta^{(l+1)} \cdot (W^{(l+1)})^\top) \odot a^{'(l)}(Z^{(l)}) \\&\qquad + \gamma^{(l)} \odot (\delta A^{(l-1)} \cdot W^{(l)}) \odot a^{''(l)}(Z^{(l)}) \end{aligned} $$ For the last layer one finds $$ \begin{aligned} \delta_{sj}^{(L)} &= \frac{\mathrm d L^D}{\mathrm d Z_{sj}^{(L)}} = \sum_{n,m} \frac{\partial L^D}{\partial \hat G_{sn}} \frac{\partial \hat G_{sn}}{\partial A^{(L)}_{sm}} \underbrace{\frac{\partial A^{(L)}_{sm}}{\partial Z^{(L)}_{sj}}}_{\delta_{m,j}}\\ &= \frac 2 S \sum_{n} \hat G_{sn} \frac{\partial \hat G_{sn}}{\partial A^{(L)}_{sj}} \cdot 1 \end{aligned} $$

Please note that in principle, the expression above has a second term, i.e. $$ \begin{aligned} \sum_{n,m} \frac{\partial L^D}{\partial \hat G_{sn}} \frac{\partial \hat G_{sn}}{\partial \delta A^{(L)}_{sm}} \underbrace{\frac{\partial \delta A^{(L)}_{sm}}{\partial Z^{(L)}_{sj}}}_{\propto a^{''(L)}(Z_{sj})=0} = 0 \end{aligned} $$ which is always zero as the activation function of the last layer is linear such that the second derivative vanishes.

The recursive equation for $\gamma^{(l)}$ reads $$ \begin{aligned} \gamma_{sj}^{(l)} &= \frac{\mathrm dL^D}{\mathrm d \delta A_{sj}^{(l)}} \\ &= \sum_k \frac{\mathrm dL^D}{\mathrm d \delta A_{sk}^{(l+1)}} \frac{\mathrm d \delta A_{sk}^{(l+1)}}{\mathrm d \delta A_{sj}^{(l)}} \\ &= \sum_k \gamma_{sk}^{(l+1)} \cdot W^{(l+1)}_{jk} \cdot a^{'(l+1)}(Z_{sk}^{(l+1)})\\ \end{aligned} $$ or in matrix notation $$ \begin{aligned} \gamma^{(l)} &= (\gamma^{(l+1)} \odot a^{'(l+1)}(Z^{(l+1)})) \cdot (W^{(l+1)})^\top \end{aligned} $$ For the last layer one finds $$ \begin{aligned} \gamma_{sj}^{(L)} &= \frac{\mathrm dL_d}{\mathrm d \delta A_{sj}^{(L)}} = \sum_{n} \frac{\mathrm dL_d}{\mathrm d \hat G_{sn}} \frac{\mathrm d \hat G_{sn}}{\mathrm d \delta A^{(L)}_{sj}}\\ &= \frac 2 S \sum_{n} \hat G_{sn} \frac{\mathrm d \hat G_{sn}}{\mathrm d \delta A^{(L)}_{sj}} \end{aligned} $$

Differential equation loss $L^D$ and biases $b^{(l)}$

Now we turn to the derivative with respect to the bias $b^{(l)}$. It reads $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm db_{ij}^{(l)}} &= \sum_{s,k} \frac{\mathrm dL^D}{\mathrm dZ^{(l)}_{sk}} \frac{\partial Z^{(l)}_{sk}}{\partial b^{(l)}_{ij}} + \sum_{s,k} \frac{\mathrm dL^D}{\mathrm d\delta A^{(l)}_{sk}} \frac{\partial \delta A^{(l)}_{sk}}{\partial b^{(l)}_{ij}} \end{aligned} $$ Using again the element-wise definition for $Z_{sk}^{(l)}$ and $\delta A_{sk}^{(l)}$ in Eqs. (4) and (7), we find $$ \begin{aligned} \frac{\partial Z^{(l)}_{sk}}{\partial b^{(l)}_{ij}} &= 1_{si} \cdot \delta_{k,j}\\ \frac{\partial \delta A^{(l)}_{sk}}{\partial b^{(l)}_{ij}} &= 0 \end{aligned} $$ Hereby, $1_{si} = 1$ are the matrix elements of an $S\times 1$ dimensional matrix required for dimension bookkeeping. With the defined quantity $\delta^{(l)}_{sj}$ in Eq. (9), we obtain $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm db_{ij}^{(l)}} &= \sum_s 1_{si} \cdot \delta_{sj}^{(l)} \end{aligned} $$ or in matrix notation $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm db^{(l)}} = 1^\top \cdot \delta^{(l)} \end{aligned} $$ where again $1 \in \mathbb R^{S\times 1}$ is required for dimension bookkeeping.

Initial condition loss $L_I$ and weights $W^{(l)}$

Now we consider the part of the loss function which ensures that the initial condition is fulfilled. This loss does not depend on $\delta A^{(l)}$. We find $$ \begin{aligned} \frac{\mathrm dL^I}{\mathrm dW_{ij}^{(l)}} &= \sum_k \frac{\mathrm dL^I}{\mathrm dZ_{1k}^{(l)}} \frac{\mathrm d Z_{1k}^{(l)}}{\mathrm dW_{ij}^{(l)}} = \eta_{1j}^{(l)} A_{1i}^{(l-1)} \end{aligned} $$ or in matrix notation $$ \begin{aligned} \frac{\mathrm dL^I}{\mathrm dW^{(l)}} &= (A^{(l-1)}_{1})^\top \cdot \eta^{(l)} \end{aligned} $$ where $A^{(l-1)}_1 \in \mathbb R^{1\times U_{l-1}}$ is the first row of $A^{(l-1)}$ and $\eta^{(l)} \in \mathbb R^{1\times U_l}$ with $$ \begin{aligned} \eta_{1j}^{(l)} &= \frac{\mathrm dL^I}{\mathrm dZ_{1j}^{(l)}} \end{aligned} $$ For $\eta^{(l)}$, the same recursive equation as for $\delta^{(l)}$ holds, i.e. $$ \begin{aligned} \eta_{1j}^{(l)} &= \sum_k \eta_{1k}^{(l+1)} \cdot W^{(l+1)}_{jk} \cdot a^{'(l)}(Z_{1j}^{(l)}) \end{aligned} $$ or in matrix notation $$ \begin{aligned} \eta^{(l)} &= (\eta^{(l+1)} \cdot (W^{(l+1)})^\top) \odot a^{'(l)}(Z^{(l)}_1) \end{aligned} $$ with $Z^{(l)}_1 \in \mathbb R^{1\times U_l}$ the first row of $Z^{(l)}$. For the last layer one finds $$ \begin{aligned} \eta^{(L)}_{1j} &= \frac{\mathrm dL^I}{\mathrm dZ_{1j}^{(L)}} = \sum_{n} \underbrace{\frac{\mathrm dL^I}{\mathrm dA_{1n}^{(L)}}}_{2 (A_{1n}^{(L)} - x_n(t_0))} \underbrace{\frac{\mathrm dA_{1n}^{(L)}}{\mathrm dZ_{1j}^{(L)}}}_{\delta_{n,j}}\\ &= 2 (A_{1j}^{(L)} - x_j(t_0)) \end{aligned} $$

Initial condition loss $L_I$ and biases $b^{(l)}$

Finally, lets look at the derivative of $L_I$ with respect to the bias terms $b^{(l)}$ $$ \begin{aligned} \frac{\mathrm dL^I}{\mathrm db_{ij}^{(l)}} &= \sum_{k} \frac{\mathrm dL^I}{dZ^{(l)}_{1k}} \frac{\mathrm dZ^{(l)}_{1k}}{\mathrm db^{(l)}_{ij}} = \eta_{1j}^{(l)} 1_{1i} \end{aligned} $$ or in matrix notation $$ \begin{aligned} \frac{\mathrm dL^I}{\mathrm db^{(l)}} &= 1^\top \cdot \eta^{(l)} \end{aligned} $$ where $1 \in \mathbb R^{1\times 1}$.

Equation summary

After the lengthy derivation above, let me summarize everything. The forward pass in the network is calculated via $$ \begin{aligned} Z^{(l)} &= A^{(l-1)} \cdot W^{(l)} + b^{(l)}\\ A^{(l)} &= a^{(l)}(Z^{(l)})\\ \delta A^{(l)} &= \left( \delta A^{(l-1)} \cdot W^{(l)} \right) \odot a^{'(l)}(Z^{(l)}) \end{aligned} $$ with values of the first layer $$ \begin{aligned} Z^{(0)} &= \begin{pmatrix}t_0 & t_1 & \ldots & t_{S-1}\end{pmatrix}^\top \in \mathbb R^{S\times 1}\\ \delta A^{(0)} &= \begin{pmatrix} 1 & 1 & \ldots & 1\end{pmatrix}^\top \in \mathbb R^{S\times 1} \end{aligned} $$

The loss $L = L^D + L^I$ consists of two parts with the differential equation loss $$ \begin{aligned} L^D &= \frac{1}{S} \sum_{s=1}^S\sum_{n=1}^N \hat G_{sn}(A^{(L)}, \delta A^{(L)})^2 \end{aligned} $$ and the initial condition loss $$ \begin{aligned} L^I &= \sum_{n=1}^N (A^{(L)}_{1n} - x_n(t_0))^2 \end{aligned} $$

The derivatives of the loss with respect to the weights and biases read $$ \begin{aligned} \frac{\mathrm dL^D}{\mathrm dW^{(l)}} &= (A^{(l-1)})^\top \cdot \delta^{(l)} + (\delta A^{(l-1)})^\top \cdot (\gamma^{(l)} \odot a^{'(l)}(Z^{(l)}))\\ \frac{\mathrm dL^D}{\mathrm db^{(l)}} &= 1^\top \cdot \delta^{(l)}\\ \frac{\mathrm dL^I}{\mathrm dW^{(l)}} &= (A^{(l-1)}_{1})^\top \cdot \eta^{(l)}\\ \frac{\mathrm dL^I}{\mathrm db^{(l)}} &= 1^\top \cdot \eta^{(l)}\\ \end{aligned} $$ where $$ \begin{aligned} \delta^{(l)} &= (\delta^{(l+1)} \cdot (W^{(l+1)})^\top) \odot a^{'(l)}(Z^{(l)}) \\&\qquad + \gamma^{(l)} \odot (\delta A^{(l-1)} \cdot W^{(l)}) \odot a^{''(l)}(Z^{(l)})\\ \gamma^{(l)} &= (\gamma^{(l+1)} \odot a^{'(l+1)}(Z^{(l+1)})) \cdot (W^{(l+1)})^\top \\ \eta^{(l)} &= (\eta^{(l+1)} \cdot (W^{(l+1)})^\top) \odot a^{'(l)}(Z^{(l)}_1) \end{aligned} $$ For the last layer it holds $$ \begin{aligned} \gamma_{sj}^{(L)} &= \frac 2 S \sum_{n} \hat G_{sn} \frac{\mathrm d \hat G_{sn}}{\mathrm d \delta A^{(L)}_{sj}}\\ \delta_{sj}^{(L)} &= \frac 2 S \sum_{n} \hat G_{sn} \frac{\mathrm d \hat G_{sn}}{\mathrm dA^{(L)}_{sj}}\\ \eta_{1j}^{(L)} &= 2 (A_{1j}^{(L)} - x_j(t_0)) \end{aligned} $$

Implementation

I implemented the ANN using python and numpy. The code can be found on GitHub. It consists of 3 base classes and several subclasses:

`Layer`

This class represents one layer of the network storing its weights $W^{(l)}$ and $b^{(l)}$. It has a forward() method calculating $Z^{(l)}$, $A^{(l)}$ and $\delta A^{(l)}$ and returning all quantities stored in a cache dict.

def forward(self, cache_prev):
    Z = np.dot(cache_prev["A"], self.W) + self.b
    A = self.a(Z)
    dA = np.multiply(np.dot(cache_prev["dA"], self.W), self.da(Z))
    return {"A": A, "Z": Z, "dA": dA}

It also has a backward() method calculating $\delta^{(l)}$, $\gamma^{(l)}$ and $\eta^{(l)}$ and the derivatives $\frac{\mathrm dL}{\mathrm dW^{(l)}}$ and $\frac{\mathrm dL}{\mathrm db}$.

def backward(self, cache, cache_prev, cache_next, layer_next):
    if cache_next:
        cache["gamma"] = np.dot(
            np.multiply(
                cache_next["gamma"],
                layer_next.da(cache_next["Z"])
            ),
            layer_next.W.T
        )
        cache["delta"] = (
            np.multiply(
                np.dot(cache_next["delta"], layer_next.W.T),
                self.da(cache["Z"])
            )
            + np.multiply(
                cache["gamma"],
                np.multiply(
                    np.dot(cache_prev["dA"], self.W),
                    self.da2(cache["Z"])
                )
            )
        )
        cache["eta"] = np.multiply(
            np.dot(cache_next["eta"], layer_next.W.T),
            self.da(cache["Z"][None,0,:])
        )

    S = cache["A"].shape[0]

    dLdW = (
        np.dot(cache_prev["A"].T, cache["delta"])
        + np.dot(
            cache_prev["dA"].T,
            np.multiply(
                cache["gamma"],
                self.da(cache["Z"])
            )
        )
        + np.dot(cache_prev["A"][None,0,:].T, cache["eta"])
    )

    dLdb = (
        np.dot(
            np.ones((S,1)).T,
            cache["delta"]
        )
        + np.dot(np.ones((1,1)).T, cache["eta"])
    )
    dLdb = dLdb.reshape(self.b.shape)

    return dLdW,dLdb

This base class has several subclasses with different activation functions, i.e. LayerLinear, LayerRelu, LayerTanh, LayerSigmoid, LayerSin.

`Optimizer`

This class represents an update algorithm for the weights of the network. I implemented two subclasses, i.e. OptimizerGradientDescent and OptimizerAdam.

`DeqNN`

This class represents the differential equation and must be derived to implement the methods G(x, dx), dGdx(x, dx) and dGddx(x, dx) calculating the differential equation $\hat G(\vec x, \vec x')$, $\frac{\mathrm d\hat G(\vec x, \vec x')}{\mathrm d\vec x}$ and $\frac{\mathrm d\hat G(\vec x, \vec x')}{\mathrm d\vec x'}$. It further implements the loss function

def L(self, G, x0):
    return 1./G.shape[0] * np.sum(np.square(G)) \
        + np.sum(np.square(self.x0 - x0))

a method for the forward pass of the network

def predict(self, t):
    caches = [{"A": t, "Z": t, "dA": np.ones((t.shape[0],1))}]
    for layer in self.layers:
        caches.append(layer.forward(caches[-1]))
    return caches[-1]["A"], caches[-1]["dA"], caches

a method for one step of the backward pass handling the values of the last layer

def dLdWdb(self, caches, G, dGdx, dGddx, l):
    cache_prev = caches[l]
    cache = caches[l+1]
    S = G.shape[0]

    # Handle last layer to compute initial values for backward pass
    if l == len(self.layers)-1:
        cache["delta"] = np.zeros(G.shape)
        cache["gamma"] = np.zeros(G.shape)
        for s in range(G.shape[0]):
            for j in range(G.shape[1]):
                for n in range(G.shape[1]):
                    cache["delta"][s,j] += G[s,n]*dGdx[s,n,j]
                    cache["gamma"][s,j] += G[s,n]*dGddx[s,n,j]
        cache["delta"] *= 2/S
        cache["gamma"] *= 2/S
        cache["eta"] = 2 * (cache["A"][None,0,:] - self.x0)
        return self.layers[l].backward(cache, cache_prev, None, None)

    # Usual backward step for all other layers
    else:
        cache_next = caches[l+2]
        return self.layers[l].backward(
            cache, cache_prev, cache_next, self.layers[l+1])

and a method for training the network

def train(self, Nepochs, Ns, tmin, tmax, optimizer, tol=1e-10,
        show_progress=10):
    loss = []
    optimizer.init_params(self.layers)
    for epoch in range(Nepochs):
        t = np.vstack([
            [self.t0], # Always add initial value time
            np.random.random([Ns-1,1])*(tmax-tmin) + tmin
        ])

        # Forward pass
        x,dx,caches = self.predict(t)
        G = self.G(x, dx, t)
        dGdx = self.dGdx(x, dx, t)
        dGddx = self.dGddx(x, dx, t)
        loss.append((epoch, self.L(G, x[0])))

        # Show progress
        if show_progress != None and epoch % show_progress == 0:
            print("Epoch {:05d}: Loss: {:.6f}".format(
                epoch, loss[epoch][1]))

        # Early stopping if loss is smaller than tolerance
        if loss[epoch][1] < tol:
            if show_progress:
                print("Solution converged after {}".format(epoch)
                    + " epochs with tolerance {}".format(tol))
            break

        # Backward pass and update
        for l in reversed(range(len(self.layers))):

            # One backward step
            dLdW, dLdb = self.dLdWdb(caches, G, dGdx, dGddx, l)

            # Parameter update
            optimizer.update(self.layers, dLdW, dLdb, l, epoch)

    return np.array(loss)

I verified my implementation by checking all calculated derivatives with a comparison to numerical derivatives.

`DeqNN` with tensorflow

In addition to the implementation from scratch, I also used tensorflow. By doing so, only the forward pass has to be implemented, while the backward pass is completely handled by tensorflow's autodiff feature. Here, the base class DeqNN has again a function for computing the loss

def L(self, x, dx, t):
    return (1./x.shape[0]
        * tf.math.reduce_sum(tf.math.square(self.G(x, dx, t)))
        + tf.math.reduce_sum(tf.math.square(self.x0 - x[0]))
    )

and a forward() method for computing the foward pass. To calculate the derivative with respect to the input, the usual forward pass of the sequential network is recorded on a tf.GradientTape such that the derivative with respect to the input can be calculated with the batch_jacobian() method. Compared to the full jacobian, this method calculates only the derivatives of the output with respect to the inputs in the same row. As the rows in our network are independent, this more efficient method can be used. As we have only a one-dimensional input, i.e. $t$, selecting the first element in the last dimension via [:,:,0] is enough which creates a $S\times N$ dimensional matrix.

def predict(self, t):
    with tf.GradientTape() as tape_dx:
        tape_dx.watch(t)
        x = self.model(t)

    return x, tape_dx.batch_jacobian(x, t)[:,:,0]

The class also has a train_step() method for calculating one step of the backward pass via the usual recording of the forward pass with a tf.GradientTape and calculating the derivatives with respect to the parameters of the network.

def train_step(self, t, optimizer):
    # Forward pass
    with tf.GradientTape() as tape:
        x, dx = self.predict(t)
        loss = self.L(x, dx, t)

    # Backward pass and update
    grads = tape.gradient(
        loss, self.model.trainable_variables
    )
    optimizer.apply_gradients(
        zip(grads, self.model.trainable_weights)
    )
    return loss

Finally, there is a train() method similar to the implementation with numpy.

In my tests, I found that for small networks, the implementation with tensorflow introduces too much overhead so that the implementation with numpy is much faster.

3. Examples

I tried the neural network solver on several differential equations to see how the architecture performs on different problems. The results are shown below.

Bernoulli differential equation

The Bernoulli differential equation reads $$ \begin{aligned} x'(t) + P(t) x(t) = Q(t) x(t)^n \end{aligned} $$ where $n$, $P(t)$ and $Q(t)$ are parameters. For $n=2$, $P(t)=-2/t$ and $Q(t) = -t^2$, the equation takes the form of the Riccati equation and can be solved analytically with $$ \begin{aligned} x(t) = \frac{5t^2}{t^5 + 5c} \end{aligned} $$ where $c$ is a constant determined by the initial condition. For $x(1) = 1$, the constant is $c=4$. In the implementation, we therefore need $$ \begin{aligned} G(x, x', t) &= x' - \frac{2}{t} x + t^2 x^2\\ \frac{\mathrm dG}{dx} &= - \frac{2}{t} + 2t^2 x\\ \frac{\mathrm dG}{dx'} &= 1 \end{aligned} $$

For the network architecture, I chose a single hidden layer with 10 tanh units and the optimization is done with the Adam optimizer. The solution is searched in the range $t \in [0.1,10]$, where I intentionally do not start at $t=0$ to avoid the singularity at $t=0$. Doing a hyperparameter search for the number of samples $S$ in each training epoch and the learning rate $\alpha$, I found optimal values $S = 50$ and $\alpha = 0.005$. With these parameters, the loss function $L$ is smaller than $10^{-6}$ within $11290$ training epochs. The result is shown in the following figure compared to the analytic solution

While the loss value $L$ is a good criteria for the overall convergence of the solution, it does not give insight about the correctness at individual points $t$. For this, the error of the found solution can be evaluated by different means. If the exact solution $x(t)$ is known, as it is the case for this example, one can calculate the difference to the obtained numerical solution $\hat x(t)$, i.e. $|x(t) - \hat x(t)|$. This is shown in the part (c) in the above figure. Alternatively, one can just evaluate the differential equation $G(\hat x(t), \hat x'(t), t)$ itself with the obtained solution as it should approach zero. This is shown in part (d) in the above figure. Overall, the obtained solution is quite accurate.

Let me mention a few observations about the activation function. Choosing a sigmoid activation function does also work, yet it performs worse than tanh. I assume that the superior performance of the tanh activation function results from the additional information in the negative range which is missing for sigmoid. A relu activation function does only work poorly. Here, not only a negative part is lacking but more importantly the second derivative of relu is zero. As can be seen in Eq. (9) for $\delta^{(l)}$, the second derivative of the activation function enters. For a vanishing second derivative this information is missing and thus, hinders an optimal learning process.

Harmonic oscillator

Next, let us consider a classical damped harmonic oscillator in one dimension, for which Newton's equation of motion read $$ \begin{aligned} x''(t) + \omega_0^2 x(t) + \gamma x'(t) = 0 \end{aligned} $$ with undamped frequency $\omega_0$ and damping constant $\gamma$ according to Eq. (2). To reduce this second-order system to a first-order system, one can define $$ \begin{aligned} \vec x(t) &= \begin{pmatrix} x_1(t) \\ x_2(t) \end{pmatrix} = \begin{pmatrix} x(t) \\ x'(t) \end{pmatrix} \\ \vec x'(t) &= \begin{pmatrix} x_1'(t) \\ x_2'(t) \end{pmatrix} = \begin{pmatrix} x_2(t) \\ -\omega_0^2 x_1(t) - \gamma x_2(t) \end{pmatrix} \end{aligned} $$ which leads to $$ \begin{aligned} \vec G(\vec x(t), \vec x'(t), t) = \begin{pmatrix} x_1'(t) - x_2(t)\\ x_2'(t) + \omega_0^2 x_1(t) + \gamma x_2(t) \end{pmatrix} = 0 \end{aligned} $$ The derivatives read $$ \begin{aligned} \frac{\mathrm dG_1}{\mathrm d \vec x} = \begin{pmatrix} \frac{\mathrm d G_1}{\mathrm d x_1}\\ \frac{\mathrm d G_1}{\mathrm d x_2} \end{pmatrix} = \begin{pmatrix} 0\\ -1 \end{pmatrix}\\ \frac{\mathrm dG_2}{\mathrm d \vec x} = \begin{pmatrix} \frac{\mathrm d G_2}{\mathrm d x_1}\\ \frac{\mathrm d G_2}{\mathrm d x_2} \end{pmatrix} = \begin{pmatrix} \omega_0^2\\ \gamma \end{pmatrix}\\ \frac{\mathrm dG_1}{\mathrm d \vec x'} = \begin{pmatrix} \frac{\mathrm d G_1}{\mathrm d x_1'}\\ \frac{\mathrm d G_1}{\mathrm d x_2'} \end{pmatrix} = \begin{pmatrix} 1\\ 0 \end{pmatrix}\\ \frac{\mathrm dG_2}{\mathrm d \vec x'} = \begin{pmatrix} \frac{\mathrm d G_2}{\mathrm d x_1'}\\ \frac{\mathrm d G_2}{\mathrm d x_2'} \end{pmatrix} = \begin{pmatrix} 0\\ 1 \end{pmatrix} \end{aligned} $$ I choose $\omega_0 = 1$, $\gamma=0.4$ and as initial condition $x_1(0) = 0$ and $x_2(0) = 1$. I search for a solution in the range $t\in[0,30]$. For this system, I chose a single hidden layer network with 100 units and sin activation functions. With a hyperparameter search, I found optimal values for the number of samples per training epoch $S=150$ and $\alpha=0.05$ for which the loss $L$ get lower than $10^{-5}$ within 289 epochs. The result is shown below compared to a solution obtained with a classical Runge-Kutta integration (for this system, an analytic solution also exists). One observes that the solution in this case is again quite accurate.

One might wonder, why I have chosen a sin activation function for this example. In the neural network solver, the activation functions can be considered as kind of basis functions for the solution ansatz. As the solution is an oscillating function in this case, a sin function as activation/basis makes perfectly sense and helps the learning process. The question is, whether one can find a solution with a tanh activation functions as well? To check which minimal model capacity is required, I trained a vanilla neural network on the analytic solution in the same time range, i.e. a neural network with a single input for the time $t$ and two outputs for $x_1(t)$ and $x_2(t)$. The tensorflow implementation is shown below

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
np.random.seed(1)
tf.random.set_seed(1)

# Analytic solution of the harmonic oscillator example
def func1(t):
    return 1.02062*np.exp(-0.2*t)*np.sin(0.979796*t)

def func2(t):
    return np.exp(-0.2*t) \
        * (np.cos(0.979796*t) - 0.204124*np.sin(0.979796*t))

# Random training points in the interval t=0...30
x_train = np.random.rand(200)*30
y_train = np.array([func1(x_train), func2(x_train)]).T

# Setup a two-layer model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(100, activation=tf.nn.tanh, input_shape=(1,)),
    tf.keras.layers.Dense(100, activation=tf.nn.tanh),
    tf.keras.layers.Dense(2)
])
model.compile(
    optimizer=tf.keras.optimizers.Adam(0.01),
    loss=tf.keras.losses.MeanSquaredError()
)

# Train network
# As learning is noisy, keep best parameters during training
checkpoint = tf.keras.callbacks.ModelCheckpoint("weights",
    save_weights_only=True, monitor="loss", verbose=1, save_best_only=True
)
history = model.fit(x_train, y_train, epochs=5000, batch_size=100,
    callbacks=[checkpoint])
model.load_weights("weights")

# Plot result compared to analytic solution
x = np.linspace(0,30,1000)
y1 = func1(x)
y2 = func2(x)
yhat = model.predict(x)

fig = plt.figure(figsize=(6,4))
c1 = "#1F77B4"
c2 = "#E25140"
plt.subplots_adjust(left=0.12, right=0.95, top=0.95, hspace=0.6)

ax = plt.subplot(3, 1, 1)
ax.annotate('(a)', xycoords="axes fraction", xy=(-0.13,0.9))
plt.xlabel("$t$")
plt.ylabel("$x_1(t)$")
plt.plot(x, y1, color=c1, label="Analytic")
plt.plot(x, yhat[:,0], color=c2, linestyle="dotted", label="NN")
plt.legend(loc='upper right', frameon=False, ncol=2)

ax = plt.subplot(3, 1, 2)
ax.annotate('(b)', xycoords="axes fraction", xy=(-0.13,0.9))
plt.xlabel("$t$")
plt.ylabel("$x_2(t)$")
plt.plot(x, y2, color=c1, label="Analytic")
plt.plot(x, yhat[:,1], color=c2, linestyle="dotted", label="NN")

ax = plt.subplot(3, 1, 3)
ax.annotate('(c)', xycoords="axes fraction", xy=(-0.13,0.9))
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.semilogy(history.epoch, history.history["loss"], color="c1")

plt.savefig("ho_nn.png", dpi=200)

The result is shown below, where one can see that a two-layer neural network with 100 units each and a tanh activation function is capable of representing a decaying oscillation over several cycles reasonable well.

Now let's try this architecture in the differential equation neural network. A typical result for $S = 200$ and $\alpha = 0.01$ after 1650 epochs is shown below. The error is $L = 8.7\cdot 10^{-4}$.

We can see that the training fails. Even if the network should be capable to represent the solution in principle, I was not able to successfully train the differential equation neural network. There is only a rough agreement in the first cycle of the oscillation, but afterwards the solution simply decays to zero without further oscillations. Just to make sure, I exported the weights from the two-layer vanilla neural network above

np.save("W1.npy", model.layers[0].get_weights()[0])
np.save("b1.npy", model.layers[0].get_weights()[1])
np.save("W2.npy", model.layers[1].get_weights()[0])
np.save("b2.npy", model.layers[1].get_weights()[1])
np.save("W3.npy", model.layers[2].get_weights()[0])
np.save("b3.npy", model.layers[2].get_weights()[1])

and used these exact weights in my differential equation neural network. This produces the following result

How can we interpret this? Why is it not possible to train the DeqNN with the tanh activation functions? If we compare the absolute error $|x-\hat x|$ and the absolute value of the differential equation $|G|$ in the two examples of direct training and using pretrained weights, one can see that the absolute error is roughly one order of magnitude lower in the pretrained case, while $|G|$ is not. The value of $|G|$ is even in the same order compared to the apparently good solution with the sin activation functions, where there, the absolute error is actually one more order of magnitude further down.

My conclusion on this is that $|G|$ alone is not a good quantity to evaluate the correctness of the solution. This has severe consequences as $G$ directly enters the loss function, which is used to train the network. While $|G|$ is a good measure to evaluate how a solution fulfills the differential equation locally at a single point, it doesn't say anything about the global property of the solution. The used loss function measures two things: How good a solution fulfills the differential equation and how good the initial condition is fulfilled. While these two conditions are sufficient analytically to give a unique solution, numerically this might not be the case. In the solution we can see that the initial conditions is always fulfilled quite well, however for increasing time the solution seems to get worse, i.e. it deviates more and more from the real solution. However, this doesn't mean that the differential equation is less well fulfilled, as one can see by looking at $|G|$. It seems that, if not enforced by other means e.g. the sin activation function, the solution valid for the initial conditions transitions slightly into other solutions fulfilling the differential equation equally well. In an extreme case, if the neural network would have enough capacity, the training could converge to a solution where at each point $t$ the solution would correspond to another solution with different initial condition while simultaneously fulfilling exactly the differential equation.

Lastly, let us consider how well the neural network can predict solutions outside of its training range. For this, I take the trained network in the range $t\in[0,30]$ from above and let it predict solutions in the range $t\in[-10,40]$. The result is shown below compared to the analytic solution

We can see, that the network dramatically fails outside of the trained range. This is no special failure of the architecture but a general property of neural networks. There is simply no way how the network can know what happens outside of the trained range as it has never seen any samples from that range. If the function outside of the training range is similar to the function inside there might be still a good agreement but there is no guarantee for that.

Finally, how well does the DeqNN solver performs compared to the Runge-Kutta solver? Here, the DeqNN solver is much slower. For this simple example, the neural network cannot play its strength due to parallelization as the evaluation of the differential equation, i.e. the function $G(x(t),x'(t),t)$, is simply not costly enough such that the overhead by the neural network is too big.

4. Conclusion

As one can see from the results of the numerical study above, one can conclude that the DeqNN architecture is in principle capable to find solutions of differential equations. Let me mention a few observations from my experiments:

The training process can be difficult, as the chosen loss function does not guarantee a convergence to the solution fulfilling the initial condition, but can converge to different solutions, especially in ranges further away from the initial condition. Choosing an activation functions encoding the general structure of the solution, i.e. a sin function in a oscillating problem, helps the learning process. There have been ideas addressing this problem by "Curriculum learning", where the training interval is slowly increased during the learning process (see eg. Ref. [5]). However, in my example of the harmonic oscillator, this procedure didn't seem to help. I think, a modification of the loss function enforcing a more global convergence to the correct initial condition problem would help to address this problem.

I found, that activation functions with vanishing second derivative like relu functions do not work as the second derivative is required in the learning process. Furthermore, activation functions which can be positive and negative seem to work better than positive-only activation functions, i.e. tanh works better than sigmoid.

An evaluation of the network outside of the trained range leads in general to random results as the network has no way to know the correct solution.

Compared to traditional integration solver methods like Runge-Kutta, the neural network approach has advantages and disadvantages. The implementation with neural networks allows a parallel computation, while the integration always has to be done sequentially. Therefore, the computation time increases with increasing complexity of the problem in the integration solver, while it can be kept roughly constant in the neural network case. However in simple cases, like my examples, the overhead of the neural network is much larger and the simple integration of the problem is faster. Furthermore, once trained, the neural network allows the evaluation of the solution function at constant time at an arbitrary value. With an integration method, a recalculation is required if the value is not contained in the discretized integration range.

In principle, the architecture of the DeqNN can be easily extended to partial differential equations, where one uses multiple inputs instead of the single input and additional outputs representing all derivatives.

References

[1] I.E. Lagaris, A. Likas and D.I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks 9, 987 (1998)
[2] V.I. Avrutskiy, Backpropagation generalized for output derivatives, arXiv:1712.04185 (2017)
[3] M. L. Piscopo, M. Spannowsky, P. Waite, Solving differential equations with neural networks: Applications to the calculation of cosmological phase transitions, Physical Review D, 100, 016002 (2019)
[4] M. Mattheakis, D. Sondak, A. S. Dogra, P. Protopapas, Hamiltonian Neural Networks for solving differential equations, arXiv:2001.11107 (2020)
[5] C. Flamant, P. Protopapas, D. Sondak, Solving Differential Equations Using Neural Network Solution Bundles, arXiv:2006.14372 (2020)
[6] S. Chakraverty, S. Mall, Single layer Chebyshev neural network model with regression-based weights for solving nonlinear ordinary differential equations, Evol. Intel. 13, 687 (2020)