A quantum-mechanical framework for million-atom scale biological systems

Modifications to the Hartree–Fock algorithm

Unlike density functional theory (DFT), Hartree–Fock is a wavefunction-based method and does not directly operate on the electronic density. Each electronic interaction integral contains products of four wave functions. As the system grows, the amount of involved wave functions increases, and the number of electronic repulsion integrals (ERIs) scales with the fourth power of the system size, making their evaluation one of the main computational bottlenecks of Hartree–Fock calculations. To address this, we use systematic approximations that significantly reduce the computational effort. Both wavefunction-wavefunction³⁵ and density-density interactions are truncated using physically motivated thresholds. Since wave functions decay exponentially with distance, only nearby orbitals contribute significantly to overlaps (see subsection ‘Density cut-offs’). We assess the relevance of a density by calculating the overlap of its two constituent wave functions. For orbital types with negative signs, we use the absolute of the product to capture significant interactions that may otherwise cancel numerically but still produce meaningful contributions, especially to energy gradients, which are important for force calculations. Once densities below the threshold are excluded, the scaling of ERI evaluations drops from quartic to quadratic. To further reduce complexity to linear scaling (order N), we introduce a cut-off for the Coulomb interaction based on inter-density distance (see subsection ‘Coulomb cut-offs’). We use a short Coulomb cut-off of 10 Å—a more detailed discussion of this parametrization is found in a separate section. The distance of two densities is defined as the distance between the centres of the two wavefunction pairs that generate them. Only ERIs with Coulomb distances below the Coulomb cut-off are calculated. For ERIs with distances near the cut-off, we apply a smoothing function to ensure continuity.

An ERI is evaluated if the overlap threshold of both densities is exceeded and their centres lie within the Coulomb cut-off. Additionally, we compute the product of the two density relevance values (see the next subsections), divide it by the distance between their centres and compare it to a final relevance threshold. This avoids evaluating interactions between weakly overlapping densities and/or those separated by large distances. To further accelerate the process, we adopt an efficient and parallelizable ERI algorithm^53,54,55 based on the so-called Gaussian lobe functions^37,56 (see subsection ‘Basis function fitting’). Unlike conventional approaches^57,58,59,60, which use angular prefactors (x, y, z) to construct directional orbitals, our method builds positive and negative lobes from distinct Gaussian components. This removes angular dependencies and simplifies the integral expressions, making them uniform across orbital types. As a result, all ERI calculations can be performed by using the same expression, which enables a simple parallelisation for both ERIs and their derivatives (necessary to calculate atomic forces). We also precompute components of the ERIs which only depend on pairs of wave functions. This reduces the full calculation of the electronic repulsion between two densities consisting of in total of four Gaussian functions to only 6 additions, 8 multiplications, 10 memory accesses, 1 square root and the evaluation of the error function (see below). For most interactions with large separations, the error function can be approximated as unity, further reducing computational cost.

The following subsections detail the generation of the lobe-based basis functions, the calculation of the density relevance, the calculation of the Coulomb interaction relevance and the mathematical expressions for the modified ERI algorithm.

Basis function fitting

In electronic structure calculations, wave functions are commonly modelled as Slater-type orbitals (STOs), which for an s-orbital have the form ${\varphi }_{i}\left({{\boldsymbol{r}}}\right)={A}_{i}\exp \left(-{\alpha }_{i}{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{i}{||}\right)$. To enable efficient integration, STOs are typically approximated by using Gaussian functions of the form ${g}_{i}\left({{\boldsymbol{r}}}\right)={A}_{i}\exp \left(-{\alpha }_{i}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{i}{||}}^{2}\right)$.

Each STO is approximated by a linear combination of ${n}_{G}$ Gaussians as

$$\begin{array}{c}{\varphi }_{i}\left({{\boldsymbol{r}}}\right)={\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{A}_{a}\exp \left(-{\alpha }_{a}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{a}{||}}^{2}\right),\end{array}$$

(1)

This approximation enables an analytic evaluation of key integrals^57,58,59,60, relevant for the overlap matrix, the kinetic and nuclei energy as well as the Coulomb and exchange interaction of the electrons in the system. Orbitals with higher angular momenta are typically expressed via Gaussians containing cartesian prefactors ($x,y,z,{x}^{2},{xy},{xz},\ldots$). Although these integrals can also be evaluated analytically, their complexity increases due to the recursive nature of the associated expressions^57,58,59,60. Furthermore, different angular momentum combinations require distinct equations, reducing parallelisation efficiency.

To avoid this bottleneck, we here adopt Gaussian lobe function expansions^37,56, which allow the construction of basis sets without cartesian prefactors. We show in this work that this approach, though currently not widely used, is particularly well-suited to modern computing architectures optimised for high parallelisation. It also enables the evaluation of ERIs using a unified matrix formalism, which is potentially compatible with GPU acceleration⁶¹. The algorithms for ERI evaluation are described in the section ‘Electron repulsion integral algorithm’.

For fitting, we start with a Slater-type orbital of the STO-6G type taken from the Basis Set Exchange⁶². In the case of p-orbitals, we approximate them with 6 Gaussian functions, three placed symmetrically slightly to the left and three to the right of the orbital centre. This corresponds in accuracy to a conventional STO-3G orbital. The fitting is performed on a three-dimensional real space grid. Fit parameters include 3 contraction exponents and 3 coefficients, with the latter multiplied by $-1$ on one side to model the nodal character of p-orbitals. For the optimisation of the parameters during the fitting process, the trust region algorithm is used. This is done via the function curve_fit from the mathematic Python library scipy⁶³.

The Gaussian lobe function expansion would also work for orbitals of higher angular momenta. For example, d-orbitals can be split into four lobes which would require twice as many Gaussians as used for p-orbitals (more for the special case of the ${{{\rm{d}}}}_{{z}^{2}}-{{\rm{orbital}}}$). The increased computational effort due to the higher number of Gaussian functions should be comparable to the increased complexity of evaluating integrals over Gaussian functions multiplied with cartesian polynomials of degree two in the conventional approaches.

Density cut-offs

A density ${\rho }_{{ij}}\left({{\boldsymbol{r}}}\right)$ is defined as the product of two wave functions:

$$\begin{array}{c}{\rho }_{{ij}}\left({{\boldsymbol{r}}}\right)={\varphi }_{i}\left({{\boldsymbol{r}}}\right){\varphi }_{j}\left({{\boldsymbol{r}}}\right)={\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{{\sum }_{b=1}^{{{n}_{{{\rm{G}}}}}_{j}}{A}_{a}}A_{b} \exp \left(-{\alpha }_{a}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{a}{||}}^{2}-{\alpha }_{b}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{b}{||}}^{2}\right).\end{array}$$

(2)

To determine whether a given density is relevant for further calculations, we compute a scalar value $r\left({\rho }_{{ij}}\right)$, given by the integral over the absolute value of all pairwise products of Gaussian functions:

$$\begin{array}{c}r\left({\rho }_{{ij}}\right)=\int d{{\boldsymbol{r}}}{\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{\sum }_{b=1}^{{{n}_{{{\rm{G}}}}}_{j}}\left|{A}_{a}\right|\left|{A}_{b}\right|\exp \left(-{\alpha }_{a}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{a}{||}}^{2}-{\alpha }_{b}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{b}{||}}^{2}\right).\end{array}$$

(3)

This equation is valid for s-type orbitals, where the integrand remains positive. Orbitals with nonzero angular momentum (e.g. p-orbitals) include prefactors $x,y,z$ which would require using the absolute values $\left|x\right|,\left|y\right|,\left|z\right|$ to ensure that the integrand is always positive. However, using these absolute values makes an analytic evaluation of the overlap integrals impossible. To resolve this, we developed an approach to represent each lobe of orbitals with a separate set of Gaussians. The previously described lobe-based basis functions (which achieve a nodal plane by summing Gaussians of different signs) are not suitable here, as their absolute-value integrals do not correspond to the absolute value of the full orbital. Instead, we independently fit each lobe of a p-type orbital using Gaussian functions, and then reconstruct the full orbital by combining lobes with different signs and different centres. The representation of a p-type orbital used for the density relevance calculations is given by:

$${\varphi }_{i}\left({{\boldsymbol{r}}}\right)={\sum }_{l=1}^{{{n}_{{{\rm{L}}}}}_{i}}{\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{s}_{l}{A}_{a}$$

$$\begin{array}{c} \cdot \exp \left(-{{\alpha }_{x}}_{i,a}{\left(x-{{x}_{0}}_{i,l,a}\right)}^{2}-{{\alpha }_{y}}_{i,a}{\left(y-{{y}_{0}}_{i,l,a}\right)}^{2}-{{\alpha }_{z}}_{i,a}{\left(z-{{z}_{0}}_{i,l,a}\right)}^{2}\right)\end{array}$$

(4)

where ${s}_{l}$ is the sign of lobe $l$ and each cartesian direction has an individual contraction factor (e.g. ${{\alpha }_{x}}_{i,a}$) and offset (e.g. ${{x}_{0}}_{i,l,a}$) for higher flexibility upon fitting. The fitting process is identical to that described in the section ‘Basis function fitting’.

Finally, to determine if a density ${\rho }_{{ij}}$ is relevant for the calculation, we compute $r({\rho }_{{ij}})$ and compare it to a threshold value.

Coulomb cut-offs

In the Hartree–Fock formalism, the Coulomb interaction is modelled as the repulsion between two electronic densities. Following the method described above, Coulomb interactions are neglected if the separation between densities exceeds a defined cut-off. However, since densities are not point-like, a suitable distance metric must be established to estimate their effective separation. For this purpose, each density is assigned an origin, calculated as the arithmetic mean of the atomic positions associated with the two basis functions forming that density. The interaction distance between two densities ${\rho }_{{ab}}$ and ${\rho }_{{cd}}$ is defined as:

$$d\left({\rho }_{{ab}},{\rho }_{{cd}}\right)=\left|\left| \frac{{{{\boldsymbol{r}}}}_{a}+{{{\boldsymbol{r}}}}_{b}}{2}-\frac{{{{\boldsymbol{r}}}}_{c}+{{{\boldsymbol{r}}}}_{d}}{2}\right|\right|,$$

(5)

where ${{{\boldsymbol{r}}}}_{a},\ldots ,{{{\boldsymbol{r}}}}_{d}$ are the positions of the atoms involved in the four basis functions $a,b,c,d$.

A lower and upper cut-off threshold ${r}_{{{{\rm{c}}}}_{{{\rm{l}}}}}$ and ${r}_{{{{\rm{c}}}}_{{{\rm{u}}}}}$ (typically 8 and 10 Å), are introduced to control the inclusion of interactions. For density pairs with distances between these two thresholds, a smooth transition is applied using a cut-off function to avoid discontinuities. The scaling function (cut-off function) used is

$$\begin{array}{c}f\left(x\right)=1+2{x}^{3}-3{x}^{2},\end{array}$$

(6)

which smoothly interpolates between 1 and 0 as $x$ increases in the interval $\left[{\mathrm{0,1}}\right]$. The distance interval $[{r}_{{{{\rm{c}}}}_{{{\rm{l}}}}},{r}_{{{{\rm{c}}}}_{{{\rm{u}}}}}]$ is linearly mapped to the domain $\left[{\mathrm{0,1}}\right]$ of $f\left(x\right)$.

Electron repulsion integral algorithm

An electron repulsion integral (ERI) describes the interaction between two electronic densities:

$${e}_{{ijkl}}\equiv \left({ij},|,{kl}\right)=\iint d{{\boldsymbol{r}}}d{{{\boldsymbol{r}}}}^{{{{\prime} }}}\frac{{\varphi }_{i}\left({{\boldsymbol{r}}}\right){\varphi }_{j}\left({{\boldsymbol{r}}}\right) \cdot {\varphi }_{k}\left({{{\boldsymbol{r}}}}^{{{{\prime} }}}\right){\varphi }_{l}\left({{{\boldsymbol{r}}}}^{{{{\prime} }}}\right)}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}^{{{{\prime} }}}{||}}$$

$$={\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{\sum }_{b=1}^{{{n}_{{{\rm{G}}}}}_{j}}{\sum }_{c=1}^{{{n}_{{{\rm{G}}}}}_{k}}{\sum }_{d=1}^{{{n}_{{{\rm{G}}}}}_{l}}\iint d{{\boldsymbol{r}}}d{{{\boldsymbol{r}}}}^{{{{\prime} }}}\frac{{A}_{a}{A}_{b}{A}_{c}{A}_{d}}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}^{{{{\prime} }}}{||}}$$

$$\cdot \exp \left(-{\alpha }_{a}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{a}{||}}^{2}-{\alpha }_{b}{{||}{{\boldsymbol{r}}}-{{{\boldsymbol{r}}}}_{b}{||}}^{2}-{\alpha }_{c}{{||}{{{\boldsymbol{r}}}}^{{{{\prime} }}}-{{{\boldsymbol{r}}}}_{c}{||}}^{2}-{\alpha }_{d}{{||}{{{\boldsymbol{r}}}}^{{{{\prime} }}}-{{{\boldsymbol{r}}}}_{d}{||}}^{2}\right).$$

$$\equiv {\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{\sum }_{b=1}^{{{n}_{{{\rm{G}}}}}_{j}}{\sum }_{c=1}^{{{n}_{{{\rm{G}}}}}_{k}}{\sum }_{d=1}^{{{n}_{{{\rm{G}}}}}_{l}}\left({g}_{a}{g}_{b} | {g}_{c}{g}_{d}\right),$$

(7)

where each orbital is expanded in terms of Gaussian functions. Each ERI of the individual Gaussian functions is given by:

$$\left({g}_{a}{g}_{b}|{g}_{c}{g}_{d}\right)={O}_{{g}_{a}{g}_{b}}{O}_{{g}_{c}{g}_{d}}{P}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}{}_{1}F_{1} \left(\frac{1}{2},\frac{3}{2},-{P}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}{d}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}^{2}\right)$$

(8)

with the confluent hypergeometric function ${}_{1}F_{1}$, two overlaps ${O}_{{g}_{a}{g}_{b}}$ and ${O}_{{g}_{c}{g}_{d}}$, a factor ${P}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}$ and a distance ${d}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}$ which are defined as:

$${O}_{{g}_{a}{g}_{b}} := \sqrt{2}{\pi }^{\frac{5}{4}}{A}_{a}{A}_{b}\frac{1}{{\left({\alpha }_{a}+{\alpha }_{b}\right)}^{\frac{3}{2}}} \cdot \exp \left(-\frac{{\alpha }_{a}{\alpha }_{b}}{{\alpha }_{a}+{\alpha }_{b}} || {{{\boldsymbol{r}}}}_{a}-{{{\boldsymbol{r}}}}_{b} ||^{2}\right),$$

(9)

$${P}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}:=\frac{1}{\frac{1}{{\alpha }_{a}+{\alpha }_{b}}+\frac{1}{{\alpha }_{c}+{\alpha }_{d}}},$$

(10)

$${d}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}:= || {{{\boldsymbol{r}}}}_{{ab}}-{{{\boldsymbol{r}}}}_{{cd}} || ,{{{\boldsymbol{r}}}}_{{ab}}=\frac{{\alpha }_{a}{{{\boldsymbol{r}}}}_{a}+{\alpha }_{b}{{{\boldsymbol{r}}}}_{b}}{{\alpha }_{a}+{\alpha }_{b}}.$$

(11)

We define $x:={P}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}{d}_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}^{2}$ and evaluate the confluent hypergeometric function using the following equation:

$$\begin{array}{c}{}_{1}F_{1}\left(\frac{1}{2},\frac{3}{2},-x\right)=\sqrt{\pi }\frac{{{\rm{erf}}}\left(\sqrt{x}\right)}{2\sqrt{x}},\end{array}$$

(12)

where the error function can be efficiently approximated using:

$$\begin{array}{c}{{\rm{erf}}}\left(x\right)\cong 1-\left({c}_{1}t+{c}_{2}{t}^{2}+{c}_{3}{t}^{3}+{c}_{4}{t}^{4}+{c}_{5}{t}^{5}\right) \cdot \exp \left(-{x}^{2}\right),t=\frac{1}{1+{c}_{0}x}\end{array}$$

(13)

(with coefficients ${c}_{0}=0.3275911$, ${c}_{1}=0.254829592$, ${c}_{2}=0.284496736$, ${c}_{3}=1.421413741$, ${c}_{4}=-1.453152027$ and ${c}_{5}=1.061405429$)⁶⁴. For large $x$ the error function ${{\rm{erf}}}\left(x\right)$ rapidly approaches 1, since $1-{{\rm{erf}}}\left(4\right)\approx 1.5 \cdot {10}^{-8}$ and $1-{{\rm{erf}}}\left(5\right)\approx 1.5 \cdot {10}^{-12}$, and can be approximated accordingly.

We recall that we evaluate the ERIs of the form

$$\left({ij}|{kl}\right)={\sum }_{a=1}^{{{n}_{{{\rm{G}}}}}_{i}}{\sum }_{b=1}^{{{n}_{{{\rm{G}}}}}_{j}}{\sum }_{c=1}^{{{n}_{{{\rm{G}}}}}_{k}}{\sum }_{d=1}^{{{n}_{{{\rm{G}}}}}_{l}}\left({g}_{a}{g}_{b} | {g}_{c}{g}_{d}\right).$$

(14)

It will only be necessary to compute a small part of all possible combinations $\left({ij},|,{kl}\right)$, since we use both a density cut-off and a Coulomb cut-off. Therefore, and to reduce redundant computation, the quadruple summation over all Gaussians is reformulated as a double summation over precomputed Gaussian pairs ${g}_{a}{g}_{b}$ and ${g}_{c}{g}_{d}$, for which the quantities ${O}_{{g}_{a}{g}_{b}}$ and ${{{\boldsymbol{r}}}}_{{ab}}$ as well as ${O}_{{g}_{c}{g}_{d}}$ and ${{{\boldsymbol{r}}}}_{{cd}}$, respectively, are stored in lookup tables. This is practical because number of density pairs is considerably smaller than the total number of ERIs. The complete ERI evaluation algorithm can be visualised in terms of a three-dimensional tensor: the first and second axis are the sums described above and the third axis is the list of all relevant combinations $\left({ij},|,{kl}\right)$. Summing up this tensor along the first two axes yields the full list of ERIs. It must be noted that each integral $\left({g}_{a}{g}_{b} | {g}_{c}{g}_{d}\right)$ in this tensor is computed by using the same equation. However, the lengths of the first two axes ${{n}_{{{\rm{G}}}}}_{i}{{n}_{{{\rm{G}}}}}_{j}$ and ${{n}_{{{\rm{G}}}}}_{k}{{n}_{{{\rm{G}}}}}_{l}$ are not always the same as the number of used Gaussian functions is not the same for all wave functions.

To account for this, we modify the equation for $\left({ij}|{kl}\right)$ by setting the limits of the sums to $\max {{n}_{{{\rm{G}}}}}_{i}{{n}_{{{\rm{G}}}}}_{j}$ and $\max {{n}_{{{\rm{G}}}}}_{k}{{n}_{{{\rm{G}}}}}_{l}$ which leads to:

$$\left({ij}|{kl}\right)={\sum }_{a,b=1}^{\max {{n}_{{{\rm{G}}}}}_{i}{{n}_{{{\rm{G}}}}}_{j}}{\sum }_{c,d=1}^{\max {{n}_{{{\rm{G}}}}}_{k}{{n}_{{{\rm{G}}}}}_{l}}{\delta }_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}^{{ijkl}}\left({g}_{a}{g}_{b} | {g}_{c}{g}_{d}\right).$$

(15)

where ${\delta }_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}^{{ijkl}}$ is 1 if ${ab}\le {{n}_{{{\rm{G}}}}}_{i}{{n}_{{{\rm{G}}}}}_{j}$ and ${cd}\le {{n}_{{{\rm{G}}}}}_{k}{{n}_{{{\rm{G}}}}}_{l}$—otherwise it is 0. Therefore, ${\delta }_{{g}_{a}{g}_{b}{g}_{c}{g}_{d}}^{{ijkl}}$ describes if the index combination ${abcd}$ is defined for the set of Gaussian functions ${g}_{a}{g}_{b}{g}_{c}{g}_{d}$ within $\left({ij},|,{kl}\right)$.

This method allows the evaluation of all $\left({ij}|{kl}\right)$ with the same equation and is therefore highly parallelizable. Additionally, it saves memory, since no intermediate values are stored (all Gaussian ERIs $\left({g}_{a}{g}_{b} | {g}_{c}{g}_{d}\right)$ become contracted immediately in each step of the loop over $({ij}|{kl})$). Precomputing and storing intermediate values dependent on only one density (${O}_{{g}_{a}{g}_{b}}$, ${O}_{{g}_{c}{g}_{d}}$, ${{{\boldsymbol{r}}}}_{{ab}}$ and ${{{\boldsymbol{r}}}}_{{cd}}$) reduce the computational effort as well.

Implementation check

To ensure that the Hartree-Fock implementation runs correctly, especially under the implemented cut-off methods and basis function manipulations, results were benchmarked against the quantum-chemistry package ORCA^{13,65,66,67,68,69}. Specifically, we assessed the agreement between our results and a standard calculation using the STO-3G basis set, focusing on both total and orbital energies. This allowed us to ensure that the programme runs correctly and that lobe function expansion and interaction truncations do not lead to strongly deviating results. As a test system, we selected the molecule beta-carotene (C₄₀H₅₆), which involves 541,089,856 unique ERIs, of which 30,572,290 were considered to have relevant density contributions. Among these, 3,946,243 integrals met both density and Coulomb interaction relevance criteria.

The total electronic energy calculated using our implementation was –1528.547 hartree (Ha), compared with –1528.663 Ha obtained via ORCA. This corresponds to an absolute deviation of 0.115 Ha, or 0.0075%. For the orbital energies, we compared all valence orbitals and the ten lowest virtual orbitals, yielding a root mean square deviation of 7.98 mHa. Using only the Coulomb cut-off leads to a value of 7.57 mHa, whereas only density cut-offs result in an error of 6.07 mHa. No cut-offs at all result in a deviation of 3.23 mHa which is attributed to the modified basis functions in the lobe function expansion which do not match the STO-3G basis exactly and therefore lead to a small error.

In terms of computational performance, ORCA required 192 s for a beta-carotene calculation on an Intel i5 processor. In comparison, our implementation with all cut-offs was completed in 21 s. When disabling the Coulomb cut-off and using a minimal density threshold of 10⁻¹⁰, typical of many Hartree–Fock codes, the runtime increased to 131 s.

Parallel Hartree–Fock on large systems

The density cut-off was set to 10⁻⁴ and the Coulomb cut-off to 10 Å with a smooth transition to zero starting at 8 Å. The self-consistent cycle of the Hartree–Fock calculation used a convergence threshold of 10⁻⁶ for the RMSD of the density matrices. The STO-3G basis set was split into individual Gaussian functions to allow an ERI evaluation according to the above-mentioned Gaussian lobe function expansion algorithm.

It has to be kept in mind that the short Coulomb cut-off neglects long-range interactions since no PME algorithm or similar methods are used. In the divide-and-conquer approach this is forced to guarantee a subsequent evaluation of the clusters. For the applications of UV/Vis spectra and atomic energies, this truncation only has a small to negligible influence on the results. Other properties are also inaccessible if the divide-and-conquer approach is used: these include the system-wide molecular orbitals such as for example HOMO and LUMO and the band gap. Orthogonality between the divide-and-conquer regions is also not maintained, the approach is no longer variational and there would be no access to MD simulations. Note that for applications up to a few thousand atoms, the formalism runs well without the divide-and-conquer approach which makes the properties mentioned above accessible in this region.

System preparation for the large-scale calculations involved solvation in water using the solvation tool of the programme Visual Molecular Dynamics (VMD)⁷⁰. The solvated system was then partitioned into subsystems on a 3D mesh. Hartree-Fock calculations were carried out independently for each subsystem. To reconstruct the total electronic density, the density matrix elements were chosen from the subsystem containing both atoms corresponding to the two basis functions. If the basis functions were associated with different subsystems, the mean of the relevant density matrix elements was used.

The total electronic density was then calculated as

$$\begin{array}{c}{\rho }_{{{\rm{total}}}}\left({{\boldsymbol{r}}}\right)={\sum }_{a=1}^{{n}_{{{\rm{bf}}}}}{\sum }_{b=1}^{{n}_{{{\rm{bf}}}}}r\left(a,b\right){P}_{{ab}}{\rho }_{{ab}}\left({{\boldsymbol{r}}}\right),\end{array}$$

(16)

where ${n}_{{{\rm{bf}}}}$ is the total number of basis functions, $r\left(a,b\right)$ is a relevance determining whether the pair basis functions $a$ and $b$ contributes significantly, and ${P}_{{ab}}$ is the corresponding element of the density matrix.

Two methods were used for the interpolation of density matrix elements. One method was to compute the new density element as the mean of the density elements of the clusters that get merged. In the current and more accurate method, which is the one found in the source code, the cross-section between the line connecting the two atoms involved in one density and the border between the two involved clusters is computed. This yields two new lengths—between the cross-section point and the two atoms—which are normalised to a sum of 1 and serve as the prefactors for the density contributions from the two subsystems. The length in one subsystem serves as the prefactor for the density of the other subsystem, for example if one atom is close to the border, then the contribution of the other subsystem, where the atom is further away from the border, will be small. The described atoms are the atoms associated with the two basis functions that each density is composed of.

Spectral calculation parameters

For the quantum spectral calculations, we employed the RT-TDHF formalism, as it scales more favourably for larger systems than the linear-response time-dependent Hartree–Fock (LR-TDHF) approach, which theoretically scales with the sixth power of the system size.

In RT-TDHF the system is excited by an electric field pulse and propagated in time using quantum-mechanical time evolution. During the propagation, the dipole moments are recorded at each time step and the frequency-dependent polarizability and absorption spectra are obtained via discrete Fourier transform of the dipole time series. The systems were propagated over 2000 steps with a time step of 0.25 atomic units (a.u.), corresponding to a total propagation time of 500 a.u. (12.1 fs). The applied electric field pulse had a strength of 10⁻⁵ a.u., a standard deviation of 0.2 a.u. and a starting time of 2 a.u. During the Fourier transform, an attenuation factor of 0.01 a.u. was used to supress numerical noise.

Following established practice, the resulting spectra were rescaled^71,72 to account for the systematic shifts in excitation energies inherent to time-dependent Hartree–Fock and density functional theory calculations. A rescaling factor of 1.335 was applied, consistent with the benchmark study by Jacquemin et al., who computed TDHF spectra for a wide range of organic dyes⁷².

For the spectral calculations, the density cut-off was chosen as 10⁻⁶ and the Coulomb cut-off as 10 Å. This parametrization lead to results in good agreement with experimental data and in a test showed only small deviations from a calculation using no cut-offs. This test and calculations with further reduction of the density or Coulomb cut-off are shown in Supplementary Fig. 11. A density cut-off of 10^-4 showed similar results to a value of 10^-6 in this test but lead to some absorption peaks vanishing or decreasing in intensity in other tests with benzene and DNA, therefore we chose a cut-off of 10⁻⁶ as the standard parametrization. For the Coulomb cut-off 8 Å also lead to similar results to 10 Å—we nonetheless use 10 Å as the default Coulomb cut-off to have some buffer as the stability might differ from system and system and to not discard too many mid-range interactions.

Atomic energy processing

The atomic energies used for the comparison with AlphaFold’s pLDDT scores were smoothed using a Savitzky-Golay filter⁷³ with a window length of 150. To match the scale of pLDDT values, the computed energies were multiplied by the factor of 15 and shifted by a structure-specific offset (–430, –440 and –460 for PDB entries AF_AFA0A023FF81F1⁴⁷, PDB AF_AFA0A023IKK2F1⁴⁷ and AF_AFA0A023GPK8F1⁴⁷, respectively). The rescaled energies were truncated at the corresponding structure’s maximum and minimum pLDDT values to eliminate extreme atomic energy outliers.

A modified expression for atomic energies was employed, defined as

$$\begin{array}{c}{E}_{a}=\frac{1}{{Z}_{a}}{\sum }_{i=1}^{{{n}_{{{\rm{bfv}}}}}_{a}}{\sum }_{j=1}^{{n}_{{{\rm{bf}}}}}{P}_{{ij}}\left({H}_{{ij}}+\frac{1}{2}{G}_{{ij}}\right),\end{array}$$

(17)

where ${E}_{a}$ is the atomic energy of atom $a$, ${Z}_{a}$ is the atomic charge (used for normalisation), ${{n}_{{{\rm{bfv}}}}}_{a}$ is the number valence basis functions for atom $a$ (excluding core contributions), and ${n}_{{{\rm{bf}}}}$ is the total number of basis functions. ${P}_{{ij}}$, ${H}_{{ij}}$ and ${G}_{{ij}}$ denote the density matrix, core Hamiltonian and two-electron interaction matrix elements, respectively.

System preparation was based on the aforementioned PDB entries⁴⁷, with solvation performed using a 10 Å water shell generated via the solvation tool of Chimera⁷⁴.

The actual and predicted pLDDT scores for the three investigated structures are shown as graphs in Supplementary Fig. 12.

The accuracy of subsystem splitting was tested using the protein Evasin P1126 (PDB ID: AF_AFA0A023FF81F1)⁴⁷. For atomic energy calculations, only valence electrons were considered. Subsystems were constructed by segmenting the amino acid chain of the whole protein into smaller peptide fragments. Comparing a full Hartree-Fock calculation with the subsystem-based approach, we obtained a root mean square deviation of 0.385 Ha for the atomic energies, corresponding to a relative error of 0.444% and a Pearson correlation coefficient of 0.9986. This is shown in Supplementary Fig. 13.

Programme implementation

The described programme is written in the Python programming language⁷⁵. To achieve high computational performance, all computational demanding functions are written either in matrix form or compiled to machine code with just-in-time (JIT) compilation. For JIT compilation, the library Numba⁷⁶ was used, and matrix operations were performed with Numpy⁷⁷ or PyTorch⁷⁸. Visualisation of results and plotting were performed using the library Matplotlib⁷⁹.

Biomolecular structures and visualisation

All calculations in this article were based on biomolecular structures obtained from the RCSB Protein Data Bank (PDB)⁴⁷. The structures and their full references (including original research articles and PDB entries) are as follows: PDB 2M3X³³ (12-bladed propeller protein), PDB 1H68^36,80 (rhodopsin), PDB 6Q3G^1,81 (bacteriophage P68), PDB 1A7Y^5,82 (Actinomycin D), PDB 1D79^3,83 (DNA with 8 base pairs), PDB 2JYK^4,84 (DNA with 21 base pairs), PDB 1MNV^6,85 (Actinomycin D bound to DNA), PDB 8WL2^38,86 (flagellar motor), PDB 7PKR^39,87 (vault protein complex). Structures used for comparing AlphaFold’s pLDDT scores to atomic energies include: PDB AF_AFA0A023FF81F1 (Evasin P1126), PDB AF_AFA0A023IKK2F1 (MHC Class II Beta Chain) and PDB AF_AFA0A023GPK8F1 (Friend of Echinoid, Isoform H). These were obtained from the Computed Structure Models (CSM) section of the RCSB PDB⁴⁷.

Visualisations of biomolecular structures were created using ChimeraX^34,74,88,89 and the PDB Mol* Viewer⁹⁰ for Supplementary Figure 10. All other figure elements (including biomolecular structures at atomic resolution, electronic densities, absorption spectra and atomic energies) were generated with Matplotlib⁷⁹ for python.

Further testing with molecular dynamics

As a test system to study structural optimisation and molecular dynamics, we used a DNA tetramer (sequence ACGT). We calculated the forces for this structure with and without using the Coulomb cut-off. The parametrization of the Coulomb cut-off is the same for the electron repulsion integrals (ERIs) and for the ion-ion interactions. The result is shown in Supplementary Fig. 14.

In addition to this, a structural optimisation was performed for the molecule. For this the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm was used—a quasi-Newton method which uses an approximation of the Hessian and which is commonly used for structural optimisation or for nonlinear optimisation problems in general. Compared to gradient descent methods which do not precondition the gradient vector, the BFGS algorithm usually shows much faster convergence. However, a convergence to zero or values very close to zero is usually not achieved for larger molecules¹⁶ since this would also require to perform a structural optimisation at a larger scale. Therefore, the root-mean-square (RMS) of the forces is often improved by one or two orders of magnitude until the algorithm plateaus.

The development of the RMS of the forces during a BFGS optimisation of the ACGT molecule is shown in Supplementary Fig. 15 for two calculations without and with Coulomb cut-off (cut-off values at 3.5 and 10 Angstrom), respectively. The calculation was run over 30 optimisation steps. In each iteration, the optimal descent prefactor was determined using a line search utilising the golden section search algorithm which avoids re-evaluations as much as possible. During the line search, only the energy is needed and not the forces and therefore the programme presented in this work was run in energy-only mode to save computation time. In Supplementary Fig. 15 we observe convergence for the calculations with and without cut-off. Both converge in a very similar way and show oscillatory convergence at the end. The minimum RMS is achieved at step 28 with a value of around 0.0009 for both calculations.

In Supplementary Fig. 16 the atomic coordinates from before and after the structure optimisation are shown as an overlay. Additionally, at the bottom, the final coordinates from the structural optimisation with Coulomb cut-off are overlayed with the final coordinates from the optimisation without Coulomb cut-off. Here, we observe that the structural optimisation does have an influence on the geometry of the system (see top panel of Supplementary Fig. 16), as it is expected. Furthermore, the final geometry is not significantly influenced by the Coulomb cut-off as both calculations lead to a very similar result. The root-mean-square displacement between the two calculations is computed as 0.047 Angstrom.

It should also be mentioned that forces that deviate from the forces without cut-offs (both density and Coulomb cut-off) still are the derivative of a (now slightly changed) energy landscape. Nonetheless, the sum of the forces still remains zero as they are the exact derivative of the energy. This has been tested by ensuring that the forces obtained from the programme can be reproduced numerically as the limit case of the finite-differences method. Testing has been done for the forces as a sum of derivatives as well as for the derivatives of the integrals itself (overlap, kinetic, nuclei and electron repulsion integrals).

We continue the testing by investigating the stability of molecular dynamics simulations with and without the Coulomb cut-off. To do so, we performed an MD simulation of the same system that was used above for the investigation of the structural optimisation process. The simulation runs over 200 time steps with a step interval of 0.2 fs.

It has to be mentioned that a slight drift in the total energy can be seen with a value of 6.0e-7 Ha per atom and per femtosecond, this is caused by some of the starting forces being very large and is not an error which occurs due to cut-offs as it happens in the calculation without cut-offs as well. The MD simulation was done with large forces to test the performance of the cut-offs for more extreme situations beyond the equilibrium—for this reason, we also chose to work with the rather drastic cut-off scheme 3.5/10. However, to ensure that drift of the total energy is indeed caused only by the integrator, a simulation was done with a smaller time step for a smaller system (DNA with just one base pair, cytosine) with the Coulomb cut-off used. In this case it could be observed that the changes of the total energy can be reduced by decreasing the time step of the molecular dynamics simulation. To be precise, reducing the time step ten-fold leads to an energy conservation error of 8.0e-8 Ha per atom and per femtosecond. For an MD simulation with 10 steps of geometry optimisation prior to its start, a time step of 0.5 fs can be used with an energy conservation error of 2.9e-9 Ha/atom/fs.

In total, we observe very little difference between the kinetics in the two MD simulations with and without cut-off used. Of course, this is not a full test, therefore we also investigate the actual trajectories. However, from Supplementary Fig. 17 it can be seen that the low Coulomb cut-off does not lead to numerical instabilities or to larger forces than in the simulation with cut-off.

Next, we consider the trajectories obtained in the two simulations. We computed the RMSD between the coordinates of the two simulations at each time step. This is shown in Supplementary Fig. 18 below as the green line. Additionally, the mean travelled distance of the atoms in the simulation was computed for both simulations is also displayed in Supplementary Fig. 18.

Here, we see a difference in the trajectories of the two MD simulations but it is small compared to the total movement of the atoms. The maximum RMSD is 0.091 Angstrom. In Supplementary Fig. 19 the atomic coordinates from the start and the end of the MD simulation with no cut-off are overlayed and a second overlay is shown where the final MD coordinates of the simulation with no cut-off are compared with the final coordinates of the MD simulation with the Coulomb cut-off.

We observe again that the changes due to the cut-off are small compared to the total movement in the MD simulation. Note that here a quite aggressive cut-off configuration (3.5 and 10 Angstrom for the lower and upper cut-off) is used which is not recommended due the very small value of 3.5 Angstrom for the lower cut-off. In the following paragraph we discuss the reason for the large deviation between the lower and upper cut-off and why a higher cut-off value is recommended for force calculations.

The choice of the lower and upper cut-off plays an important role, specifically for force calculations. For the calculations that only compute the energy of the system, the difference between the two cut-offs only has the purpose to smoothen the transition to zero. However, if the energy gradient is computed, a fast transition from the actual value of the Coulomb interaction to zero can lead to very large derivatives. This can lead to the derivative between the two cut-off values being larger that the derivative for distances below the lower cut-off value. Such a behaviour has to be avoided to maintain a smooth transition to zero for the forces as well. This is achieved by increasing the distance between the values of the two Coulomb cut-offs, ideally the upper Coulomb cut-off should be at least 2.5 times larger than the lower cut-off to ensure a smooth transition to zero for the forces. For this reason, a cut-off configuration of 8/20 is more recommended which was tested at the beginning of this section for the force comparisons.

To investigate the effect of the divide-and-conquer method on atomic forces, we performed the same test as for the Coulomb cut-off where the forces are computed with and without the approximation used. Here, we study a larger system in the form of the DNA 12-mer ACGTACGTACGT which will be split into three separate sections during the divide-and-conquer formalism. We compute the forces using the Coulomb cut-off configuration (3.5/10) used for the investigation of the effects of the Coulomb cut-off.

Two different configurations are tested, one with a buffer zone of 8 and one with 10 Angstrom. Neighbouring atoms to the buffer zone and hydrogen atoms for bond termination have been added as described in the respective section in the main text.

From Supplementary Fig. 20 we can see that the errors of the forces are quite low, even for the smaller buffer zone at 8 Angstrom. One explanation for this is that the Coulomb cut-off already truncated a lot of the long-range interaction but nonetheless this test gives the information that no significant additional errors occur on top of those caused by the Coulomb cut-off. Furthermore, we see that the distribution of the electron density in the core region is not significantly influenced by the parts not included in the buffer zone because otherwise stronger deviations in the forces would be expected. Another explanation for the low error is the adding of neighbouring atoms to avoid breaking bonds and the replacement of heavier atoms with hydrogen atoms at the section borders which help to construct physically sensible boundary conditions, potentially reducing errors further.

For investigating interactions between molecules, we consider the binding energies between two molecules and two proteins as two test systems. To be precise, the two systems studied are the vitamin biotin (vitamin H) bound to the protein avidin (PDB 1AVD)^91,92 and the drug FK506 bound to the protein FKBP (PDB 1FKJ).^93,94 Both of these systems have high binding energies of –20.4 and –12.8 kcal/mol, respectively.^95,96 Since the computation of binding energies can be considered an area where high accuracies are necessary, it serves as a test if minimal-basis Hartree-Fock can still provide somewhat reliable results.

For this test we perform a single-point energy calculation using Hartree-Fock and try out different cut-off configurations, with no divide-and-conquer used. Since the whole protein with ligand is too large to study without cut-offs, we focus on the area around the ligand. Therefore, the system used in the calculations consists of the ligand and the environment within 4 Angstrom around the ligand. Additionally, bond cutting is accounted for by adding further atoms and replacing some atoms with hydrogen atoms to terminate bonds. The exact procedure is described in the corresponding section in the main text and the resulting systems as well as the complete protein-ligand complex are shown in Supplementary Fig. 21.

For the two systems discussed above, the binding energy is computed as the energy difference between the energy of the protein-ligand complex and the sum of energies of the protein and the ligand (with ‘protein’ we here refer to the part of the protein treated in the calculation, as described above). In Supplementary Table 1 the resulting energies are listed for different combinations of density and Coulomb cut-offs.

Here, we see limitations due to the use of cut-offs. Especially for the Coulomb cut-off configurations 3.5/10 and 8/10, the binding energy shows strong deviations from the results with no Coulomb cut-off—in some cases no binding is predicted. A reduced density cut-off of 1.0e-5 also show a deviation for the biotin/avidin complex. A higher Coulomb cut-off configuration (15/20) shows less strong deviation from the binding energy without a Coulomb cut-off. Lowering the density cut-off to values of 10⁻⁷ or 10⁻⁸ has little influence on the computed binding energy. However, for the calculations with the highest accuracy, the experimental binding energies can be reproduced with errors of 2.3 and 1.2 kcal/mol.

These results show that using low Coulomb cut-offs makes it impossible to use the formalism for applications which need a higher accuracy. In these cases, it is necessary to use no (or a much larger) Coulomb cut-off. The density cut-off at 10⁻⁶ appears to be a reasonable choice which should not compromise the accuracy significantly. Consequently, testing is required if a Coulomb cut-off or a higher density cut-off is used for applications beyond those studied in this work to ensure that the accuracy is sufficient for the investigated problems. If no Coulomb cut-off is used and a density cut-off smaller than 10⁻⁶ is employed, we observe that we can reproduce the experimental binding energies for the two studied systems within a reasonable error.

Source link

A quantum-mechanical framework for million-atom scale biological systems

Energy Storage Chemicals Market in Germany | Report – IndexBox

Discovery renews confirmation that Mars had the “chemistry for life”

Syria arrests Assad-era general accused of involvement in chemical massacre

Abandoned malls, whispers of nuclear war and young foreigners detained. This is what’s REALLY going on in Dubai… and the chilling warning one taxi driver gave to the Mail’s IAN BIRRELL

Chongqing Aims To Build Hub Role

US trade chief says tech restrictions to block Chinese autos

S’porean man found dead at residential unit in Hong Kong after failing to report for work

Dubai’s glitzy restaurants forced to change menus as Iran war hits trade

Juanjo Duran, Global Head of Media and Entertainment Content Partnerships at Google

Brighton fighter Louie Doherty eyes Bangkok title shot

China voices concerns over ‘discriminatory’ EU draft laws – World

Annual meeting reaffirms strength of Australia-Indonesia livex partnership

Chelsea vs AC Milan Set for Jakarta Showdown on August 8

Most Popular

Media & Entertainment CEO Pay Surged 117% In 2025

DSM Fresh Foods Appoints Rajneesh Bhasin as Independent Director to Strengthen Governance – scanx.trade

Chongqing Tongliang Long vs Shenzhen Peng City

Our Picks

HKSAR’s challenges warrant mission-oriented governance

‘Study in China’ trend facilitates mutual learning among civilizations

UAE oil firm chief says Hormuz still shut, Iran must open it without conditions

SUBSCRIBE TO OUR NEWSLETTER

A quantum-mechanical framework for million-atom scale biological systems

Modifications to the Hartree–Fock algorithm

Basis function fitting

Density cut-offs

Coulomb cut-offs

Electron repulsion integral algorithm

Implementation check

Parallel Hartree–Fock on large systems

Spectral calculation parameters

Atomic energy processing

Programme implementation

Biomolecular structures and visualisation

Further testing with molecular dynamics

Related Posts