Molecular deep learning at the edge of chemical space

Data preprocessing and analysis

Molecular representation and description

Each molecular structure in this study was represented as a SMILES⁴² string. CATS⁵⁶ descriptors and 2048-bit ECFPs⁵³ (using a radius of 2) were computed for each molecule.

Data collection and curation

Thirty-three labelled datasets of molecular structures with their corresponding experimental target property were used:

Moreover, small molecules were collected from ChEMBLv33⁴⁵ for model pretraining. Because molecular structures from ChEMBL were used for pretraining, molecules with a Bemis–Murcko scaffold⁷⁶ similar to any such scaffold in the labelled datasets (Tanimoto similarity coefficient⁵⁴ on EFCPs larger than 0.7) were removed. This included molecules without a Bemis–Murcko scaffold (that is, containing no rings).

For prospective virtual screening, the most recent CDK1 (CHEMBL308) data were fetched from ChEMBL v35⁴⁵ (accessed in April 2025) as an additional and independent dataset. Raw data were processed in accordance with previous work⁴⁸.

All bioactivity endpoints of the MoleculeACE⁴⁸ and CDK1 datasets were converted from continuous regression labels into binary classification labels. Molecules with an EC₅₀ (half maximal effective concentration) or K_i (inhibitory constant) of 100 nM or lower were labelled as bioactive, whereas less potent molecules were labelled as inactive. For the LIT-PCBA⁴⁹ and Ames mutagenicity dataset⁵⁰, their original binary classification labels were used.

For all datasets, SMILES strings were preprocessed using RDKit v. 2024.3.3⁷⁷. For each SMILES string, stereochemistry tokens as well as salts and solvents (Supplementary Table 2) were removed. Each molecule was sanitized, neutralized using predefined neutralization reactions (Supplementary Table 3), and its SMILES string was canonicalized. Disconnected structures and molecules that contained formal charges, contained complex ring systems (SMILES strings with a ring index of 9 or higher), non-standard isotopes or any atoms other than Cl, Br, H, C, N, O, F, S and I were removed. Molecules were removed if they contained more than 100 tokens in their canonical SMILES string or if they could not be featurized into CATS descriptors and/or ECFP fingerprints. Sizes of datasets before and after data curation are presented in Supplementary Table 1.

Virtual screening library (retrospective)

Molecules were gathered from three commercial screening libraries:

All molecules were processed in the same way as the training datasets. Finally, all unique molecules were aggregated.

Virtual screening library (prospective)

The most recent Specs library was downloaded from https://www.specs.net/index.php?view=databases&page=download (accessed in April 2025). All molecules were processed in the same way as the training datasets. To ensure that molecules were compatible with our experimental assay (likely soluble in 1% dimethylsulfoxide (DMSO)), physicochemical rules were enforced based on relaxed rule-of-five and Veber criteria:

(1)

Molecular weight between 200 g mol⁻¹ and 600 g mol⁻¹.
(2)

logP lower than 6.
(3)

Total polar surface area between 20 Å² and 140 Å².
(4)

Number of hydrogen bond donors lower than six.
(5)

Number of rotatable bonds lower than ten.
(6)

A maximum of two rule-of-five violations.

In addition, molecules with a terminal enone, isocyanate, quinone, aromatic nitro groups, azide or epoxide groups were removed by using SMILES Arbitrary Target Specification (SMARTS) patterns to prevent assay interference. Furthermore, to prevent the selection of trivially simple molecules and enrich the general-purpose Specs library for kinase-relevant chemical space, several general criteria were enforced:

(1)

A molecule must have an ATP mimetic core, that is, at least one heteroatom in a ring, or a fused carbocyclic system.
(2)

A molecule must have a polar anchor to ensure solubility or solvent interaction.
(3)

A molecule must have enough hydrophobic mass or planarity to fill the kinase ATP pocket (for example, gatekeeper region).
(4)

A molecule must have a directional H-bond donor/acceptor to ensure potential for hinge interaction.
(5)

A molecule must have two or more rings.

These rules are intentionally permissive for new chemotypes, and 98.6% of all kinase inhibitors that ever went into clinical trials⁷⁸ pass these filters. Finally, molecules with a Tanimoto similarity on ECFPs >0.7 to any molecule in the respective target’s data (PIM1, and CDK1) were removed. This left 185,298 and 185,336 screening molecules for PIM1 and CDK1, respectively.

Molecular cyclic skeletons

Cyclic skeletons⁵¹ (core ring systems without exocyclic substituents) were extracted from molecules to serve as the most fundamental molecular scaffold representation. From each molecule, Bemis–Murcko scaffolds⁷⁶ were obtained, removing peripheral substituents. Remaining double-bonded exocyclic substituents were then removed, and all atoms and bonds were made generic to obtain the final cyclic skeleton.

Data splitting

The curated molecular structures from ChEMBL were split into a training (80%, n = 1,230,041), a test set (10%, n = 153,755) and a validation (10%, n = 153,755) set, using random splitting. All labelled datasets were split into a training set (~50%), test set (~25%) and OOD set (~25%). To determine the OOD molecules, spectral clustering was performed on unique cyclic skeletons (see below). The molecules corresponding to the scaffolds in the clusters with the lowest mean cluster similarity to all other clusters that constituted approximately 25% of the total dataset size were taken as the OOD set. The remaining molecules (that is, the ~75% most similar molecules) were split randomly in a train and test set, with the test set being equal in size to the OOD set. An overview of all datasets is presented in Supplementary Table 1.

Spectral clustering

Spectral clustering was performed on a molecular similarity matrix A using Sci-kit learn⁷⁹. A is an n × n matrix where each element A_ij is the Tanimoto coefficient on ECFPs T_ij between two molecular structures. By using a molecular similarity matrix directly, we bypass the complex, high-dimensional and non-Euclidean nature of molecular structures. From this affinity matrix, the symmetrically normalized Laplacian was constructed as follows:

$${L}_{\mathrm{sym}}=I-{D}^{-1/2}A{D}^{-1/2},$$

(3)

where I is the identity matrix and D is the degree matrix. Subsequently, eigenvalue decomposition was performed:

$${L}_{\mathrm{sym}}={\bf{U}}\Lambda {{\bf{U}}}^{T}{.}$$

(4)

To determine the number of clusters k for spectral clustering, the eigenvalues λ₁, λ₂, …, λ_n were sorted in ascending order and the elbow (the point of maximal curvature) of the resulting sequence was estimated using the kneed algorithm⁸⁰. Finally, the spectral embeddings of the data were clustered by taking the top k (smallest) eigenvectors U_k, normalizing U_k to unit length and performing k-means clustering on all rows (u_k).

Chemical space visualization

To visualize each labelled dataset, molecules were first encoded as ECFPs. The resulting binary ECFPs were reduced to 100 components using truncated singular value decomposition and embedded into a two-dimensional space using t-distributed stochastic neighbour embedding with a perplexity value of 30 and default settings.

MCS fraction

To compute molecular core similarity, we computed the MCS fraction⁵⁵ between a molecule M_a and a reference molecule M_b as

$$\mathrm{MCS}\,\mathrm{fraction}=\frac{\left|\mathrm{MCS}\left({M}_{a},\,{M}_{b}\right)\right|}{\left|{M}_{a}\right|},$$

(5)

where |M_a| is the number of atoms in M_a and MCS is the maximal common substructure between the two molecules M_a and M_b, as determined by the FMCS algorithm⁷ in RDKit. A high MCS fraction indicates that a molecule shares a significant portion of their overall core structure with a reference molecule. This implementation is asymmetric.

Molecular complexity

To quantify molecular complexity we compute the well-established Bertz complexity⁸¹ and Böttcher complexity⁸². In addition, to align complexity measures with the task of reconstructing SMILES strings, we also estimated the complexity of the molecular graph directly⁸³ as

$${C}_{\mathrm{graph}}=V{\log }_{2}V-\mathop{\sum }\limits_{i\in {\mathscr{G}}}{V}_{i}{\log }_{2}{p}_{i},$$

(6)

where G represents the molecular graph, V is the total number of elements in the graph and V_i is the number of the elements in the ith set of elements. In a similar fashion, we also estimated the complexity of a SMILES string, via their entropy computed as

$${C}_{\mathrm{SMILES}}=-\mathop{\sum }\limits_{i\in S}{p}_{i}{\log }_{2}{p}_{i},$$

(7)

where S represents the set of unique tokens in a SMILES string, and p_i is the probability of the ith token occurring in S. Tokens representing the start, end of sequence, and padding were not considered. Moreover, for each molecule, we counted the number of SMILES tokens (excluding padding), the number of SMILES string branches (that is, ‘(‘ tokens), and the presence of 50 unique molecular patterns⁷⁴.

Virtual screening

Top-k molecules were selected in a multi-objective manner by selecting the k molecules closest to the utopia point⁶⁸. The distance to this ideal point can be calculated as

$${d}_{\mathrm{utopia}}=\sqrt{\mathop{\sum }\limits_{i=1}^{n}{\left({\mathrm{norm}}_{i}\right)}^{2}},$$

(8)

where norm_i is the normalized objective that is either maximized (for example, predicted bioactivity) or minimized (for example, prediction uncertainty):

$$\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{max}]}=\frac{{x}_{\text{max}}-{x}_{i}}{{x}_{\text{max}}-{x}_{\text{min}}}{\rm{or}}\genfrac{}{}{0ex}{}{{{\rm{norm}}}_{i}}{[\text{min}]}=\frac{{x}_{i}-{x}_{\text{min}}}{{x}_{\text{max}}-{x}_{\text{min}}}.$$

(9)

Machine learning

Encoder

Canonical SMILES strings were encoded by a one-dimensional convolutional neural network. SMILES string character tokens were embedded using a randomly initialized trainable embedding layer of size 128. Several one-dimensional convolutional layers were used with a stride of 1 and no padding. Each layer was followed by a ReLU activation, standard max pooling with a kernel size equal to that of the convolutional layers, and dropout. Both convolutional and pooling layers used a stride of 1 and no padding. The final output was flattened and compressed to a latent vector (z) of size 128 using a fully connected layer. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of convolutional layers [2, 3], filter size [256, 512], kernel size [6, 8], weight decay on convolutional neural network weights [0, 1 × 10⁻⁴] and dropout [0, 0.1].

Decoder

Encoded latent molecular representations in z were reconstructed back to SMILES strings using a conditioned LSTM model. A randomly initialized trainable embedding layer of 128 neurons was used to embed SMILES string character tokens. The following hyperparameters were optimized (see ‘Hyperparameter optimization’ section): the number of LSTM layers (n_layers) [2, 3] and the LSTM hidden size (${\mathrm{size}}_{\mathrm{layers}}$) [256, 512]. Models were trained autoregressively without teacher forcing using next token prediction based on the tokens predicted in the previous steps rather than the ground truth tokens. A reconstruction loss normalized for sequence length was used (equation (1)). To condition the model, the LSTM hidden state h₀ was initialized with z for every molecule. To correctly match all dimensions of h₀ (${n}_{\mathrm{layers}},{\mathrm{size}}_{\mathrm{layers}}$), z was first transformed to ${n}_{\mathrm{layers}}\,\times {\mathrm{size}}_{\mathrm{layers}}$ with a fully connected layer, after which it was reshaped into ${{\rm{n}}}_{\mathrm{layers}}$ chunks of ${\mathrm{size}}_{\mathrm{layers}}$.

Approximate Bayesian classifier

Labels were predicted from either ECFPs or SMILES strings encoded into latent molecular representationsz. A MLP was used with several fully connected layers and an output layer consisting of two neurons. The number of MLP layers [2, 3] and the MLP hidden size [1,024, 2,048] were optimized (see ‘Hyperparameter optimization’ section). To estimate prediction uncertainty, anchored ensembling⁴⁷ was implemented on the MLP as in our previous work⁷⁴. We used an ensemble of M = 10 models. For each model, $m\in [1\ldots M]$, we regularized its parameters θ_m with a set of ‘anchored’ parameters θ_anchor,m that prevent different models in the ensemble to converge to the same parameter space. Each model is initiated with distinct θ_anchor, which is controlled by different random seeds. The classification loss in our implementation is defined as

$${{\mathscr{L}}}_{\mathrm{MLP}}=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}\mathop{\underbrace{\log {p}_{m}\left(\,{y|x}\right)s}}\limits_{\text{prediction}}+\mathop{\underbrace{\lambda {{||}{\theta }^{m}-{\theta }_{\mathrm{anchor}}^{m}{||}}^{2}}}\limits_{\mathrm{anchoring}},\,$$

(10)

where λ is a regularization coefficient (set to 3 × 10⁻⁴). To estimate the expected value ${\mathbb{E}}$ of each molecule x, we take the mean prediction over the whole ensemble, as follows:

$${\mathbb{E}}\left(\,{y|x}\right)=\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,y,|,x\right).$$

(11)

Similarly, we estimate prediction uncertainty for each molecule $x$ as the mean entropy ${\mathbb{H}}$ over all models in the ensemble:

$${{{\mathbb{H}}}}\left({y|x}\right)=-\frac{1}{M}\mathop{\sum }\limits_{m=1}^{M}{p}_{m}\left(\,{y|x}\right)\log {p}_{m}\left(\,{y|x}\right).$$

(12)

JMM

Canonical SMILES strings were encoded into compressed latent molecular representations ${\bf{z}}$ using the encoder model described above. Subsequently, ${\bf{z}}$ was used to perform both molecular property prediction with a classifier and molecular reconstruction using the ${\bf{z}}$-conditioned decoder model²⁷. The model was optimized in a joint fashion using the following weighted composite loss function:

$${{\mathscr{L}}}_{\mathrm{JMM}}={{\mathscr{L}}}_{\mathrm{reconstruction}}+\gamma \times {{\mathscr{L}}}_{\mathrm{MLP}},$$

(13)

where the scalar γ was set at 0.1 based on preliminary experiments. A regular autoencoder was used as preliminary experiments showed no performance benefits of the more complex variational autoencoder⁸⁴ architecture.

RF

An RF classifier was trained on either ECFPs or CATS⁵⁶ descriptors. The following hyperparameters were optimized (see ‘Hyperparameter optimization’): the number of trees [100, 250, 500, 1,000], the maximal tree depth [10, 20, 30, ∞] and the minimal samples per split [2, 5, 10].

Model training

Autoencoders

Encoder–decoder models were (pre)trained to reconstruct SMILES strings of general drug-like molecules from ChEMBL using the Adam optimizer. Mini batches of 256 random molecules were sampled from the training data for 1,000,000 steps using uniform sampling. Gradients were clipped with a max norm of 5. Early stopping with a patience of up to 20 evaluation checkpoints was implemented by monitoring validation loss every 10,000 steps. The model checkpoint with the best validation loss was used.

Classifiers

Classifiers using ECFPs or SMILES strings as input were trained for molecular property prediction on each of the labelled dataset using a similar setup to the autoencoders. However, tenfold Monte Carlo cross-validation was used with 10% validation splits instead of one predefined data split. Mini batches of 64 were resampled during training based on the occurrence of their class with

$${P}_{c}=1-\frac{{n}_{c}}{N},$$

(14)

where P_c is the probability of sampling class c, n_c is the number of samples of class c, and N is the total number of samples. Models were trained for 5,000 steps with an early stopping patience of 10 evaluation checkpoints, performed every 10 steps. The model checkpoint with the best validation loss was used. For the RF control models, molecules were weighted inversely proportionally to their class frequency to mitigate class imbalance during training.

JMMs

Joint models, each consisting of a SMILES string encoder, a decoder and a classifier (Fig. 1b), were initialized with pretrained weights. The SMILES string encoders and classifiers used weights from models trained on the labelled datasets. For the decoder, decoder weights were used from an autoencoder pretrained on ChEMBL. Using a mini batch size of 64, the joint models were finetuned for 10,000 steps with an early stopping patience of 50 evaluation checkpoints, performed every 20 steps. A learning rate of 3 × 10⁻⁶ was used for the encoder and classifier, whereas a learning rate of 3 × 10⁻⁷ was used for the decoder. No weight decay was applied.

Hyperparameter optimization

Hyperparameters, as specified previously, were optimized for all autoencoders, classifiers (using ECFPs or SMILES strings), and RF control models using a simple grid search. Tenfold Monte Carlo cross-validation was used, repeatedly using 10% of the training data as a validation split. The hyperparameters of the model with the best mean validation loss was used.

Model evaluation

Predictions were evaluated according to the following performance metrics:

$$\text{Balanced accuracy}=\frac{1}{2}\left(\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}+\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}\right),$$

(15)

$$\text{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}},$$

(16)

where TP is the number of true (that is, correctly predicted) positives, TN is the number of true negatives, FN is the number of false negatives, and FP is the number of false positives. In addition, the hit rate (true positive rate) was determined for virtual screening experiments as

$$\text{Hit rate}=\frac{\mathrm{TP}}{P},$$

(17)

and enrichment factor as

$$\text{Enrichment factor}=\frac{{\mathrm{TP}}_{k}}{P/N}.$$

(18)

Here, TP_k is the number of correctly identified positives in the subset of k prioritized molecules, P is the total number of positives in the full dataset, and N is the total number of molecules in the full dataset.

Biological characterization

Sixty screening compounds were purchased from Specs Compound Handling B.V. and dissolved at 10 mM in 100% DMSO.

To screen for bioactivity, a point screening was first performed at a concentration of 10 µM (in 1% DMSO) in technical triplicates using the ADP-Glo Kinase Assay platform from Promega using the Chemi-Verse PIM1 Kinase Assay Kit and the Chemi-Verse CDK1/CyclinA2 Kinase Assay Kit from BPS Bioscience in Costar flat white 96-well plates. AZD1208 (CAS 1204144-28-4) and dinaciclib (CAS 779353-01-4), purchased from TargetMol Chemicals, were used as positive controls for PIM1 and CDK1, respectively. Bioactivity was measured as the area under the curve of an 18-step luminescence scan between 398 nm and 653 nm with an integration time of 1 s and a settle time of 100 ms, normalized for the signal in buffer-only wells.

For each target protein, the six compounds with the highest bioactivity at 10 µM were followed up with an 8-point dose–response curve using the same assay. Screening compounds were measured in technical triplicates from 10 µM to 0.0046 µM, whereas reference compounds were measured in duplicate.

Hardware and training set-up

All computational experiments were performed on a Lenovo ThinkSystem SD650-N v2 server equipped with Intel Xeon Platinum 8360Y central processing units and NVIDIA A100 (40 GB) graphics processing units. Up to five models were trained in parallel on a single graphics processing unit.

Software and code

All code was implemented in Python (v. 3.12). Deep learning models were implemented using PyTorch (v. 2.3.0)⁸⁵. Traditional machine learning models and clustering was implemented using Sci-kit learn v.1.4.0 (ref. ⁷⁹). All molecular data were handled using RDKit (v.2024.3.3)⁷⁷. For data visualization, R (v.4.3.0) and the R package ggplot2 (v.3.4.2) were used along with Adobe Illustrator.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Source link

Molecular deep learning at the edge of chemical space

Application of non-targeted analysis for the identification of novel environmental contaminants in amniotic fluid

Catalyst design enables efficient biomass upgrading under ambient conditions

Solar-powered airship reaches new heights – News

Abandoned malls, whispers of nuclear war and young foreigners detained. This is what’s REALLY going on in Dubai… and the chilling warning one taxi driver gave to the Mail’s IAN BIRRELL

US trade chief says tech restrictions to block Chinese autos

Leather MIG Welding Gloves – Heat Fire Resistant for Welding/Grilling/BBQ(Black/Brown/Blue)

Reebok plans 200 China stores by 2029 – report

Focusing on the Hydrogen Energy Commercial Vehicle Sector, HTWO Guangzhou Reached Strategic Cooperation with GAC Lincheng – Shanghai Metals Market

UAE students adapt as IB, UK exams stay cancelled despite classroom return

BitDelta Group CEO Dr. Demetrios Zamboglou Visits India as BitDelta India Begins Operations

Is UAE changing its Tour de France plans? Tadej Pogacar may get a new right-hand man following classics showdown