BIOPTIC shares ML benchmark for Small Molecules binding prediction on Polaris

A new benchmark for developing Machine Learning models has been revealed! Multiple data sources, aggressive filtering and high quality train/test splits - see more below

November 18, 2024
BIOPTIC shares ML benchmark for Small Molecules binding prediction on Polaris

The problem

One of the most practical tasks in structure-based drug discovery is predicting the binding affinity between a small molecule and a protein structure. Although this is the mainstream approach, protein structures are not always available. Some structures have disordered regions that don’t fold into a single shape, and others are difficult to crystallize. This is why we pursue a sequence-based approach, where a model is trained to make predictions using only the amino acid sequence, without a known 3D structure.

The need for better data and benchmarks

High-quality benchmarks drive model development, but since we couldn’t find any suitable ones, we created our own. Today, we’re excited to introduce PLUMBER - Protein–Ligand Unseen Matching Benchmark for Evaluating Robustness and make it available to the community through Polaris, a new benchmarking platform for ML in drug discovery applications. This sequence-based benchmark for binding affinity prediction was developed with several strict criteria in mind.

First, models benefit significantly from large datasets, so we aimed to collect as much publicly available data as possible, utilizing BindingDB, ChEMBL, and BioLip2 databases.

Second, models perform better with quality data. Therefore, we extensively preprocess and clean the data to remove outliers, duplicates, noise, and low-quality data points. Having tens of millions of data points at the beginning of the funnel, we narrowed down the dataset to about 1.8M high quality unique activity values of Ki, Kd, IC50 and EC50.

Third, we aim to apply the model to novel chemical spaces and proteins. To achieve this, we designed our test set to differ from the training set by employing a split from the recent PLINDER dataset. The proteins in our test set differ from those in the training set more substantially than random, temporal, or sequence similarity-based splits, providing a more realistic estimate of real-life performance and helping set the right goals rather than focusing on overfitting to leaked benchmarks.

This benchmark enabled us to develop our sequence-based model.and now we’re excited to share it with the community to support the development of this emerging field. On Polaris, you can find source code for all of the data pre-processing steps and the uploaded benchmark. We hope this will stimulate the creation of new ML methods that can accelerate our development of therapeutics for patients in need.

Link to benchmark on Polaris platform

Subscribe to our newsletter

Be the first to know the latest BIOPTIC news

By clicking Subscribe you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.