The diagram of the original virtual screening pipeline in DrugCLIP is shown below, with clickable nodes (blue boxes) highlighting the most essential classes and methods.
Bash script for virtual screening with DrugCLIP:
retrieval.py to run
the workflow.Responsibilities:
task=drugclip, unimol/tasks/drugclip.py
is invoked, and the DrugCLIP class is
initialized.build_model method from
DrugCLIP classretrieval_multi_folds to execute the virtual
screening pipeline.Class for task="drugclip".
build_model()unicore.models.build_model.arch="drugclip",
unimol/models/drugclip.py is invoked.BindingAffinityModel class
is initialized, and its build_model method is called.retrieval_multi_foldsThis function orchestrates the multi-fold virtual screening process and manages caching of molecular and pocket embeddings. By default, it loops over 6 folds.
use-cache=True, precomputed molecular embeddings are
loaded for the current fold.use-cache=False:
load_mols_dataset
reads the LMDB file containing molecular data.DataLoader is created.model.mol_model.embed_tokens.model.mol_model.encoder to compute molecular
representations.[CLS] token embedding, project to
lower-dimensional space using mol_project, and normalize to
unit length.mol_reps. After all
batches, convert mol_reps into a matrix of shape
(num_samples x embedding_size).load_pockets_dataset reads the LMDB file containing
pocket data.DataLoader is created.model.pocket_model.embed_tokens.model.pocket_model.encoder to compute pocket
representations.[CLS] token embedding, project to
lower-dimensional space using pocket_project, and normalize
to unit length.pocket_reps. After all
batches, convert pocket_reps into a matrix of shape
(num_samples x embedding_size).pocket_reps matrix with the transpose of
the mol_reps matrix to compute cosine similarities..txt file.LMDBDataset class, which reads LMDB
files and returns all data for a single molecule when accessed by index
(dataset[idx]).AffinityMolDataset class, which
prepares and organizes the molecule data into a structured dictionary
format.RemoveHydrogenDataset class to remove
hydrogen atoms from the molecule representation.NormalizeDataset class to center the 3D
coordinates of the atoms.BOS) and appends (EOS) special tokens.EdgeTypeDataset class.DistanceDataset class.NestedDictionaryDataset object, ready for model input.load_pockets_dataset()Follows the same processing steps as
load_mols_dataset(), but after step 3 it crops the
pocket sequence if it contains more than a specified maximum
number of atoms (default: 256).
This class implements the DrugCLIP architecture for binding affinity prediction.
drugclip_architecture to set up the
model configuration.UniMolModel class).UniMolModel class).build_model method creates and returns an instance
of the model.Unique IDs representing atom-type pairs
Encodes geometry into attention bias
Unit-normalized embeddings → dot product