High-level original pipeline

The diagram of the original virtual screening pipeline in DrugCLIP is shown below, with clickable nodes (blue boxes) highlighting the most essential classes and methods.

Scripts

retrieval.sh

Bash script for virtual screening with DrugCLIP:

  1. Calls retrieval.py to run the workflow.
  2. Accepts command-line arguments (CLI) for customization.
  3. Sets the task to drugclip.
  4. If –use-cache=true, it loads pre-computed molecular embeddings.
  5. If –use-cache=false, it generates new molecular embeddings from scratch.
  6. When –use-cache=false, an LMDB file containing molecule information must be provided via the –MOL_PATH argument.

retrieval.py

Responsibilities:

  1. Parse command-line arguments (CLI).
  2. Set up the task using UniCore.
  3. Build the model corresponding to the specified architecture using the build_model method from DrugCLIP class
  4. Call retrieval_multi_folds to execute the virtual screening pipeline.

Classes

DrugCLIP class

Class for task="drugclip".

Important Methods

build_model()

retrieval_multi_folds

This function orchestrates the multi-fold virtual screening process and manages caching of molecular and pocket embeddings. By default, it loops over 6 folds.

  1. Load checkpoint weights:
  2. Load or compute molecular embeddings:
  3. Load or compute pocket embeddings:
    1. load_pockets_dataset reads the LMDB file containing pocket data.
    2. A PyTorch DataLoader is created.
    3. A batch loop is executed:
      1. Prepare the model input: extract distances, edge types, and tokens; embed tokens using model.pocket_model.embed_tokens.
      2. Fuse and project distance and edge information for graph attention.
      3. Apply model.pocket_model.encoder to compute pocket representations.
      4. Extract the [CLS] token embedding, project to lower-dimensional space using pocket_project, and normalize to unit length.
      5. Append embeddings to a list pocket_reps. After all batches, convert pocket_reps into a matrix of shape (num_samples x embedding_size).
  4. Compute similarity matrices:
  5. Aggregate results across folds:
  6. Save results:

load_mols_dataset()

  1. Initializes the LMDBDataset class, which reads LMDB files and returns all data for a single molecule when accessed by index (dataset[idx]).
  2. Wraps it in the AffinityMolDataset class, which prepares and organizes the molecule data into a structured dictionary format.
  3. Applies the RemoveHydrogenDataset class to remove hydrogen atoms from the molecule representation.
  4. Uses the NormalizeDataset class to center the 3D coordinates of the atoms.
  5. Extracts atom information, tokenizes it, and prepends (BOS) and appends (EOS) special tokens.
  6. Generates unique identifiers for each edge between atoms using the EdgeTypeDataset class.
  7. Computes pairwise distances between atoms using the DistanceDataset class.
  8. Combines all the processed information into a single NestedDictionaryDataset object, ready for model input.

load_pockets_dataset()

Follows the same processing steps as load_mols_dataset(), but after step 3 it crops the pocket sequence if it contains more than a specified maximum number of atoms (default: 256).

BindingAffinityModel

This class implements the DrugCLIP architecture for binding affinity prediction.

Concepts

Edge Types

Unique IDs representing atom-type pairs

Gaussian Basis Features

Encodes geometry into attention bias

Cosine Similarity Retrieval

Unit-normalized embeddings → dot product