2000 to 3000 molecules are typically required over many years before a single compound is suitable for clinical use. AI has the potential to shorten the path to a clinical molecule. The aim is to design and develop novel, precision engineered drugs with an improved probability of clinical success. This article introduces state-of-the-art AI-resources available for pharmaceutical and biotech companies for drug discovery in oncology.
Drug discovery encompasses target identification and validation, hit identification and lead optimization phases (see Fig. 1) and is followed by the drug development phases when preclinical, clinical and post market trials are conducted.
Fig. 1: Key areas of application of artificial intelligence in drug discovery.
AI has the potential to improve all those phases. In particular, it is critical in oncology, where the phases of disease hypothesis, target identification and validation requires the integration of genomic, functional genomics and genome engineering, combining them in structural and functional ways to mimic the tumor phenotype. In the past, most drug targets have been found by combining published scientific literature for insights into molecular pathways or genetic variants linked to disease. Now the focus is on the identification of original novel targets through genomics, functional genomics enabled by artificial intelligence tools.
Current AI applications and tools in drug discovery
Current AI applications cover a broad range of tasks in the drug discovery and development pipelines for the different therapeutic modalities in oncology, including small molecules and biologics, antibodies, peptides, miRNAs, and gene editing therapies. The tasks can be classified in four classes:
For each of those steps and tasks, there is a vast amount of AI resources, both proprietary and freely available.
AI Innovations: protein structure prediction models and generative chemistry
Among recent advances, Deep Learning (DL) enabled solutions are the most promising. Particularly, this section will introduce the potential in drug discovery of generative deep models for novel chemical synthesis and the emergence of highly efficient protein structure prediction models, such as Alpha Fold2 [4], and more recently ESM Metagenomic protein Atlas [10], opening innovative solutions for structure-based drug design, target and activity modeling in the protein space.
I. Protein structure prediction models: Alpha Fold 2
AlphaFold2 is a DL model developed by DeepMind that predicts the folding of monomeric proteins for which the availability of homology templates is limited [4]. The accuracy of predictions based on distance difference test score is non-inferior to experimental methods and confidence metrics are provided to guide the usage of 3D structures produced with different levels of uncertainty. It has been shown that when the uncertainty is taken into account, predictions can be applied to existing structural biology challenges, and their quality is near that of experimental models. [18]
To understand its utility in oncology we need to answer the following questions:
Fig 2. “Lock and Key” theory of drug-target interactions. Image source: Christopher Vakoc.
3.1 Understand the effects of genetic variants, especially those located in cancer driver genes or oncogenic mutations (TCGA and OncoKB). This task is both important for human diseases but hard. In order to discern whether mutations are deleterious, nowadays almost exclusively statistical approaches are used by comparing healthy and sick populations. AlphaFold2 can predict wild type protein structures with high accuracy, but according to some reports [17] it cannot predict the impact of cancer missense mutations on protein structures because the training data for AlphaFold2 do not contain altered structures of these mutated proteins. However, this limitation is expected to be circumvented by language-model-based prediction methods, such as ESM Metagenomic, which are better suited to quickly determining how mutations alter a protein’s structure.
3.2 Protein design o de novo protein design.There are emerging efforts in this field such as AlphaDesign [9], a computational framework for de novo protein design that embeds AlphaFold2 as an oracle within an optimized design process. It was reported to enable rapid prediction of completely novel protein monomers starting from random sequences. They also mention that a recent and unexpected utility of AlphaFold2 to predict the structure of protein complexes, further allows their framework to design higher-order complexes. They also refer to the potential for designing proteins that bind to a pre-specified target protein (protein-to-protein interactions). Structural integrity of predicted structures is validated and confirmed by standard ab initio folding and structural analysis methods as well as more extensively by performing rigorous all-atom molecular dynamics simulations. Their approach also reveals the capacity of AlphaFold2 to predict proteins that switch conformation upon complex formation.
3.3 Transcriptional regulation engineering: Currently the structural machinery for the protein-based signal transduction pathways that gives the cell its form, motility, and function is not yet engineerable. AlphaFold2 may change this, not only in terms of de novo protein design but also in engineering multi-domain proteins with flexible linkers and programmable logic.
Deep Learning-models for protein structure prediction such as AlphaFold2 and ESM Meta are an outstanding achievement given the wider coverage, precision and speed with which predictions of folded protein states are now possible, but still only a piece of the puzzle that highlight challenges in drug discovery such as the prediction of protein-protein interaction complexes, allostery and dynamics, the relative positioning of protein domains in multi-domain proteins, the identification of immunogenic peptides (neoantigen prediction) or the prediction of the consequences of different types of mutations. In addition to those open challenges in early drug discovery, still crucial questions of in vivo efficacy and safety of any drug remains – even if we are able to dock (and do structure-based design) on more targets than before to discover ligands faster, it does not anticipate its failure or success as a drug, once tested in vivo or clinical trials. Drugs fail in the clinic because the wrong targets have been chosen or because their effects are different than anticipated.
In conclusion, protein structure prediction models are an important but only one piece of the puzzle useful for both target identification and drug design when optimally leveraged in more complex ML pipelines. Also, the potential of those models to understand and engineer structural systems in cancer biology has only started to be explored and could hopefully be used to advance in vivo crucial questions for drug development.
AI Innovations: protein structure prediction models and generative chemistry
II. Generative models in drug discovery
De novo molecular design has increasingly been using generative models based in DL techniques, proposing novel compounds that are likely to possess desired properties or activities. DL techniques based on mixed architectures such as Generative Adversarial Networks (GAN), Variational Autoencoders (VAE) and Deep Reinforcement Learning are being increasingly used for generative chemistry, can be trained on existing data sets and provide for the generation of novel compounds. Typically, the new compounds follow the same underlying statistical distributions of properties exhibited on the training data set. Additionally, different optimization strategies, including transfer learning, Bayesian optimization, reinforcement learning, and conditional generation, can direct the generation process toward desired aims, regarding their biological activities, synthesis processes or chemical features.
Here, the relevance is not the quantities of predicted molecules it can produce, but the ability of those lead molecules to meet the safety and efficacy criteria of successful drugs and to withstand studies in animals and human patients. In particular, the generated molecules have to obtain superior properties given a range of structurally diverse drugs and to suffice other basic properties, such as synthesizability and low off-target effects. Two illustrative works are GENTRL [11] and Chemistry42 (developed by InSilico).
Fig 3. GENTRL architecture. Image source [11]
Fig 4. In silico Medicine Generative procedures for CDK20 hits. Image source: [7]
There are many relevant works in the field of generative chemistry, most of them are based on gradient optimization algorithms as the two deep-learning examples referred above, while there are also approaches based on gradient-free stochastics methods such as evolutionary algorithms. Given its extension, we refer the reader to a summary [12] of exemplar methods for de novo molecular design broken down by the coarseness of molecular representation: in essence, whether molecular design is modeled on an atom-based, fragment-based, or reaction-based paradigm. Also acknowledging the challenges of this field, strong benchmarks to standardize the assessment of generative molecular models are now available as open-source projects [13,14].
The role of ML and AI in drug discovery is growing with a special interest for de novo molecular design methods because of their ability to navigate extremely large chemical spaces more effectively than either virtual screening or a human expert. To put it in context, the space of possible small organic molecules has an estimated number of 1060 molecules. Despite early concerns regarding the use of automated methods for molecular design, often relating to the instability, reactivity, actionability, and synthetic feasibility of the molecules suggested, we now have a variety of tools at our disposal that are proficient generators of sensible molecule structures [12 ].
Final Considerations
The application of DL is making outstanding achievements in specific steps of the drug discovery pipeline such as protein structure prediction and generative chemistry, but to truly advance drug discovery in oncology, in addition to those innovations we still need to understand cancer biology better in the first place. It is currently not trivial to apply AI methods in the drug discovery context, which is, to a good extent, because of difficulties in generating and labeling relevant biological, and physiological data for questions related to efficacy and safety, also, the amount of data available in the cancer biology datasets does not qualify as ‘big-data’, the datasets available for cancer therapeutics are substantially smaller than those available in other fields [16] and data suffer from high technical heterogeneity, high-dimensionality and low signal-to-noise ratio. Omics data often suffer from measurement inconsistencies between cohorts, marked batch effects and dependencies on specific experimental platforms. Such a lack of consistency is a major hurdle when applying AI methods. Consensus on the measurement, alignment and normalization of tumour omics data will be critical for each data type. In agreement with [15], only when we are then able to measure and capture relevant biological endpoints in vivo we will be able to advance the field significantly further, and to apply the computational algorithms currently available to us fruitfully in the drug discovery area, with respect to compound efficacy and safety in the clinic.
Dr. Aurelia Bustos, MD, PhD