Back to top

Small Organic Molecules: Chemical Space, Reactions, Catalysis and Autocatalysis Workshop

24 Mar 20

Date:   Tuesday 24th March 2020, 9.00 am – 6.00 pm, Department of Statistics, Large Lecture Theatre LG.01, 24-29 St Giles’, Oxford.  Please register here.

9.00    Introduction and Welcome [Jotun Hein, Department of Statistics, Oxford]

Session I: Combinatorics and Algorithmics of Molecular Graphs

9.10    GDB and the Chemical Space  [Jean-Louis Reymond, Berne Switzerland]

Chemical space is a concept to organize molecular diversity by postulating that different molecules occupy different regions of a mathematical space where the position of each molecule is defined by its properties. Our aim is to develop methods to explicitly explore chemical space in the area of drug discovery. We have enumerated all possible molecules following simple rules of chemical stability and synthetic feasibility to form the Generated DataBases (GDB).  Exploring GDB in comparison to known molecules reveals that vast areas of chemical space are still entirely unknown yet are accessible for experimental exploration by straightforward synthetic methods. I will discuss how to visualize chemical space and exemplify the discovery and synthesis of new scaffolds for drug discovery.

9.50   Algorithms for Chemical Space Enumeration and Applications in Astrobiology [M Meringer, Germany]

From its very beginnings the development of algorithms for the enumeration of chemical space was closely related to NASA’s early exobiology activities. The complete and non-redundant generation of all connectivity isomers corresponding to a given molecular formula was part of the DENDRAL program, established in the mid-1960s. In the 1970s mathematicians provided new techniques to increase efficiency of the first approaches, and starting with the 1990’s implementations became available as software packages for customary computers.    In this talk some of the most essential principles of exhaustive and non-redundant structure enumeration are reviewed: starting from simple labeled graphs we will see how to avoid isomorphic duplicates, introduce structural constraints and how to generate molecular graphs in an efficient way.  Recently these methods have been rediscovered for application in astrobiology and origins of life research, particularly for generating and analyzing virtual chemical compound libraries surrounding the most important biomolecules. Results concerning the genetically encoded amino acid alphabet, nucleotide analogs and the core of intermediary metabolism are summarized.

10.40   Coffee

11.00    Accelerating Graph Edit Distance Search by Chemical Space Enumeration. [Dr John Mayfield or Dr Richard Gowers, NextMove Software, Cambridge.]

The explosive growth of purchase-on-demand chemical databases, such as Enamine REAL and ZINC, is overwhelming traditional chemical database search technologies, whose run-times scale linearly with database size. One approach to tackle this is to switch to Graph Edit Distance (GED) based chemical similarity which can be used to efficiently find nearest neighbours whilst examining only a tiny fraction of a data set.  The caveat is that these approaches trade-off run-time for disk-space, and require pre-enumeration of large chemical spaces[1].  NextMove Software’s SmallWorld currently uses a pre-computed index of over 380 billion nodes and several trillion edges, requiring over 22 TB of disk space.  This talk describes the many algorithmic and practical challenges of providing efficient chemical similarity search using such large indices (maps of chemical space).

[1] D.W. Williams, J. Huan and W. Wang, “Graph Database Indexing using Structured Graph Decomposition”, 23rd IEEE International Conference on Data Engineering, Istanbul, 2007.

11.40    Exploring Chemical Spaces Using Graph Transformations [Jakob L. Andersen, SDU, Denmark]

Modelling of chemical systems can be done at many different levels of detail. At one end of the spectrum we find the very fine-grained level of quantum mechanics which is computationally very demanding and large-scale modelling practically impossible. At the other end we find we find vast chemical spaces, though where the structure of molecules and the details of reactions are lost, and mere network topology is left. At an intermediary level we can model molecules as graphs with vertex and edge attributes, and it then becomes natural to view each chemical reaction as a formal transformation of graphs. Going further we can specify classes of reactions, as is common in chemistry, by modelling them as graph transformation rules. A chemical space can thus be specified implicitly by a graph grammar: a set of starting molecules and a set of rules for generating new molecules. In this talk we present this modelling paradigm and methods for automatically generating reaction networks. The explicit representation of atoms and their mapping through each reaction makes it possible, e.g., to get a computational

handling on isotope labelling experiments. The methods are based on the Double Pushout (DPO) approach to graph transformation, which it self is based on category theory. By leveraging the compositionality of DPO rules and adapting the framework to chemistry it is possible to introduce stereo-chemical information in the model while retaining practical efficient algorithms. We also present recent development towards a rule-based stochastic simulation engine where reactions are generated ad hoc, without requiring a full specification of the reaction network beforehand. An implementation of the graph transformation system is available in the MØD software package (

12.20-1.20 Lunch

Session II: Chemical Space and Drug Design

1.20    Using Artificial Intelligence to Optimise Small-Molecule Drug Design  [Nathan Brown, BenevolentAI]

The concept of in silico molecular design go back decades and has a long history of published approaches using many different algorithms and models [1,2]. Major challenges involved in de novo molecular design are manifold, including identifying appropriate molecular representations for optimisation, scoring designed molecules against multiple modelled endpoints, and objectively quantifying synthetic feasibility of the designed structures.

Recently, multiobjective de novo design, more recently referred to as generative chemistry, has had a resurgence of interest. This renaissance has highlighted a step-change in successful applications of such methods. This presentation will review the development of de novo design methods over the years including the author’s original work in this area from the early 2000s [3], to recent approaches that show great promise [4,5]. Through this review, improvements in important components of de novo design, including machine learning model predictions and automated synthesis planning, will also be presented.

[1] Nicolaou, C. A., Brown, N., Pattichis, C. S. Molecular optimization using computational multi-objective methods. Current Opinion in Drug Discovery and Development, 2007, 10(3), 316-324. [2] Nicolaou, C .A., Brown, N. Multi-objective optimization methods in drug design. Drug Discov. Today: Technol. 2013, 10(3), e427-e435. [3] Brown, N.; McKay, B.; Gilardoni, F. A Graph-Based Genetic Algorithm and Its Application to the Multiobjective Evolution of Median Molecules. J. Chem. Inf. Comput. Sci. 2004, 44(3), 1079-1087. [4] Neil, D.; Segler, M.; Guasch, L.; Ahmed, M.; Plumbley, D.; Sellwood, M.; Brown, N. Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design. 2018. [5] Brown, N.; Fiscato, M.; Segler, M. H. S. Vaucher, A. C. GuacaMol: Benchmarking Models for De Novo Molecular Design. J. Chem. Inf. Model. 2019, 59(3), 1096-1108.

2.00    Learning from the Ligand [Garrett Morris, Department of Statistics, Oxford]

Protein-ligand docking is widely used in virtual screening to discover new inhibitors,1 and to predict the binding mode of a small molecule in a macromolecular target. Scoring functions play a key role in docking, attempting both to quantify the strength of binding and predict native-like binding modes. Traditionally, scoring functions rely on three-dimensional interaction terms in their calculations, but occasionally include a term describing the ligand, such as the number of rotatable bonds.2,3 I will present recent work4 that shows how a rich set of quick-to-compute, ligand-based descriptors improves the ability of traditional structure-based scoring functions and newer machine learning models to predict protein-ligand binding affinities. I will also discuss the limitations of existing ensemble-based machine learning methods and highlight the value of ligand-based virtual screening.

  1. Ripphausen, P., D. Stumpfe, and J. Bajorath (2012). “Analysis of structure-based virtual screening studies and characterization of identified active compounds.” Future Medicinal Chemistry4: 603-613. 2.     Morris, G. M., R. Huey, W. Lindstrom, M. F. Sanner, R. K. Belew, D. S. Goodsell, and A. J. Olson (2009). “AutoDock4 and AutoDockTools4: Automated docking with selective receptor flexibility.” Journal of Computational Chemistry30: 2785-2791. 3.     Morris, G. M., D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, and A. J. Olson (1998). “Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function.” Journal of Computational Chemistry19: 1639-1662.  4.     Boyles, F., C. M. Deane, and G. M. Morris (2019). “Learning from the Ligand: Using Ligand-based Features to Improve Binding Affinity Prediction.” Bioinformaticsbtz665, 10.1093/bioinformatics/btz665.

2.40    Coffee Break

Session III: Autocatalysis and the Origins of Life

3.10    Wim Hordijk  – A History of Autocatalytic Sets [Parmenides Foundation, Pullach, Germany]

The concept of autocatalytic sets was originally introduced by Stuart Kauffman in 1971. An autocatalytic set is a self-sustaining chemical reaction network in which all the molecules mutually catalyze each other’s formation from a basic food source. This notion is often seen as a “counterargument” against the dominant genetics-first view of the origin of life, focusing more on metabolism instead. However, it has taken several decades for this idea to really catch on. Thanks to theoretical as well as experimental progress in more recent research on autocatalytic sets, especially over the past 15 years, the concept now seems to be gaining significant interest and support. In this talk, a brief history of research on autocatalytic sets will be presented.

3.50    Small molecules driving autocatalysis in prokaryotic metabolism [Joana Xavier, Düsseldorf, Germany]

All life we know is autocatalytic—life needs previous life to originate, another phrasing of the chicken and egg conundrum. But then, how did life start? Biocatalysis is usually led by DNA-encoded enzymes, but this is another chicken and egg paradox, because enzymes are essential to synthesize and read DNA. What is often overlooked when dealing with cellular complexity is how frequently (~60%) enzymes are pockets for much simpler molecules that are then the de facto catalysts: cofactors. Small organic molecules (known as vitamins), and oftentimes divalent metals, cofactors realize catalysis in the enzyme active center and have been demonstrated to enhance reaction rates greatly, without the need for the surrounding protein. In the first part of this talk I will present results from the exploration of the conserved essentiality of organic cofactor-biosynthesis routes in a variety of prokaryotes from both domains, bacteria and archaea. The results were clear: nine organic cofactors were found to be universally-essential, conserved, and there is little redundancy in their metabolism. That work has since led many in the field of genome-scale metabolic modelling to incorporate cofactor biosynthesis in their formulations. Little did I know that when changing fields to the origin of Life I would come face to face with the essentiality of cofactors again.  In the second part of this talk I will present recent results, where we modeled and analyzed the expansion of early biomolecular networks—represented as RAFs—from a manually-curated and defined FS with geochemical evidence for its presence in the Archean Eon. Our starting reaction pool was a biosphere-level set of ~5997 reactions from prokaryotic anaerobes. We add to our FS organic cofactors, for they are orders of magnitude simpler than enzymes, have been synthesized abiotically in the laboratory and can catalyze biochemical reactions single-handedly. We show that ATP is not essential for a viable network to arise, but that another set of cofactors (more likely to have originated abiotically and early) is, and can expand the network to a size close to a modern real metabolic network. Finally, we uncovered RAFs for a species of bacteria (Moorella thermoacetica) and one of archaea (Methanococcus maripaludis) considered as relics of early evolution, and show that they overlap in a highly connected, core primordial network.

4.30    Autocatalysis in Chemical Space [Søren Riis, Queen Mary University of London]

The theory of Reflexively Autocatalytic and Food-Generated Sets (RAFs) was pioneered by Mike Steel in 2000, mathematically formalizes autocatalysis in a chemical reaction network (CRN). This is of great interest in the study of origins of life. Here we investigate how to define the concepts necessary for us ing the RAF in Chemical Space. Three components are need: Firstly, a formal definition of Chemical Space, where each molecule will be a node. Secondly, reactions that operates on molecules/nodes in Chemical Space and thirdly catalysis, where molecules activates reactions. Given a set of food molecules and either predicted or user given sets of reactions and catalysis, then the RAF formalism can be applied. This was done in the simplest possibly way for allcomponents. We investigated the performance for three interesting cases: 1. An RNA world model and we succeded to replicated computationally somefamous laboratory results by von Kiedrowsi. 2. A simple formose chemistry with 2 reactions was defined and RAFs sets found within this chemistry. 3.This was extended to 8 reactions and a richer chemistry.

4.50    General Discussion

We will collect questions from participants, but some are obvious: What are the potentials and limits of the present wave of Deep Learning and Big Data?  What are the optimal DL-architectures? Which data could be used to train DL-models?  Can some of these methods also be used to analyze/predict dynamics of structures?  How does DL deal with evolutionary correlations in the data? Can DL and explicit continuous markov chain evolutionary modelling be combined? What are the analogues of dimensional reduction methods for directional data?

Pub – Royal Oak