profile-img
Ben Birnbaum

@benbirnbaum

Machine learning engineer. Former lead of the Machine Learning Team at @flatironhealth, now diving into drug discovery and computational chemistry.

calendar_today16-02-2009 22:41:53

319 Tweets

485 Followers

936 Following

Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Finally got around to reading this upcoming ICLR 2024 paper by Wenhao Gao @rociomer3 and Connor W. Coley: arxiv.org/abs/2110.06389. I really liked this pragmatic yet principled approach to generative chemistry, and I highly recommend the full paper. Here’s my own summary:

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Many ML approaches for generating analogs to molecules produce compounds that are not synthesizable. Naively, one could always filter the compounds produced by a generative algorithm by running a retrosynthetic planner and keeping only those for which a synthetic route is found.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

But these planners are very slow, and so this turns out not to be a very practical fix.

The authors solve this problem by going bottoms-up rather than top-down.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Instead of learning to generate molecules optimizing one or more properties and then filtering to the ones that can be synthesized, they learn to synthesize analogs of a molecule and then use an optimization algorithm (in this case a genetic algorithm) to improve those analogs.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Their method is parameterized by which molecules can be used as building blocks and which reaction templates can be used for combining molecules. I love that these inputs are customizable, since in practice groups will often have their own that they want to use.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The first step of their method is model training. They use the building blocks and reaction templates to randomly generate a bunch of synthetic plans and then train models to predict each step in those plans (e.g. which reaction template and/or building block will be used).

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The trick is that the models have access not only to the inputs at each step but also to an embedding of the target molecule.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

So, with a representation of what has been synthesized so far, as well as a representation of where the synthesis should go, the model has, at least in theory, what it needs to predict the next reaction step.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Once the model is trained, it can be run on a set of target compounds to find synthetic plans for those compounds. Sometimes the model will succeed, and sometimes it will fail.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

But this failure is actually a feature, not a bug. The compounds that are generated instead will tend to be analogous, since they are close in embedding space, and they will also be synthesizable, since only the supplied building blocks and reaction templates were used.

account_circle