benbirnbaum : Finally got around to reading this upcoming @iclr

2 years ago

Finally got around to reading this upcoming ICLR 2024 paper by Wenhao Gao @rociomer3 and Connor W. Coley: arxiv.org/abs/2110.06389. I really liked this pragmatic yet principled approach to generative chemistry, and I highly recommend the full paper. Here’s my own summary:

thumb_up_off_alt18

repeat2

account_circle

Ben Birnbaum

2 years ago

Many ML approaches for generating analogs to molecules produce compounds that are not synthesizable. Naively, one could always filter the compounds produced by a generative algorithm by running a retrosynthetic planner and keeping only those for which a synthetic route is found.

thumb_up_off_alt2

account_circle

Ben Birnbaum

2 years ago

But these planners are very slow, and so this turns out not to be a very practical fix.

The authors solve this problem by going bottoms-up rather than top-down.

thumb_up_off_alt2

account_circle

Ben Birnbaum

2 years ago

Instead of learning to generate molecules optimizing one or more properties and then filtering to the ones that can be synthesized, they learn to synthesize analogs of a molecule and then use an optimization algorithm (in this case a genetic algorithm) to improve those analogs.

thumb_up_off_alt2

account_circle

Ben Birnbaum

2 years ago

Their method is parameterized by which molecules can be used as building blocks and which reaction templates can be used for combining molecules. I love that these inputs are customizable, since in practice groups will often have their own that they want to use.

account_circle

Ben Birnbaum

2 years ago

The first step of their method is model training. They use the building blocks and reaction templates to randomly generate a bunch of synthetic plans and then train models to predict each step in those plans (e.g. which reaction template and/or building block will be used).

account_circle

Ben Birnbaum

2 years ago

The trick is that the models have access not only to the inputs at each step but also to an embedding of the target molecule.

account_circle

Ben Birnbaum

2 years ago

So, with a representation of what has been synthesized so far, as well as a representation of where the synthesis should go, the model has, at least in theory, what it needs to predict the next reaction step.

account_circle

Ben Birnbaum

2 years ago

Once the model is trained, it can be run on a set of target compounds to find synthetic plans for those compounds. Sometimes the model will succeed, and sometimes it will fail.

account_circle

Ben Birnbaum

2 years ago

But this failure is actually a feature, not a bug. The compounds that are generated instead will tend to be analogous, since they are close in embedding space, and they will also be synthesizable, since only the supplied building blocks and reaction templates were used.

account_circle

Ben Birnbaum

2 years ago

So now we have a way to generate analogs of any molecule that are guaranteed to be synthesizable.