profile-img
Ben Birnbaum

@benbirnbaum

Machine learning engineer. Former lead of the Machine Learning Team at @flatironhealth, now diving into drug discovery and computational chemistry.

calendar_today16-02-2009 22:41:53

319 Tweets

485 Followers

937 Following

Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Their method is parameterized by which molecules can be used as building blocks and which reaction templates can be used for combining molecules. I love that these inputs are customizable, since in practice groups will often have their own that they want to use.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The first step of their method is model training. They use the building blocks and reaction templates to randomly generate a bunch of synthetic plans and then train models to predict each step in those plans (e.g. which reaction template and/or building block will be used).

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The trick is that the models have access not only to the inputs at each step but also to an embedding of the target molecule.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

So, with a representation of what has been synthesized so far, as well as a representation of where the synthesis should go, the model has, at least in theory, what it needs to predict the next reaction step.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Once the model is trained, it can be run on a set of target compounds to find synthetic plans for those compounds. Sometimes the model will succeed, and sometimes it will fail.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

But this failure is actually a feature, not a bug. The compounds that are generated instead will tend to be analogous, since they are close in embedding space, and they will also be synthesizable, since only the supplied building blocks and reaction templates were used.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The final step is to layer in an optimization algorithm like a genetic algorithm. The procedure above is used to generate a bunch of analogs and then score them according to whatever metrics are of interest (e.g. docking, ML property prediction, MPO, etc.).

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Each analog is represented by its embedding, and new embeddings are created via mating and mutating the embeddings of the best molecules generated. These new embeddings can then be used to guide the synthesis of new molecules, and the whole process repeats until convergence.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

The results are competitive with the state of the art and have the very important distinction of always being for molecules that can be synthesized.

account_circle
Ben Birnbaum(@benbirnbaum) 's Twitter Profile Photo

Overall this is a really clever idea, and one that feels very customizable and extensible. Here are links to the full paper (again) and the github:
arxiv.org/abs/2110.06389
github.com/wenhao-gao/Syn…

account_circle