RFdiffusion Outputs Need a Developability Check Before Wet Lab

RFdiffusion and BindCraft are very good at the tasks they were trained for. RFdiffusion produces protein backbones that fold, that respect the geometry of a specified hotspot or motif, and that are recognized as realistic by independent structure predictors. BindCraft produces sequence-structure pairs that AlphaFold is confident will fold into a binder geometry with reasonable interface metrics. ProteinMPNN, conditioned on a target backbone, produces sequences that match the fold distribution it learned from the PDB.

None of those three loss functions penalize developability liabilities. That is the problem this post is about.

What Their Loss Functions Actually Optimize

A short honest read of what the generative stack is doing:

RFdiffusion is a denoising diffusion model trained on backbone coordinates. Its objective rewards generating backbones that score well under RoseTTAFold and that sit in the distribution of real protein folds. It does not see sidechain identity. It does not see hydrophobicity. It does not see chemical stability.
ProteinMPNN is an inverse-folding model trained with a sequence-recovery objective on the PDB. Given a backbone, it predicts sequences that the training set’s folds tended to carry. It optimizes the distribution of residues that fit a given geometry. It does not optimize for aggregation, deamidation, isomerization, or glycan placement.
BindCraft, in the common configuration, wraps ProteinMPNN and AlphaFold2 in a binder-biased loop, scoring designs on predicted AlphaFold interface metrics (ipTM, pAE of the interface, designed-chain pLDDT). These are binding-relevant signals, not developability signals.

The result, when you take raw outputs to wet lab, is predictable. Most designs fold. A meaningful fraction bind. A separate and often overlapping fraction carry developability liabilities that will cause problems at expression, purification, concentration, or long-term storage. If the campaign is a first-pass research-grade effort and you are not carrying these binders forward, this may not matter. If the campaign is going to feed anything downstream (affinity maturation, humanization, preclinical scale-up), filtering the raw outputs is non-negotiable.

What Goes Wrong in Raw Outputs

Specific patterns we see repeatedly in raw RFdiffusion plus ProteinMPNN and raw BindCraft outputs when they are scanned for developability:

Hydrophobic interface residues bleeding into exterior surface. The generative stack is trained to build paratopes that make strong contacts with the target. Large hydrophobics (Phe, Trp, Leu, Ile) at the interface are a feature, not a bug. The issue is that these residues often extend past the interface footprint, leaving part of the hydrophobic sidechain exposed on the solvent-facing surface of the design. This produces exactly the HIC retention and PSR polyspecificity signatures that predict downstream CMC trouble, as measured in the Jain et al. 2017 clinical panel (Proc. Natl. Acad. Sci. USA, 114:944 to 949).

Free cysteines. ProteinMPNN assigns cysteine at a frequency close to the PDB background rate. In real proteins, most cysteines are in structural disulfides or buried. In raw de novo designs, cysteines often appear on the surface or in unpaired positions, where they drive oxidation, mispaired disulfide scrambling between molecules, and aggregation during purification. This is one of the most common reasons a promising de novo design fails to express cleanly.

N-glycosylation sequons. The N-X-S/T sequon occurs purely by chance at some frequency in any ProteinMPNN sequence. If the sequon falls on the surface of the design and the design is expressed in a eukaryotic host, you get variable-occupancy glycosylation that produces a heterogeneous product. This is a particular risk for teams that go directly from a ProteinMPNN output to a mammalian expression test, or that display the design on yeast (yeast N-glycosylation differs from human and can cause additional problems downstream).

AlphaFold-confident but aggregation-prone designs. High AlphaFold pLDDT and high ipTM do not imply the design is soluble. pLDDT measures the predictor’s confidence in the fold. Aggregation is a thermodynamic property of the ensemble in solution, often driven by short aggregation-prone regions that the structure predictor confidently places in a buried or semi-buried position in the static model, but that become accessible during the folding pathway or in partially unfolded states. This is orthogonal to what AlphaFold measures, and the Sormanni and Vendruscolo CamSol work has explicitly separated these two signals.

Deamidation and isomerization motifs in exposed loops. NG, NS, DG appear at their background frequency in raw designs. When they land in a flexible exposed loop (common in BindCraft paratope regions), they are on the fast end of deamidation and isomerization kinetics and will accumulate damage in storage.

None of these failure modes are model bugs. They are consequences of optimizing the wrong objective relative to the shipping criterion.

Why This Matters More for De Novo Than for Humanized Antibodies

Humanized antibodies carry the evolutionary pressure of the original immune repertoire plus the human germline frameworks they were grafted onto. That pressure is not a developability filter per se, but the sequences involved have been through selection for expression, solubility, and chemical stability at some level (in vivo selection against catastrophically bad sequences, germline conservation across evolutionary time, iterative humanization engineering in industry practice). The result is that humanized antibodies tend to arrive at discovery with a baseline developability profile that is imperfect but not catastrophic.

De novo designs carry no such pressure. The training signal is structural and binding-related, not evolutionary. The generative distribution contains sequences that look nothing like anything that has ever had to be made in a cell and handled at high concentration. Some of those sequences are fine. Some of them are much worse than anything you would find in a naive mammalian immune repertoire. You cannot tell the difference by looking at the design, the AlphaFold ipTM, or the BindCraft score.

The practical conclusion is that the developability filter is a harder requirement for de novo campaigns than it is for humanized programs. The de novo space has no inbuilt floor. Filtering is the floor.

A Concrete Triage Workflow

What we recommend, for any RFdiffusion or BindCraft campaign that is going to feed anything downstream of first-pass research screening:

Generate your N backbone candidates with RFdiffusion, constrained against your target and your chosen hotspot. Epitope selection upstream of this step matters more than any downstream filter, which is a separate post.
Sequence the backbones with ProteinMPNN (for RFdiffusion outputs) or use the BindCraft sequence-plus-structure pipeline directly.
Run self-consistency filtering first. Fold each design with an independent predictor (ESMFold, ColabFold, Boltz-2) and keep only designs whose predicted structure matches the intended backbone to within a reasonable RMSD. This single step removes 50 to 80 percent of raw output and is the cheapest quality gate you can run. It is a binding-relevance filter, not a developability one, but it has to come first.
Run a developability scan on the self-consistency survivors. Specifically, check: exposed hydrophobic surface area, aggregation-prone regions (APR scores from Aggrescan, CamSol profile, or SAP), N-X-S/T glycosylation sequons, free cysteines, NG/NS/NT/DG/DS/DH motifs in exposed loops, and overall charge distribution. Flag or remove the worst offenders.
Rank the clean survivors by your actual binding-relevance metric (interface score, ipTM, binder pLDDT, whichever is primary for your campaign). Take the top K forward to synthesis and to display screening.
At the display step, the wet-lab selection adds another filter layer. Display screening on yeast display naturally selects against the worst expression and folding failures, so developability problems that a sequence scan missed often still get filtered out experimentally. That is useful, but it is a much more expensive filter than the in silico step. Use the in silico filter first, then let display screening confirm.

The specific thresholds at each step depend on the campaign. For a research-grade tool binder, loose thresholds are fine. For anything that is going to be a lead for a therapeutic program, the thresholds tighten substantially. The Ranomics AI Binder Sprint pipeline runs this filter by default with thresholds tuned for the therapeutic track.

What You Get for This Filter

Concretely, running this in silico developability filter before synthesis tends to improve three numbers in a campaign.

Wet-lab hit rate goes up. Self-consistency plus developability filtering raises the fraction of synthesized designs that actually express and display. This is not because developability liabilities cause zero expression directly, but because the filter correlates with “designs that look more like real proteins” and those designs more reliably fold, express, and reach the display surface.

Synthesis budget goes down. If the filter removes 40 percent of an already self-consistency-filtered pool, you save 40 percent of the gene synthesis cost for the next synthesis order. For academic and seed-stage campaigns, this is a material saving. Gene synthesis is usually the largest line item in a binder campaign, larger than GPU compute.

Downstream campaigns inherit cleaner leads. The hits that come back from screening have fewer sequence liabilities, which means fewer red flags during the subsequent affinity maturation or humanization step. Less re-engineering, fewer points of potential regression.

A Note on What This Does Not Do

The filter does not replace experimental developability confirmation. HIC, DSF, DLS, SEC, thermal stability, and accelerated stability studies are real measurements, and the correlations from in silico scores to wet-lab behavior are strong but imperfect. What the filter does is triage. It removes the designs that are visibly likely to fail before you commit synthesis budget, so that your experimental budget goes to the designs that have a reasonable chance of clearing a full developability panel.

The filter also does not tell you whether the binder will work against your target. Binding is a separate question and is tested experimentally by display screening, SPR, BLI, or a cell-based assay. The filter is a necessary filter, not a sufficient one.

Where to Start

For DIY campaigns running RFdiffusion, ProteinMPNN, BindCraft, or any hybrid of the three, the fastest way to add a developability filter to your existing pipeline is to pipe the ProteinMPNN or BindCraft sequence output through Developability Scout. It is free, browser-based, and produces a scorecard per sequence. It slots in between step 3 (self-consistency) and step 5 (synthesis order) in the workflow above.

For teams that want the full campaign run end to end with developability filtering integrated into the pipeline by default, that is what the AI Binder Sprint is scoped to deliver. The Sprint includes RFdiffusion plus BindCraft plus ProteinMPNN in parallel, self-consistency filtering, developability filtering, synthesis of the filtered pool, yeast display selection, NGS hit calling, and validation of the top hits. Developability is a gate in that pipeline, not an afterthought.

For grant-scale single-target campaigns where the output is a ranked NGS hit list and the team is running synthesis and display in-house, the Binder Pilot program is the scoped-down version. Developability Scout is the free upstream piece for either path.

The one recommendation that applies to all three paths: do not take raw RFdiffusion or BindCraft outputs to wet lab. The filter is cheap, and skipping it is the most expensive way to learn what their loss functions do not cover.

Developability Scout: Free developability scan for de novo designs and antibody sequences before synthesis.
AI Binder Sprint: Full de novo binder campaign with developability filtering and yeast display selection integrated.