The headline numbers in AI protein design papers — “90% of designs bind!” — are correct in their context and misleading out of it. In production binder design campaigns, the wet-lab hit rate for raw RFdiffusion or BindCraft output is a small fraction of the published rate. Aggressive developability filtering closes much of the gap. Hit rates approaching the method-paper headlines are exceptional and usually mean the target was particularly cooperative. Here are the failure modes that account for the gap, and the filters that close it.
What the published hit rates measure
When a method paper reports “90% binding,” it typically means: of designs that passed multiple curation steps internal to the method, 90% bound the target in a screening assay calibrated to that method. The denominator excludes:
- Designs that didn’t pass AlphaFold2 confidence filters
- Designs with low pLDDT or pTM scores
- Designs the authors discarded for obvious structural issues
- Designs that failed expression and were therefore never tested
- Designs that bound but to off-target sites the method-development screen didn’t detect
The published rate is a useful method-development metric. It is not a project-planning metric. The number a customer cares about — designs ordered, fraction that produced a usable binder — is what we mean by “wet-lab hit rate” in this article.
Failure mode 1: aggregation and expression failure
The single largest failure category in our campaigns is designs that fail to express in soluble form. They aggregate during yeast or mammalian expression, or they fail to reach the cell surface in display experiments, or they form inclusion bodies in E. coli-based downstream production. The structural model the AI generated may be plausible, but the protein doesn’t fold to that model in cells.
Why it happens: AI design models are trained on PDB structures, which are by definition successfully expressed and crystallized proteins. The training set has selection bias. Designs that resemble the failure cases — proteins that didn’t make it to the PDB — are not penalized by the model because the model never saw them.
How to filter: predicted aggregation propensity (TANGO, AGGRESCAN, or developability-aware variants), surface hydrophobicity scores, and net surface charge. Any design with a hydrophobic patch larger than ~400 Ų should be flagged. We documented the specific filters at length in RFdiffusion outputs need a developability check.
Failure mode 2: low affinity (Kd > 1 µM)
A design that “binds” the target on AlphaFold2 confidence metrics may bind with Kd in the high micromolar range. This passes binary “does it bind” assays but fails any quantitative selection. The failure isn’t in the structural model; it’s that the model’s scoring function is correlated with affinity but not equivalent to it. A geometrically perfect interface can still have insufficient enthalpic and entropic optimization.
Why it happens: AI design optimizes for “designability” — the probability that a sequence folds to a given backbone — and “binding mode plausibility.” It does not directly optimize for affinity. Some methods incorporate affinity-correlated proxies, but the proxy is imperfect.
How to filter: dock-and-rescore approaches (Rosetta interface analysis, AlphaFold2-multimer with confidence-on-docked-pose), interface buried surface area, and shape complementarity. Designs with buried surface area below ~600 Ų rarely produce sub-µM affinity binders.
Failure mode 3: off-target binding (polyspecificity)
Designs that bind the intended target also bind everything else: serum albumin, off-target cell-surface receptors, plate plastic. This appears as “non-specific binding” in flow cytometry and as panreactivity in PSR (poly-specificity reagent) assays. For therapeutic applications, polyspecificity is a developability dealbreaker.
Why it happens: hydrophobic patches and electrostatic asymmetries that drive on-target affinity also drive off-target binding. The design model sees the on-target benefit; it doesn’t penalize the off-target cost.
How to filter: PSR scoring during candidate triage, baculovirus particle (BVP) ELISA before commitment to soluble expression, and aggressive negative-selection rounds during yeast or mammalian display screening. We’ve documented the developability red flags in detail at 5 Developability Red Flags.
Failure mode 4: binding the wrong epitope
The design binds the target — at a site other than the one specified by the hotspot conditioning. This is a partial success: you have a binder, but it doesn’t do what the customer needs (block a specific receptor interaction, neutralize a specific function). The wet-lab assay reports “yes, binds.” The functional assay reports “no effect.”
Why it happens: hotspot conditioning is a soft constraint, not a hard one. The model can satisfy the hotspot condition partially while finding a stronger binding mode elsewhere on the target. Diffusion models that “almost” hit the right epitope are surprisingly common output.
How to filter: AlphaFold2-multimer redocking with hotspot-residue contact monitoring, competition assays during display screening (titrate against a known epitope-blocker), and structural biology validation of leads (HDX-MS, cryo-EM if available, or competition with structurally characterized binders).
In our pipeline, we never let a campaign exit on a binary “binds or does not bind” readout. Every confirmed binder has to clear a downstream functional or competition assay before it counts as a deliverable, because the cases where binding does not translate to function are common enough that the binary gate is a confidence trap.
Failure mode 5: post-translational modification incompatibility
The design assumes a structure that depends on disulfide bonds, glycosylation, or specific PTMs that don’t form correctly in the chosen expression system. Yeast-displayed designs that depend on mammalian-style glycosylation are the most common case. Bacterial expression of designs that need disulfide formation is another.
Why it happens: AI design models don’t model PTM machinery. They predict structure assuming the relevant PTMs are installed correctly. When the expression system doesn’t deliver, the design doesn’t fold.
How to filter: PTM site prediction during the design phase, expression-system fit-checking (don’t display in yeast a design that assumes mammalian glycosylation), and the two-platform approach for designs where PTM dependence is unclear.
The closed-loop fix
The pattern that closes the gap between published hit rates and production hit rates is iteration with feedback. Single-shot AI design produces a candidate list with a known hit-rate distribution. The hit rate for the next campaign improves only if the design model learns from the wet-lab outcome.
In practice, “learning” doesn’t mean retraining the diffusion model — that’s not feasible at customer-campaign scale. It means:
-
Filter rules learned from this campaign feed the next campaign’s pre-screen filters. If a particular hotspot configuration produced 80% aggregators in the last run, deprioritize that configuration in the next.
-
Sequence patterns that produced binders inform ProteinMPNN sampling temperature and constraint. Designs that worked share statistical signatures the model can be coaxed toward.
-
Target-specific developability thresholds replace generic ones. A specific target’s binders may tolerate higher hydrophobicity than the generic threshold; a different target may need stricter filters.
The closed loop is what shifts hit rates from “random pass-through of design pipeline” to “candidates already vetted against the target’s specific failure modes.”
Practical recommendations
If you’re running an AI binder design campaign and want to maximize hit rate:
-
Apply developability filters before wet-lab synthesis, not after. Synthesizing 1,000 candidates and discarding 950 in screening is more expensive than synthesizing 100 pre-filtered candidates and screening all of them. Synthesis cost dominates wet-lab budget for AI campaigns at this scale.
-
Plan for at least two design rounds. Single-round designs produce leads that are good enough to validate the workflow but rarely good enough to deliver to customers. The second round, with feedback from the first, is where the campaign-quality output happens.
-
Run yeast or mammalian display, not just ELISA. Display screening surfaces polyspecificity and aggregation issues that ELISA assays miss. The triage value is worth the extra weeks.
-
Budget for at least 1,000 designs synthesized per round. Below this, the per-round hit rate noise dominates the design-quality signal.
-
Don’t trust the published hit rate of any AI-design vendor without seeing their last campaign’s numbers. Method papers are the floor; vendor claims should match or exceed those numbers on real customer projects.
In our practice, the underlying mindset matters more than any single filter on this list. We treat AI design outputs the way we treat any other protein design: the developability rules, hydrophobic-patch checks, charge balance, and structural sanity that apply to grafted CDRs and scaffold-designed proteins still apply to model output. The model is the starting point of the design process, not the finished design.
If you’re scoping an AI protein binder design campaign and want help anticipating which failure modes matter for your target, start a Binder Pilot or reach out via the contact page. We design and validate de novo binders end-to-end and publish hit rates without curation.