Why AI-Designed Binders Fail Wet Lab (and Fixes)

The headline numbers in AI protein design papers — “90% of designs bind!” — are correct in their context and misleading out of it. In production binder design campaigns, the wet-lab hit rate for raw RFdiffusion or BindCraft output is a small fraction of the published rate. Aggressive developability filtering closes much of the gap. Hit rates approaching the method-paper headlines are exceptional and usually mean the target was particularly cooperative. Here are the failure modes that account for the gap, and the filters that close it.

What the published hit rates measure

When a method paper reports “90% binding,” it typically means: of designs that passed multiple curation steps internal to the method, 90% bound the target in a screening assay calibrated to that method. The denominator excludes:

Designs that didn’t pass AlphaFold2 confidence filters
Designs with low pLDDT or pTM scores
Designs the authors discarded for obvious structural issues
Designs that failed expression and were therefore never tested
Designs that bound but to off-target sites the method-development screen didn’t detect

The published rate is a useful method-development metric. It is not a project-planning metric. The number a customer cares about — designs ordered, fraction that produced a usable binder — is what we mean by “wet-lab hit rate” in this article.

Failure mode 1: aggregation and expression failure

The single largest failure category in our campaigns is designs that fail to express in soluble form. They aggregate during yeast or mammalian expression, or they fail to reach the cell surface in display experiments, or they form inclusion bodies in E. coli-based downstream production. The structural model the AI generated may be plausible, but the protein doesn’t fold to that model in cells.

Why it happens: AI design models are trained on PDB structures, which are by definition successfully expressed and crystallized proteins. The training set has selection bias. Designs that resemble the failure cases — proteins that didn’t make it to the PDB — are not penalized by the model because the model never saw them.

How to filter: predicted aggregation propensity (TANGO, AGGRESCAN, or developability-aware variants), surface hydrophobicity scores, and net surface charge. Any design with a hydrophobic patch larger than ~400 Å² should be flagged. We documented the specific filters at length in RFdiffusion outputs need a developability check.

Failure mode 2: low affinity (Kd > 1 µM)

A design that “binds” the target on AlphaFold2 confidence metrics may bind with Kd in the high micromolar range. This passes binary “does it bind” assays but fails any quantitative selection. The failure isn’t in the structural model; it’s that the model’s scoring function is correlated with affinity but not equivalent to it. A geometrically perfect interface can still have insufficient enthalpic and entropic optimization.

Why it happens: AI design optimizes for “designability” — the probability that a sequence folds to a given backbone — and “binding mode plausibility.” It does not directly optimize for affinity. Some methods incorporate affinity-correlated proxies, but the proxy is imperfect.

How to filter: dock-and-rescore approaches (Rosetta interface analysis, AlphaFold2-multimer with confidence-on-docked-pose), interface buried surface area, and shape complementarity. Designs with buried surface area below ~600 Å² rarely produce sub-µM affinity binders.

Failure mode 3: off-target binding (polyspecificity)

Designs that bind the intended target also bind everything else: serum albumin, off-target cell-surface receptors, plate plastic. This appears as “non-specific binding” in flow cytometry and as panreactivity in PSR (poly-specificity reagent) assays. For therapeutic applications, polyspecificity is a developability dealbreaker.

Why it happens: hydrophobic patches and electrostatic asymmetries that drive on-target affinity also drive off-target binding. The design model sees the on-target benefit; it doesn’t penalize the off-target cost.

How to filter: PSR scoring during candidate triage, baculovirus particle (BVP) ELISA before commitment to soluble expression, and aggressive negative-selection rounds during yeast or mammalian display screening. We’ve documented the developability red flags in detail at 5 Developability Red Flags.

Failure mode 4: binding the wrong epitope

The design binds the target — at a site other than the one specified by the hotspot conditioning. This is a partial success: you have a binder, but it doesn’t do what the customer needs (block a specific receptor interaction, neutralize a specific function). The wet-lab assay reports “yes, binds.” The functional assay reports “no effect.”

Why it happens: hotspot conditioning is a soft constraint, not a hard one. The model can satisfy the hotspot condition partially while finding a stronger binding mode elsewhere on the target. Diffusion models that “almost” hit the right epitope are surprisingly common output.

How to filter: AlphaFold2-multimer redocking with hotspot-residue contact monitoring, competition assays during display screening (titrate against a known epitope-blocker), and structural biology validation of leads (HDX-MS, cryo-EM if available, or competition with structurally characterized binders).

In our pipeline, we never let a campaign exit on a binary “binds or does not bind” readout. Every confirmed binder has to clear a downstream functional or competition assay before it counts as a deliverable, because the cases where binding does not translate to function are common enough that the binary gate is a confidence trap.

Failure mode 5: post-translational modification incompatibility

The design assumes a structure that depends on disulfide bonds, glycosylation, or specific PTMs that don’t form correctly in the chosen expression system. Yeast-displayed designs that depend on mammalian-style glycosylation are the most common case. Bacterial expression of designs that need disulfide formation is another.

Why it happens: AI design models don’t model PTM machinery. They predict structure assuming the relevant PTMs are installed correctly. When the expression system doesn’t deliver, the design doesn’t fold.

How to filter: PTM site prediction during the design phase, expression-system fit-checking (don’t display in yeast a design that assumes mammalian glycosylation), and the two-platform approach for designs where PTM dependence is unclear.

The closed-loop fix

The pattern that closes the gap between published hit rates and production hit rates is iteration with feedback. Single-shot AI design produces a candidate list with a known hit-rate distribution. The hit rate for the next campaign improves only if the design model learns from the wet-lab outcome.

In practice, “learning” doesn’t mean retraining the diffusion model — that’s not feasible at customer-campaign scale. It means:

Filter rules learned from this campaign feed the next campaign’s pre-screen filters. If a particular hotspot configuration produced 80% aggregators in the last run, deprioritize that configuration in the next.
Sequence patterns that produced binders inform ProteinMPNN sampling temperature and constraint. Designs that worked share statistical signatures the model can be coaxed toward.
Target-specific developability thresholds replace generic ones. A specific target’s binders may tolerate higher hydrophobicity than the generic threshold; a different target may need stricter filters.

The closed loop is what shifts hit rates from “random pass-through of design pipeline” to “candidates already vetted against the target’s specific failure modes.”

Practical recommendations

If you’re running an AI binder design campaign and want to maximize hit rate:

Apply developability filters before wet-lab synthesis, not after. Synthesizing 1,000 candidates and discarding 950 in screening is more expensive than synthesizing 100 pre-filtered candidates and screening all of them. Synthesis cost dominates wet-lab budget for AI campaigns at this scale.
Plan for at least two design rounds. Single-round designs produce leads that are good enough to validate the workflow but rarely good enough to deliver to customers. The second round, with feedback from the first, is where the campaign-quality output happens.
Run yeast or mammalian display, not just ELISA. Display screening surfaces polyspecificity and aggregation issues that ELISA assays miss. The triage value is worth the extra weeks.
Budget for at least 1,000 designs synthesized per round. Below this, the per-round hit rate noise dominates the design-quality signal.
Don’t trust the published hit rate of any AI-design vendor without seeing their last campaign’s numbers. Method papers are the floor; vendor claims should match or exceed those numbers on real customer projects.

In our practice, the underlying mindset matters more than any single filter on this list. We treat AI design outputs the way we treat any other protein design: the developability rules, hydrophobic-patch checks, charge balance, and structural sanity that apply to grafted CDRs and scaffold-designed proteins still apply to model output. The model is the starting point of the design process, not the finished design.

If you’re scoping an AI protein binder design campaign and want help anticipating which failure modes matter for your target, start a Binder Pilot or reach out via the contact page. We design and validate de novo binders end-to-end and publish hit rates without curation.

Frequently asked questions

Why do AI-designed protein binders fail in the wet lab?

The most common failure modes are aggregation and expression failure, low affinity in the high-micromolar range, off-target polyspecificity, binding the wrong epitope, and post-translational modification incompatibility. AI design models are trained on successfully crystallized proteins, so they do not penalize designs that resemble the failure cases, and they optimize for designability rather than directly for affinity or developability.

Why are real hit rates lower than the numbers in AI protein design papers?

Method papers report hit rates after multiple internal curation steps, excluding designs that failed confidence filters, failed expression, or were discarded for structural issues. That denominator is much smaller than the number of candidates a customer actually orders. The published rate is a method-development metric, not a project-planning metric. Aggressive developability filtering before synthesis closes much of the gap.

How can the wet-lab hit rate of AI-designed binders be improved?

Apply developability filters before synthesis rather than after, since synthesis cost dominates the wet-lab budget at scale. Plan for at least two design rounds so feedback from the first informs the second. Screen on yeast or mammalian display rather than ELISA alone, because display surfaces polyspecificity and aggregation issues. Budget for at least 1,000 designs synthesized per round so design-quality signal exceeds noise.

Do AlphaFold confidence metrics predict binder affinity?

Not reliably. Confidence metrics such as ipTM work well as negative filters: low scores reliably rule out high-affinity binders. High scores, however, do not predict affinity. Model confidence outputs are best treated as triage to discard the worst candidates, not as a ranking on the best. The real ranking has to come from downstream display screening and biochemical readouts.

Why Most AI-Designed Binders Fail Wet Lab — and How to Fix It