logo FlowBank

Query-Adaptive Agentic Workflow Optimization through Precompute-and-Reuse

1University of Maryland, College Park, 2Amazon

Overview


LLM-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow for each query at substantial inference cost.

Our motivating analysis shows that these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating a workflow for every instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time.

To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization: (i) Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool; (ii) Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy; and (iii) Matching casts deployment as edge-value prediction on a query–workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility.

Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

FlowBank overview
Figure 1. FlowBank turns workflows from one-shot solutions into reusable assets. Left: rather than committing to a single universal workflow or generating a new workflow for every query, FlowBank builds a compact portfolio of complementary workflows offline and assigns each query to the member with the best predicted utility. Right: this portfolio view recovers query-level adaptivity without the full per-query generation cost, yielding a stronger performance–cost trade-off across five benchmarks. (The plotted cost axis is inverted so that higher values correspond to lower actual cost.)

Contributions

  • We identify two under-exploited forms of workflow complementarity: discarded task-level workflows retain set-level reuse value, and part of the gain from query-level generation can be recovered through adaptive reuse of cheaper precomputed workflows.
  • We propose FlowBank, a three-stage framework that builds a reusable workflow bank through diversified search, coverage-aware curation, and graph-based query-adaptive matching.
  • We show FlowBank achieves the strongest average performance and competitive performance–cost trade-offs across five benchmarks, validating portfolio-based workflow reuse as a stronger alternative to committing to a single workflow or a single paradigm.

Motivating Observations

Given a task dataset $D$, an agentic workflow $\omega(\cdot)$ is a computation graph of LLM calls mapping a query $q$ to a prediction $\omega(q)$. We measure two quantities: the performance $e(\omega, q) \in [0,1]$ and the token cost $c(\omega, q) > 0$. Task-level optimization picks one static workflow $\omega_{\text{stat}} = \arg\max_{\omega}\sum_{q} e(\omega, q)$ for all queries, while query-level optimization trains a meta-generator $G_\phi$ to design a custom workflow $\omega^q_{\text{dyn}} = G_\phi(q)$ per query. Our analysis reveals two under-exploited forms of complementarity.

Observation 1. Task-level search explores many workflows but deploys only one — yet the discarded workflows retain set-level reuse value.
Coverage of workflow sets
Figure 2. Coverage on MATH for workflow sets built from AFlow candidates. Coverage rises steadily with set size $k$, and the combinatorially best size-$k$ subset (orange) consistently beats the top-$k$ accuracy set (blue) — accuracy-based construction is suboptimal for set coverage.
Observation 2. Part of the gain from expensive query-level generation can be recovered by cheaper precomputed workflows.
Oracle selector vs ScoreFlow
Figure 3. Performance–cost comparison on DROP between ScoreFlow and an oracle selector over a static workflow and ScoreFlow. A substantial fraction of queries solved by ScoreFlow can be handled by the cheaper static workflow, so redundant query-level cost can be avoided.

Realizing this potential requires addressing two challenges that directly motivate the design of FlowBank: (1) Workflow pool construction — traditional task-level methods optimize workflows for individual accuracy, which is suboptimal for building a high-coverage pool; and (2) Query-adaptive selection — gains are realized only if the system efficiently routes each query to the right workflow at inference time.

Method

FlowBank adopts a portfolio-based methodology for workflow deployment. Rather than searching for a single universally best workflow, we build, curate, and deploy a compact set of complementary workflows through three stages.

FlowBank pipeline
Figure 4. Overview of FlowBank. DiverseFlow builds a diverse raw pool $\Omega_{\text{raw}}$ via performance-oriented warm-up followed by coverage-oriented expansion; CuraFlow selects a compact portfolio $\Omega^*$ that retains most attainable coverage while pruning redundant workflows; and a bipartite query–workflow matcher predicts each portfolio member's utility under the performance–cost trade-off and executes the highest-scoring workflow.
Stage 1 · Diversifying

Complementarity-oriented Workflow Diversification via DiverseFlow

Built on Monte Carlo tree search, DiverseFlow inherits AFlow's four-step search loop (select parent → propose new workflow in code space → evaluate → tag success/failure). It runs in two phases:

Performance-oriented warm-up. For the first $N_0$ rounds, a parent workflow is sampled with weight $\eta(\omega)=\sum_{q} e(\omega, q)$, so strong individual workflows are more likely to be expanded:

$P(\omega_i)= \rho \cdot \frac{1}{n} + (1 - \rho) \cdot \frac{\exp(\alpha(\eta(\omega_i) - \eta_{\max}))}{\sum_{j}\exp(\alpha(\eta(\omega_j) - \eta_{\max}))}$

Complementarity-oriented expansion. After warm-up, the weight switches to a query-weighted score $\sum_{q}\overline{\mu}(q)\,e(\omega, q)$, where the query difficulty weight $\mu(q) = \frac{1}{1+\sum_{i} e(\omega_i, q)}$ peaks at $1$ when no workflow solves $q$. This steers MCTS toward under-covered queries. We finally augment the pool with a ScoreFlow query-level workflow to form $\Omega_{\text{raw}}$.

Stage 2 · Curating

Coverage-aware Combinatorial Curation via CuraFlow

Using the full pool is neither necessary nor desirable: many workflows are redundant and enlarging the pool yields diminishing coverage gains while raising selector burden. CuraFlow compresses $\Omega_{\text{raw}}$ into a compact portfolio. For each cardinality $k$, it finds the size-$k$ subset that maximizes coverage:

$\Omega_k^* = \arg\max_{\Omega \subseteq \Omega_{\mathrm{raw}},\, |\Omega| = k} \mathrm{Coverage}(\Omega), \quad \mathrm{Coverage}(\Omega) = \frac{1}{|D|} \sum_{q\in D} \max_{\omega \in \Omega} e(\omega, q)$

Ties are broken toward the subset with the lowest mean pairwise correlation. Because coverage is submodular, marginal gains diminish, so we pick the smallest portfolio reaching a saturation ratio $\tau$ of the full pool's coverage:

$k^*=\min\{k: \max_{|\Omega|=k}\mathrm{Coverage}(\Omega)\geq\tau\cdot\mathrm{Coverage}(\Omega_{\text{raw}})\}$

DiverseFlow improves recall of useful behaviors; CuraFlow improves precision and deployability.

Stage 3 · Matching

Graph-Based Query-Adaptive Matching

We cast workflow selection as edge-value prediction on a query–workflow bipartite graph $G = (D \cup \Omega^*, E)$. Query nodes are initialized with text embeddings and workflow nodes with workflow-description embeddings; a heterogeneous GNN maps both into a shared space. Each edge carries a cost-aware supervision value:

$v_{q,\omega} = (1 - \lambda) \cdot \tilde{e}_{q,\omega} + \lambda \cdot (1 - \tilde{c}_{q,\omega})$

where $\tilde{e}$ and $\tilde{c}$ are normalized performance and cost, and $\lambda$ controls the deployment trade-off. A 2-layer GNN encoder plus MLP decoder is trained with masked edge prediction under a per-edge BCE loss, naturally handling ties. At deployment, a new query is attached to the graph and routed by $\pi_\theta(q) = \arg\max_{\omega \in \Omega^*} f_\theta(q,\omega)$ — a single forward pass whose cost is negligible compared to executing any workflow.

Experiments

We evaluate FlowBank on five public benchmarks spanning four domains: math reasoning (MATH, AMC), code generation (MBPP), question answering (MMLU Pro), and reading comprehension (DROP). Each benchmark is split into training and test sets with a 1:4 ratio. We use Qwen3-8B as the optimizer and GPT-4o mini (temperature 0) as a fixed executor across all workflows, comparing against 7 manually designed and 6 automated baselines.

Main Results

FlowBank achieves the best average performance (73.40) and ranks first on all five benchmarks. It improves over the strongest automated baseline AFlow (GPT-4o, 70.40) by 3.00 absolute points (4.26% relative) — despite using a weaker optimizer — and over the strongest manually designed workflow MultiPersona (63.87) by 9.53 points (14.92% relative).

Table 1. Performance & Cost comparison across 5 benchmarks. Bold = best, underline = second best (performance ranked across all methods; cost highlighted within the automated block and FlowBank). Cost unit: ×10−3 $ per 1K tokens.

Method MATHAMCMBPP DROPMMLU ProAverage
Perf.CostPerf.CostPerf.Cost Perf.CostPerf.CostPerf.Cost
IO56.580.5153.590.5372.920.0875.490.0850.890.0360.440.22
Manually Designed Agentic Workflows
CoT55.760.5352.470.5474.290.0877.960.1151.350.0460.980.23
ComplexCoT54.390.6249.520.6471.680.1077.740.3364.030.3563.850.42
Self-Consistency57.823.3853.893.4473.980.5278.330.7450.930.3461.491.51
Multi-agent Debate58.927.6653.827.9674.680.8678.651.6457.560.8863.863.45
Self-Refine55.141.0852.881.0671.070.2776.710.3443.750.4657.960.62
MedPrompt56.695.6455.195.4971.850.6181.771.7552.790.6362.772.58
MultiPersona57.410.5653.740.5870.380.2876.130.2961.700.3663.870.41
Automated Agentic Workflow Optimization Methods
GPTSwarm54.941.6955.429.0471.260.2578.370.5158.371.1463.432.54
ADAS56.172.9547.333.2272.340.7781.170.9859.711.0563.221.71
AgentSquare62.761.3558.781.7972.730.5276.840.7663.750.7566.701.02
AFlow (Qwen3-8B)63.243.2362.752.5179.371.0277.081.0564.621.4768.561.79
AFlow (GPT-4o)63.992.7763.774.8683.770.5780.261.0265.610.8970.401.95
MaAS65.022.4363.362.7079.082.0476.641.6462.311.2968.091.90
ScoreFlow62.002.3660.153.5183.631.7082.101.4264.582.6269.512.37
FlowBank69.341.7867.942.4984.261.6283.491.0667.401.5273.401.65

Performance–Cost Trade-off

Beyond raw accuracy, FlowBank remains efficient. Its average inference cost of 1.65 is below AFlow (GPT-4o, 1.95) and ScoreFlow (2.37), while its average performance is higher than both. AgentSquare is cheaper (1.02) but trails by 6.70 points. As the figure shows, no compared method offers both higher average performance and lower average costFlowBank sits on the Pareto frontier.

Pareto front
Figure 5. Performance–cost trade-off across all methods. FlowBank is on the Pareto frontier.

Ablation Study

Each stage contributes: replacing DiverseFlow with AFlow, the top-$k$ accuracy set for CuraFlow, or the GNN selector with a flat MLP all degrade performance. Training on the full uncurated pool inflates oracle coverage but hurts final performance, showing that redundancy makes selection harder. The best results require all three stages together.

Table 2. Ablation study of FlowBank on MATH and DROP. "Oracle" denotes an oracle selector on test data.

Method MATH DROP
Stage 1Stage 2Stage 3 Perf.OraclePerf.Oracle
69.3483.7483.4990.47
✗ (AFlow)67.6979.0182.9889.95
✗ (Top-$k$ Acc. set)67.9080.6682.3786.73
✗ (Full pool)68.1192.0482.4596.59
✗ (MLP Classifier)68.7283.7482.8390.47

Hyperparameter Effects

AMC performance rises from 61.07 at $K{=}1$ to 67.94 at $K{=}5$ before dipping at $K{=}6$, supporting a compact, diverse portfolio. On MMLU Pro, larger $\lambda$ steadily lowers cost while performance stays stable for $\lambda \in [0.1, 0.4]$ and drops at $\lambda{=}0.5$. Overall, FlowBank is robust to moderate $\lambda$ and maintains a strong trade-off without heavy tuning.

Hyperparameter effects
Figure 6. Left: impact of portfolio size $K$ on AMC. Right: impact of cost-regularization weight $\lambda$ on MMLU Pro.

Conclusion

We revisit agentic workflow optimization from a different perspective: rather than choosing between a single workflow for all queries and synthesizing a new workflow per query, we recover query-level adaptivity through offline computation and reusable workflow assets. FlowBank operationalizes this with three stages — Diversifying, Curating, and Matching — that discover complementary workflows, distill them into a compact portfolio, and assign each query to the right workflow at inference time. Across five benchmarks, this precompute-and-reuse strategy delivers the strongest average performance among the evaluated methods while maintaining competitive cost.

BibTeX

@article{yuan2026flowbank,
  title={FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse},
  author={Yuan, Lingzhi and Deng, Chenghao and Yu, Fangxu and Chakraborty, Souradip and Rostami, Mohammad and Huang, Furong},
  journal={arXiv preprint arXiv:2606.11290},
  year={2026}
}