Preprint

PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents

Zhenbang Du^1,* Jun Luo^2,* Zhiwei Zheng³ Xiangchi Yuan¹ Kejing Xia¹ Dachuan Shi¹ Qirui Jin¹ Qijia He² Shaofeng Zou⁴ Yingbin Liang² Wenke Lee¹

¹Georgia Institute of Technology ²Ohio State University ³University of Pennsylvania ⁴Arizona State University

* Equal contribution

arXiv Code HF Model HF Dataset

PACT comparison with hint-guided training and prompt-only rollout optimization — PACT keeps rollouts prompt-only, then uses expert traces as privileged optimization signals for training.

Abstract

TL;DR: PACT uses expert tool-use traces only during optimization, not rollout generation, combining trace-conditioned RL with component-aware SFT.

Multi-turn tool-use agents must reason, call tools, and adapt to observations across several interaction turns. Post-training such agents is challenging, as reinforcement learning often suffers from sparse rewards and weak credit assignment despite matching the prompt-only inference setting, while supervised fine-tuning on expert traces provides dense process supervision but can over-constrain the model to fixed trajectories. To tackle this, we propose PACT, a Privileged trAce Co-Training framework for multi-turn tool-use agents. The key idea is to use expert traces only as training-time optimization signals rather than rollout-time hints. PACT keeps rollout generation prompt-only, then uses expert traces to guide optimization through two complementary signals: a trace-conditioned RL surrogate that evaluates prompt-only rollouts under expert-trace context, and a component-aware SFT loss that supervises reasoning prefixes and tool-calls with annealed strength. To reduce over-reliance on the training-only trace context, PACT further introduces a prompt-only anchoring. We also provide a latent-trace view that connects the two trace-based objectives and explains how expert traces can guide optimization without being used during rollout generation. Experiments on FTRL, BFCL, and ToolHop show that PACT consistently improves over strong SFT- and RL-based baselines, highlighting the value of privileged trace co-training for multi-turn tool-use learning.

Motivation

Full-trace SFT can over-imitate

Expert traces are useful, but multi-turn tool-use tasks admit multiple valid paths. Treating a single trace as a fixed target can overfit teacher-specific decisions.

Prompt-only RL is under-guided

Rollout RL matches inference, but sparse trajectory-level rewards give weak credit assignment for intermediate reasoning, tool choices, and arguments.

Hints create input mismatch

Using expert traces as rollout hints changes the training input condition. PACT instead reserves traces for optimization while sampling rollouts from the original prompt.

Method

PACT decouples exploration from privileged supervision. The policy samples prompt-only rollouts, scores those rollouts under an expert-trace context, and supervises selected model-controllable parts of the expert trace.

Prompt-only rollout

For each prompt, PACT samples rollouts only from the original user prompt, preserving the same input condition used at inference time.

Trace-conditioned RL

The same prompt-only rollouts are evaluated under a privileged expert-trace context, producing a reward-weighted optimization signal without trace-guided generation.

Component-aware SFT

Reasoning prefixes and complete tool-call spans receive supervised loss, while environment tool responses remain context and are never prediction targets.

Results

PACT improves average performance across FTRL, BFCL, and ToolHop for both evaluated Qwen3 model sizes.

Qwen3-1.7B Avg 24.90 vs. 21.70 best baseline

Qwen3-4B Avg 36.90 vs. 33.07 best baseline

Qwen3-4B FTRL Solve-R 42.41 vs. 37.17 best baseline

Qwen3-4B BFCL Avg 21.26 vs. 18.83 best baseline

Method	FTRL			BFCL				ToolHop	Avg
Method	Solve-R	Solve-P	Solve-F1	Multi-Turn	Search	Memory	Avg	AC	Avg
Qwen3-1.7B
Vanilla	21.66	20.78	19.72	13.00	1.00	8.60	7.53	25.63	19.06
SFT	22.36	17.89	17.66	13.38	2.50	8.60	8.16	23.32	17.88
GRPO	21.57	17.37	18.13	11.88	3.50	9.25	8.21	25.13	18.08
FTRL	22.78	22.06	21.06	12.62	6.00	11.18	9.93	26.43	20.45
ToolRL	22.79	17.93	17.99	13.30	2.50	8.39	8.06	25.83	18.52
CHORD	22.68	19.14	18.50	12.62	3.50	6.45	7.52	27.14	19.00
MatchTIR	24.12	21.50	21.43	14.00	4.50	9.03	9.18	32.26	21.70
SFT→MatchTIR	23.39	19.64	19.72	13.63	5.50	11.18	10.10	31.06	20.78
PACT	28.33	26.94	22.93	14.88	7.00	13.63	11.84	34.47	24.90
Qwen3-4B
Vanilla	32.14	31.66	28.60	23.50	12.00	16.34	17.28	31.86	28.31
SFT	28.19	29.32	24.67	22.62	4.00	12.90	13.17	30.25	25.12
GRPO	33.12	31.80	30.84	22.00	9.00	15.48	15.49	34.47	29.14
FTRL	31.98	33.24	31.34	22.38	9.00	16.77	16.05	43.02	31.13
ToolRL	32.59	28.98	28.28	23.00	11.00	17.42	17.14	34.27	28.25
CHORD	32.69	29.84	29.88	22.00	12.50	15.48	16.66	37.69	29.35
MatchTIR	36.10	30.60	31.60	24.38	17.50	14.62	18.83	47.54	32.93
SFT→MatchTIR	37.17	30.40	32.03	23.62	17.85	13.76	18.41	47.34	33.07
PACT	42.41	34.69	36.60	27.50	18.00	18.28	21.26	49.55	36.90

Complete main results from the paper. AC denotes Answer Correctness.

BibTeX

@article{du2026pact,
  title={PACT: Privileged Trace Co-Training for Multi-Turn Tool-Use Agents},
  author={Du, Zhenbang and Luo, Jun and Zheng, Zhiwei and Yuan, Xiangchi and Xia, Kejing and Shi, Dachuan and Jin, Qirui and He, Qijia and Zou, Shaofeng and Liang, Yingbin and Lee, Wenke},
  journal={arXiv preprint arXiv:2606.16215},
  year={2026}
}