Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Dong Yang1, Yiyi Cai2, Haoyu Zhang1, Yuki Saito1, Hiroshi Saruwatari1
1The University of Tokyo, 2Independent Researcher,

Abstract. Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher–Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth.

Architecture of the proposed GibbsTTS model.

The samples are from the subjective evaluations.

The model names are kept consistent with our paper. For example, in “MI-DFM (GibbsTTS) + Numerical KO”, “MI-DFM” denotes the generative method, while “Numerical KO” refers to the employed time scheduler. “MaskGCT (original)” indicates the officially released open-source implementation of MaskGCT.

Seed-TTS test sets

CosyVoice 3 test sets