Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Dong Yang¹, Yiyi Cai², Haoyu Zhang¹, Yuki Saito¹, Hiroshi Saruwatari¹
¹The University of Tokyo, ²Independent Researcher,

arXiv GitHub Interactive Demo

Abstract. Metric-induced discrete flow matching (MI-DFM) exploits token-latent geometry for discrete generation, but its practical use is limited by two issues: heuristic schedulers requiring hyperparameter search, and finite-step path-tracking error from its first-order continuous-time Markov chain (CTMC) solver. We address both issues. First, we derive a kinetic-optimal scheduler for prescribed scalar-parameterized probability paths, and instantiate it for MI-DFM as a training-free numerical schedule that traverses the path at constant Fisher–Rao speed. Second, we introduce a finite-step moment correction that adjusts the jump probability while preserving the CTMC jump destination distribution. We validate the resulting method, GibbsTTS, on codec-based zero-shot text-to-speech (TTS). Under controlled comparisons with a unified architecture and large-scale dataset, GibbsTTS achieves the best objective naturalness and is preferred in subjective evaluations over masked discrete generative baselines. Additionally, in comparison with the evaluated state-of-the-art TTS systems, GibbsTTS shows strong speaker similarity, achieving the highest similarity on three of four test sets and ranking second on the fourth.

Architecture of the proposed GibbsTTS model.

The samples are from the subjective evaluations.

The model names are kept consistent with our paper. For example, in “MI-DFM (GibbsTTS) + Numerical KO”, “MI-DFM” denotes the generative method, while “Numerical KO” refers to the employed time scheduler. “MaskGCT (original)” indicates the officially released open-source implementation of MaskGCT.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Seed-TTS test sets

CosyVoice 3 test sets