Frame-Wise Breath Detection with Self-Training:
An Exploration of Enhancing Breath Naturalness in Text-to-Speech

Authors: Dong Yang, Tomoki Koriyama, Yuki Saito

arXiv GitHub Poster

Please use headphones to listen to these utterances.

1. VITS-generated synthetic speech

The training speech data includes the breath sounds of these speakers.

Groud truth VITS VITS w/ baseline VITS w/ proposed
1. (speaker id: 1678) Partly, said Margaret sighing, because it is so very different from Helstone.
2. (speaker id: 1649) When he ended they applauded his speech mildly; but it was chiefly for the reason that he had spoken so forcibly and well.
3. (speaker id: 716) As I stared at them, they met my gaze; and then first one and then another turned away from my direct stare, and looked at me in an odd furtive manner.
4. (speaker id: 6139) Number one doesn't sound very inviting said rob with a sour grimace. Who is your number two? Lloyd held out the second envelope.
5. (speaker id: 3118) After the most accurate examination of which I am capable, I venture to affirm, that the rule here holds without any exception.
6. (speaker id: 8718) Josiah he stayed with her, an between him an mord, they helped her along, but I had to git out and scratch for a livin.

2. VITS-generated synthetic speech

The training speech data does not include the breath sounds of the two speakers.

VITS with the baseline model can not synthesize the breath sounds of speaker 3630.

We use "<" as the breath mark within the text. The texts are from the test set of other speakers (breath detection here is finished by the propoed model).

VITS w/ baseline VITS w/ proposed
1. (speaker id: 3630) At that moment the small window in the lodge opened < a hand passed through < seized the key and the candlestick < and lighted the taper.
2. (speaker id: 1811) But when his mother assures him that the stars always appear so to her < and he turns to look in her face he says < why mother < how beautiful you look.

3. FastSpeech2-generated synthetic speech

Pre-trained HiFiGAN is used as the vocoder.

FastSpeech2 can not generate the Mel-spectrogram of breath sounds with adequate quality.

Ground truth FastSpeech2 FastSpeech2 w/ proposed
1. (speaker id: 14) Isabella corroborated it, my dearest catherine you cannot form an idea of the dirt, come you must go, you cannot refuse going now.
2. (speaker id: 850) What side would be likely to prevail in such a conflict, must depend on the means which the contending parties could employ toward insuring success.