Speech-laugh synthesis

Overview

This page introduces some example of synthesized speech-laugh generated by the model introduced in our research paper. From Sample 1 to Sample 4 are introduced in paper but the others are not.

Abstract: This study is the first challenge of building a synthetic speechlaugh model via a deep learning technique. To maintain the phonetic intelligibility of synthesized speech-laugh, the model was trained with nonlaughing read speech material for both phones of speech-laugh (SL) and of speech (SP). To control laughing onset in SL, the model was also trained using SL material only for the phones of SL instances. The listening tests revealed that the naturalness score for synthesized female SL was as high as that for human SL and that the laughter-likeness score for synthesized SL was higher than that for synthesized SP in almost all conditions. The dictation test revealed that the training for phonetic intelligibility in SL synthesis was highly effective for synthesized SL. However, the difference between segmented SL onset and correct onset was greater for synthesized SL with phonetic intelligibility training than for that without training.

Index Terms: speech-laugh synthesis, paralinguistic information, laughter onset controllability, naturality, intelligibility

R. Setoguchi and Y. Arimoto, “Assessment of the synthetic quality and controllability of laughing onset in speech-laugh synthesis,” in Proceedings of Interspeech2025, 2025. (accepted)

Sample 1: synthesized speech-laugh in closed condition via pretraining female model with a high naturalness score in Figure 5 (a)

Score: 4.15 out of 5 in naturalness
The input text: "h i cl k a k a cl t e r u y o" in Japanese ("It's stuck")

Sample 2: synthesized speech-laugh in closed condition via pretraining female model with a high laughter-likeness score in Figure 5 (b)

Score: 4.00 out of 5 in laughter-likeness
The input text: "d o sh i t a N d a r o n e" in Japanese ("I wonder what's going on")

Sample 3: synthesized speech-laugh via pretraining model with a low CER in Figure 6 (a)

CER: 0
The input text: "n a m a e o k a i t e o k i n a s a i" in Japanese ("Write your name down")

Sample 4: synthesized speech-laugh via no-pretraining model with a high CER in Figure 6 (b)

CER: 0.90
The input text: "n a m a e o k a i t e o k i n a s a i" in Japanese ("Write your name down")

Sample 5: synthesized speech-laugh in open condition via pretraining female model with a high naturalness score

Score: 3.50 out of 5 in naturalness
The input text: "m a cl t e m a cl t e m a cl t e" in Japanese ("Wait wait wait")

Sample 6: synthesized speech-laugh in open condition via pretraining male model with a high laughter-likeness score

Score: 3.76 out of 5 in laughter-likeness
The input text: "u w a" in Japanese ("Wow")