Sequence-to-Sequence Acoustic Modeling with Semi-Stepwise Monotonic Attention for Speech Synthesis

Authors

Xiao Zhou xiaozh@mail.ustc.edu.cn
Zhen-Hua Ling zhling@ustc.edu.cn
Li-Rong Dai lrdai@ustc.edu.cn
Ya-Jun Hu yjhu@iflytek.com

Abstract

Encoder-decoder with attention has become a popular archi- tecture to achieve sequence-to-sequence (Seq2Seq) acoustic modeling for speech synthesis. To improve the robustness of attention mechanism, the methods utilizing the monotonic alignment between phone sequences and acoustic feature sequences have been proposed, such as stepwise monotonic attention (SMA). owever, the phone sequences derived by grapheme-to-phoneme (G2P) conversion may not contain the pauses at the phrase boundaries in utterances, which challenges the assumption of strictly stepwise alignment in SMA. Therefore, this paper proposes a semi-stepwise monotonic atten- tion (SSMA) to improve the performance of Seq2Seq speech synthesis when phrase boundaries are not available in both training and synthesis stages. In this method, hidden states are introduced which absorb the pause segments in utterances in an unsupervised way. Thus, the attention at each decoding frame has three options, moving forward to the next phone, staying unmoved, or jumping to a hidden state. Experimental results show that SSMA can achieve better naturalness of synthetic speech than SMA when phrase boundaries are not available. Besides, the pause positions derived from the alignment paths of SSMA matched the manually labelled phrase boundaries quite well.

Audio Samples

Text	薄弱校经费少条件差待遇低人才留不住，有的中学竟连语文数学课都需外聘教师	作为美国邻国的墨西哥，受到影响的并不仅限于贸易，移民反毒等领域也包括在内	教学大楼科研楼美术楼音乐楼图书馆体育馆等，壮观雄伟，气势非凡	四，请看周恩来对艺术的又一关怀之情，四，长征组歌的另一版本，五孙焕英	可我买化肥农药雇用机械的费用，每亩少说也在两百元，一掐算没有多大利
SMA-PB
SSMA-PB
SMA+PB