TTO_Grant Catalogue Grant Catalogue | Page 11

Rapid Unsupervised Speaker Adaptation for HMM-based Text-to-Speech Synthesis ABSTRACT Recently, statistical speech synthesis (SSS) approach is proposed that can address both of these problems. In the SSS approach, statistical models are used to synthesize the speech sounds. Smooth synthetic speech that does not have the spurious error problem of the concatenative synthesis methods can be generated with the SSS approach. Moreover, voice style and characteristics as well as emotions can be easily transformed. In the latest TTS Blizzard challenges, one of the instances of the SSS techniques outperformed the concatenative synthesis techniques in Mean Opinion Score (MOS) quality tests. The high quality and intelligibility speech it generates, the flexibility it offers in voice/speaker/emotion conversion, and its small memory requirements make SSS systems a strong candidate to replace the concatenative systems that are the most popular TTS systems in use today. SSS systems already started to enable high quality embedded speech synthesis products because of its small memory footprint requirements. Moreover, the SSS technology is receiving increasing attention from companies that offer server or PC-based TTS applications because of its competitive voice quality and flexibility in voice conversion. The success of the current SSS systems is expected open new research avenues which will lead to new discoveries and potentially make the SSS technology the dominant TTS technology in the next decade. Electrical & Electronics Engineering Concatenative synthesis method has been the dominant approach in text-to-speech synthesis (TTS) in the last decade. Despite its success, the concatenative synthesis approach has several disadvantages. One of the disadvantages is the spurious errors that pop up during synthesis which can significantly distract the listener. A second disadvantage with the concatenative approach is the difficulty in modifying the voice characteristics, voice style, and emotions. Yrd. Doç. Dr. Cenk Demiroğlu DEPARTMENT Electrical & Electronics Engineering CONTACT [email protected] FUNDING SCHEME TÜBİTAK 3501 One of the most exciting research directions in the SSS field is speaker adaptation where the goal is to adapt the voice model to a target speaker that does not exist in the training data. Maximum a posteriori (MAP) and maximum likelihood linear regression (MLLR) methods are the two of the commonly used approaches used for adaptation. MLLR method performs better than the MAP method when the amount of adaptation data is small. Therefore, MLLR adaptation is more suitable for rapid adaptation. Several variations of the MLLR technique and combination of the MLLR and MAP techniques are used in the context of SSS. START DATE 01.03.2010 Unsupervised adaptation is difficult to achieve with SSS because of the rich context information used in speech sounds. It is very difficult to generate the correct context using speech recognition tools as is commonly done in unsupervised speech recognition systems. There is only one paper on unsupervised adaptation for SSS [1]. The idea proposed in [1] attempts to extract only the triphone context ignoring the other information such as syllable, location of the sound in the syllable etc.. OZU BUDGET 108,400.00 TL 2010 National Grants DURATION 36 months . 11