Abstract: Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our approach improves disentanglement and conversion performance across multiple VC methods, showing significant effectiveness, particularly in attention-based method, with 44% relative improvement in objective intelligibility.

---------> Discrete Unit Based Time Masking For Voice Conversion <----------


Figure 1: Standard encoder-decoder based VC training vs VC training with the proposed masking approach

Figure 2: Proposed training for Encoder-Decoder based VC Frameworks

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used
Methods:
  • VC with baseline TriAAN-VC [1] and VQMIVC [2]
  • TriAAN-VC Discrete Unit Based Time Masking: 10 percent and 20 percent (proposed)
  • VQMIVC Discrete Based Unit Time Masking: 10 perent (proposed) and 20 percent

TriAAN-VC

TriAAN-VC Ground Truth TriAAN-VC Baseline TriAAN-VC /w Discrete Unit Based Time Masking (10%) TriAAN-VC /w Proposed Discrete Unit Based Time Masking (20%)

Female-to-Female

Source: p343 Target: p249
Source: p277 Target: p249

Female-to-Male

Source: p277 Target: p271
Source: p343 Target: p271

Male-to-Female

Source: p271 Target: p282
Source: p277 Target: p282

Male-to-Male

Source: p245 Target: p285
Source: p271 Target: p285



VQMIVC

VQMIVC Ground Truth VQMIVC Baseline VQMIVC /w Proposed Discrete Unit Based Time Masking (10%) VQMIVC /w Discrete Unit Based Time Masking (20%)

Female-to-Female

Source: p330 Target: p240

Male-to-Female

Source: p256 Target: p228

Female-to-Male

Source: p247 Target: p267

Male-to-Male

Source: p287 Target: p311
[1] Hyun Joon Park, Seok Woo Yang, Jin Sob Kim, Wooseok Shin, and Sung Won Han, “Triaan-vc: Triple adaptive attention normalization for any-to-any voice conversion,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023.
[2] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng, “VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion,” in Proc. Interspeech 2021, 2021, pp. 1344-1348.