Discrete Unit Based Masking for Improving Disentanglement in Voice Conversion

Abstract: Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our approach improves disentanglement and conversion performance across multiple VC methods, showing significant effectiveness, particularly in attention-based method, with 44% relative improvement in objective intelligibility.

---------> Discrete Unit Based Time Masking For Voice Conversion <----------

Figure 1: Standard encoder-decoder based VC training vs VC training with the proposed masking approach

Figure 2: Proposed training for Encoder-Decoder based VC Frameworks

-----------------------------> Speech Samples <---------------------------

Experimental Setup:

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used

Methods:

VC with baseline TriAAN-VC [1] and VQMIVC [2]

TriAAN-VC Discrete Unit Based Time Masking: 10 percent and 20 percent (proposed)

VQMIVC Discrete Based Unit Time Masking: 10 perent (proposed) and 20 percent

TriAAN-VC

	TriAAN-VC Ground Truth	TriAAN-VC Baseline
Female-to-Female
	Source: p343	Target: p249
	Source: p277	Target: p249
Female-to-Male
	Source: p277	Target: p271
	Source: p343	Target: p271
Male-to-Female
	Source: p271	Target: p282
	Source: p277	Target: p282
Male-to-Male
	Source: p245	Target: p285
	Source: p271	Target: p285

VQMIVC

	VQMIVC Ground Truth	VQMIVC Baseline
Female-to-Female
	Source: p330	Target: p240
Male-to-Female
	Source: p256	Target: p228
Female-to-Male
	Source: p247	Target: p267
Male-to-Male
	Source: p287	Target: p311

[1] Hyun Joon Park, Seok Woo Yang, Jin Sob Kim, Wooseok Shin, and Sung Won Han, “Triaan-vc: Triple adaptive attention normalization for any-to-any voice conversion,” ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023.
[2] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng, “VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion,” in Proc. Interspeech 2021, 2021, pp. 1344-1348.

---------> Discrete Unit Based Time Masking For Voice Conversion <----------

-----------------------------> Speech Samples <---------------------------

The samples are from speakers that are unseen during the VC training. For each conversion, a random reference utterance from target speaker(~3s) is used

Methods: VC with baseline TriAAN-VC [1] and VQMIVC [2] TriAAN-VC Discrete Unit Based Time Masking: 10 percent and 20 percent (proposed) VQMIVC Discrete Based Unit Time Masking: 10 perent (proposed) and 20 percent

TriAAN-VC

Female-to-Female

Female-to-Male