Phoneme Hallucinator: One-shot Voice Conversion via Set Expansion

Siyuan Shan, Yang Li, Amartya Banerjee, Junier B. Oliva

Department of Computer Science, UNC Chapel Hill, Chapel Hill, NC, USA

Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

[Paper], [Code]

Contirbution:


Description of the image

Contents:

Comparing different methods

This section compares Phoneme Hallucinator with several strong baselines such as VQMIVC, YourTTS, kNN-VC and FreeVC.

Source Target VQVIMC YourTTS kNN-VC FreeVC Phoneme Hallucinator

Ablation studies: target audio length

This section conduct ablation studies of the impact of target audio length on the converesion. Specifically, we show the conversion results using the first 1s, 3s, 5s, 7s and 9s of the target audio respectively.

Source Target VC Results (1s) VC Results (3s) VC Results (5s) VC Results (7s) VC Results (9s)

Ablation studies: number of hallucinated features

This section conduct ablation studies of the number of hallucinated features ranging from 5,000 to 50,000.

Source Target VC Results (5,000) VC Results (15,000) VC Results (30,000) VC Results (50,000)

Ablation studies: architecture

This section conduct ablation studies of the model architecture. MOD denotes the modulation mechanism, CAT denotes the concatenation mechanism and PEQ denotes the permutation equivariant embedding.

Source Target Full model w/o MOD w/o CAT w/o PEQ

Cross-lingual voice conversion

This section shows cross-lingual voice conversion results. Note that our phoneme hallucinator is trained on English data only.

Source Target kNN-VC Phoneme Hallucinator
Germany
Spanish
Chinese
Spanish