Diff-VS

Diffusion-based vocal separation built on the EDM framework
Achieves results on par with many discriminative models in both subjective and objective evaluations
Requires as few as 7 steps to reach optimal performance

Objective Evaluation

Train and evaluate on MUSDB

Model	Type	Params	cSDR ↑
HDemucs [6]	disc	42 M	8.13
TFC-TDF V3 [7]	disc	70 M	9.59
BSRNN [5]	disc	37 M	10.01
BS-RoFormer-6L [3]	disc	72 M	10.66
SCNet-L [2]	disc	42 M	10.86
Diff-VS	gen	57 M	10.12

Train on MUSDB + MoisesDB and evaluate on MUSDB

Model	Type	Params	cSDR ↑
SCNet-L [2]	disc	42 M	11.11
SGMSEVS [1]	gen	65 M	8.63
Diff-VS	gen	57 M	10.88

Subjective Evaluation

We use MERT-MSE as our proxy subjective metrics [1]

Model	Type	Params	MERT MSE ↓
SCNet-L [2]	disc	42 M	0.096
SGMSEVS [1]	gen	65 M	0.089
Mel-Roformer [3]	disc	105 M	0.071
Diff-VS V2	gen	54 M	0.083

Listening Samples

Listening samples are randomly chosen from MUSDB test set

Little Chicago’s Finest - My Own (SDR 15.41)

Mixture

Clean Vocals

Diff-VS

Demucs

SCNet

PR - Happy Daze (SDR -0.58)

Mixture

Clean Vocals

Diff-VS

Demucs

SCNet

The Long Wait - Dark Horses (SDR 11.62)

Mixture

Clean Vocals

Diff-VS

Demucs

SCNet

Timboz - Pony (SDR 5.2)

Mixture

Clean Vocals

Diff-VS

Demucs

SCNet

We Fell From The Sky - Not You (SDR 8.44)

Mixture

Clean Vocals

Diff-VS

Demucs

SCNet

Teaser

Ongoing work on Diff-VS v2 can achieve on par results with BS-RoFormer and SCNet

Train and evaluate on MUSDB

Model	Type	Params	cSDR ↑
SCNet-L [2]	disc	42 M	10.86
BS-RoFormer-12L [3]	disc	93 M	11.49
Mel-Roformer [4]	disc	105 M	12.08
Diff-VS V2	gen	54 M	11.46

Train with MoisesDB + MUSDB and evaluate on MUSDB

Model	Type	Params	cSDR ↑
SCNet-L [2]	disc	42 M	11.11
SGMSEVS [1]	gen	65 M	8.63
Diff-VS V2	gen	54 M	11.85

Little Chicago’s Finest - My Own

PR - Happy Daze

The Long Wait - Dark Horses

Timboz - Pony

We Fell From The Sky - Not You

Reference

[1] Bereuter, Paul A., et al., “Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025.

[2] Tong, Weinan, et al. “Scnet: Sparse compression network for music source separation.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[3] Lu, Wei-Tsung, et al. “Music source separation with band-split rope transformer.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[4] Wang, Ju-Chiang, Wei-Tsung Lu, and Minz Won. “Mel-band roformer for music source separation.” arXiv preprint arXiv:2310.01809.

[5] Luo, Yi, and Jianwei Yu. “Music source separation with band-split RNN.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1893-1901.

[6] Rouard, Simon, Francisco Massa, and Alexandre Défossez. “Hybrid transformers for music source separation.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

[7] Kim, Minseok, Jun Hyung Lee, and Soonyoung Jung. “Sound demixing challenge 2023 music demixing track technical report: Tfc-tdf-unet v3.” arXiv preprint arXiv:2306.09382 (2023).

Diff-VS

Efficient Audio-Aware Diffusion U-Net for Vocals Separation

Yun-Ning (Amy) Hung, Richard Vogl, Filip Korzeniowski, Igor Pereira

Moises

Diff-VS

Objective Evaluation

Train and evaluate on MUSDB

Train on MUSDB + MoisesDB and evaluate on MUSDB

Subjective Evaluation

Listening Samples

Little Chicago’s Finest - My Own (SDR 15.41)

PR - Happy Daze (SDR -0.58)

The Long Wait - Dark Horses (SDR 11.62)

Timboz - Pony (SDR 5.2)

We Fell From The Sky - Not You (SDR 8.44)

Teaser

Train and evaluate on MUSDB

Train with MoisesDB + MUSDB and evaluate on MUSDB

Little Chicago’s Finest - My Own

PR - Happy Daze

The Long Wait - Dark Horses

Timboz - Pony

We Fell From The Sky - Not You

Reference