Skip to the content.

Diff-VS



Objective Evaluation

Train and evaluate on MUSDB

Model Type Params cSDR ↑
HDemucs [6] disc 42 M 8.13
TFC-TDF V3 [7] disc 70 M 9.59
BSRNN [5] disc 37 M 10.01
BS-RoFormer-6L [3] disc 72 M 10.66
SCNet-L [2] disc 42 M 10.86
Diff-VS gen 57 M 10.12

Train on MUSDB + MoisesDB and evaluate on MUSDB

Model Type Params cSDR ↑
SCNet-L [2] disc 42 M 11.11
SGMSEVS [1] gen 65 M 8.63
Diff-VS gen 57 M 10.88



Subjective Evaluation

We use MERT-MSE as our proxy subjective metrics [1]

Model Type Params MERT MSE ↓
SCNet-L [2] disc 42 M 0.096
SGMSEVS [1] gen 65 M 0.089
Mel-Roformer [3] disc 105 M 0.071
Diff-VS V2 gen 54 M 0.083



Listening Samples

Listening samples are randomly chosen from MUSDB test set

Little Chicago’s Finest - My Own (SDR 15.41)

Mixture
Clean Vocals
Diff-VS
Demucs
SCNet

PR - Happy Daze (SDR -0.58)

Mixture
Clean Vocals
Diff-VS
Demucs
SCNet

The Long Wait - Dark Horses (SDR 11.62)

Mixture
Clean Vocals
Diff-VS
Demucs
SCNet

Timboz - Pony (SDR 5.2)

Mixture
Clean Vocals
Diff-VS
Demucs
SCNet

We Fell From The Sky - Not You (SDR 8.44)

Mixture
Clean Vocals
Diff-VS
Demucs
SCNet



Teaser

Ongoing work on Diff-VS v2 can achieve on par results with BS-RoFormer and SCNet

Train and evaluate on MUSDB

Model Type Params cSDR ↑
SCNet-L [2] disc 42 M 10.86
BS-RoFormer-12L [3] disc 93 M 11.49
Mel-Roformer [4] disc 105 M 12.08
Diff-VS V2 gen 54 M 11.46

Train with MoisesDB + MUSDB and evaluate on MUSDB

Model Type Params cSDR ↑
SCNet-L [2] disc 42 M 11.11
SGMSEVS [1] gen 65 M 8.63
Diff-VS V2 gen 54 M 11.85

Little Chicago’s Finest - My Own

PR - Happy Daze

The Long Wait - Dark Horses

Timboz - Pony

We Fell From The Sky - Not You



Reference

[1] Bereuter, Paul A., et al., “Towards Reliable Objective Evaluation Metrics for Generative Singing Voice Separation Models”, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2025.

[2] Tong, Weinan, et al. “Scnet: Sparse compression network for music source separation.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[3] Lu, Wei-Tsung, et al. “Music source separation with band-split rope transformer.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

[4] Wang, Ju-Chiang, Wei-Tsung Lu, and Minz Won. “Mel-band roformer for music source separation.” arXiv preprint arXiv:2310.01809.

[5] Luo, Yi, and Jianwei Yu. “Music source separation with band-split RNN.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023): 1893-1901.

[6] Rouard, Simon, Francisco Massa, and Alexandre Défossez. “Hybrid transformers for music source separation.”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

[7] Kim, Minseok, Jun Hyung Lee, and Soonyoung Jung. “Sound demixing challenge 2023 music demixing track technical report: Tfc-tdf-unet v3.” arXiv preprint arXiv:2306.09382 (2023).