Maurício do V. M. da Costa 1, Eloi Moliner 2,
1 MTDML, IMM, University of Osnabrück, Osnabrück, Germany
2 Acoustics Lab, Department of Information and Communications Engineering, Aalto University, Finland
Below you can listen to curated unconditional audio samples generated by different diffusion models. Each row presents selected samples from five different approaches: UNet-1D (1D U-Net with temporal convolutions), NCSN++ (2D U-Net on STFT representations), CQTDiff+ (differentiable CQT with U-Net), MR-CQTDiff (our proposed multi-resolution CQT approach), and LDM (Latent Diffusion Model). All architectures are configured with around 40 million parameters for fair comparison.
UNet-1D | NCSN++ | CQTDiff+ | MR-CQTDiff (proposed) | LDM |
---|---|---|---|---|
UNet-1D | NCSN++ | CQTDiff+ | MR-CQTDiff (proposed) | LDM |
---|---|---|---|---|