An Octave-based Multi-Resolution CQT Architecture for Diffusion-based Audio Generation

Maurício do V. M. da Costa 1, Eloi Moliner 2,

1 MTDML, IMM, University of Osnabrück, Osnabrück, Germany
2 Acoustics Lab, Department of Information and Communications Engineering, Aalto University, Finland

Listening Examples (unconditional generation)


Below you can listen to curated unconditional audio samples generated by different diffusion models. Each row presents selected samples from five different approaches: UNet-1D (1D U-Net with temporal convolutions), NCSN++ (2D U-Net on STFT representations), CQTDiff+ (differentiable CQT with U-Net), MR-CQTDiff (our proposed multi-resolution CQT approach), and LDM (Latent Diffusion Model). All architectures are configured with around 40 million parameters for fair comparison.

OpenSinger


UNet-1D NCSN++ CQTDiff+ MR-CQTDiff (proposed) LDM

Free Music Archive


UNet-1D NCSN++ CQTDiff+ MR-CQTDiff (proposed) LDM