An Octave-based Multi-Resolution CQT Architecture for Diffusion-based Audio Generation

Listening Examples (unconditional generation)

Below you can listen to curated unconditional audio samples generated by different diffusion models. Each row presents selected samples from five different approaches: UNet-1D (1D U-Net with temporal convolutions), NCSN++ (2D U-Net on STFT representations), CQTDiff+ (differentiable CQT with U-Net), MR-CQTDiff (our proposed multi-resolution CQT approach), and LDM (Latent Diffusion Model). All architectures are configured with around 40 million parameters for fair comparison.

OpenSinger

UNet-1D	NCSN++	CQTDiff+	MR-CQTDiff (proposed)	LDM

Free Music Archive

UNet-1D	NCSN++	CQTDiff+	MR-CQTDiff (proposed)	LDM