Upsampling layers in
neural audio synthesis


 

ICASSP 2021

Article where we describe upsampling artifacts

ArXiv 2021

Article where we study upsampling layers

Code

Toy experiments to understand the artifacts


A number of recent advances in neural audio synthesis rely on upsampling layers, which can introduce undesired artifacts. In computer vision, upsampling artifacts have been studied and are known as checkerboard artifacts (due to their characteristic visual pattern). However, their effect has been overlooked so far in audio processing. Here, we address this gap by studying this problem from the audio signal processing perspective. The main sources of upsampling artifacts are:


Tonal artifacts

Additive periodic noise percieved as a high-frequency buzzing noise.

Filtering artifacts

They attenuate some bands, de-emphasizing high-end frequencies.

Spectral replicas

The spectral replicas of signal offsets can introduce additional artifacts.


The following video gently introduces those concepts:



We benchmark different upsampling layers for music source separation: transposed and subpixel convolutions, different interpolation upsamplers like nearest neighbor and stretch interpolation, and wavelet-based upsamplers.
 
Our results show that filtering artifacts, associated with interpolation upsamplers like nearest neighbor, are perceptually preferrable (higher MOS scores)—even if they tend to achieve worse SDR scores. Check our papers for further experimental details.
 
transposed convolution subpixel convolution nearest neighbor stretch interpolation lazy wavelet
MOS (↑) 2.70 / 5 2.48 / 5 3.30 / 5 2.83 / 5 2.45 / 5
SDR (↑) 5.39 dB 5.44 dB 5.17 dB 5.23 dB 5.31 dB

Upsampling layers can introduce unpleasant tonal artifacts,
you can listen those in the following examples of separated vocals!

Song 1 Song 2 Song 3 Song 4
original mixture
transposed convolution
subpixel convolution
nearest neighbor
stretch interpolation
lazy wavelet

The following spectrogram depicts the kind of tonal artifacts (horizontal lines) one can listen throughout this website. These are particularly noticeable in high-frequency and silent regions. To easily identify and understand those, we recommend using headphones or visualizing their spectrograms!



 

Now note that nearest neighbor does not introduce tonal artifacts,
observe this behavior in the following examples of separated drums!

Song 1 Song 2 Song 3 Song 4
original mixture
transposed convolution
subpixel convolution
nearest neighbor
stretch interpolation
lazy wavelet

The following spectrogram depicts the kind of filtering artifacts (the highlighted horizontal valley) one can listen throughout this website. Those attenuate some bands, and are not necessarily noticeable unless we visualize the spectrograms.




So far, we only discussed filtering and tonal artifacts. Filtering artifacts are introduced by the frequency response of the non-learnable components of interpolation and wavelet upsamplers. Tonal artifacts are introduced by the structure (and initialization) of transposed and subpixel convolution architectures. However, spectral replicas can introduce additional upsampling artifacts. For this reason, stretch and lazy wavelet can also introduce tonal artifacts.

Note the above mentioned artifacts in the following examples of separated drums!

Song 1 Song 2 Song 3 Song 4
original mixture
transposed convolution
subpixel convolution
nearest neighbor
stretch interpolation
lazy wavelet

Spectral replicas of signal offsets

Definition: offsets are constants with zero frequency. Hence, its frequency transform contains an energy component at frequency zero. When upsampling, zero-frequency spectral replicas can appear in-band, introducing tonal artifacts.
 
Key observation: note that stretch interpolations architecture can only introduce filtering artifacts (not tonal artifacts). Yet, the above stretch separations contain tonal artifacts. This is because the spectral replicas of signal offsets (accross feature maps) can appear in-band when upsampling. More details in our paper.
 
Practical advice: the spectral replicas of offsets interact with filtering artifacts. Importantly, interpolation-based upsamplers (like nearest neighbor, which introduce filtering artifacts) can attenuate the exact bands where spectral replicas of signal offests appear. Hence, filtering artifacts are a powerful tool to combat the spectral replicas of signal offsets. More details in our paper.
 
Negative result: we tried using normalization layers, as a way to mitigate the spectral replicas of signal offsets. However, informal listening reveals that normalization layers did not remove the tonal artifacts as intended. Listen to the subpixel convolution separations (a model which uses normalization layers, see our paper for more details).

Spectral replicas as a source
of coherent high-frequency content

Hypothesis definition: we can sort interpolation upsamplers by how strong their filtering artifacts are, for example: sinc → linear → nearest neighbor → stretch. Note that sinc interpolation strongly filters the signal, and stretch interpolation introduces no filtering artifacts. While sinc interpolation is widely used in audio because it removes all spectral replicas, this might not be desirable for deep learning.
 
Positive result: we hypotesized that allowing spectral replicas accross feature maps is beneficial, as it allows the model to have access to coherent high-frequency feature maps for wide-band synthesis. In our paper we experimentally validate this hypothesis, and we recommend using nearest neighbor or stretch interpolation.

Post-networks to palliate
upsampling artifacts

Negative result: previous research explored using post-processing networks, as an “a posteriori” mechanism to palliate upsampling artifacts (link, link). However, we found that post-neworks were unable to fully remove tonal artifacts. Listen to the following examples:
Bass Other Drums Vocals
transposed convolution + post-network
subpixel convolution + post-network