Page 46 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 46
ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4
the derived data. However, the prosodic attributes are
dif icult and time consuming to annotate. The prosody
of speech is best captured by pitch, energy and melspec‑
trogram of the audio frames. Such features help the deep
learning model to incorporate natural and expressive au‑
dio to meet the end tasks such as generation of expressive
video.
2.3.1 Pitch
Pitch is the fundamental frequency of an audio waveform,
and is an important parameter in the analysis and synthe‑ Fig. 4 – Architecture of Generative Adversarial Network (GAN)
sis of speech and music. Normally only voiced speech and
harmonic music have well‑de ined pitch. But we can still back propagation and dropout algorithms and sample
use pitch as a low‑level feature to characterize the fun‑ from the generative model using only forward propaga‑
damental frequency of any audio waveform. The typical tion.
pitch frequency for human speech is between 50 and 450
Hz, whereas the pitch range for music is much wider. Fig. 4 shows the general architecture of GAN. To learn the
generator’s distribution over data , we de ine a prior
2.3.2 Energy on input noise variables ( ), then represent a mapping
to data space as ( ; ), where is a differentiable func‑
Energy models the excitation pattern on the basilar mem‑ tion represented by a multilayer perceptron with param‑
brane by simulating the acoustic signal transformations eters . We also de ine a second multilayer perceptron
in the ear according to the perceptual model of the human ( ; ) that outputs a single scalar. ( ) represents the
auditory system. Short‑term speech energy is closely re‑ probability that came from the data rather than . We
lated with activation or arousal dimension of the emotion, train to maximize the probability of assigning the cor‑
its usage in the conventional features contributes to the rect label to both training examples and samples from .
classi ication of emotions. We simultaneously train to minimize log(1− ( ( ))):
In other words, and play the following two‑player
2.3.3 Melspectrogram minimax game with value function ( , ):
A melspectrogram is a spectrogram where the frequen‑
cies are converted to the mel scale.This mel scale is con‑
structed such that sounds of equal distance from each minmax ( , ) = ∼ data ( ) [log ( )]+
other on the mel scale, also “sound” to humans as they are ∼ ( ) [log(1− ( ( )))].
equal in distance from one another. In contrast to the Hz
scale, where the difference between 500 and 1000 Hz is 2.5 Normalization
obvious, whereas the difference between 7500 and 8000
Hz is barely noticeable. The normalization framework has become the integral
part of neural network training. It has gained success due
2.4 Generative adversarial network to many reason such as a higher learning rate, faster train‑
ing, regularization effects, smoothing of loss landscape,
The Generative Adversarial Network (GAN) [10] consists etc. The variants of normalization are discussed in the
of the generative model and discriminative model. The following subsections. One of the irst normalization ar‑
GAN framework naturally takes up a game‑theoretic ap‑ chitecture proposed was batch normalization [11] which
proach. The word “adversarial” is chosen as the two net‑ helps the deep learning community understand the effect
works, i.e., generator and discriminator are in constant of normalization.
con lict and compete with each other. The generative
model can be thought of as analogous to a team of coun‑ 2.5.1 Batch normalization
terfeiters, trying to create money similar to the real ones
while the discriminator acts as police, trying to detect In traditional deep networks, a too‑high learning rate
the counterfeit currency. Competition in this game drives may result in the gradients that explode or vanish, as
both teams to improve their methods by constantly giving well as getting stuck in poor local minima. Batch normal‑
knowledge and feedback until the counterfeits are indis‑ ization [11] helps address such issues. By normalizing
tinguishable from the genuine articles. activations throughout the network, it prevents small
The generative model generates samples by passing changes to the parameters from amplifying into larger
random noise through a multilayer perceptron, and the and suboptimal changes in activations in gradients; for
discriminative model is also a multilayer perceptron. We instance, it prevents the training from getting stuck in
can train both models using only the highly successful the saturated regimes of nonlinearities.
30 © International Telecommunication Union, 2021