Page 46 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 46

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4





          the derived data. However, the prosodic attributes are
          dif icult and time consuming to annotate. The prosody
          of speech is best captured by pitch, energy and melspec‑
          trogram of the audio frames. Such features help the deep
          learning model to incorporate natural and expressive au‑
          dio to meet the end tasks such as generation of expressive
          video.


          2.3.1  Pitch
          Pitch is the fundamental frequency of an audio waveform,
          and is an important parameter in the analysis and synthe‑  Fig. 4 – Architecture of Generative Adversarial Network (GAN)
          sis of speech and music. Normally only voiced speech and
          harmonic music have well‑de ined pitch. But we can still  back propagation and dropout algorithms and sample
          use pitch as a low‑level feature to characterize the fun‑  from the generative model using only forward propaga‑
          damental frequency of any audio waveform. The typical  tion.
          pitch frequency for human speech is between 50 and 450
          Hz, whereas the pitch range for music is much wider.  Fig. 4 shows the general architecture of GAN. To learn the
                                                               generator’s distribution    over data   , we de ine a prior
                                                                                       
          2.3.2  Energy                                        on input noise variables    (  ), then represent a mapping
                                                                                       
                                                               to data space as   (  ;   ), where    is a differentiable func‑
                                                                                    
          Energy models the excitation pattern on the basilar mem‑  tion represented by a multilayer perceptron with param‑
          brane by simulating the acoustic signal transformations  eters    . We also de ine a second multilayer perceptron
                                                                       
          in the ear according to the perceptual model of the human    (  ;   ) that outputs a single scalar.   (  ) represents the
                                                                       
          auditory system. Short‑term speech energy is closely re‑  probability that    came from the data rather than    . We
                                                                                                             
          lated with activation or arousal dimension of the emotion,  train    to maximize the probability of assigning the cor‑
          its usage in the conventional features contributes to the  rect label to both training examples and samples from   .
          classi ication of emotions.                          We simultaneously train    to minimize log(1−  (  (  ))):
                                                               In other words,    and    play the following two‑player
          2.3.3  Melspectrogram                                minimax game with value function    (  ,  ):
          A melspectrogram is a spectrogram where the frequen‑
          cies are converted to the mel scale.This mel scale is con‑
          structed such that sounds of equal distance from each   minmax   (  ,  ) =      ∼   data (  ) [log  (  )]+
                                                                     
                                                                        
          other on the mel scale, also “sound” to humans as they are                        ∼      (  ) [log(1−  (  (  )))].
          equal in distance from one another. In contrast to the Hz
          scale, where the difference between 500 and 1000 Hz is  2.5 Normalization
          obvious, whereas the difference between 7500 and 8000
          Hz is barely noticeable.                             The normalization framework has become the integral
                                                               part of neural network training. It has gained success due
          2.4 Generative adversarial network                   to many reason such as a higher learning rate, faster train‑
                                                               ing, regularization effects, smoothing of loss landscape,
          The Generative Adversarial Network (GAN) [10] consists  etc. The variants of normalization are discussed in the
          of the generative model and discriminative model. The  following subsections. One of the  irst normalization ar‑
          GAN framework naturally takes up a game‑theoretic ap‑  chitecture proposed was batch normalization [11] which
          proach. The word “adversarial” is chosen as the two net‑  helps the deep learning community understand the effect
          works, i.e., generator and discriminator are in constant  of normalization.
          con lict and compete with each other. The generative
          model can be thought of as analogous to a team of coun‑  2.5.1  Batch normalization
          terfeiters, trying to create money similar to the real ones
          while the discriminator acts as police, trying to detect  In traditional deep networks, a too‑high learning rate
          the counterfeit currency. Competition in this game drives  may result in the gradients that explode or vanish, as
          both teams to improve their methods by constantly giving  well as getting stuck in poor local minima. Batch normal‑
          knowledge and feedback until the counterfeits are indis‑  ization [11] helps address such issues. By normalizing
          tinguishable from the genuine articles.              activations throughout the network, it prevents small
          The generative model generates samples by passing    changes to the parameters from amplifying into larger
          random noise through a multilayer perceptron, and the  and suboptimal changes in activations in gradients; for
          discriminative model is also a multilayer perceptron. We  instance, it prevents the training from getting stuck in
          can train both models using only the highly successful  the saturated regimes of nonlinearities.




          30                                 © International Telecommunication Union, 2021
   41   42   43   44   45   46   47   48   49   50   51