Page 54 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 54

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4





                                                               and 43 female actors between the ages of 20 and 74 com‑
                                                               ing from a variety of races and ethnicities (African Amer‑
                 =−8.97 +0.056⋅FR+0.41⋅PSNR−0.0038⋅PSNR  2     ican, Asian, Caucasian, Hispanic, and Unspeci ied). Ac‑
                          2
                −0.001⋅FR +0.00079⋅FR⋅PSNR                     tors spoke from a selection of 12 sentences. The sen‑
                                                      (15)     tences were presented using one of six different emotions
                                                               (Anger, Disgust, Fear, Happy, Neutral, and Sad) and four
          Knowing the average PSNR and frame size, we use this  different emotion levels (Low, Medium, High, and Un‑
          model to calculate each receiver's QoE at present and es‑  speci ied).
          timate their QoE in the future for different pro iles.
          The total QoE for each receiver, which aims to re lect their  VOXCELEB2: VoxCeleb2 [71] is a very large‑scale
          satisfaction with the whole video streaming experience,  audio‑visual speaker recognition data set collected from
          will be a function of the individual QoE corresponding to  open‑source media. Voxceleb2 contains over 1 million
          each player.                                         utterances for over 6,000 celebrities, extracted from
          The QoE metric has several advantages:               videos uploaded to YouTube. The data set is fairly gender
                                                               balanced, with 61 % of the speakers male.
            • Due to erratic network connectivity or low band‑
             width, the Quality of Experience (QoE) can be low.  7.1.2  Preprocessing steps
             With the proposed model we can signi icantly im‑
             prove the QOE by sending the audio signal and syn‑  Videos are processed at 25fps and frames are resized
             thesizing the video at the receiver's end, thus im‑  into 256X256 size and audio features are processed at
             proving the PSNR.                                 16khz. The groundtruth ofoptical  lowiscalculatedusing
                                                               the farneback optical  low algorithm [66]. To extract the
            • The proposed video streaming pipeline helps in dy‑  keypoint heatmaps, we have used the pretrained hour‑
             namically using the proposed video generation ar‑  glass face keypoint detection [65]. Every audio frame is
             chitecture when the quality of experience goes be‑  centered around a single video frame. To do that, zero
             low the threshold PSNR level. It thus gives the  lex‑  padding is done before and after the audio signal and use
             ibility to control the QOE based on the compute re‑  the following formula for the stride.
             source, bandwidth availability and importance of the
             speaker in the video conference.                                           =  audio sampling rate
                                                                                   video frames per sec
          7.  EXPERIMENTS                                      We extract the pitch, F0 using using PyWorldVocoder [72]

          7.1 Implementation details                           from the raw waveform with the frame size of 1024 and
                                                               hop size of 256 sampled at 16khz to obtain the pitch of
          7.1.1  Data sets                                     each frame and compute the L2‑norm of the amplitude of
                                                               each STFT frame as the energy. We quantize the F0 and
          We have used the GRID [68], LOMBARD GRID [69],       energy of each frame to 256 possible values and encode
          Crema‑D [70] and VoxCeleb2 [71] data sets for the ex‑  them into a sequence of one‑hot vectors as p and e respec‑
          periments and evaluation of different metrics.       tively and then feed the value of p, e and 256 dimensional
                                                               melspectrogram features in the proposed normalization
          GRID: GRID [68] is a large multi‑talker audiovisual sen‑  method.
          tence corpus to support joint computational‑behavioral
          studies in speech perception. In brief, the corpus consists  7.1.3  Metrics
          of high quality audio and video (facial) recordings of 1000
          sentences spoken by each of the 34 talkers (18 male, 16  To quantify the quality of the  inal generated video, we
          female). Sentences are of the form ”put red at G9 now”.  use the following metrics.  Peak Signal to Noise Ra‑
                                                               tio (PSNR), Structural Similarity Index (SSIM), Cumula‑
                                                               tive Probability Blur Detection (CPBD) and Average Con‑
          LOMBARD GRID: Lombard GRID [69] is a bi‑view au‑     tent Distance (ACD). PSNR, SSIM, and CPBD measure the
          diovisual Lombard speech corpus that can be used to  quality of the generated image in terms of the pres‑
          support joint computational‑behavioral studies in speech  ence of noise, perceptual degradation, and blurriness re‑
          perception. The corpus includes 54 talkers, with 100 ut‑  spectively.  ACD [44] is used for the identi ication of
          terances per talker (50 Lombard and 50 plain utterances).  the speaker from the generated frames by using Open‑
          This data set follows the same sentence format as the au‑  Pose [73]. Along with image quality metrics, we also
          diovisual GRID corpus, and can thus be considered as an  calculate Word Error Rate (WER) using pretrained Lip‑
          extension of that corpus.                            Net architecture [74], Blinks/sec using [75] and Land‑
                                                               mark Distance (LMD) [76] to evaluate our performance
          CREMA‑D: CREMA‑D [70] is a data set of 7,442 origi‑  of speech recognition, eye‑blink reconstruction and lip re‑
          nal clips from 91 actors. These clips were from 48 male  construction respectively.




          38                                 © International Telecommunication Union, 2021
   49   50   51   52   53   54   55   56   57   58   59