Page 54 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 54

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

and 43 female actors between the ages of 20 and 74 com‑
ing from a variety of races and ethnicities (African Amer‑
=−8.97 +0.056⋅FR+0.41⋅PSNR−0.0038⋅PSNR 2 ican, Asian, Caucasian, Hispanic, and Unspeci ied). Ac‑
2
−0.001⋅FR +0.00079⋅FR⋅PSNR tors spoke from a selection of 12 sentences. The sen‑
(15) tences were presented using one of six different emotions
(Anger, Disgust, Fear, Happy, Neutral, and Sad) and four
Knowing the average PSNR and frame size, we use this different emotion levels (Low, Medium, High, and Un‑
model to calculate each receiver's QoE at present and es‑ speci ied).
timate their QoE in the future for different pro iles.
The total QoE for each receiver, which aims to re lect their VOXCELEB2: VoxCeleb2 [71] is a very large‑scale
satisfaction with the whole video streaming experience, audio‑visual speaker recognition data set collected from
will be a function of the individual QoE corresponding to open‑source media. Voxceleb2 contains over 1 million
each player. utterances for over 6,000 celebrities, extracted from
The QoE metric has several advantages: videos uploaded to YouTube. The data set is fairly gender
balanced, with 61 % of the speakers male.
• Due to erratic network connectivity or low band‑
width, the Quality of Experience (QoE) can be low. 7.1.2 Preprocessing steps
With the proposed model we can signi icantly im‑
prove the QOE by sending the audio signal and syn‑ Videos are processed at 25fps and frames are resized
thesizing the video at the receiver's end, thus im‑ into 256X256 size and audio features are processed at
proving the PSNR. 16khz. The groundtruth ofoptical lowiscalculatedusing
the farneback optical low algorithm [66]. To extract the
• The proposed video streaming pipeline helps in dy‑ keypoint heatmaps, we have used the pretrained hour‑
namically using the proposed video generation ar‑ glass face keypoint detection [65]. Every audio frame is
chitecture when the quality of experience goes be‑ centered around a single video frame. To do that, zero
low the threshold PSNR level. It thus gives the lex‑ padding is done before and after the audio signal and use
ibility to control the QOE based on the compute re‑ the following formula for the stride.
source, bandwidth availability and importance of the
speaker in the video conference. = audio sampling rate
video frames per sec
7. EXPERIMENTS We extract the pitch, F0 using using PyWorldVocoder [72]

7.1 Implementation details from the raw waveform with the frame size of 1024 and
hop size of 256 sampled at 16khz to obtain the pitch of
7.1.1 Data sets each frame and compute the L2‑norm of the amplitude of
each STFT frame as the energy. We quantize the F0 and
We have used the GRID [68], LOMBARD GRID [69], energy of each frame to 256 possible values and encode
Crema‑D [70] and VoxCeleb2 [71] data sets for the ex‑ them into a sequence of one‑hot vectors as p and e respec‑
periments and evaluation of different metrics. tively and then feed the value of p, e and 256 dimensional
melspectrogram features in the proposed normalization
GRID: GRID [68] is a large multi‑talker audiovisual sen‑ method.
tence corpus to support joint computational‑behavioral
studies in speech perception. In brief, the corpus consists 7.1.3 Metrics
of high quality audio and video (facial) recordings of 1000
sentences spoken by each of the 34 talkers (18 male, 16 To quantify the quality of the inal generated video, we
female). Sentences are of the form ”put red at G9 now”. use the following metrics. Peak Signal to Noise Ra‑
tio (PSNR), Structural Similarity Index (SSIM), Cumula‑
tive Probability Blur Detection (CPBD) and Average Con‑
LOMBARD GRID: Lombard GRID [69] is a bi‑view au‑ tent Distance (ACD). PSNR, SSIM, and CPBD measure the
diovisual Lombard speech corpus that can be used to quality of the generated image in terms of the pres‑
support joint computational‑behavioral studies in speech ence of noise, perceptual degradation, and blurriness re‑
perception. The corpus includes 54 talkers, with 100 ut‑ spectively. ACD [44] is used for the identi ication of
terances per talker (50 Lombard and 50 plain utterances). the speaker from the generated frames by using Open‑
This data set follows the same sentence format as the au‑ Pose [73]. Along with image quality metrics, we also
diovisual GRID corpus, and can thus be considered as an calculate Word Error Rate (WER) using pretrained Lip‑
extension of that corpus. Net architecture [74], Blinks/sec using [75] and Land‑
mark Distance (LMD) [76] to evaluate our performance
CREMA‑D: CREMA‑D [70] is a data set of 7,442 origi‑ of speech recognition, eye‑blink reconstruction and lip re‑
nal clips from 91 actors. These clips were from 48 male construction respectively.

49 50 51 52 53 54 55 56 57 58 59