Page 89 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 89

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020




          where ℎ =   ∗    is the signal resulting from filtering                                      – frame-wise (3)


          s with the spatial highpass filter   , the temporal adap-                                    – WPSNR     (7)

          tation can be integrated by adding to ℎ  the weighted                                        – WPSNR'     (10)

          result ℎ =   ∗    of a temporal highpass filtering step:  dB




             = max          ,    ∑   ℎ   ,     +   ℎ   ,       .

                                 ,  ∈
                                                                                   Frame Index i
          Note that the lower limit remains at        = 2        as in  Fig. 2   –   Effect of different temporal WPSNR averaging methods on

          [17]. Two simple highpass filters   were found useful.  coded video with visual quality drop (MarketPlace, HD,   10s [20]).


          The first one, a first-order FIR applied for frame rates
          of 30 Hz or lower, is ℎ   ,    =     ,    −      ,    and

          the second one, a second-order FIR used, accordingly,   4.  TEMPORALLY VARYING VIDEO QUALITY
          for frame rates above 30 and up to 60 Hz, is defined as
                                                               It was described in Sec. 2 that, for video sequences, the
          ℎ   ,    =     ,    − 2     ,    +      ,   .     In other
                                                               traditional approach is to average the individual frame

          words, one or two previous frames are used to obtain   (W)PSNR values so as to obtain a single measurement
          a simple estimate of the temporal activity in each block
                                                               value for a complete video. It was observed [18] that,

             of each signal s over time. Naturally, for frame rates  for compressed video content which strongly varies in

          higher than 60 Hz, a third-order FIR could be specified,   visual quality over time, such averaging of frame-wise
          but due to a lack of correspondingly recorded content,   model output may not always correlate well with MOS
          such operating points have not been examined yet. The   values provided by human observers. The averaging of

          dependency of the filter order of    on the frame rate  logarithmic (W)PSNR values appears to be especially

          is based upon psychovisual considerations: the limited   suboptimal on decoded video material of high overall
          temporal (highpass-like) integration of visual stimuli
                                                               visual quality in which, however, brief passages exhibit
          in human perception [21] implies that a shorter filter

                                                               relatively low quality. With the growing popularity of
          impulse response should be employed at relatively low   rate adaptive video streaming, particularly on mobile
          frame rates than at higher ones. Naturally, filters which
                                                               devices, such situations actually occur quite frequently.
          more accurately model the nonlinear temporal contrast
                                                               It was discovered experimentally that, under such cir-
          sensitivity of the human visual system could be used,   cumstances, non-expert viewers assign relatively low
          but for the sake of simplicity and low complexity, such
                                                               scores to the tested video (compared with a video with

          an option is not considered here. Note, also, that taking
                                                               balanced visual quality) during subjective VQA tasks,
          the absolute values of the first-order highpass outputs   even if the majority of frames of the compressed video
          as above is identical to the “absolute value of temporal   are of excellent quality to their eyes. This observation,


          information”(ATI) filter described in [22].
                                                               which is confirmed by QoE-related feedback of consu-
          The relative weight   is an experimentally determined  mers as reported by, e. g., Netflix [12], indicates that the
          constant for which   = 2 was selected. To compensate  log-domain average WPSNR values of (7) tend to over-

          for the increased variance in     relative to    after the  estimate the subjective quality in such cases.

          introduction of term ℎ   (the sum may increase while  A simple solution to this problem is to apply a square-

                 remains unchanged),        is modified accordingly  mean-root (SMR) approach [27] which takes the arith-
          [18], resulting in          which, in turn, yields the weight  metic average of the square roots of the linear-domain
                                                               frame-wise WPSNR  data (before taking their base-10


                                                               logarithms) and, only thereafter, applies the logarithm:
                     =       with, again,    =         (9)


                                                                                          ∙√ ∙ ∙

          as a spatio-temporal visual sensitivity measure. It must    WPSNR′ = 20 ∙ log              .      (10)
                                                                                         ∑         ∑
          be noted that the temporal activity component of     is

          a quite crude, but very low-complexity, approximation   Note the use of constant 20 in (10) instead of 10 in (3),
          of a   -wise motion estimation operation, as typically  representing the linear-domain squaring operation. A


          performed in all modern video codecs. Evidently, more



                                                               comparison of WPSNR and WPSNR′is shown in Fig. 2,
          elaborate, but computationally more complex, activity   with a 0.35 dB lower value of the latter in this example.
          metrics accounting for block-internal motion between
          frames i, i –1 and, for high frame rates, i –2 before deri-  5.  VARYING INPUT OR OUTPUT BIT DEPTH
          ving ℎ may be devised [23],[24]. Such designs, which



          may use neural networks [25] or estimations of multi-  Typically, the input and output bit depths of the color
          scale statistical models [26], are not considered here   planes of a video presentation are held constant for a
          since one objective is to maintain very low complexity.   specific distribution path. Sometimes, however, it may

                                             © International Telecommunication Union, 2020                   67
   84   85   86   87   88   89   90   91   92   93   94