Page 88 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 88

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020




          ment of the PSNR measure, termed WPSNR, which was    Section 7 then summarizes the results of experimental
          examined further in JVET-K0206 [14] and finalized in   evaluation of the respectively extended WPSNR, called

          JVET-M0091 [15]. More recently, model-free weighting   XPSNR in this paper, on various MOS annotated video


          was also studied [16]. The WPSNR output values were   databases and Section 8 concludes the paper. Note that
          found to correlate with subjective mean opinion score   parts of this paper were previously published in [18].
          (MOS) data at least as well as (MS-)SSIM; see [17],[18].
          One particular advantage of the WPSNR is its backward   2.  REVIEW OF BLOCK-BASED WPSNR
          compatibility with the conventional PSNR. Specifically,



          by defining the exponent 0 ≤   ≤ 1[13] controling the  The WPSNR  output for a codec and a video frame (or

                                                               still image) stimulus s is defined, similarly to PSNR, as
          impact of the local visual activity measure on the block-
          wise distortion weighting parameter   (see Sec. 2) as
                                                                                           ∙ ∙
                                                                       WPSNR = 10 ∙ log              ,       (3)


                               = 0,                    (1)                                  ∑
                                                               where W and H are the luma-channel width and height,
          all weights   reduce to 1 and, as a result, the WPSNR  respectively, of s, BD is the coding bit depth per sample,

          becomes equivalent to the PSNR [13],[17]. It is shown
          in [13],[19] that a block-wise perceptual weighting of        =   ∙ ∑              ,    −    ,        (4)

          the local distortion, i.e., the sum of squared errors SSE                ,  ∈
          between the decoded and original picture block signal,
                                                               is the equivalent of (2) for block    at index k, with x, y

                                                               as the horizontal and vertical sample coordinates, and

                =   ∙     =   ∙ ∑  ,           ,    −    ,     ,  (2)


          can readily be utilized to govern the quantization step-       =         with exponent    =        (5)


          size in an image or video encoder’s bit allocation unit.
          In this way, an encoder can optimize its compression   represents the visual sensitivity weight (a scale factor)
          result for maximum performance (i.e., minimum mean   associated with the N×N sized    and calculated from


          weighted block SSE and, thus, maximum visual recon-  the block’s spatial activity measure    and an average


          struction quality) according to the WPSNR.           overall activity       . Details can be found in [17]–[19].
          Although, as noted above, the WPSNR proved useful in
                                                                                              ∙
          the context of still-image coding and achieved similar,           = round  128 ∙       ∙           (6)
          or even better, subjective performance than MS-SSIM-
          based visually motivated bit allocation in video coding   was chosen since, for the commonly used HD and UHD
          [19], its use as a general-purpose VQA metric for video   resolutions of 1920×1080 and 3840×2160 pixels, this
          material of varying resolution, bit depth, and dynamic   choice conveniently aligns with the largest block size in

          range is limited. This is evident from the relatively low   modern video codecs.        is defined empirically such

          correlation between the WPSNR output values and the   that, on average,   ≈ 1 over a large set of test images

          corresponding MOS data available, e.g., from the study   and video frames with a specified resolution W·H and

          published in [10],[11] or the results of JVET’s 2017 Call
                                                               bit depth BD [14]; see also Sec. 5. Hence, as indicated
          for Proposals (CfP) on video compression technologies
                                                               in Sec. 1.1, the WPSNR is a generalization of the PSNR

          with capability beyond HEVC [20]. In fact, this correla-  by means of a block-wise weighting of the assessed SSE.
          tion was found to be worse than that of (MS-)SSIM and   For video signals, the frame-wise logarithmic WPSNRs
          VMAF, particularly for ultra-high-definition (UHD) and
                                                               values are averaged arithmetically to get a single result:
          mixed 8-bit/10-bit video content with a resolution of
          more than, say, 2048×1280 luma samples.
                                                                          WPSNR =  ∙ ∑      WPSNR ,          (7)

          1.2  Outline of this paper
                                                               with F denoting the evaluated number of video frames.
          Given the necessity for an improvement of the WPSNR

          metric as indicated in Sec.1.1 above, this paper focuses   3.  EXTENSION FOR MOVING PICTURES
          on and proposes modifications to several details of the
                                                               The spatially adaptive WPSNR method of [17],[19] and

          WPSNR algorithm. After summarizing the block-wise
          operation of the WPSNR in Section 2, the paper follows   Sec. 2 can easily be extended to motion picture signals
                                                                 , where i is the frame index in the video sequence, by
          up with descriptions of low-complexity extensions for
                                                               introducing a temporal adaptation into the calculation
          motion picture processing (Section 3), improved per-
          formance in case of varying video quality (Section 4) or   of the visual activity   . Given that in our prior studies,

          input/output bit depth (Section 5), and the handling of
          videos with very high and low resolutions (Section 6).      = max        ,         ∑   ,  ∈     ℎ   ,       ,  (8)


          66                                 © International Telecommunication Union, 2020
   83   84   85   86   87   88   89   90   91   92   93