Page 89 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 89
ITU Journal: ICT Discoveries, Vol. 3(1), June 2020
where ℎ = ∗ is the signal resulting from filtering – frame-wise (3)
s with the spatial highpass filter , the temporal adap- – WPSNR (7)
tation can be integrated by adding to ℎ the weighted – WPSNR' (10)
result ℎ = ∗ of a temporal highpass filtering step: dB
= max , ∑ ℎ , + ℎ , .
, ∈
Frame Index i
Note that the lower limit remains at = 2 as in Fig. 2 – Effect of different temporal WPSNR averaging methods on
[17]. Two simple highpass filters were found useful. coded video with visual quality drop (MarketPlace, HD, 10s [20]).
The first one, a first-order FIR applied for frame rates
of 30 Hz or lower, is ℎ , = , − , and
the second one, a second-order FIR used, accordingly, 4. TEMPORALLY VARYING VIDEO QUALITY
for frame rates above 30 and up to 60 Hz, is defined as
It was described in Sec. 2 that, for video sequences, the
ℎ , = , − 2 , + , . In other
traditional approach is to average the individual frame
words, one or two previous frames are used to obtain (W)PSNR values so as to obtain a single measurement
a simple estimate of the temporal activity in each block
value for a complete video. It was observed [18] that,
of each signal s over time. Naturally, for frame rates for compressed video content which strongly varies in
higher than 60 Hz, a third-order FIR could be specified, visual quality over time, such averaging of frame-wise
but due to a lack of correspondingly recorded content, model output may not always correlate well with MOS
such operating points have not been examined yet. The values provided by human observers. The averaging of
dependency of the filter order of on the frame rate logarithmic (W)PSNR values appears to be especially
is based upon psychovisual considerations: the limited suboptimal on decoded video material of high overall
temporal (highpass-like) integration of visual stimuli
visual quality in which, however, brief passages exhibit
in human perception [21] implies that a shorter filter
relatively low quality. With the growing popularity of
impulse response should be employed at relatively low rate adaptive video streaming, particularly on mobile
frame rates than at higher ones. Naturally, filters which
devices, such situations actually occur quite frequently.
more accurately model the nonlinear temporal contrast
It was discovered experimentally that, under such cir-
sensitivity of the human visual system could be used, cumstances, non-expert viewers assign relatively low
but for the sake of simplicity and low complexity, such
scores to the tested video (compared with a video with
an option is not considered here. Note, also, that taking
balanced visual quality) during subjective VQA tasks,
the absolute values of the first-order highpass outputs even if the majority of frames of the compressed video
as above is identical to the “absolute value of temporal are of excellent quality to their eyes. This observation,
information”(ATI) filter described in [22].
which is confirmed by QoE-related feedback of consu-
The relative weight is an experimentally determined mers as reported by, e. g., Netflix [12], indicates that the
constant for which = 2 was selected. To compensate log-domain average WPSNR values of (7) tend to over-
for the increased variance in relative to after the estimate the subjective quality in such cases.
introduction of term ℎ (the sum may increase while A simple solution to this problem is to apply a square-
remains unchanged), is modified accordingly mean-root (SMR) approach [27] which takes the arith-
[18], resulting in which, in turn, yields the weight metic average of the square roots of the linear-domain
frame-wise WPSNR data (before taking their base-10
logarithms) and, only thereafter, applies the logarithm:
= with, again, = (9)
∙√ ∙ ∙
as a spatio-temporal visual sensitivity measure. It must WPSNR′ = 20 ∙ log . (10)
∑ ∑
be noted that the temporal activity component of is
a quite crude, but very low-complexity, approximation Note the use of constant 20 in (10) instead of 10 in (3),
of a -wise motion estimation operation, as typically representing the linear-domain squaring operation. A
performed in all modern video codecs. Evidently, more
comparison of WPSNR and WPSNR′is shown in Fig. 2,
elaborate, but computationally more complex, activity with a 0.35 dB lower value of the latter in this example.
metrics accounting for block-internal motion between
frames i, i –1 and, for high frame rates, i –2 before deri- 5. VARYING INPUT OR OUTPUT BIT DEPTH
ving ℎ may be devised [23],[24]. Such designs, which
may use neural networks [25] or estimations of multi- Typically, the input and output bit depths of the color
scale statistical models [26], are not considered here planes of a video presentation are held constant for a
since one objective is to maintain very low complexity. specific distribution path. Sometimes, however, it may
© International Telecommunication Union, 2020 67