Page 77 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 77
ITU Journal: ICT Discoveries, Vol. 3(1), June 2020
Model Communication in (Embedded) ML Pipelines
Federated Learning Peer-to-Peer Learning Distributed Training On-Device Inference
S C S
C C
C
C
C
C C
C C C C C C C C
C C
C Client S Server Training Data Test Data Model(-Update) Prediction
Fig. 2 – Model communication at the training and inference stages of different Embedded ML pipelines. From left to right: (1) Federated
learning allows multiple clients to jointly train a neural network on their combined data, without any of the local clients having to
compromise the privacy of their data. This is achieved by iteratively exchanging model updates with a centralized server. (2) In scenarios
where it is undesirable to have a centralized entity coordinating the collaborative training process, peer-to-peer learning offers a potential
solution. In peer-to-peer learning the clients directly exchange parameter updates with their neighbors according to some graph predefined
topology. (3) In the data center setting, training speed can be drastically increased by splitting the workload among multiple training
devices via distributed training. This however requires frequent communication of model gradients between the learner devices. (4) On-
device inference protects user privacy and allows fast and autonomous predictions, but comes at the cost of communicating trained models
from the server to the individual users.
proposed which vary with respect to the computational tropy of the neural network representation during train-
effort of encoding and compression results. We want to ing [80]. It is important to note however, that all of these
stress that neural network compression is a very active methods require re-training of the network and are thus
field of research and considers issues of communication computationally expensive and can only be applied if
efficiency, alongside other factors such as memory- and the full training data is available.
computation complexity, energy efficiency and special- In situations where compression needs to be fast and/or
ized hardware. While we only focus on the communi- no training data is available at the sending node,
cation aspect of neural network compression, a more trained compression techniques cannot be applied and
comprehensive survey can be found e.g. in [19]. one has to resort to ordinary lossy compression methods.
In neural network compression it is usually assumed that Among these, (vector) quantization methods [18][17]
the sender of the neural network has access to the entire and efficient matrix decompositions [68][87] are popu-
training data and sufficient computational resources to lar.
retrain the model. By using training data during the A middle-ground between trained and ordinary lossy
compression process the harmful effects of compression compression methods are methods which only require
can be alleviated. The three most popular methods for some data to guide the compression process. These ap-
trained compression are pruning, distillation and trained proaches use different relevance measures based e.g. on
quantization. the diagonal of the Hessian [29], Fisher information [69]
Pruning techniques [40][12][28][84] aim to reduce the en- or layer-wise relevance [85][4] to determine which pa-
tropy of the neural network representation by forcing rameters of the network are important.
a large number of elements to zero. This is achieved
by modifying the training objective in order to promote Many of the above described techniques are somewhat
orthogonal and can be combined. For instance the sem-
sparsity. This is typically done by adding an ℓ 1 or ℓ 2
regularization penalty to the weights, but also Bayesian inal “Deep Compression” method [27] combines prun-
approaches [53] have been proposed. Pruning techniques ing with learned quantization and Huffman coding to
have been shown to be able to achieve compression rates achieve compression rates of up to x49 on a popu-
of ore than one order of magnitude, depending on the lar VGG model, without any loss in accuracy. More
degree of overparameterization in the network [28]. recently the DeepCABAC [79] algorithm, developed
Distillation methods [31] can be used to transfer the within the MPEG standardization initiative on neural
2
knowledge of a larger model into a considerably smaller network compression , makes use of learned quantiza-
architecture. This is achieved by using the predictions of tion and the very efficient CABAC encoder [51] to fur-
the larger network as soft-labels for the smaller network. ther increase the compression rate to x63.6 on the same
architecture.
Trained quantization methods restrict the bitwidth of
the neural network during training, e.g., reducing the
precision from 32 bit to 8 bit [77]. Other methods gen- 2 https://mpeg.chiariglione.org/standards/mpeg-7/compression-
eralize this idea and aim to directly minimize the en- neural-networks-multimedia-content-description-and-analysis
© International Telecommunication Union, 2020 55