Page 77 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 77

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020



                                     Model Communication in (Embedded) ML Pipelines

              Federated Learning        Peer-to-Peer Learning      Distributed Training     On-Device Inference

                      S                    C                                                       S
                                                                     C        C
                                                      C
                                      C
                                                 C
                                                                     C        C
             C    C     C    C                                                            C    C     C    C
                                          C           C

                           C  Client  S  Server  Training Data  Test Data  Model(-Update)  Prediction

          Fig. 2 – Model communication at the training and inference stages of different Embedded ML pipelines. From left to right: (1) Federated
          learning allows multiple clients to jointly train a neural network on their combined data, without any of the local clients having to
          compromise the privacy of their data. This is achieved by iteratively exchanging model updates with a centralized server. (2) In scenarios
          where it is undesirable to have a centralized entity coordinating the collaborative training process, peer-to-peer learning offers a potential
          solution. In peer-to-peer learning the clients directly exchange parameter updates with their neighbors according to some graph predefined
          topology. (3) In the data center setting, training speed can be drastically increased by splitting the workload among multiple training
          devices via distributed training. This however requires frequent communication of model gradients between the learner devices. (4) On-
          device inference protects user privacy and allows fast and autonomous predictions, but comes at the cost of communicating trained models
          from the server to the individual users.


          proposed which vary with respect to the computational  tropy of the neural network representation during train-
          effort of encoding and compression results. We want to  ing [80]. It is important to note however, that all of these
          stress that neural network compression is a very active  methods require re-training of the network and are thus
          field of research and considers issues of communication  computationally expensive and can only be applied if
          efficiency, alongside other factors such as memory- and  the full training data is available.
          computation complexity, energy efficiency and special-  In situations where compression needs to be fast and/or
          ized hardware. While we only focus on the communi-   no training data is available at the sending node,
          cation aspect of neural network compression, a more  trained compression techniques cannot be applied and
          comprehensive survey can be found e.g. in [19].      one has to resort to ordinary lossy compression methods.
          In neural network compression it is usually assumed that  Among these, (vector) quantization methods [18][17]
          the sender of the neural network has access to the entire  and efficient matrix decompositions [68][87] are popu-
          training data and sufficient computational resources to  lar.
          retrain the model. By using training data during the  A middle-ground between trained and ordinary lossy
          compression process the harmful effects of compression  compression methods are methods which only require
          can be alleviated. The three most popular methods for  some data to guide the compression process. These ap-
          trained compression are pruning, distillation and trained  proaches use different relevance measures based e.g. on
          quantization.                                        the diagonal of the Hessian [29], Fisher information [69]
          Pruning techniques [40][12][28][84] aim to reduce the en-  or layer-wise relevance [85][4] to determine which pa-
          tropy of the neural network representation by forcing  rameters of the network are important.
          a large number of elements to zero. This is achieved
          by modifying the training objective in order to promote  Many of the above described techniques are somewhat
                                                               orthogonal and can be combined. For instance the sem-
          sparsity. This is typically done by adding an ℓ 1 or ℓ 2
          regularization penalty to the weights, but also Bayesian  inal “Deep Compression” method [27] combines prun-
          approaches [53] have been proposed. Pruning techniques  ing with learned quantization and Huffman coding to
          have been shown to be able to achieve compression rates  achieve compression rates of up to x49 on a popu-
          of ore than one order of magnitude, depending on the  lar VGG model, without any loss in accuracy. More
          degree of overparameterization in the network [28].  recently the DeepCABAC [79] algorithm, developed
          Distillation methods [31] can be used to transfer the  within the MPEG standardization initiative on neural
                                                                                  2
          knowledge of a larger model into a considerably smaller  network compression , makes use of learned quantiza-
          architecture. This is achieved by using the predictions of  tion and the very efficient CABAC encoder [51] to fur-
          the larger network as soft-labels for the smaller network.  ther increase the compression rate to x63.6 on the same
                                                               architecture.
          Trained quantization methods restrict the bitwidth of
          the neural network during training, e.g., reducing the
          precision from 32 bit to 8 bit [77]. Other methods gen-  2 https://mpeg.chiariglione.org/standards/mpeg-7/compression-
          eralize this idea and aim to directly minimize the en-  neural-networks-multimedia-content-description-and-analysis





                                             © International Telecommunication Union, 2020                    55
   72   73   74   75   76   77   78   79   80   81   82