Page 80 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 80

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020



          niques [35] and biased compression methods in conjunc-  ror accumulation [49] achieve impressive compression
          tion with error accumulation [44][43].               rates of more than ×500 at only a marginal loss of con-
                                                               vergence speed in terms of training iterations, however
          2.4 Distributed Training in the Data Center          these methods also have relatively high computational
                                                               overhead and do not harmonize well with all-reduce
          Training modern neural network architectures with mil-  based parameter aggregation protocols [72].
          lions of parameters on huge datasets such as ImageNet  The most typical connectivity structure in distributed
          can take prohibitively long time, even on the latest high-  training in the data center, is an all-to-all connection
          end hardware. In distributed training in the data cen-  topology where all computing devices are directly con-
          ter, the computation of stochastic mini-batch gradients  nected via hard-wire. An all-to-all connection allows
          is parallelized over multiple machines in order to reduce  for efficient model update aggregation via all-reduce
          training time. In order to keep the compute devices  operations [22].  However, to efficiently make use of
          synchronized during this process, they need to commu-  these primitives, compressed representations need to be
          nicate their locally communicated gradient updates af-  summable. This property is satisfied for instance by
          ter every iteration, which results in very high-frequent  sketches [35] and low-rank approximations [72].
          communication of neural data. This communication is
          time consuming for large neural network architectures  3. RELATED CHALLENGES IN EM-
          and limits the benefits of parallelization according to   BEDDED ML
          Amdahl’s law [61].
          Compression for Training in the Data-Center: A       Despite the recent progress made in efficient deep neu-
          large body of research has been devoted to the develop-  ral network communication, many unresolved issues still
          ment of gradient compression techniques. These meth-  remain. Some of the most pressing challenges for Em-
          ods can be roughly organized into two groups: Unbiased  bedded ML include:
          and biased compression methods. Unbiased (probabilis-  Energy Efficiency: Since mobile and IoT devices usu-
          tic) compression methods like QSGD [3], TernGrad [78]  ally have very limited computational resources, Embed-
          and [75] reduce the bitwidth of the gradient updates in  ded ML solutions are required to be energy efficient. Al-
          such a way that the expected quantization error is zero.  though many research works aim to reduce the complex-
          Since these methods can be easily understood within the  ity of models through neural architecture search [82], de-
          framework of stochastic gradient based optimization, es-  sign energy-efficient neural network representations [81],
          tablishing convergence is straightforward. However, the  or tailor energy-efficient hardware components [15], the
          compression gains achievable with unbiased quantiza-  energy efficiency of on-device inference is still a big chal-
          tion are limited, which makes these methods unpop-   lenge.
          ular in practice. Biased compression methods on the  Convergence: An important theoretical concern when
          other hand empirically achieve much more aggressive  designing compression methods for distributed training
          compression rates, at the cost of inflicting a systematic  schemes is that of convergence. While the convergence
          error on the gradients upon quantization, which makes  properties of vanilla stochastic gradient descent based
          convergence analysis more challenging. An established  algorithms and many of their distributed variants are
          technique to reduce the impact of biased compression  well understood [10][38][46], the assumption of statisti-
          on the convergence speed is error accumulation. In er-  cal non-iid-ness of the clients data in many Embedded
          ror accumulation the compute nodes keep track of all  ML applications still pose a set of novel challenges, es-
          quantization errors inflicted during training and add the  pecially when compression methods are used.
          accumulated errors to every newly computed gradient.  Privacy and Robustness:     Embedded ML appli-
          This way, the gradient information which would other-  cations promise to preserve the privacy of the local
          wise be destroyed by aggressive quantization is retained  datasets. However, multiple recent works have demon-
          and carried over to the next iteration. In a key theoret-  strated that in adversarial settings information about
          ical contribution it was shown [63][39] that the asymp-  the training data can be leaked via the parameter
          totic convergence rate of SGD is preserved under the  updates [32].  A combination of cryptographic tech-
          application of all compression operators which satisfy a  niques such as Secure Multi-Party Computation [25] and
          certain contraction property. These compression opera-  Trusted Execution Environments [64], as well a quan-
          tors include random sparsification [63], top-k sparsifica-  tifiable privacy guarantees provided by differential pri-
          tion [49], low rank approximations [72], sketching [35]  vacy [23] can help to overcome these issues. However
          and deterministic binarization methods like signSGD  it is still unclear how these techniques can be effec-
          [8].                                                 tively combined with methods for compressed communi-
          All these methods come with different trade-offs with  cation and what optimal trade-offs can be made between
          respect to achievable compression rate, computational  communication-efficiency and privacy guarantees.
          overhead of encoding and decoding and suitability for  Since privacy guarantees conceal information about the
          different model aggregation schemes. For instance, com-  participating clients and their data, there is also an in-
          pression methods based on top-k sparsification with er-  herent trade-off between privacy and robustness, which





          58                                 © International Telecommunication Union, 2020
   75   76   77   78   79   80   81   82   83   84   85