Page 80 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 80
ITU Journal: ICT Discoveries, Vol. 3(1), June 2020
niques [35] and biased compression methods in conjunc- ror accumulation [49] achieve impressive compression
tion with error accumulation [44][43]. rates of more than ×500 at only a marginal loss of con-
vergence speed in terms of training iterations, however
2.4 Distributed Training in the Data Center these methods also have relatively high computational
overhead and do not harmonize well with all-reduce
Training modern neural network architectures with mil- based parameter aggregation protocols [72].
lions of parameters on huge datasets such as ImageNet The most typical connectivity structure in distributed
can take prohibitively long time, even on the latest high- training in the data center, is an all-to-all connection
end hardware. In distributed training in the data cen- topology where all computing devices are directly con-
ter, the computation of stochastic mini-batch gradients nected via hard-wire. An all-to-all connection allows
is parallelized over multiple machines in order to reduce for efficient model update aggregation via all-reduce
training time. In order to keep the compute devices operations [22]. However, to efficiently make use of
synchronized during this process, they need to commu- these primitives, compressed representations need to be
nicate their locally communicated gradient updates af- summable. This property is satisfied for instance by
ter every iteration, which results in very high-frequent sketches [35] and low-rank approximations [72].
communication of neural data. This communication is
time consuming for large neural network architectures 3. RELATED CHALLENGES IN EM-
and limits the benefits of parallelization according to BEDDED ML
Amdahl’s law [61].
Compression for Training in the Data-Center: A Despite the recent progress made in efficient deep neu-
large body of research has been devoted to the develop- ral network communication, many unresolved issues still
ment of gradient compression techniques. These meth- remain. Some of the most pressing challenges for Em-
ods can be roughly organized into two groups: Unbiased bedded ML include:
and biased compression methods. Unbiased (probabilis- Energy Efficiency: Since mobile and IoT devices usu-
tic) compression methods like QSGD [3], TernGrad [78] ally have very limited computational resources, Embed-
and [75] reduce the bitwidth of the gradient updates in ded ML solutions are required to be energy efficient. Al-
such a way that the expected quantization error is zero. though many research works aim to reduce the complex-
Since these methods can be easily understood within the ity of models through neural architecture search [82], de-
framework of stochastic gradient based optimization, es- sign energy-efficient neural network representations [81],
tablishing convergence is straightforward. However, the or tailor energy-efficient hardware components [15], the
compression gains achievable with unbiased quantiza- energy efficiency of on-device inference is still a big chal-
tion are limited, which makes these methods unpop- lenge.
ular in practice. Biased compression methods on the Convergence: An important theoretical concern when
other hand empirically achieve much more aggressive designing compression methods for distributed training
compression rates, at the cost of inflicting a systematic schemes is that of convergence. While the convergence
error on the gradients upon quantization, which makes properties of vanilla stochastic gradient descent based
convergence analysis more challenging. An established algorithms and many of their distributed variants are
technique to reduce the impact of biased compression well understood [10][38][46], the assumption of statisti-
on the convergence speed is error accumulation. In er- cal non-iid-ness of the clients data in many Embedded
ror accumulation the compute nodes keep track of all ML applications still pose a set of novel challenges, es-
quantization errors inflicted during training and add the pecially when compression methods are used.
accumulated errors to every newly computed gradient. Privacy and Robustness: Embedded ML appli-
This way, the gradient information which would other- cations promise to preserve the privacy of the local
wise be destroyed by aggressive quantization is retained datasets. However, multiple recent works have demon-
and carried over to the next iteration. In a key theoret- strated that in adversarial settings information about
ical contribution it was shown [63][39] that the asymp- the training data can be leaked via the parameter
totic convergence rate of SGD is preserved under the updates [32]. A combination of cryptographic tech-
application of all compression operators which satisfy a niques such as Secure Multi-Party Computation [25] and
certain contraction property. These compression opera- Trusted Execution Environments [64], as well a quan-
tors include random sparsification [63], top-k sparsifica- tifiable privacy guarantees provided by differential pri-
tion [49], low rank approximations [72], sketching [35] vacy [23] can help to overcome these issues. However
and deterministic binarization methods like signSGD it is still unclear how these techniques can be effec-
[8]. tively combined with methods for compressed communi-
All these methods come with different trade-offs with cation and what optimal trade-offs can be made between
respect to achievable compression rate, computational communication-efficiency and privacy guarantees.
overhead of encoding and decoding and suitability for Since privacy guarantees conceal information about the
different model aggregation schemes. For instance, com- participating clients and their data, there is also an in-
pression methods based on top-k sparsification with er- herent trade-off between privacy and robustness, which
58 © International Telecommunication Union, 2020