Page 79 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media

P. 79

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020

ple local epochs without any loss in convergence speed if and downstream by letting clients train smaller sub-
the clients hold iid data (meaning that all client’s data models, which are then assembled into a larger model
was sampled independently from the same distribution) at the server after every communication round. As the
[52]. Communication delay reduces both the down- empirical benefits of training time compression seem to
stream communication from the server to the clients be limited, the majority of methods uses post-hoc com-
and the upstream communication from the clients to the pression techniques. Probabilistic quantization and sub-
server equally. It also reduces the total number of com- sampling can be used in addition to other techniques
munication rounds, which is especially beneficial under such as DeepCABAC [79] or sparse binary compression
the constraints of the federated setting as it mitigates [58].
the impact of network latency and allows the clients to Federated learning typically assumes a star-shape com-
perform computation offline and delay communication munication topology, where all clients directly commu-
until a fast network connection is available. nicate with the server. In some situations it might how-
However, different recent studies show that communica- ever be beneficial to consider also hierarchical commu-
tion delay drastically slows down convergence in non-iid nication topologies where the devices are organized at
settings, where the local client’s data distributions are multiple levels. This communication topology naturally
highly divergent [89][59]. Different methods have been arises, for instance, in massively distributed IoT set-
proposed to improve communication delay in the non-iid tings, where geographically proximal devices are con-
setting, with varying success: FedProx [56] limits the di- nected to the same edge server. In these situations, hi-
vergence of the locally trained models by adding a regu- erarchical aggregation of client contributions can help
larization constraint. Other authors [89] propose mixing to reduce the communication overhead by intelligently
in iid public training data with every local client’s data. adapting the communication to the network constraints
This of course is only possible if such public data is avail- [50][1].
able. The issue of heterogeneity can also be addressed
with Multi-Task and Meta-Learning approaches. First 2.3 Peer-to-Peer Learning
steps towards adaptive federated learning schemes have
been made [57][36], but the heterogeneity issue is still Training with one centralized server might be undesir-
largely unsolved. able in some scenarios, because it introduces a single
Communication delay produces model-updates, which point of failure and requires the clients to trust a central-
can be compressed further before communication and a ized entity (at least to a certain degree). Fully decen-
variety of techniques have been proposed to this end. In tralized peer-to-peer learning [70][67][7][46] overcomes
this context it is important to remember the asymmetry these issues, as it allows clients to directly communicate
between upstream and downstream communication dur- with one another. In this scenario it is usually assumed
ing federated learning: During upstream communica- that the connectivity structure between the clients is
tion, the server receives model updates from potentially given by a connected graph. Given a certain connec-
a very large number of clients, which are then aggre- tivity structure between the clients, peer-to-peer learn-
gated using e.g. an averaging operation. This averaging ing is typically realized via a gossip communication pro-
over the contributions from multiple clients allows for a tocol, where in each communication round all clients
stronger compression of every individual update. In par- perform one or multiple steps of stochastic gradient de-
ticular, for unbiased compression techniques it follows scent and then average their local model with those from
directly from the central limit theorem, that the indi- all their peers. Communication in peer-to-peer learning
vidual upstream updates can be made arbitrarily small, may thus be high frequent and involve a large number of
while preserving a fixed error, as long as the number clients (see Table 1). As clients typically are embodied
of clients is large enough. Compressing the upstream is by mobile or IoT devices which collect local data, peer-
also made easier by the fact that the server is always up- to-peer learning shares many properties and constraints
to-date with the latest model, which allows the clients of federated learning. In particular, the issues related to
to send difference models instead of full models. These non-iid data discussed above apply in a similar fashion.
difference models contain less information and are thus A unique characteristic of peer-to-peer learning is that
less sensitive to compression. As clients typically do not there is no central entity which orchestrates the train-
participate in every communication round, their local ing process. Making decisions about training related
models are often outdated and thus sending difference meta parameters may thus require additional consen-
models is not possible during downstream. sus mechanisms, which could be realized e.g. via block
For the above reasons, most existing works on improving chain technology [14].
communication efficiency in federated learning only fo- Compression for Peer-to-Peer Learning: Commu-
cus on the upstream communication (see Table 1). One nication efficient peer-to-peer learning of neural net-
line of research confines the parameter update space of works is a relatively young field of research, and thus
the clients to a lower dimensional subspace, by impos- the number of proposed compression methods is still
ing e.g. a low-rank or sparsity constraint [45]. Federated limited. However, first promising results have already
dropout [11] reduces communication in both upstream been achieved with quantization[55], sketching tech-

74 75 76 77 78 79 80 81 82 83 84