Page 79 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 79

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020



          ple local epochs without any loss in convergence speed if  and downstream by letting clients train smaller sub-
          the clients hold iid data (meaning that all client’s data  models, which are then assembled into a larger model
          was sampled independently from the same distribution)  at the server after every communication round. As the
          [52].  Communication delay reduces both the down-    empirical benefits of training time compression seem to
          stream communication from the server to the clients  be limited, the majority of methods uses post-hoc com-
          and the upstream communication from the clients to the  pression techniques. Probabilistic quantization and sub-
          server equally. It also reduces the total number of com-  sampling can be used in addition to other techniques
          munication rounds, which is especially beneficial under  such as DeepCABAC [79] or sparse binary compression
          the constraints of the federated setting as it mitigates  [58].
          the impact of network latency and allows the clients to  Federated learning typically assumes a star-shape com-
          perform computation offline and delay communication  munication topology, where all clients directly commu-
          until a fast network connection is available.        nicate with the server. In some situations it might how-
          However, different recent studies show that communica-  ever be beneficial to consider also hierarchical commu-
          tion delay drastically slows down convergence in non-iid  nication topologies where the devices are organized at
          settings, where the local client’s data distributions are  multiple levels. This communication topology naturally
          highly divergent [89][59]. Different methods have been  arises, for instance, in massively distributed IoT set-
          proposed to improve communication delay in the non-iid  tings, where geographically proximal devices are con-
          setting, with varying success: FedProx [56] limits the di-  nected to the same edge server. In these situations, hi-
          vergence of the locally trained models by adding a regu-  erarchical aggregation of client contributions can help
          larization constraint. Other authors [89] propose mixing  to reduce the communication overhead by intelligently
          in iid public training data with every local client’s data.  adapting the communication to the network constraints
          This of course is only possible if such public data is avail-  [50][1].
          able. The issue of heterogeneity can also be addressed
          with Multi-Task and Meta-Learning approaches. First  2.3 Peer-to-Peer Learning
          steps towards adaptive federated learning schemes have
          been made [57][36], but the heterogeneity issue is still  Training with one centralized server might be undesir-
          largely unsolved.                                    able in some scenarios, because it introduces a single
          Communication delay produces model-updates, which    point of failure and requires the clients to trust a central-
          can be compressed further before communication and a  ized entity (at least to a certain degree). Fully decen-
          variety of techniques have been proposed to this end. In  tralized peer-to-peer learning [70][67][7][46] overcomes
          this context it is important to remember the asymmetry  these issues, as it allows clients to directly communicate
          between upstream and downstream communication dur-   with one another. In this scenario it is usually assumed
          ing federated learning: During upstream communica-   that the connectivity structure between the clients is
          tion, the server receives model updates from potentially  given by a connected graph. Given a certain connec-
          a very large number of clients, which are then aggre-  tivity structure between the clients, peer-to-peer learn-
          gated using e.g. an averaging operation. This averaging  ing is typically realized via a gossip communication pro-
          over the contributions from multiple clients allows for a  tocol, where in each communication round all clients
          stronger compression of every individual update. In par-  perform one or multiple steps of stochastic gradient de-
          ticular, for unbiased compression techniques it follows  scent and then average their local model with those from
          directly from the central limit theorem, that the indi-  all their peers. Communication in peer-to-peer learning
          vidual upstream updates can be made arbitrarily small,  may thus be high frequent and involve a large number of
          while preserving a fixed error, as long as the number  clients (see Table 1). As clients typically are embodied
          of clients is large enough. Compressing the upstream is  by mobile or IoT devices which collect local data, peer-
          also made easier by the fact that the server is always up-  to-peer learning shares many properties and constraints
          to-date with the latest model, which allows the clients  of federated learning. In particular, the issues related to
          to send difference models instead of full models. These  non-iid data discussed above apply in a similar fashion.
          difference models contain less information and are thus  A unique characteristic of peer-to-peer learning is that
          less sensitive to compression. As clients typically do not  there is no central entity which orchestrates the train-
          participate in every communication round, their local  ing process. Making decisions about training related
          models are often outdated and thus sending difference  meta parameters may thus require additional consen-
          models is not possible during downstream.            sus mechanisms, which could be realized e.g. via block
          For the above reasons, most existing works on improving  chain technology [14].
          communication efficiency in federated learning only fo-  Compression for Peer-to-Peer Learning: Commu-
          cus on the upstream communication (see Table 1). One  nication efficient peer-to-peer learning of neural net-
          line of research confines the parameter update space of  works is a relatively young field of research, and thus
          the clients to a lower dimensional subspace, by impos-  the number of proposed compression methods is still
          ing e.g. a low-rank or sparsity constraint [45]. Federated  limited. However, first promising results have already
          dropout [11] reduces communication in both upstream  been achieved with quantization[55], sketching tech-





                                             © International Telecommunication Union, 2020                    57
   74   75   76   77   78   79   80   81   82   83   84