Page 202 - Kaleidoscope Academic Conference Proceedings 2020
P. 202

2020 ITU Kaleidoscope Academic Conference




                        2.  RELATED WORKS

           We shall introduce some related traffic identification works
           that involve the DL domain, and further categorize them as
           two major groups.


           For feature-engineering: Basically, these methods still use
           hand-designed features but utilize the DL as a measure of
           feature processing.  For  example,  Niyaz  et  al.  proposed  an
           approach using the deep belief network (DBN) to make a
           feature selection before the ML classification [4]. Hochst et
           al.  introduced  the  auto-encoder  network  to  perform
           dimension reduction for the manually extracted flow features
           [5]. Rezaei et al. applied a similar pre-training strategy as we
           do [6]. What is different is that this work introduced neural   Figure 1 – Comparison of raw payload processing between
           networks to reconstruct time series features. Its pre-training   CNN-based methods and PERT
           plays  a  role  of  re-processing  the  hand-designed  features.
           Ours instead, is to perform a representation learning for the   However, the range of byte value is rather small considering
           raw traffic.                                       the size of a common NLP vocabulary. To extend the vocab
                                                              size of traffic bytes, we introduce a tokenization which takes
           For  representation  learning:  Works  similar  to  ours  that   pairs  of  bytes  (value  range  from  0  to  65535)  as  basic
           apply the DL to learn the encoding representation from raw   character units to generate the bigram strings, as illustrated
           traffic  bytes  without  manual  feature-engineering.  These   in Figure 1. Afterward, the NLP related encoding methods
           works  are  also  considered  as  end-to-end  implements  of   can be directly applied to the tokenized traffic bytes. Thus,
           traffic  classification.  Wang  et  al.  proposed  this  encrypted   the encrypted traffic identification is transformed to a NLP
           traffic classification framework for the first time [7]. They   classification task.
           transformed payload data to grayscale images and applied
           convolutional  neural  networks  (CNN)  to  perform  image   3.2   Representation Learning
           processing. Afterward, the emergence of a series of CNN-
           based works like [8] proved the validity of such an end-to-  While performing representation learning in an NLP task, the
           end  classification.  Lopez-Martin  et  al.  further  discussed  a   word embedding is widely utilized. Recently, a breakthrough
           possible  combination  for  traffic  identification  where  the   was  made  in  this  research  area  as  the  dynamic  word
           CNN  is  still  used  for  representation  learning,  but  a  long   embedding technique overcame the drawback that traditional
           short-term memory (LSTM) network is introduced to learn   word embedding methods like the Word2Vec [11] are only
           the flow behaviors [9]. It inspired the hierarchical spatial-  capable  of  mapping  words  to  unchangeable  vectors.  By
           temporal features-based (HAST) models which obtained a   contrast, vectors trained by dynamic word embedding can be
           state-of-the-art result in the intrusion detection domain [10].   adjusted  according  to  its  context  inputs,  making  it  more
                                                              powerful to learn detailed contextual information. This is just
           Nevertheless, for end-to-end encrypted traffic classification   what  we  need  for  extracting  complex  contextual  features
           nowadays, CNN is still the mainstream whereas the NLP-  from the encrypted traffic data.
           related network only works as an assistance to do jobs such
           as  capturing  flow  information. We  can  hardly  find  a  full-  Current popular dynamic word embedding like BERT could
           NLP  scheme  similar  to  ours,  let  alone  one  which  applies   be considered as a stack of a certain type of encoding layers.
           current dynamic word embedding techniques.         Each encoder takes the outputs of its former layer as inputs
                                                              and further learns a more abstract representation. In another
                     3.  MODEL ARCHITECTURE                   word, word embedding will be dynamically adjusted while
                                                              passing through its next encoding layer.
           3.1    Payload Tokenization
                                                              In our work, we take the tokenized payload string [w1, w2, ...,
           According to [2], the payload bytes of a packet are likely to   wk] as our original inputs. The first group of word embedding
           expose some visible information, especially for the first few   vectors  [x1,  x2,  ...,  xk]  at  the  bottom  of  the  network  are
           packets of a traffic flow. Thus, most DL-based methods use   randomly initialized. After N times of dynamic encoding, we
                                                                                                            N
                                                                                                  N
                                                                                                     N
           this byte data to construct traffic images as the inputs of a   obtain the final word embedding outputs [h 1, h 2, ..., h k]
           CNN  model.  This  is  because  the  byte  data  is  ideal  for   that imply extremely abstract contextual information of the
           generating pixel images as its value ranges from 0 to 255,   original payload.
           which is just fit for a grayscale image. Rather than applying
           such an image processing strategy, we treat the payload bytes
           of  a  packet  as  a  language-like  string for introducing  NLP
           processing.






                                                          – 144 –
   197   198   199   200   201   202   203   204   205   206   207