Page 204 - Kaleidoscope Academic Conference Proceedings 2020
P. 204

2020 ITU Kaleidoscope Academic Conference




           range  of  NLP  tasks  is  their  “pre-training  +  fine-tuning”
           strategy. To the best of our knowledge, our work is the first
           to  introduce  such  a  strategy  to  an  end-to-end  encrypted
           traffic classification architecture. To perform pre-training is
           to initialize the encoding network and to give it the ability of
           contextual  information  extraction  before  it  is  applied  to
           downward tasks. The unsupervised language model (LM) is
           widely used for word embedding pre-training [14]. BERT,
           specifically,  proposed  a  masked  LM  which  hides  several
           words from the original string with a unique symbol ‘unk’,
           and uses the rest of the words to predict those hidden ones.

           To demonstrate the procedure of the masked LM, we give a
           masked traffic bigram string as w = [w1, ..., ‘unk’, ..., ‘unk’, ...,
           wk] and a list msk = [i1, i2, ..., im] which indicates the position
           of bigram units that are masked. After the encoding, for each
           embedding vector hi that are encoded from the i-th position
           of the original input, a full connection is followed:

                          =      ′             ℎ(         +     ) +     ′   (5)
                          
                                           
           Where tanh is an activation function like the Relu. The size   Figure 4 – Illustration of the flow-level encrypted traffic
           of the output vector oi = [oi,1, ..., o i,|V|] is the vocab size |V|. It   classification initialized with a packet-level pre-training
           stores all the likelihoods about what the corresponding traffic
           bigram is at the i-th position.                    Figure  4  shows  our  encrypted  traffic  classification
                                                              framework. Below are the detailed descriptions:
           In the end, the masked LM uses partial outputs {oi, iϵmsk} to
           perform a large softmax classification with the class number   Packets  extraction: While  classifying  an  encrypted  traffic
           of  vocab  size.  The  objective  is  to maximize the  predicted   flow, only the first M packets (3 for example in Figure 4)
           probabilities of all the masked bigrams, which can be simply   need to  be  used.  The  bigram  tokenization  is  performed  to
           written as (θ represents the parameters of the entire network):   payload bytes in each packet to generate a list of tokenized
                                                              payload strings [str1, str2, ..., strM].

                                               ∑     ∈                  (     |     ,     )   (6)
                                     
                                                 
                                  
                                               
                                                              Encoding  for  packets:  Before  classification,  initialize  the
           The LM is considered as a powerful initialization approach   encoding  network  of  the  classifier  with  the  pre-trained
           for the encoding network using large-scale unlabeled data,   counterpart. As the encoding network is packet-level, each
           yet it is very time-consuming. Even if we want to perform a   tokenized  string  will  be  individually  transported  to  the
           flow-level classification for encrypted traffic, we argue that   encoders.  According  to  [3],  while  carrying  out  a
           the  pre-training  should  be  packet-level  considering  the   classification  with  BERT,  a  unique  token  ‘cls’  should  be
           possible calculation costs. Particularly, we collect raw traffic   added at the beginning of the input as the classification mark.
           packets from the Internet despite their sources and extract   For i-th packet, its tokenized string will be modified as stri’
           their  payload  bytes  to  generate  an  unsupervised  data  set.   =  [cls,  w i,1,  wi,2,  ...,  w i,k].  After  encoding,  a  series  of
                                                                                N
                                                                                      N
                                                                                             N
           Then, the extracted payload bytes are tokenized as bigram   embedding vectors [h i,CLS, h i,1, ..., h i,k] is outputted, yet
                                                                       N
           strings and are utilized to perform a PERT pre-training. After   only the h i,CLS will be picked as the further classification
                                                                                     N
           the  training  converges,  we  save  the  adjusted  encoding   input. We simply represent h i,CLS as embi. In order to make
           network.                                           use  of  all  of  the  information  extracted  from  the  first  M
                                                              packets,  we  apply  a  concatenation  to  merge  the  encoded
           3.4    Flow-level Classification                   packets:
           While  implementing  a  certain  task  like  classification,  the                   =              ⊕               ⊕ … ⊕                (7)
                                                                                             
                                                                                   
                                                                                                           
           pre-trained encoding network will be totally reused and be
           further  adjusted  to  learn  the  real  relationship  between  the   Final  classification:  In  the  end,  a  softmax  classification
           inputs and a specific task objective. This is the concept of   layer is used to learn the probability distribution of the input
           “fine-tuning”, where a network is trained based on a proper   flows among possible  traffic  classes. The  objective  of the
           initialization to achieve a boosted effect in downward tasks.   flow-level classification can be written as below:
                                                                                              ∑     ∈                           (     |            (    ),     )  (8)
                                                                                              
                                                                                 
                                                                                    
                                                              Where Rflow represents the flow-level training set. Given a
                                                              flow sample f, yf represents its true label (class) and emb(f)




                                                          – 146 –
   199   200   201   202   203   204   205   206   207   208   209