Page 204 - Kaleidoscope Academic Conference Proceedings 2020

P. 204

2020 ITU Kaleidoscope Academic Conference

range of NLP tasks is their “pre-training + fine-tuning”
strategy. To the best of our knowledge, our work is the first
to introduce such a strategy to an end-to-end encrypted
traffic classification architecture. To perform pre-training is
to initialize the encoding network and to give it the ability of
contextual information extraction before it is applied to
downward tasks. The unsupervised language model (LM) is
widely used for word embedding pre-training [14]. BERT,
specifically, proposed a masked LM which hides several
words from the original string with a unique symbol ‘unk’,
and uses the rest of the words to predict those hidden ones.

To demonstrate the procedure of the masked LM, we give a
masked traffic bigram string as w = [w1, ..., ‘unk’, ..., ‘unk’, ...,
wk] and a list msk = [i1, i2, ..., im] which indicates the position
of bigram units that are masked. After the encoding, for each
embedding vector hi that are encoded from the i-th position
of the original input, a full connection is followed:

= ′ ℎ( + ) + ′ (5)

Where tanh is an activation function like the Relu. The size Figure 4 – Illustration of the flow-level encrypted traffic
of the output vector oi = [oi,1, ..., o i,|V|] is the vocab size |V|. It classification initialized with a packet-level pre-training
stores all the likelihoods about what the corresponding traffic
bigram is at the i-th position. Figure 4 shows our encrypted traffic classification
framework. Below are the detailed descriptions:
In the end, the masked LM uses partial outputs {oi, iϵmsk} to
perform a large softmax classification with the class number Packets extraction: While classifying an encrypted traffic
of vocab size. The objective is to maximize the predicted flow, only the first M packets (3 for example in Figure 4)
probabilities of all the masked bigrams, which can be simply need to be used. The bigram tokenization is performed to
written as (θ represents the parameters of the entire network): payload bytes in each packet to generate a list of tokenized
payload strings [str1, str2, ..., strM].

∑ ∈ ( | , ) (6)

Encoding for packets: Before classification, initialize the
The LM is considered as a powerful initialization approach encoding network of the classifier with the pre-trained
for the encoding network using large-scale unlabeled data, counterpart. As the encoding network is packet-level, each
yet it is very time-consuming. Even if we want to perform a tokenized string will be individually transported to the
flow-level classification for encrypted traffic, we argue that encoders. According to [3], while carrying out a
the pre-training should be packet-level considering the classification with BERT, a unique token ‘cls’ should be
possible calculation costs. Particularly, we collect raw traffic added at the beginning of the input as the classification mark.
packets from the Internet despite their sources and extract For i-th packet, its tokenized string will be modified as stri’
their payload bytes to generate an unsupervised data set. = [cls, w i,1, wi,2, ..., w i,k]. After encoding, a series of
N
N
N
Then, the extracted payload bytes are tokenized as bigram embedding vectors [h i,CLS, h i,1, ..., h i,k] is outputted, yet
N
strings and are utilized to perform a PERT pre-training. After only the h i,CLS will be picked as the further classification
N
the training converges, we save the adjusted encoding input. We simply represent h i,CLS as embi. In order to make
network. use of all of the information extracted from the first M
packets, we apply a concatenation to merge the encoded
3.4 Flow-level Classification packets:
While implementing a certain task like classification, the = ⊕ ⊕ … ⊕ (7)

pre-trained encoding network will be totally reused and be
further adjusted to learn the real relationship between the Final classification: In the end, a softmax classification
inputs and a specific task objective. This is the concept of layer is used to learn the probability distribution of the input
“fine-tuning”, where a network is trained based on a proper flows among possible traffic classes. The objective of the
initialization to achieve a boosted effect in downward tasks. flow-level classification can be written as below:
∑ ∈ ( | ( ), ) (8)

Where Rflow represents the flow-level training set. Given a
flow sample f, yf represents its true label (class) and emb(f)

– 146 –

199 200 201 202 203 204 205 206 207 208 209