Page 202 - Kaleidoscope Academic Conference Proceedings 2020

P. 202

2020 ITU Kaleidoscope Academic Conference

2. RELATED WORKS

We shall introduce some related traffic identification works
that involve the DL domain, and further categorize them as
two major groups.

For feature-engineering: Basically, these methods still use
hand-designed features but utilize the DL as a measure of
feature processing. For example, Niyaz et al. proposed an
approach using the deep belief network (DBN) to make a
feature selection before the ML classification [4]. Hochst et
al. introduced the auto-encoder network to perform
dimension reduction for the manually extracted flow features
[5]. Rezaei et al. applied a similar pre-training strategy as we
do [6]. What is different is that this work introduced neural Figure 1 – Comparison of raw payload processing between
networks to reconstruct time series features. Its pre-training CNN-based methods and PERT
plays a role of re-processing the hand-designed features.
Ours instead, is to perform a representation learning for the However, the range of byte value is rather small considering
raw traffic. the size of a common NLP vocabulary. To extend the vocab
size of traffic bytes, we introduce a tokenization which takes
For representation learning: Works similar to ours that pairs of bytes (value range from 0 to 65535) as basic
apply the DL to learn the encoding representation from raw character units to generate the bigram strings, as illustrated
traffic bytes without manual feature-engineering. These in Figure 1. Afterward, the NLP related encoding methods
works are also considered as end-to-end implements of can be directly applied to the tokenized traffic bytes. Thus,
traffic classification. Wang et al. proposed this encrypted the encrypted traffic identification is transformed to a NLP
traffic classification framework for the first time [7]. They classification task.
transformed payload data to grayscale images and applied
convolutional neural networks (CNN) to perform image 3.2 Representation Learning
processing. Afterward, the emergence of a series of CNN-
based works like [8] proved the validity of such an end-to- While performing representation learning in an NLP task, the
end classification. Lopez-Martin et al. further discussed a word embedding is widely utilized. Recently, a breakthrough
possible combination for traffic identification where the was made in this research area as the dynamic word
CNN is still used for representation learning, but a long embedding technique overcame the drawback that traditional
short-term memory (LSTM) network is introduced to learn word embedding methods like the Word2Vec [11] are only
the flow behaviors [9]. It inspired the hierarchical spatial- capable of mapping words to unchangeable vectors. By
temporal features-based (HAST) models which obtained a contrast, vectors trained by dynamic word embedding can be
state-of-the-art result in the intrusion detection domain [10]. adjusted according to its context inputs, making it more
powerful to learn detailed contextual information. This is just
Nevertheless, for end-to-end encrypted traffic classification what we need for extracting complex contextual features
nowadays, CNN is still the mainstream whereas the NLP- from the encrypted traffic data.
related network only works as an assistance to do jobs such
as capturing flow information. We can hardly find a full- Current popular dynamic word embedding like BERT could
NLP scheme similar to ours, let alone one which applies be considered as a stack of a certain type of encoding layers.
current dynamic word embedding techniques. Each encoder takes the outputs of its former layer as inputs
and further learns a more abstract representation. In another
3. MODEL ARCHITECTURE word, word embedding will be dynamically adjusted while
passing through its next encoding layer.
3.1 Payload Tokenization
In our work, we take the tokenized payload string [w1, w2, ...,
According to [2], the payload bytes of a packet are likely to wk] as our original inputs. The first group of word embedding
expose some visible information, especially for the first few vectors [x1, x2, ..., xk] at the bottom of the network are
packets of a traffic flow. Thus, most DL-based methods use randomly initialized. After N times of dynamic encoding, we
N
N
N
this byte data to construct traffic images as the inputs of a obtain the final word embedding outputs [h 1, h 2, ..., h k]
CNN model. This is because the byte data is ideal for that imply extremely abstract contextual information of the
generating pixel images as its value ranges from 0 to 255, original payload.
which is just fit for a grayscale image. Rather than applying
such an image processing strategy, we treat the payload bytes
of a packet as a language-like string for introducing NLP
processing.

– 144 –

197 198 199 200 201 202 203 204 205 206 207