Page 205 - Kaleidoscope Academic Conference Proceedings 2020
P. 205
Industry-driven digital transformation
indicates its concatenated embedding. P is the conditional implements of the original BERT model and several recently
probability which the softmax layer provides. In a matter of published modified models. In practice, we chose the
speaking, the objective is to maximize the probability that optimized BERT implement named A Lite BERT (ALBERT)
each encoded flow sample is predicted as its corresponding [16] which is more efficient and less resource-consuming.
category. The flow-level information is involved in the final However, even to be properly optimized, current BERT pre-
softmax classifier, and thus will be used to fine-tune the training is very costly that we use 4 Nvidia Tesla P100 GPU
packet-level encoding network during the back propagation. cards.
The main point of such a fine-tuning strategy is to separate
the learning of the packets relationship from the time- Table 1 – Pre-training parameter settings
consuming pre-training procedures.
Parameter Value Description
4. EXPERIMENTS hidden_size 768 Vector size of the encoding
outputs (embedding vectors).
4.1 Experiment Settings num_hidden_layers 12 Number of encoders used in
the encoding network.
4.1.1 Data sets num_attention_heads 12 Number of attention heads
used in the multi-head
Unlabeled Traffic Data: The data set that is utilized for the attention mechanism.
pre-training of our PERT encoding network. To generate this intermediate_size 3072 Size of the hidden vectors in
data set, we capture a large amount of raw traffic data from the FNN networks.
different sources using different devices through a network input_length 128 Amount of tokenized
sniffer. Typically, there is no special requirement for the bigrams used in a single
unlabeled traffic data except you should make sure your packet.
collected samples can cover the mainstream protocols, as
many as possible. Table 1 shows the settings of our pre-training and
corresponding description of each parameter. Such settings
ISCX Data Set: We chose a popular encrypted traffic data refer to what common NLP works with BERT encoding use.
1
set "ISCX2016 VPN-nonVPN" [15] to make our After sufficient training, we save the encoding network as a
4
classification evaluations more persuasive. However, this Pytorch format which can be reused in our classification
data set only marks where its encrypted traffic data is networks. Also, all of our other networks are implemented
captured from and whether the capturing is through a VPN using the Pytorch.
session or not, which means a further labeling should be
performed. The ISCX data set is utilized in several works yet Table 2 – Classification parameter settings
the results are rather different even when the same model is
applied [7],[8]. This is mainly due to how the raw data is Parameter Value Description
processed and labeled. We only found [7] provided their pre- packet_num alternative Amount of the first packets
2
processing and labeling procedures in their github . In this (5 by default) in a flow that are chosen.
way, we follow this open source project to process the raw softmax_hidden 768 Size of the hidden vectors
ISCX data set and label it with 12 classes. in the softmax layer.
dropout 0.5 The dropout rate of the
Android Application Data Set: We find the ISCX data set softmax layer.
is not entirely encrypted as it also contains data of some
unencrypted protocols like the DNS. To make a better Classification: The encoding network used at the
evaluation, in this work, we manually capture traffic samples classification stage shares strictly the same structure as the
from 100 Android applications via the Android devices and pre-trained one. Other settings of the classification layers are
network sniffer tool-kit. All the captured data belongs to the shown in Table 2. As fine-tuning the encoding network in a
top activated applications of the Chinese Android app classification task is relatively inexpensive [3], a single GPU
markets. Afterward, we exclusively pick the HTTPS flows card will be just enough.
to ensure only the encrypted data remains.
4.1.3 Baselines
4.1.2 Parameters
Below are the baseline classification methods we use for
Pre-training: First of all, to perform the packet-level PERT comparison:
pre-training for our unlabeled traffic data, we introduce
3
public Python library transfomers which provide
1 https://www.unb.ca/cic/datasets/vpn.htm 3 https://huggingface.co/transformers/
2 https://github.com/echowei/DeepTraffic 4 https://pytorch.org
– 147 –