Page 203 - Kaleidoscope Academic Conference Proceedings 2020
P. 203
Industry-driven digital transformation
Linear projections: Each embedding vector xi will be
projected to three vectors using linear transformations:
=
= (1)
=
Q
Where W , W and W are the three groups of linear
K
V
parameters.
Self-attention and optional masking: The purpose of linear
projections is to generate the inputs for the self-attention
mechanism. Generally speaking, self-attention is to figure
the compatibility between each input xi and all the other
inputs x1~xk via a similarity function, and further to calculate
a weighted sum for xi which implies its overall contextual
information. In detail, our self-attention is calculated as
follows:
Figure 2 – Representation learning network for payload = ∑ =1 � × (2)
data using the dynamic word embedding architecture
The similarity between xi and xj is figured by a scaled dot-
The illustration of our representation learning is shown in product operator, where d k is the dimension of Kj, and Z is
Figure 2. the normalization factor. It should be noticed that not every
input vector is needed for self-attention calculation. An
optional masking strategy that randomly ignores a few inputs
while generating attention vectors is allowed to avoid the
over-fitting.
Multi-head attention: In order to grant encoders the ability
of reaching more contextual information, the transformer
encoding applies a multi-head attention mechanism.
Specifically, linear projections will be done for M times for
each xi to generate multiple attention vectors [atti,1, atti,2, ...,
atti,M]. Afterward, a concatenation operator is utilized to
obtain the final attention vector:
= ⊕ ⊕ … ⊕ (3)
, , ,
Feed-forward network (FFN): A full-connected network to
provide the output of current encoder. For xi, it is as follows:
= (0, + ) + (4)
W1, b1, W2 and b2 are the full-connection parameters and
Figure 3 – Detail of the encoding layer max(0, x) is a ReLU activation.
Earlier dynamic word embedding called the Embeddings Finally, we get the dynamic embedding hi which is encoded
from Language Models (Elmo) [12] uses the bidirectional from xi. It can be further encoded by the next encoding layer
LSTM as its encoder unit, which is not suitable for large- or be directly used in downward tasks. Similar to the naming
scale training since the LSTM has a bad support for parallel of BERT, we name our encoding network as the Payload
calculations. To solve this problem, [3] replaced the LSTM Encoding Representation from Transformer (PERT)
with a self-attention encoder that is firstly applied in the considering the application of a transformer encoder.
transformer model [13], and named their embedding model
as BERT. This is what we also use for encoding the 3.3 Packet-level Pre-training
encrypted payload as shown in Figure 3. Taking our first
embedding vectors [x1, x2, ..., xk] as examples, there are A key factor that makes BERT and its extensive models
several steps of the transformer encoding as follows: continuously achieve state-of-the-art results among a wide
– 145 –