Page 132 - Kaleidoscope Academic Conference Proceedings 2021
P. 132
2021 ITU Kaleidoscope Academic Conference
phase, where the input data is fed to the bidirectional LSTM, 3.2.1 Linear discriminant analysis
and the hidden states of forward pass and backward pass are
combined within the output layer. The validation and cost is The higher the number of features makes it harder to process
computed after the output layer and weights and biases are the training set and further processing. However, most of
altered through back-propagation. 20% of the data is those features are correlated, and hence redundant. The final
separated from the data set for validation and cross entropy concatenated feature vector from the previous step is the
is used for error calculation of the validation data. Stochastic input to the LDA algorithm and the vector with reduced
optimization with a learning rate of 0.001 is used for cost dimensionality is the output. A range of parametric values of
minimization. the LDA is examined during the evaluations and the best
value of a corresponding parameters is stored for testing. The
3. SYSTEM PARAMETERS AND ALGORITHMS non-important parts with fewer variations are removed and it
is used to maintain the dominant trend that have more
3.1 Feature extraction variation of the data [11]. The feature sets of a reduced
dimension of 34 on 13026 number of samples is the resultant
Considering a large amount of data generated from the video feature vector, which are ready for classification.
stream, the initial set of data needs to be reduced to more
manageable groups for processing in order to minimize the 3.3 Skeleton generation
response time. The pre-trained network is treated as an
arbitrary feature extractor in deep learning-based systems, From each video frame, the neural network-based system
which allows us to propagate forward the input image, detects a human skeleton (joint positions), and then for each
stopping at pre-specified layer, and outputs of that layer are frame in the input video stream, the skeleton is utilized as
taken as features. raw data to extract features which detect the human subject
and results in generating a skeleton for frame as detailed in
The skeleton data of the first N frames are combined as a Algorithm-1. A continuous sequence of skeleton information
sliding window of size N. For feature extraction the skeleton is generated for a video stream.
data is preprocessed and used, and then sent into a classifier
to attain the end recognition result. Similarly, to achieve a Algorithm 1: Skeleton_Generation
real-time recognition framework, along the time dimension Input: Video stream
of the video, the window is slid frame by frame and for each Initialize numbering for joints
video frame it outputs a label for the action. Declare pose_pairs
for each frame in video:
The computation of the body displacement is a significant for i in range (len(BODY_PARTS)):
parameter in action recognition which is obtained by Generate heatmap
dividing the displacement of the neck by height. The joint Find x,y coordinates
positions are normalized using the body height. A total of if multiple person detected
206 dimensional features are extracted from each frame of Compute centroid value
the video. for pair in pose_pairs:
Draw ellipse for pose_pair coordinates
3.2 Feature selection Draw lines between pose_pair coordinates
A total of 13026 feature samples with 206 dimensions are 3.4 Classifier implementation
extracted. These features corresponding to the normalized
joint points include displacement of the body, height, The Skeleton Activity Forecasting (SAF) and Bi-LSTM
locations. For optimization of the features, the network (SAF+Bi-LSTM) is implemented for multi-class
dimensionality reduction or feature selection methods are classification i.e., N classifiers are constructed for n number
applied and reduced into 34 dimensions before classification of classes. The i Bi-LSTM is trained with the i class data
th
th
(Figure 3). such that positive samples are labeled and all the rest become
negative samples. The test sample is run against all the n Bi-
Spatio-temporal key 13026 Samples LSTMs and the result of the multi-class is analyzed for the
points 206 Dimensions recognition of action. The Bi-LSTM is trained on the
concatenated skeleton feature vector and it is based on the
Bi-LSTM
LDA maximum value among the n classifiers. The different
Classifier classes pose sequences are trained on the n Bi-LSTM
classifiers.
34 Dimensions
Body displacement
The deep neural network is implemented with three hidden
layers and dropout layer. The X feature vector that is the
Figure 3 – Feature selection concatenation of the extracted skeletal features along with
the Y target vector is taken by the network to learn the non-
linear function for classification. Back propagation is used
– 70 –