Page 132 - Kaleidoscope Academic Conference Proceedings 2021
P. 132

2021 ITU Kaleidoscope Academic Conference




           phase, where the input data is fed to the bidirectional LSTM,   3.2.1   Linear discriminant analysis
           and the hidden states of forward pass and backward pass are
           combined within the output layer. The validation and cost is   The higher the number of features makes it harder to process
           computed after the output layer and weights and biases are   the training set and further  processing. However, most  of
           altered through  back-propagation. 20%  of the  data is   those features are correlated, and hence redundant. The final
           separated from the data set for validation and cross entropy   concatenated feature vector from the  previous step is the
           is used for error calculation of the validation data. Stochastic   input to the LDA algorithm and the  vector  with  reduced
           optimization with a learning rate of 0.001 is used for cost   dimensionality is the output. A range of parametric values of
           minimization.                                      the LDA is examined during the evaluations and the best
                                                              value of a corresponding parameters is stored for testing. The
             3.  SYSTEM PARAMETERS AND ALGORITHMS             non-important parts with fewer variations are removed and it
                                                              is used to maintain the  dominant trend  that have more
           3.1   Feature extraction                           variation of  the  data [11]. The feature  sets of  a  reduced
                                                              dimension of 34 on 13026 number of samples is the resultant
           Considering a large amount of data generated from the video   feature vector, which are ready for classification.
           stream, the initial set of data needs to be reduced to more
           manageable groups for processing in order to minimize the   3.3   Skeleton generation
           response time. The  pre-trained  network is treated as an
           arbitrary feature extractor in deep learning-based systems,   From  each video frame,  the neural network-based system
           which allows  us to  propagate forward the input image,   detects a human skeleton (joint positions), and then for each
           stopping at pre-specified layer, and outputs of that layer are   frame in the input video stream, the skeleton is utilized as
           taken as features.                                 raw data to extract features which detect the human subject
                                                              and results in generating a skeleton for frame as detailed in
           The skeleton data of the first N frames are combined as a   Algorithm-1. A continuous sequence of skeleton information
           sliding window of size N. For feature extraction the skeleton   is generated for a video stream.
           data is preprocessed and used, and then sent into a classifier
           to attain the end recognition result. Similarly, to achieve a   Algorithm 1: Skeleton_Generation
           real-time recognition framework, along the time dimension     Input: Video stream
           of the video, the window is slid frame by frame and for each   Initialize numbering for joints
           video frame it outputs a label for the action.      Declare pose_pairs
                                                               for each frame in video:
           The computation of the body displacement is a significant        for i in range (len(BODY_PARTS)):
           parameter  in action recognition which  is  obtained by    Generate heatmap
           dividing the displacement of the neck by height. The joint    Find x,y coordinates
           positions are normalized using the body height. A total of       if multiple person detected
           206 dimensional features are extracted from each frame of            Compute centroid value
           the video.                                          for pair in pose_pairs:
                                                                    Draw ellipse for pose_pair coordinates
           3.2   Feature selection                                  Draw lines between pose_pair coordinates

           A total of 13026 feature samples with 206 dimensions are   3.4   Classifier implementation
           extracted. These features corresponding to the normalized
           joint points  include displacement of the  body, height,   The Skeleton Activity Forecasting (SAF)  and Bi-LSTM
           locations. For optimization of  the features,  the   network (SAF+Bi-LSTM) is implemented for multi-class
           dimensionality reduction  or  feature selection methods are   classification i.e., N classifiers are constructed for n number
           applied and reduced into 34 dimensions before classification   of classes. The i  Bi-LSTM is trained with the i  class data
                                                                           th
                                                                                                     th
           (Figure 3).                                        such that positive samples are labeled and all the rest become
                                                              negative samples. The test sample is run against all the n Bi-

              Spatio-temporal key    13026 Samples            LSTMs and the result of the multi-class is analyzed for the
                   points          206 Dimensions             recognition  of action. The  Bi-LSTM is trained on the
                                                              concatenated skeleton feature vector and it is based on the
                                                 Bi-LSTM
                                      LDA                     maximum value among the  n classifiers. The different
                                                  Classifier   classes  pose  sequences are trained on the  n Bi-LSTM
                                                              classifiers.
                                           34 Dimensions
              Body displacement
                                                              The deep neural network is implemented with three hidden
                                                              layers and dropout layer. The  X feature  vector that is the
                        Figure 3 – Feature selection          concatenation of the extracted skeletal features along with
                                                              the Y target vector is taken by the network to learn the non-
                                                              linear function for classification. Back propagation is used




                                                           – 70 –
   127   128   129   130   131   132   133   134   135   136   137