Professional Documents
Culture Documents
ISSN No:-2456-2165
Abstract:- The goal of vision-based sign language final sign labels. Our model's output demonstrates consistent
recognition is to improve communication for the hearing performance gains for the suggested methodology across all
impaired. However, the majority of the available sign tested datasets. In order to promote two-way communication
language datasets are constrained. Real-time hand sign between the hearing impaired and the general public, we
language identification is a problem in the world of propose a deep learning based architecture for effective
computer vision due to factors including hand occlusion, continuous sign language recognition. We do continuous sign
rapid hand movement, and complicated backgrounds. identification of vast sign vocabularies using real-time input
from a webcam on a phone, laptop, or other device. The
In this study, we develop a deep learning-based input's first SSD architecture is utilised to detect the user's
architecture for effective sign language recognition using hands. Additionally, a 3D hand skeleton is created utilising
Single Shot Detector (SSD), 2D Convolutional Neural the 3D keypoints to provide numerous views of the hand.
Network (2DCNN), 3D Convolutional Neural Network Then, five 3DCNNs are fed with hands,RGB Video , and a
(3DCNN), and Long Short-Term Memory (LSTM) from Depth Video. Each 3DCNN extracts spatio-temporal
Depth and RGB input films. complementary information, such as features related to
appearance, geometry under various camera angles, and
Keywords:- Sign Language Recognition System, Multi Modal features from hand joint prediction scores. For the final
Approach, Skeleton Based. categorization of the hand, the outputs of the five 3DCNNs
are concatenated, ensembled and sent to an LSTM. Finally,
I. INTRODUCTION we can use the expansion of 3D local spatiotemporal features
for long-term sign modelling.
Since sign languages have distinctive linguistic patterns,
they are largely employed as a means of communication by II. LITERATURE REVIEW
the deaf community. A loss of socialisation may occur when
deaf-mute people are unable to communicate with members Sign language recognition system (SLR) has achieved
of the hearing community. The caregiver must communicate great progress and high recognition accuracy in recent years
with the deaf-mute person in these circumstances. Therefore, by developing a practical deep learning architecture and
it is crucial to create a continuous sign language recognition improving computing power [1, 2, 3, 4, 5, 6, 7, 8, 9]. The
system that does not require any apparatus on the hands in remaining task of SLR is to capture all the body movement
order to give these two populations of individuals similar information and local arm,hand,facial expressions at the same
communication opportunities. time. [1] proposes to use a linear discriminant analysis (LDA)
algorithm for hand gesture recognition to convert the
An extensive amount of study has been done to improve recognized hand gestures into text and speech formats. This
the understanding of hand sign language. However, there are paper [2] proposes a hand gesture recognition method based
still a number of difficulties, including real-time performance, on YCbCr color space, COG, and template matching. In [3],
hand occlusion, quick hand movements, and many others. the proposed system utilizes LMC as a sensor and SVM and
DNN is utilized for data training. Study [4] is creating a
Several suggested models combine characteristics in network that can effectively classify images of static signs by
various ways. We are expanding on the 2D hand skeleton equivalent text from CNN. They present a glove system that
model, allowing different view points to learn the features of recognizes numbers from [5] sign language. The KNN
the hand and leveraging the dynamics as a local spatio- algorithm is used as a classifier. Model [6] recognizes
temporal as well as long-term features from the LSTM model. characters using SVM and FMG algorithms. In [7],RNN is
While just the features relating to 2D views of the hand have used to capture the long-term time dependence between the
been included in these models. Our model makes use of inputs and 2D-CNN is used to extract spatial features from
several and complimentary sources of data, including the input. The system proposed in [8] converts the cross-
heatmap features, pixel level appearance, and hand skeleton domain knowledge into message tokens to improve the
view. Utilizing 3DCNN and LSTM, we can learn how to take accuracy of the WSLR model. The proposed system [9]
advantage of the preceding 2DCNN model along with the presents a transformer-based learning system for recognizing
other feature representations. This model also contains a continuous sign language and translating it to text , This is
simple identification procedure: SSD to concentrate on the achieved using connectionist temporal classification (CTC).
region of interest, 2D CNN to extract spatial discriminative These methods are not yet effective enough to get complete
features, 3D CNN to extract local spatio-temporal dynamics motion information.
from various feature representations, and LSTM to recognise
Fig 1:- Our Approach to sign Language Recognition System Using a Multi-Modal Ensemble
IV. Ⅳ. RESULT AND ANALYSIS Table 1:- Results of single modalities on AUTSL dataset.
In this section, we present evaluation of our proposed
framework on the AUTSL dataset.
Autsl Dataset –
Is collected for general SLR tasks in Turkish sign
language. Kinect V2 sensor is utilized in the collection
procedure. Specifically, 43 signers with 20 backgrounds are
assigned to perform 226 different sign actions. In general, it
contains 38,336 video clips which is split to training,
validation, and testing subsets. The statistical summary of the Table 2:- Performance our ensemble results evaluated on
balanced dataset, which is used in the challenge, is listed in AUTSL test set.
Table
Evaluation –
Baseline Method – When training our models on the training set, we adopt
Along with the AUTSL benchmark, several deep an early stopping technique based on the validation accuracy
learning based models are proposed. We treat the best model to obtain our best models. Then we test our best models on
benchmarked in [12]. Specifically, the model is mainly the test set and use the hyperparameters tuned on validation
constructed using CNN + LSTM structure, where 2D-CNN set to obtain our ensemble prediction. To further improve our
model are used to extract feature for each video frame and performance, we finetune our best models on the union of
bidirectional LSTMs (BLSTM) are adopted on top of the training and validation set. we stop training when the training
these 2D CNN features to lean their temporal relations. A loss in our finetuning experiment is reduced to the same level
feature pooling model (FPM) [12] is plugged in after the 2D as our best models in the training phase. Our predictions with
and without finetuning are evaluated and reported in Table 2.
CNN model to obtain multi-scale representation of the Our proposed SAM-SLR approach surpasses the baseline
features. methods significantly.