You are on page 1of 4

Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Sign Language Recognition System


Likhitha K#, Sahana H J*,Niharika B R, Abhishek Raju# , Prathima M G#
#
Computer Science Department, B.I.T. Bangalore

Abstract:- The goal of vision-based sign language final sign labels. Our model's output demonstrates consistent
recognition is to improve communication for the hearing performance gains for the suggested methodology across all
impaired. However, the majority of the available sign tested datasets. In order to promote two-way communication
language datasets are constrained. Real-time hand sign between the hearing impaired and the general public, we
language identification is a problem in the world of propose a deep learning based architecture for effective
computer vision due to factors including hand occlusion, continuous sign language recognition. We do continuous sign
rapid hand movement, and complicated backgrounds. identification of vast sign vocabularies using real-time input
from a webcam on a phone, laptop, or other device. The
In this study, we develop a deep learning-based input's first SSD architecture is utilised to detect the user's
architecture for effective sign language recognition using hands. Additionally, a 3D hand skeleton is created utilising
Single Shot Detector (SSD), 2D Convolutional Neural the 3D keypoints to provide numerous views of the hand.
Network (2DCNN), 3D Convolutional Neural Network Then, five 3DCNNs are fed with hands,RGB Video , and a
(3DCNN), and Long Short-Term Memory (LSTM) from Depth Video. Each 3DCNN extracts spatio-temporal
Depth and RGB input films. complementary information, such as features related to
appearance, geometry under various camera angles, and
Keywords:- Sign Language Recognition System, Multi Modal features from hand joint prediction scores. For the final
Approach, Skeleton Based. categorization of the hand, the outputs of the five 3DCNNs
are concatenated, ensembled and sent to an LSTM. Finally,
I. INTRODUCTION we can use the expansion of 3D local spatiotemporal features
for long-term sign modelling.
Since sign languages have distinctive linguistic patterns,
they are largely employed as a means of communication by II. LITERATURE REVIEW
the deaf community. A loss of socialisation may occur when
deaf-mute people are unable to communicate with members Sign language recognition system (SLR) has achieved
of the hearing community. The caregiver must communicate great progress and high recognition accuracy in recent years
with the deaf-mute person in these circumstances. Therefore, by developing a practical deep learning architecture and
it is crucial to create a continuous sign language recognition improving computing power [1, 2, 3, 4, 5, 6, 7, 8, 9]. The
system that does not require any apparatus on the hands in remaining task of SLR is to capture all the body movement
order to give these two populations of individuals similar information and local arm,hand,facial expressions at the same
communication opportunities. time. [1] proposes to use a linear discriminant analysis (LDA)
algorithm for hand gesture recognition to convert the
An extensive amount of study has been done to improve recognized hand gestures into text and speech formats. This
the understanding of hand sign language. However, there are paper [2] proposes a hand gesture recognition method based
still a number of difficulties, including real-time performance, on YCbCr color space, COG, and template matching. In [3],
hand occlusion, quick hand movements, and many others. the proposed system utilizes LMC as a sensor and SVM and
DNN is utilized for data training. Study [4] is creating a
Several suggested models combine characteristics in network that can effectively classify images of static signs by
various ways. We are expanding on the 2D hand skeleton equivalent text from CNN. They present a glove system that
model, allowing different view points to learn the features of recognizes numbers from [5] sign language. The KNN
the hand and leveraging the dynamics as a local spatio- algorithm is used as a classifier. Model [6] recognizes
temporal as well as long-term features from the LSTM model. characters using SVM and FMG algorithms. In [7],RNN is
While just the features relating to 2D views of the hand have used to capture the long-term time dependence between the
been included in these models. Our model makes use of inputs and 2D-CNN is used to extract spatial features from
several and complimentary sources of data, including the input. The system proposed in [8] converts the cross-
heatmap features, pixel level appearance, and hand skeleton domain knowledge into message tokens to improve the
view. Utilizing 3DCNN and LSTM, we can learn how to take accuracy of the WSLR model. The proposed system [9]
advantage of the preceding 2DCNN model along with the presents a transformer-based learning system for recognizing
other feature representations. This model also contains a continuous sign language and translating it to text , This is
simple identification procedure: SSD to concentrate on the achieved using connectionist temporal classification (CTC).
region of interest, 2D CNN to extract spatial discriminative These methods are not yet effective enough to get complete
features, 3D CNN to extract local spatio-temporal dynamics motion information.
from various feature representations, and LSTM to recognise

IJISRT22JUL728 www.ijisrt.com 1904


Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Multi-modal Approach multi modal approach is an Pose - Word-level sign language recognition is the
end to end framework which provides users with a fundamental building block for interpreting sign language
convolution architecture to exploit different features captured sentences. signalling a sign language word requires very
from an image, word composition and the matching relation subtle body movements that make WSLR a particularly
between the two modalities. Different modalities might challenging problem. The human skeletal motion plays a
contain different information related to the hand gesture significant role in conveying what word the person is
which can complement each other and provide us with a signalling. Hence a pose-based model is used to tackle the
distinctive representation of the action. problem of WSLR. Human pose estimation involves
localising key points of human joints from a single image or
The [4] paper proposes a effective method utilizing a video.The 27-node skeleton graph was constructed using a
super vector to fuse different multi view representation pretrained HRNet whole-body pose estimator given by
together MMPose to estimate 133-point whole-body keypoints from
RGB videos. The graph is divided into four streams (joint,
Skeleton Based Action Recognition Skeleton based bone, joint motion and bone motion). As data augmentations,
action recognition is the process of recognizing action using random sampling, mirroring, rotating, scaling, jittering, and
the skeleton data obtained from the image. The skeleton data shifting are used.
is nothing but the information related to the 2D or 3D
coordinates of the human skeletal joints. It can also be used RGB - To facilitate parallel loading and processing, all
along with other modalities to achieve a multi modal frames of RGB videos are extracted and saved as pictures.
representation of the action. Usually we use recurrent neural Based on the important points derived from whole-body
networks to model skeleton data. posture estimation, RGB and optical flow frames are cropped
and scaled to 256 x 256 pixels.
The [10] proposes a Graph based approach to model the
changing patterns of skeleton data using Graph convoluted Optical Flow - The TVL1 technique, which is
network (GCN). This method is also termed as ST-GCN. But implemented with OpenCV and CUDA, is used to obtain
still the skeleton based sign language recognition systems are optical flow features.The output flow maps of x and y
still under explored. The [12] tried to extend STGCN to SLR directions are concatenated in channel dimension.
but was unsuccessful to achieve higher accuracy and used
only 20 sign classes. HHA - stand for horizontal disparity, height above the
ground and angle normal. The HHA representation encodes
III. OUR APPROACH properties of geocentric pose that emphasize complementary
discontinuities in the image (depth, surface normal and height)
SSTCN - Separable Spatial Temporal Convolution because of which it works better than using raw depth images
Network. for learning feature representations with convolutional neural
networks.
This architecture has proposed an SSTCN to further
exploit whole body skeleton features, which can significantly Depth Flow - HHA features the same way as the RGB
improve the accuracy on whole-body key points compared frames in data augmentation. Besides, the exact same
with the traditional 3D convolution. Besides using key point procedure used for RGB to extract optical flow from the
coordinates generated from the whole-body pose network, an depth modality (named depth flow). The depth flow is cleaner
SSTCN model to perceive the sign language from whole- and captures different information compared with the RGB
body features is also proposed.Features of 33 key points from flow
60 frames of each video as the input to our model is extracted ,
which contains 1 landmark on the nose, 2 landmarks on RGB Ensemble - Ensemble is to combine the
shoulders, 4 landmarks on mouth, 2 landmarks on wrists, 2 predictions from multiple neural network models to reduce
landmarks on elbows and 22 landmarks on hands. the variance of predictions and reduce generalisation error.
Here, The skeleton-based technique, which incorporates SL-
Instead of using 3D convolution, the input features are GCN and SSTCN, outperforms RGB + Flow and Depth
processed with a 2D convolution layer separably, which ensemble models.
reduces the parameters and makes it easier to converge.

IJISRT22JUL728 www.ijisrt.com 1905


Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165

Fig 1:- Our Approach to sign Language Recognition System Using a Multi-Modal Ensemble

The ensemble results of RGB All and RGB-D All


demonstrate that the whole-body skeleton based approaches
are able to collaborate with the other modalities and further
improve the final recognition rate.

Multi ensemble model - We use a simple ensemble


method to ensemble all four modalities above. Specifically,
we save the output of the last fully-connected layers of each
modality before the softmax layer.

IV. Ⅳ. RESULT AND ANALYSIS Table 1:- Results of single modalities on AUTSL dataset.
In this section, we present evaluation of our proposed
framework on the AUTSL dataset.

 Autsl Dataset –
Is collected for general SLR tasks in Turkish sign
language. Kinect V2 sensor is utilized in the collection
procedure. Specifically, 43 signers with 20 backgrounds are
assigned to perform 226 different sign actions. In general, it
contains 38,336 video clips which is split to training,
validation, and testing subsets. The statistical summary of the Table 2:- Performance our ensemble results evaluated on
balanced dataset, which is used in the challenge, is listed in AUTSL test set.
Table
 Evaluation –
 Baseline Method – When training our models on the training set, we adopt
Along with the AUTSL benchmark, several deep an early stopping technique based on the validation accuracy
learning based models are proposed. We treat the best model to obtain our best models. Then we test our best models on
benchmarked in [12]. Specifically, the model is mainly the test set and use the hyperparameters tuned on validation
constructed using CNN + LSTM structure, where 2D-CNN set to obtain our ensemble prediction. To further improve our
model are used to extract feature for each video frame and performance, we finetune our best models on the union of
bidirectional LSTMs (BLSTM) are adopted on top of the training and validation set. we stop training when the training
these 2D CNN features to lean their temporal relations. A loss in our finetuning experiment is reduced to the same level
feature pooling model (FPM) [12] is plugged in after the 2D as our best models in the training phase. Our predictions with
and without finetuning are evaluated and reported in Table 2.
CNN model to obtain multi-scale representation of the Our proposed SAM-SLR approach surpasses the baseline
features. methods significantly.

IJISRT22JUL728 www.ijisrt.com 1906


Volume 7, Issue 7, July – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. CONCLUSION [11]. Songyao Jiang, Bin Sun , Lichen Wang, Yue Bai,
Kunpeng Li and Yun Fu , "Skeleton Aware Multi-
In this paper, we propose a new deep learning-based modal Sign Language Recognition" , IEEE Xplore ,
pipeline architecture that efficiently realises real-time 2021.
automated sign language recognition by combining SSDs, [12]. Ozge Mercanoglu Sincan and Hacer Yalim Keles.
2DCNNs, 3DCNNs, and LSTMs. The model gave a new AUTSL: A large scale multi-modal turkish sign
hand skeleton feature representation to 3DCNN for richer language dataset and baseline methods. IEEE Access,
features after projecting it onto three surfaces. To obtain the 8:181340–181355, 2020.
trademark features, we also applied pixel-level 3DCNN and [13]. John Bush Idoko. “Deep learning based sign language
heatmap features. The LSTM is given the concatenated translation system” KSII Transaction on internet and
output of all 3DCNNs with stacked inputs in order to information systems (TIIS). 2020.
recognise sign language completely. Additionally, a thorough [14]. M. E. Al-Ahdal and M. T. Nooritawati, “Review in sign
study of the single view and multiview projections of the language recognition systems,” in 2012 IEEE
2DCNN and 3DCNN models is provided. Symposium on Computers Informatics (ISCI), March
2012, pp. 52–57.
REFERENCES [15]. Cao, Z., Hidalgo, G., Simon, T., Wei, S., & Sheikh, Y.
(2017). OpenPose: Realtime multi-person 2D pose
[1]. Himanshu Gupta, Aniruddh Ramjiwal, Jasmin T. Jose , estimation using part affinity fields (pp. 7291–7299).
"Vision Based Approach to Sign Language Las Vegas, United States: CVPR.
Recognition" , IEEE , 2018. [16]. Dadashzadeh, A., and Tavakoli Targhi, A., and
[2]. Mahesh Kumar N B, " Conversion of Sign Language Tahmasbi, M., & Mirmehdi, M. (2018). HGR-Net: A
into Text" , Springer link , 2018. fusion network for hand gesture segmentation and
[3]. Teak-Wei Chong and Boon-Giin Lee , " American Sign recognition.
Language Recognition Using Leap Motion Controller [17]. Elboushaki, A., Hannane, R., Afdel, K., & Koutti, L.
with Machine Learning Approach" , mdpi Journal , (2020). MultiD-CNN: A multi-dimensional feature
2018. learning approach based on deep convolutional
[4]. Lean Karlo S. Tolentino, Ronnie O. Serfa Juan, August networks for gesture recognition in RGB-D image
C. Thio-ac, Maria Abigail B. Pamahoy, Joni Rose R. sequences. Expert Systems with Applications
Forteza, and Xavier Jet O. Garcia , " Static Sign [18]. Ferreira, P. M., Cardoso, J. S., & Rebelo, A. (2019). On
Language Recognition Using Deep Learning" , the role of multimodal learning in the recognition of
International Journal of Machine Learning and sign language. Multimedia Tools and Applications,
Computing, , 2019 78(8), 10035–10056
[5]. Rim Barioul ,Sameh Fakhfakh Ghribi , Houda Ben [19]. Cao, Zh. And Hidalgo, G. and Simon, T. and Wei, Sh.E.
Jmaa Derbel ,and Olfa Kanoun, " Four Sensors Bracelet and Sheikh, Y. (2017). OpenPose: Realtime Multi-
for American Sign Language Recognition based on Person 2D Pose Estimation using Part Affinity Fields.
Wrist Force Myography" , IEEE Xplore , 2020. [20]. Dadashzadeh, A. and Tavakoli Targhi, A. and
[6]. Paul D. Rosero Montalvo, Pamela Gody Trujillo, Tahmasbi, M. and Mirmehdi, M. (2018). HGR-Net: A
Edison Flores Bosemedian, Jorge Carrascal Garcia, Fusion Network for Hand Gesture Segmentation and
Santiago otero potosi, Henry Benitez Pereira, " Sign Recognition.
Language Recognition Based on Intelligent Glove [21]. Preetham, C.; Ramakrishnan, G.; Kumur, S.; Tamse,
Using Machine Learning Techniques" , IEEE ,2020. A.; Krishnapura, H. Hand Talk-Implemented of a
[7]. Dongxu Li , Cristian Rodriguez Opazo, Xin Yu, Gesture Recognition Glove.
Hongdong Li, “Word-level Deep Sign Language [22]. Guo, H. and Wang, G. and Chen, X. and Zhang, C. and
Recognition from Video: A New Large-scale Dataset Qiao, F. and Yang, H. (2017). Region Ensemble
and Methods Comparison “ , Computer vision Network: Improving Convolutional Network for Hand
foundation , IEEE Xplore., 2020. Pose Estimation.
[8]. Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, [23]. Wu, J.; Sun, L.; Jafari, R. A Wearable System for
Hongdong Li , " Transferring Cross-domain Knowledge Recognizing American Sign Language in Real-Time
for Video Sign Language Recognition" , IEEE Xplore , Using IMU and Surface EMG Sensors. IEEE , 2016
2020 [24]. Cheok, M.J.; Omar, Z.; Jaward, M.H. A review of hand
[9]. Necati Cihan Camgoz ,̈ Oscar Kollerq, Simon Hadfield gestures and sign language recognition techniques.
and Richard Bowden , " Sign Language Transformers:
Joint End-to-end Sign Language Recognition and
Translation" , IEEE Xplore , 2020.
[10]. Ozge Mercanoglu Sincan, Anil Osman Tur, and Hacer
Yalim Keles. Isolated sign language recognition with
multi-scale features using LSTM. In 2019 27th Signal
Processing and Communications Applications
Conference (SIU), pages 1–4. IEEE, 2019.

IJISRT22JUL728 www.ijisrt.com 1907

You might also like