Thesis

Zafar Ahmed Ansari
Indian Institute of Technology
Jodhpur,
Rajasthan,
India.

Indian Sign Language Gesture Recognition using Kinect

Introduction

Human computer interaction systems based on human gestures and emotions are inventions not of recent years. Even in 1980s research was being done in this area.
The problems with old systems were bad quality and high price of the used hardware. Most of the gesture recognition systems were based upon the usage of one or more RGB cameras, which were dependent upon the conditions of surrounding light and skin color of the human.
The other systems, based on special sensors attached to a human body, were very expensive and were not available to everyone. New generation of depth sensors, like Microsoft Kinect and Asus Xtion Pro, allows the creation of high quality and low-cost gesture recognition systems.
Kinect provides us a way to map the environment in front of it by giving us a depth image. There are at present two major open source drivers for the Kinect- libfreenect and OpenNi. OpenNi especially provides us a well documented programming interface.
Setting up the system
I am using Ubuntu 12.04.
1. C++
Initially C++ felt easier to work with. After sometime I found that it was not built for vectorised calculations that matrix operations usually require.
2. Matlab
I switched from C++ to Matlab because the power and speed Matlab offered were outstanding for my objective.
3. Libfreenect
I am using Matlab's mex environment to run libfreenect in C as the driver for image acquisition from Kinect. There are very few frame drops in this driver and mostly its only use is depth image acquisition.
4. Java/Processing
I am seeing a lot of Kinect APIs are being built using Processing and Java. It would be nice to release my code in one of these languages too.
Dataset collection
From the sign dictionaries compiled by FDMSE, Ramakrishna Mission Vivekananda University, Coimbatore (Weblink) I selected a subset of 140 signs. These signs are drawn from an eclectic mix of daily use, technical, and banking words. This subset is mainly of static signs. Signs for the English alphabet have also been included for the purpose of fingerspelling. I have collected 5041 images each of depth and RGB types from the Kinect sensor. In all there have been 18 volunteers each contributing 280 images each of depth and RGB types.
Denoising
The depth image is noisy both in spatial and time domains. The noise in the data manifests itself as white spots continuously popping in and out of the picture in the time domain. Some of the noise in the data comes from the IR light being scattered by the object it’s hitting, some comes from shadows of objects closer to the Kinect.
The noise is represented in the form of jagged edges and 2047 pixel values (white spots).
To get rid of the 2047 values, a two fold method is used to smoothing the depth data: median filtering, and subsequently mode filtering (using a fixed window). Median filtering removes the bigger blobs of noisy white pixels. Once that is done mode filtering gets rid of the remaining smaller ones.
It is important to get rid of most 2047 pixel values as they interfere in hand segmentation.

I have now segmented out the hands from the image. I am now researching on a number of features that could give good classification results.