Pramit Saha
MASc Student, Graduate Research Assistant
Electrical & Computer Engineering
Faculty of Applied Science
Research Themes: The Communicating Mind and Body
Pramit is a MITACS Globalink Graduate Fellow, working as a Graduate Research Assistant under Prof Sidney Fels in HCT Lab. He is also the recipient of Graduate Support Initiative (GSI) Award (2018 – 2019).
Pramit completed his Bachelors in Electrical Engineering from Jadavpur University in 2016 and worked as a MITACS Globalink Research Intern in the Faculty of Medicine and Dentistry, University of Alberta, on Biomedical image segmentation, before joining the Brain-to-speech Research team in Human Communication Technologies Lab as a Masters student (September, 2017-till now). He is primarily working towards finding different alternative vocal communication pathways for people with speaking disabilities, by bridging the gap between deep learning strategies and human-computer interfaces, in order to map active thoughts or gestures to the acoustics space for artificial voice synthesis. He is also an executive member of the The Electrical and Computer Engineering Graduate Student Association (ECEGSA) and has been acting as the Treasurer of ECEGSA for two consecutive terms (2018-19 and 2019-20). His personal interests include cultivating Vedic literature for finding answers to the big questions about life, spirituality, science, meditation, culture, universe and everything else.
Research interests: Pramit’s research interests center around Deep learning, Brain Computer Interfaces, Human Computer Interaction and Control, Computer Vision, Signal and Image processing, Silent Speech Interfaces and Bio-mechanical modeling.
Research projects:
Two of the main research areas he is involved in, are:
1. Recognition of Imagined speech EEG
Speech-related Brain Computer Interface (BCI) technologies provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals. In order to infer imagined speech from active thoughts, we implement a novel hierarchical deep learning BCI system for subject-independent classification of 11 speech tokens including phonemes and words. We encode the complex representation of high-dimensional electroencephalography (EEG) data via a channel cross-covariance matrix that computes the joint variability of the EEG electrodes. We then extract the hidden spatio-temporal information embedded within the matrix using hierarchical deep neural network architecture composed of a regular convolutional neural network and a temporal CNN running in parallel, withe their outputs cascaded to a deep autoencoder. Our novel approach exploits predicted articulatory information of six phonological categories (e.g., ± nasal, ± bilabial) as an intermediate step for classifying the phonemes and words, thereby finding discriminative signal responsible for natural speech synthesis. Ultimately, we concatenate the latent vectors of the bottleneck layer of the deep autoencoders corresponding to the phonological categories and pass it through another hierarchical architecture. Our results on the KARA database show that the best model achieves an average accuracy of 83.42% across the six different binary phonological classification tasks, and 53.36% for the individual token identification task, significantly outperforming our baselines. Our work arguably shows there is a brain imagery footprint for the underlying articulatory movement related to different sounds that can aid in decoding speech tokens.
2. Development of Gesture / Hand controlled sound synthesis devices
(a) Gesture controlled formant based sound synthesizers:
Individuals with speaking disabilities often use Text-To-Speech (TTS) synthesizers for communication. However, users of TTS synthesizers often produce monotonous speech and the use of such synthesizers often renders lively communication difficult. As a result, hand gestures have been used to successfully generate of speech . Fels and Hinton designed Glove-Talk II that translates hand gestures to spoken English via an adaptive interface. The system allows users to generate an unlimited number of English vocabularies by controlling ten parameters of the speech synthesizer. Each parameter maps to a different hand gesture or location, allowing the user’s hands to act as an artificial vocal tract. Another hand-gesture-to-sound mapping system developed by Kunikoshi et al. maps a set of five hand gestures to the five vowels of Japanese with smooth transitions. However, both systems have their limitations. The first system uses extensive hand movements and is less intuitive for an interested layman. The second system is designed to synthesize speech sounds of only one language and uses distinct hand gestures to represent individual speech sounds, as a result of which, continuous change between two speech sounds cannot be intuitively represented by continuous hand movement. The goal of this project is to develop a synthesizer for which hand movement can be used to control and produce a continuous vowel space more easily and intuitively.
(b) Dual-handed articulatory sound synthesis:
This centers around the development of an interface involving four degrees-of-freedom (DOF) mechanical control of a two dimensional, mid-sagittal tongue through a biomechanical toolkit called ArtiSynth and a sound synthesis engine called JASS towards articulatory sound synthesis. As a demonstration of the project, the user will learn to produce a range of JASS vocal sounds, by varying the shape and position of the ArtiSynth tongue in 2D space through a set of four force-based sensors. In other words, the user will be able to physically play around with these four sensors, thereby virtually controlling the magnitude of four selected muscle excitations of the tongue to vary articulatory structure. This variation is computed in terms of Area Functions in ArtiSynth environment and communicated to the JASS based audio-synthesizer coupled with two-mass glottal excitation model to complete this end-to-end gesture-to-sound mapping.