Sindhu B. Hegde

Hi! I am a fourth year PhD student in the Visual Geometry Group (VGG) at the University of Oxford, supervised by Prof. Andrew Zisserman. My research is in Computer Vision, particularly in understanding non-verbal communication (including co-speech gestures and lip-reading), video understanding, and self-supervised learning. I also work as an AI Scientist at Rode Microphones, focusing on multimodal LLM-based research.

Prior to joining Oxford, I worked as a Lead Data Scientist @ Verisk Analytics. Before that, I pursued Masters’ by Research (MS) at Centre for Visual Information Technology (CVIT), IIIT Hyderabad supervised by Prof. C V Jawahar (IIIT-H) and Prof. Vinay Namboodiri (University of Bath, UK). My Masters’ research focused on exploiting the redundancies in vision and speech modalities for cross-modal generation. Earlier, I completed my undergraduate studies at KLE Technological University, advised by Prof. Shankar Gangisetty and Prof. Uma Mudenagudi.

Research interests: Computer Vision, Machine Learning, Deep Learning, Video Understanding, Multi-modal Learning: Vision + Speech/Language

News [Archive]

Nov 2025	I am honoured to have been awarded the 2025 Google PhD Fellowship in Machine Perception.
Jul 2025	JEGAL has been accepted to ICCV 2025 (ORAL). See you in Hawaii 🏝️⛱️ 🌊
Apr 2025	Our paper on Understanding Co-speech Gestures in-the-wild is up on arXiv. Links: Project page, Dataset
Jan 2025	Our paper on Scaling Multilingual Visual Speech Recognition accepted to ICASSP 2025 (ORAL). Links: Project page, Dataset
Sep 2023	Our paper on GestSync: Determining who is speaking without a talking head accepted to BMVC 2023 (ORAL). Links: Project page, Demo
Jul 2023	Participated in the International Computer Vision Summer School (ICVSS) at Sicily, Italy. Had an incredible experience of learning from some of the most distinguished computer vision experts!

Talks

Oct 2025	Invited talk on “Understanding Co-speech Gestures in Videos” at the Berkeley AI Research Lab (BAIR) , University of California, Berkeley. Hosted by Prof. Alyosha Efros.

Recent papers [Full list]

ICCV

Understanding Co-speech Gestures in-the-wild

Hegde, Sindhu, Prajwal, KR, Kwon, Taein, and Zisserman, Andrew

International Conference on Computer Vision (ICCV) 2025

Abs PDF Code Website

Co-speech gestures play a vital role in non-verbal communication. In this paper, we introduce a new framework for co-speech gesture understanding in the wild. Specifically, we propose three new tasks and benchmarks to evaluate a model’s capability to comprehend gesture-speech-text associations: (i) gesture based retrieval, (ii) gesture word spotting, and (iii) active speaker detection using gestures. We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks. By leveraging a combination of global phrase contrastive loss and local gesture-word coupling loss, we demonstrate that a strong gesture representation can be learned in a weakly supervised manner from videos in the wild. Our learned representations outperform previous methods, including large vision-language models (VLMs). Further analysis reveals that speech and text modalities capture distinct gesture related signals, underscoring the advantages of learning a shared tri-modal embedding space. The dataset, model, and code are available at: https://www.robots.ox.ac.uk/ vgg/research/jegal.
ICASSP

Scaling Multilingual Visual Speech Recognition

Prajwal, KR, Hegde, Sindhu, and Zisserman, Andrew

In International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2025

Abs PDF Code Website

Visual Speech Recognition (lip-reading) has witnessed tremendous improvements, reaching word error rates as low as 12.8 WER in English. However, the performance in other languages is lagging far behind, due to the lack of labeled multilingual video data. In this work, we reduce the performance gap with the help of three key advances: (i) introducing the largest multilingual lip-reading dataset to date, (ii) proposing a single multi-task architecture that can perform two tasks simultaneously: identify the language and transcribe the utterance, and (iii) jointly training this architecture on all the languages together, resulting in large WER improvements as opposed to training monolingual models separately. We achieve state-of-the-art performance in both visual language identification and multilingual lip-reading tasks. Moreover, our pipeline uses zero manual annotations, as all the training transcriptions are obtained using a pre-trained ASR model. We also show that our multilingual model can be readily fine-tuned for new low-resource languages on which models trained from scratch do not converge. Our data, code, and models are available at: www.robots.ox.ac.uk/∼vgg/research/multivsr.
BMVC

GestSync: Determining who is speaking without a talking head

Hegde, Sindhu, and Zisserman, Andrew

In British Machine Vision Conference (BMVC) 2023

Abs PDF Code Website

In this paper we introduce a new synchronisation task, Gesture-Sync: determining if a person’s gestures are correlated with their speech or not. In comparison to Lip-Sync, Gesture-Sync is far more challenging as there is a far looser relationship between the voice and body movement than there is between voice and lip motion. We introduce a dual-encoder model for this task, and compare a number of input representations including RGB frames, keypoint images, and keypoint vectors, assessing their performance and advantages. We show that the model can be trained using self-supervised learning alone, and evaluate its performance on the LRS3 dataset. Finally, we demonstrate applications of Gesture-Sync for audio-visual synchronisation, and in determining who is the speaker in a crowd, without seeing their faces.