Modeling Word Importance in Conversational Transcripts: Toward improved live captioning for Deaf and hard of hearing viewers

Abstract

Despite the recent improvements in automatic speech recognition (ASR) systems, their accuracy is imperfect in live conversational settings. Classifying the importance of each word in a caption transcription can enable evaluation metrics that best reflect Deaf and Hard of Hearing (DHH) readers’ judgment of the caption quality. Prior work has proposed using word embeddings, e.g., word2vec or BERT embeddings, to model word importance in conversational transcripts. Recent work also disseminated a human-annotated word importance dataset. We conducted a word-token level analysis on this dataset and explored Part-of-Speech (POS) distribution. We then augmented the dataset with POS tags and reduced the class imbalance by generating 5% additional text using masking. Finally, we investigated how various supervised models learn the importance of words. The best performing model trained on our augmented dataset performed better than prior models. Our findings can inform the design of a metric for measuring live caption quality from DHH users’ perspectives.

Publication
In Proceedings of the 20th International Web for All Conference (W4A ‘23)
Akhter Al Amin
Akhter Al Amin
Software Engineer at Amazon
Saad Hassan
Saad Hassan
Assistant Professor

My research interests include human-computer interaction (HCI), accessibility, and computational social science.

Matt Huenerfauth
Matt Huenerfauth
Professor and Dean at RIT
Cecilia 0. Alm
Cecilia 0. Alm
Professor at RIT