In human - human interaction, communication is regulated by audiovisual feedback provided by the involved parties. There are several channels through which the feedback can be provided with the most common being speech. However, spoken words are highly person and context dependent, making the speech recognition and extraction of semantic information about the underlying intent a very challenging task for machines. Other channels which provide useful feedback in human-human interactions include facial expressions, head and hand gestures and non-linguistic vocalizations, e.g., laughter, yawns, cries, sighs etc. vocalisations. However, it is surprising that our knowledge about non linguistic vocalisations is still incomplete and little empirical information is available.
Non-linguistic vocalizations (or nonverbal vocalizations) are defined as very brief, discrete, nonverbal expressions of affect in both face and voice. People are very good at recognizing emotions just by hearing such vocalizations, which suggests that information related to human emotions is conveyed by these vocalizations. For example, laughter is a very good indicator of amusement and crying is a very good indicator of sadness.
The aim of our work is to recognize non-linguistic vocalizations in social interaction or in human-machine interaction. Unlike most previous works in which only audio signals are considered, we combine audio and visual information. The audio expression is usually accompanied by a facial expression which provides complementary information and can improve the performance of the recognition system.
We have mainly focused on laughter which is the most frequent and one of the most useful vocalisations in social interaction. It is usually perceived as positive feedback, i.e., it shows joy, acceptance, agreement, but it can also be used as negative feedback, e.g., irony. There is also evidence that laughter has a strong genetic basis since babies have the ability to laugh before they can speak and children who were born both deaf and blind still have the ability to laugh.
The system we have developed so far  is capable of discriminating laughter from speech based on shape and cepstral features (MFCCs). The key idea is that the correlation between audio and visual features and their time evolution is different in speech and laughter. The spatial relationship between
audio and visual features and the temporal relationship between past and future values of audio and visual features are explicitly modelled using predictive models. Classification is performed based on the model that best describes this relationship, i.e., produces the lowest prediction error.
We have also investigated the performance of different types of features [2,3,4] and classifiers . Finally we have also studied how different types of laughter (voiced / unvoiced) are correlated with the perceived hilarity of multimedia content .
F. Eyben, S. Petridis, B. Schuller, G. Tzimiropoulos, S. Zafeiriou. Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’11). Prague, Czech Republic, pp. 5844 - 5847, May 2011.
B. Reuderink, M. Poel, K. Truong, R. Poppe, M. Pantic. Proceedings of Joint Int'l Workshop on Machine Learning and Multimodal Interaction (MLMI'08). Utrecht, Netherlands, 5237: pp. 137 - 148, September 2008.