Non-linguistic vocalisation recognition

In human - human interaction, communication is regulated by audiovisual feedback provided by the involved parties. There are several channels through which the feedback can be provided with the most common being speech. However, spoken words are highly person and context dependent, making the speech recognition and extraction of semantic information about the underlying intent a very challenging task for machines. Other channels which provide useful feedback in human-human interactions include facial expressions, head and hand gestures and non-linguistic vocalizations, e.g., laughter, yawns, cries, sighs etc. vocalisations. However, it is surprising that our knowledge about non linguistic vocalisations is still incomplete and little empirical information is available.

Non-linguistic vocalizations (or nonverbal vocalizations) are defined as very brief, discrete, nonverbal expressions of affect in both face and voice. People are very good at recognizing emotions just by hearing such vocalizations, which suggests that information related to human emotions is conveyed by these vocalizations. For example, laughter is a very good indicator of amusement and crying is a very good indicator of sadness.

The aim of our work is to recognize non-linguistic vocalizations in social interaction or in human-machine interaction. Unlike most previous works in which only audio signals are considered, we combine audio and visual information. The audio expression is usually accompanied by a facial expression which provides complementary information and can improve the performance of the recognition system.

We have mainly focused on laughter which is the most frequent and one of the most useful vocalisations in social interaction. It is usually perceived as positive feedback, i.e., it shows joy, acceptance, agreement, but it can also be used as negative feedback, e.g., irony. There is also evidence that laughter has a strong genetic basis since babies have the ability to laugh before they can speak and children who were born both deaf and blind still have the ability to laugh.

The system we have developed so far [1] is capable of discriminating laughter from speech based on shape and cepstral features (MFCCs). The key idea is that the correlation between audio and visual features and their time evolution is different in speech and laughter. The spatial relationship between

audio and visual features and the temporal relationship between past and future values of audio and visual features are explicitly modelled using predictive models. Classification is performed based on the model that best describes this relationship, i.e., produces the lowest prediction error.

We have also investigated the performance of different types of features [2,3,4] and classifiers [5]. Finally we have also studied how different types of laughter (voiced / unvoiced) are correlated with the perceived hilarity of multimedia content [6].

Related Publications

  1. Prediction-based Audiovisual Fusion for Classification of Non-linguistic Vocalisations

    S. Petridis, M. Pantic. IEEE Transactions on Affective Computing, accepted for publication. December 2015 2015.

    Bibtex reference [hide]
    @article{predBasedAVfusion,
        author = {S. Petridis and M. Pantic},
        booktitle = {2015},
        journal = {IEEE Transactions on Affective Computing, accepted for publication},
        month = {December 2015},
        title = {Prediction-based Audiovisual Fusion for Classification of Non-linguistic Vocalisations},
        year = {2015},
    }
    Endnote reference [hide]
    %0 Journal Article
    %T Prediction-based Audiovisual Fusion for Classification of Non-linguistic Vocalisations
    %A Petridis, S.
    %A Pantic, M.
    %J IEEE Transactions on Affective Computing, accepted for publication
    %D 2015
    %8 December 2015
    %F predBasedAVfusion

  2. Audiovisual Vocal Outburst Classification In Noisy Acoustic Conditions

    F. Eyben, S. Petridis, B. Schuller, M. Pantic. Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’12). Kyoto, Japan, pp. 5097 - 5100, March 2012.

    Bibtex reference [hide]
    @inproceedings{eybenPetridis_ICASSP2012,
        author = {F. Eyben and S. Petridis and B. Schuller and M. Pantic},
        pages = {5097--5100},
        address = {Kyoto, Japan},
        booktitle = { Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’12)},
        month = {March},
        title = {Audiovisual Vocal Outburst Classification In Noisy Acoustic Conditions},
        year = {2012},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Audiovisual Vocal Outburst Classification In Noisy Acoustic Conditions
    %A Eyben, F.
    %A Petridis, S.
    %A Schuller, B.
    %A Pantic, M.
    %B Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’12)
    %D 2012
    %8 March
    %C Kyoto, Japan
    %F eybenPetridis_ICASSP2012
    %P 5097-5100

  3. Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks

    F. Eyben, S. Petridis, B. Schuller, G. Tzimiropoulos, S. Zafeiriou. Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’11). Prague, Czech Republic, pp. 5844 - 5847, May 2011.

    Bibtex reference [hide]
    @inproceedings{EybenEtAlICASSP2011,
        author = {F. Eyben and S. Petridis and B. Schuller and G. Tzimiropoulos and S. Zafeiriou},
        pages = {5844--5847},
        address = { Prague, Czech Republic},
        booktitle = { Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’11)},
        month = {May},
        title = {Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks},
        year = {2011},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Audiovisual classification of vocal outbursts in human conversation using long-short-term memory networks
    %A Eyben, F.
    %A Petridis, S.
    %A Schuller, B.
    %A Tzimiropoulos, G.
    %A Zafeiriou, S.
    %B Proceedings of IEEE Int’l Conf. Acoustics, Speech and Signal Processing (ICASSP’11)
    %D 2011
    %8 May
    %C Prague, Czech Republic
    %F EybenEtAlICASSP2011
    %P 5844-5847

  4. Prediction-Based Classification For Audiovisual Discrimination Between Laughter And Speech

    S. Petridis, M. Pantic, J. Cohn. Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG'11). Santa Barbara, CA, USA, pp. 619 - 626, March 2011.

    Bibtex reference [hide]
    @inproceedings{PetridisEtAlFG2011,
        author = {S. Petridis and M. Pantic and J. Cohn},
        pages = {619--626},
        address = {Santa Barbara, CA, USA},
        booktitle = {Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG'11)},
        month = {March},
        title = {Prediction-Based Classification For Audiovisual Discrimination Between Laughter And Speech},
        year = {2011},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Prediction-Based Classification For Audiovisual Discrimination Between Laughter And Speech
    %A Petridis, S.
    %A Pantic, M.
    %A Cohn, J.
    %B Proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG?11)
    %D 2011
    %8 March
    %C Santa Barbara, CA, USA
    %F PetridisEtAlFG2011
    %P 619-626

  5. Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

    S. Petridis, M. Pantic. IEEE Transactions on Multimedia. 13(2): pp. 216 - 234, April 2011.

    Bibtex reference [hide]
    @article{petridis2011TMM,
        author = {S. Petridis and M. Pantic},
        pages = {216--234},
        journal = {IEEE Transactions on Multimedia},
        month = {April},
        number = {2},
        title = {Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help},
        volume = {13},
        year = {2011},
    }
    Endnote reference [hide]
    %0 Journal Article
    %T Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help
    %A Petridis, S.
    %A Pantic, M.
    %J IEEE Transactions on Multimedia
    %D 2011
    %8 April
    %V 13
    %N 2
    %F petridis2011TMM
    %P 216-234

  6. Classifying laughter and speech using audio-visual feature prediction

    S. Petridis, A. Asghar, M. Pantic. Proceedings of IEEE Int'l Conf. Acoustics, Speech and Signal Processing (ICASSP'10). Dallas, USA, pp. 5254 - 5257, March 2010.

    Bibtex reference [hide]
    @inproceedings{Petridis2010clasu,
        author = {S. Petridis and A. Asghar and M. Pantic},
        pages = {5254--5257},
        address = {Dallas, USA},
        booktitle = {Proceedings of IEEE Int'l Conf. Acoustics, Speech and Signal Processing (ICASSP'10)},
        month = {March},
        title = {Classifying laughter and speech using audio-visual feature prediction},
        year = {2010},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Classifying laughter and speech using audio-visual feature prediction
    %A Petridis, S.
    %A Asghar, A.
    %A Pantic, M.
    %B Proceedings of IEEE Int?l Conf. Acoustics, Speech and Signal Processing (ICASSP?10)
    %D 2010
    %8 March
    %C Dallas, USA
    %F Petridis2010clasu
    %P 5254-5257

  7. Static vs. Dynamic Modelling of Human Nonverbal Behaviour from Multiple Cues and Modalities

    S. Petridis, H. Gunes, S. Kaltwang, M. Pantic. Proceedings of ACM Int'l Conf. Multimodal Interfaces (ICMI'09). Cambridge, USA, pp. 23 - 30, November 2009.

    Bibtex reference [hide]
    @inproceedings{Petridis2009svdmo,
        author = {S. Petridis and H. Gunes and S. Kaltwang and M. Pantic},
        pages = {23--30},
        address = {Cambridge, USA},
        booktitle = {Proceedings of ACM Int'l Conf. Multimodal Interfaces (ICMI'09)},
        month = {November},
        title = {Static vs. Dynamic Modelling of Human Nonverbal Behaviour from Multiple Cues and Modalities},
        year = {2009},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Static vs. Dynamic Modelling of Human Nonverbal Behaviour from Multiple Cues and Modalities
    %A Petridis, S.
    %A Gunes, H.
    %A Kaltwang, S.
    %A Pantic, M.
    %B Proceedings of ACM Int?l Conf. Multimodal Interfaces (ICMI?09)
    %D 2009
    %8 November
    %C Cambridge, USA
    %F Petridis2009svdmo
    %P 23-30

  8. Is this joke really funny? Judging the mirth by audiovisual laughter analysis

    S. Petridis, M. Pantic. Proceedings of IEEE Int'l Conf. Multimedia, Expo (ICME'09). Cancun, Mexico, pp. 1444 - 1447, July 2009.

    Bibtex reference [hide]
    @inproceedings{Petridis2009itjrf,
        author = {S. Petridis and M. Pantic},
        pages = {1444--1447},
        address = {Cancun, Mexico},
        booktitle = {Proceedings of IEEE Int'l Conf. Multimedia, Expo (ICME'09)},
        month = {July},
        title = {Is this joke really funny? Judging the mirth by audiovisual laughter analysis},
        year = {2009},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Is this joke really funny?: Judging the mirth by audiovisual laughter analysis
    %A Petridis, S.
    %A Pantic, M.
    %B Proceedings of IEEE Int?l Conf. Multimedia, Expo (ICME?09)
    %D 2009
    %8 July
    %C Cancun, Mexico
    %F Petridis2009itjrf
    %P 1444-1447

  9. Audiovisual laughter detection based on temporal features

    S. Petridis, M. Pantic. Proceedings of ACM Int'l Conf. Multimodal Interfaces (ICMI'08). Chania, Greece, pp. 37 - 44, October 2008.

    Bibtex reference [hide]
    @inproceedings{Petridis2008aldbo,
        author = {S. Petridis and M. Pantic},
        pages = {37--44},
        address = {Chania, Greece},
        booktitle = {Proceedings of ACM Int'l Conf. Multimodal Interfaces (ICMI'08)},
        month = {October},
        title = {Audiovisual laughter detection based on temporal features},
        year = {2008},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Audiovisual laughter detection based on temporal features
    %A Petridis, S.
    %A Pantic, M.
    %B Proceedings of ACM Int?l Conf. Multimodal Interfaces (ICMI?08)
    %D 2008
    %8 October
    %C Chania, Greece
    %F Petridis2008aldbo
    %P 37-44

  10. Decision-level fusion for audio-visual laughter detection

    B. Reuderink, M. Poel, K. Truong, R. Poppe, M. Pantic. Proceedings of Joint Int'l Workshop on Machine Learning and Multimodal Interaction (MLMI'08). Utrecht, Netherlands, 5237: pp. 137 - 148, September 2008.

    Bibtex reference [hide]
    @inproceedings{Reuderink2008dffal,
        author = {B. Reuderink and M. Poel and K. Truong and R. Poppe and M. Pantic},
        pages = {137--148},
        address = {Utrecht, Netherlands},
        booktitle = {Proceedings of Joint Int'l Workshop on Machine Learning and Multimodal Interaction (MLMI'08)},
        month = {September},
        title = {Decision-level fusion for audio-visual laughter detection},
        volume = {5237},
        year = {2008},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Decision-level fusion for audio-visual laughter detection
    %A Reuderink, B.
    %A Poel, M.
    %A Truong, K.
    %A Poppe, R.
    %A Pantic, M.
    %B Proceedings of Joint Int?l Workshop on Machine Learning and Multimodal Interaction (MLMI?08)
    %D 2008
    %8 September
    %V 5237
    %C Utrecht, Netherlands
    %F Reuderink2008dffal
    %P 137-148

  11. Fusion of audio and visual cues for laughter detection

    S. Petridis, M. Pantic. Proceedings of ACM Int'l Conf. Image and Video Retrieval (CIVR'08). Niagara Falls, Canada, pp. 329 - 337, July 2008.

    Bibtex reference [hide]
    @inproceedings{Petridis2008foaav,
        author = {S. Petridis and M. Pantic},
        pages = {329--337},
        address = {Niagara Falls, Canada},
        booktitle = {Proceedings of ACM Int'l Conf. Image and Video Retrieval (CIVR'08)},
        month = {July},
        title = {Fusion of audio and visual cues for laughter detection},
        year = {2008},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Fusion of audio and visual cues for laughter detection
    %A Petridis, S.
    %A Pantic, M.
    %B Proceedings of ACM Int?l Conf. Image and Video Retrieval (CIVR?08)
    %D 2008
    %8 July
    %C Niagara Falls, Canada
    %F Petridis2008foaav
    %P 329-337

  12. Audiovisual discrimination between laughter and speech

    S. Petridis, M. Pantic. Proceedings of IEEE Int'l Conf. Acoustics, Speech and Signal Processing (ICASSP'08). Las Vegas, USA, pp. 5117 - 5120, April 2008.

    Bibtex reference [hide]
    @inproceedings{Petridis2008adbla,
        author = {S. Petridis and M. Pantic},
        pages = {5117--5120},
        address = {Las Vegas, USA},
        booktitle = {Proceedings of IEEE Int'l Conf. Acoustics, Speech and Signal Processing (ICASSP'08)},
        month = {April},
        title = {Audiovisual discrimination between laughter and speech},
        year = {2008},
    }
    Endnote reference [hide]
    %0 Conference Proceedings
    %T Audiovisual discrimination between laughter and speech
    %A Petridis, S.
    %A Pantic, M.
    %B Proceedings of IEEE Int?l Conf. Acoustics, Speech and Signal Processing (ICASSP?08)
    %D 2008
    %8 April
    %C Las Vegas, USA
    %F Petridis2008adbla
    %P 5117-5120