In this part of our research we propose a voting scheme in the space-time domain that allows both the temporal and spatial localization of activities. Our method uses an implicit representation of the spatiotemporal shape of an activity that relies on the spatiotemporal localization of ensembles of spatiotemporal features. The latter are localized around spatiotemporal salient points. We compare feature ensembles using a star graph model that compensates for scale changes using the scales of the features within each ensemble. We use boosting in order to create codebooks of characteristic ensembles for each class. Subsequently, we match the selected codewords with the training sequences of the respective class, and store the spatiotemporal positions at which each codeword is activated. This is performed with respect to a set of reference points, (e.g. the center of the torso and the lower bound of the subject) and with respect to the start/end of the action instance.
Voting example. (a) During training, the position θd and average spatiotemporal scale Sd of the activated ensemble is stored with respect to one or more reference points (e.g., the center of the subject, marked with the blue cross). (b) During testing, votes are cast using the stored θd values, normalized by SqSd−1 in order to account for scale changes.
In this way, we create class-specific spatiotemporal models that encode the spatiotemporal positions at which each codeword is activated in the training set. During testing, each activated codeword casts probabilistic votes to the location in time where the activity starts and ends, as well as towards the location of the utilized reference points in space. In this way a set of class-specific voting spaces is created. We use Mean Shift at each voting space in order to extract the most probable hypotheses concerning the spatiotemporal extend of the activities.
Overview of the spatiotemporal voting process
Each hypothesis is subsequently verified by performing action category classification with a Relevance Vector Machine (RVM).
Joint Localization and recognition: ROC curves corresponding to each class of the KTH dataset.
The proposed method was also used for the detection of social signals in image sequences depicting political debates. More specifically we performed experiments on hand-raising detection. Hand raising activities in political debates could potentially be an important cue for agreement/disagreement detection. Here, we consider a single raising and lowering of the speaker’s hand as a single hand-raising activity instance. We used 10 hand raising instances in order to train the corresponding model, and tested the proposed algorithm on 20 test sequences of political debates. The latter include view-point and scene changes, camera zoom and videos where the onset and offset of the action were out of the camera’s view. The localization results that we achieved, and a still frame of a hand raising instance is shown in the following figure:
As can be seen, the proposed algorithm was able to localize 90% of the extracted hypotheses within 10 frames from the ground truth annotation.
A. Oikonomopoulos, I. Patras, M. Pantic, N. Paragios. editors: T. Huang, A. Nijholt, A. Pentland, M. Pantic. Lecture Notes in Artificial Intelligence, Special Volume on Artificial Intelligence for Human Computing. vol. 4451, pp. 133 - 154, 2007.