In this part of our research, we propose a set of novel visual descriptors that are derived from a set of detected spatiotemporal salient points. The salient points that we extract, correspond to areas where independent motion occurs, like ongoing activities in the scene. Centered at each salient point, we define a spatiotemporal neighborhood whose dimensions are proportional to the detected space-time scale of the point. Then, a three-dimensional piecewise polynomial, namely a B-spline, is fitted at the locations of the salient points that fall within this neighborhood. Our descriptors are subsequently derived as the partial derivatives of the resulting polynomial.
First derivatives along two directions of a B-spline polynomial, plotted as 3-dimensional vectors.
At the next step, the set of descriptors extracted from each spline is accumulated into a number of histograms. This number depends on the maximum degree of the partial derivatives. Since our descriptors correspond to geometric properties of the spline, they are translation invariant. Furthermore, the use of the automatically detected space-time scales of the salient points for the definition of the neighborhood ensures invariance to space and time scaling. Subsequently, we create we create a codebook of visual verbs by clustering our motion descriptors across the whole dataset, where a visual verb corresponds to a combined shape and motion descriptor. Each video in our dataset is then represented as a histogram of visual verbs. We use a kernel based classifier, namely the Relevance Vector Machine (RVM) , in order to classify test examples into one of the classes present in the training dataset. We evaluate the proposed method on publicly available human action datasets, like the Weizmann and KTH datasets.
Confusion matrices for the KTH and Weizmann datasets.