CN104063721B - A kind of human behavior recognition methods learnt automatically based on semantic feature with screening - Google Patents
- ️Fri Jun 16 2017
Info
-
Publication number
- CN104063721B CN104063721B CN201410319126.5A CN201410319126A CN104063721B CN 104063721 B CN104063721 B CN 104063721B CN 201410319126 A CN201410319126 A CN 201410319126A CN 104063721 B CN104063721 B CN 104063721B Authority
- CN
- China Prior art keywords
- features
- spatio
- temporal
- video
- level Prior art date
- 2014-07-04 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Image Analysis (AREA)
Abstract
本发明公开了一种高效的基于语义特征自动学习与筛选的人类行为识别方法,包括从运动视频中检测时空兴趣点,提取时空兴趣点周围的运动和表观信息;在时空兴趣点特征基础上设计包含时空上下文信息的底层特征,描述一个局部区域的所有时空兴趣点特征,并且记录兴趣点之间的相对时空位置关系;在底层特征基础上,利用基于图模型的非负矩阵分解算法来自动生成高层语义特征;建立基于L2,1范数的组稀疏来选择各个行为类别中具有代表性和区分性的高层语义,通过模型的优化,将各个行为类别中具有代表性的语义特征保留下来,同时只采用优化后来自同一个行为类别的语义特征来训练分类器。本发明大幅提升了人类行为识别的智能化水平。
The invention discloses an efficient human behavior recognition method based on automatic learning and screening of semantic features, including detecting spatiotemporal interest points from motion videos, extracting motion and appearance information around the spatiotemporal interest points; based on the features of spatiotemporal interest points Design the underlying features containing spatio-temporal context information, describe the characteristics of all spatio-temporal interest points in a local area, and record the relative spatio-temporal positional relationship between the interest points; on the basis of the underlying features, use the non-negative matrix factorization algorithm based on the graph model Generate high-level semantic features; establish group sparsity based on the L 2,1 norm to select representative and distinguishable high-level semantics in each behavior category, and preserve representative semantic features in each behavior category through model optimization , while only using optimized semantic features from the same behavior category to train the classifier. The invention greatly improves the intelligence level of human behavior recognition.
Description
技术领域technical field
本发明涉及计算机应用技术领域,特别涉及一种基于语义特征自动学习与筛选的行为识别方法。The invention relates to the field of computer application technology, in particular to a behavior recognition method based on automatic learning and screening of semantic features.
背景技术Background technique
视觉是人类观察和认识世界的重要途径。随着计算机处理能力的不断提高,我们希望计算机能够具有人类的部分视觉功能,帮助甚至代替人眼和大脑对外界事物进行观察和感知。伴随着计算机硬件处理能力的提高和计算机视觉技术的出现,人们对计算机的这一期望有可能成为现实。Vision is an important way for human beings to observe and understand the world. With the continuous improvement of computer processing capabilities, we hope that computers can have part of the visual function of human beings, helping or even replacing human eyes and brains to observe and perceive external things. With the improvement of the processing power of computer hardware and the emergence of computer vision technology, people's expectations for computers may become a reality.
基于视频的人类行为分析的目的是理解和识别人的个体动作,人与人之间的交互运动,人与周围环境的交互关系等。它利用计算机技术,在不需要人为干预或者尽量少的人为干预的条件下,实现基于视频的人体检测、人体跟踪,及对人类的行为的理解。尽管这对于人类的认知系统而言是一件很简单的本能反映,但对于计算机系统来说,考虑到周围环境的复杂性,人类的体态、运动习惯等方面的差异性,准确理解和分析视频中的人类行为具有很大的挑战性。The purpose of video-based human behavior analysis is to understand and recognize individual actions of people, interactive movements between people, interactive relationships between people and the surrounding environment, etc. It uses computer technology to realize video-based human detection, human tracking, and understanding of human behavior without human intervention or as little human intervention as possible. Although this is a very simple instinctive reflection for the human cognitive system, for the computer system, taking into account the complexity of the surrounding environment, the differences in human posture, exercise habits, etc., it is necessary to accurately understand and analyze Human behavior in video is quite challenging.
传统的人类行为识别方法主要采用视频的底层特征,如:表观特征、形状特征、光流特征以及时空兴趣点特征等。其中,时空兴趣点特征结合词包模型的方法最为流行,该方法的优点在于,模型简单而且具有较高的识别准确率,而且对于噪声、遮挡和形变具有较强的稳定性,不需要对目标进行跟踪。Traditional human behavior recognition methods mainly use the underlying features of videos, such as appearance features, shape features, optical flow features, and spatiotemporal interest point features. Among them, the method of combining spatio-temporal interest point features with bag-of-words model is the most popular. The advantage of this method is that the model is simple and has high recognition accuracy, and it has strong stability against noise, occlusion and deformation, and does not need to to track.
在“X.Burgos,P.Dollar,D.Lin,D.Anderson,P.Perona,Social behaviorrecognition in continuous video,in:Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition,2012”(参考文献1)中采用归一化后的像素亮度、梯度以及光流特征来描述每个时空兴趣点周围的区域,进而描述运动行为,并且在多个运动行为数据集上取得了很好的识别结果,其中梯度特征的效果最好。该方法将时空兴趣点周围各个子区域内提取的特征向量连接在一起,构成一个柱状图特征,这种方法的不足是对于光照等外界因素的变化比较敏感。在“I.Laptev,T.Lindeberg,Localdescriptors for spatio-temporal recognition,Spatial Coherence for VisualMotion Analysis,2006”(参考文献2)中尝试对兴趣点周围区域进行多种分割和特征组合来提高识别准确率,光流和梯度的组合取得了最好的识别效果。在“A.Klaser,M.Marszalek,C.Schmid,A spatio-temporal descriptor based on3D-gradients,in:Proceedings of the British Machine Vision Conference”(参考文献3)中建立一个稳定且计算简单的三维时空特征,它利用规则的多面体将空间量化成20个方向。该方法仍然通过建立梯度方向的柱状图来描述局部时空兴趣点特征。Adopted in "X. Burgos, P. Dollar, D. Lin, D. Anderson, P. Perona, Social behavior recognition in continuous video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012" (Reference 1) The normalized pixel brightness, gradient and optical flow features are used to describe the area around each spatio-temporal interest point, and then describe the motion behavior, and have achieved good recognition results on multiple motion behavior data sets, among which the gradient feature best effect. This method connects the feature vectors extracted from various sub-regions around the spatio-temporal interest point to form a histogram feature. The disadvantage of this method is that it is sensitive to changes in external factors such as illumination. In "I.Laptev, T.Lindeberg, Localdescriptors for spatio-temporal recognition, Spatial Coherence for VisualMotion Analysis, 2006" (Reference 2), try to perform multiple segmentation and feature combinations on the area around the point of interest to improve the recognition accuracy, The combination of optical flow and gradient achieves the best recognition results. In "A. Klaser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3D-gradients, in: Proceedings of the British Machine Vision Conference" (Reference 3) Building a stable and computationally simple 3D spatio-temporal feature , which quantizes space into 20 directions using regular polyhedra. This method still describes the local spatio-temporal interest point characteristics by building a histogram of gradient directions.
近年来,人们发现传统的底层特征对于运动行为的描述具有很大的局限性,不能有效地描述运动目标的时间和空间信息,所以人们试图在底层特征的基础上建立中层以及高层的语义特征来更准确地描述运动行为。在“J.Liu,M.Shah,B.Kuipers,S.Savarese,Cross-view action recognition via view knowledge transfer,in:Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pp.3209-3216,2011”(参考文献4)中采用互信息最大化技术来学习出一个紧凑的中层字典。他们将字典中具有相似分布的多个视觉单词融合为一个视觉单词,并且利用时空金字塔匹配的方法来挖掘时间信息。在“J.Liu,Y.Yang,I.Saleemi,M.Shah,Learning semantic features foraction recognition via diffusion map,Computer Vision and Image Understanding,Vol.116,No.3,pp.361-377,2012”(参考文献5)中利用扩散映射自动地从大量中层特征中学习出高层语义词汇表,其中每个中层特征被表示成互信息的向量形式。但是算法产生的多个词汇表对应不同的类别,因此产生的词汇表缺乏普遍性,限制了该算法的实际应用。In recent years, people have found that the traditional low-level features have great limitations in the description of sports behavior, and cannot effectively describe the temporal and spatial information of moving targets. Therefore, people try to establish middle-level and high-level semantic features based on the low-level features. Describe motor behavior more accurately. In "J. Liu, M. Shah, B. Kuipers, S. Savarese, Cross-view action recognition via view knowledge transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.3209-3216, 2011" ( Reference 4) uses the mutual information maximization technique to learn a compact middle-level dictionary. They fused multiple visual words with similar distributions in the dictionary into one visual word, and used spatiotemporal pyramid matching to mine temporal information. In "J.Liu, Y.Yang, I.Saleemi, M.Shah, Learning semantic features foraction recognition via diffusion map, Computer Vision and Image Understanding, Vol.116, No.3, pp.361-377, 2012"( Reference 5) uses diffusion mapping to automatically learn a high-level semantic vocabulary from a large number of mid-level features, where each mid-level feature is represented as a vector of mutual information. However, multiple vocabularies generated by the algorithm correspond to different categories, so the generated vocabularies lack universality, which limits the practical application of the algorithm.
跟传统的底层特征相比,高层语义特征更能够准确描述运动行为的时间和空间属性。但是也存在一些不足,比如:绝大多数基于学习的高层属性都是建立在底层特征的基础上,视频中提取的底层特征既包含前景特征也包含大量的背景特征,这些背景特征会影响算法自动学习出来的高层语义属性的判别性。Compared with traditional low-level features, high-level semantic features can more accurately describe the temporal and spatial attributes of motor behavior. But there are also some shortcomings, for example: most of the high-level attributes based on learning are based on the underlying features. The underlying features extracted from the video include both foreground features and a large number of background features. These background features will affect the automatic algorithm. Discriminativeness of learned high-level semantic attributes.
发明内容Contents of the invention
(一)要解决的技术问题(1) Technical problems to be solved
本发明的目的是克服现有的行为识别方法在高层语义学习方面的不足,从而提出一种基于语义特征自动学习与筛选的行为识别方法。The purpose of the present invention is to overcome the shortcomings of the existing behavior recognition methods in high-level semantic learning, thereby proposing a behavior recognition method based on automatic learning and screening of semantic features.
(二)技术方案(2) Technical solution
本发明在传统兴趣点特征的基础上建立时空上下文特征,然后利用基于图模型的非负矩阵分解算法生成高层语义特征,接着设计出一种基于组稀疏的高层特征筛选算法来提取出对于各个行为类别具有代表性的高层语义特征。The present invention establishes spatio-temporal context features on the basis of traditional interest point features, then utilizes a non-negative matrix decomposition algorithm based on a graph model to generate high-level semantic features, and then designs a high-level feature screening algorithm based on group sparsity to extract information for each behavior Categories are representative high-level semantic features.
本发明提出的基于语义特征自动学习与筛选的人类行为识别方法包括:The human behavior recognition method based on automatic learning and screening of semantic features proposed by the present invention includes:
步骤S1、从视频中检测时空兴趣点;Step S1, detecting spatio-temporal interest points from the video;
步骤S2、提取所述时空兴趣点的周围区域的视频底层特征;Step S2, extracting the video underlying features of the surrounding area of the spatio-temporal interest point;
步骤S3、根据所述视频底层特征建立时空上下文特征;Step S3, establishing spatio-temporal context features according to the underlying features of the video;
步骤S4、采用基于图模型的非负矩阵分解算法,根据所述视频底层特征生成高层语义特征;Step S4, using a non-negative matrix factorization algorithm based on a graph model to generate high-level semantic features according to the bottom-level features of the video;
步骤S5、利用基于L2,1范数的组稀疏在高层语义特征基础上筛选出具有代表性和区分性的高层语义;Step S5, using group sparseness based on the L2,1 norm to select representative and distinguishable high-level semantics on the basis of high-level semantic features;
步骤S6、利用筛选出的高层语义特征来训练分类器,利用训练好的分类别对视频进行分类。Step S6 , using the selected high-level semantic features to train a classifier, and classifying the videos by using the trained subcategories.
一种实施方式是,所述步骤S2包括:利用梯度柱状图特征来提取时空兴趣点周围区域的表观特征;利用光流柱状图特征来提取时空兴趣点周围区域的运动特征。In one embodiment, the step S2 includes: using the gradient histogram feature to extract the apparent features of the area around the spatio-temporal interest point; using the optical flow histogram feature to extract the motion feature of the area around the spatio-temporal interest point.
一种实施方式是,所述的步骤S3包括:以单个时空兴趣点为中心,搜索出距离中心时空兴趣点最近的N个相邻兴趣点;设计一种时空上下文特征,可以同时描述局部区域内N+1个时空兴趣点特征以及它们之间的相对位置关系;用一个权重向量来约束不同的相邻兴趣点特征,距离中心兴趣点越近的邻近兴趣点被赋予的权重越大。One implementation is that the step S3 includes: centering on a single spatio-temporal interest point, searching for N adjacent points of interest closest to the center spatio-temporal interest point; designing a spatio-temporal context feature that can simultaneously describe the N+1 spatio-temporal interest point features and their relative positional relationship; a weight vector is used to constrain different adjacent interest point features, and the closer to the central interest point, the greater the weight given to the adjacent interest point.
一种实施方式是,所述的步骤S4包括:采用基于图模型的非负矩阵分解将每个样本分解成为一组基向量的线性表示,并且线性表示中的加权系数都为正数;运用该算法将人类运动行为分解成基于部分的表示,同时使得相似的人类运动行为在新的基向量的表示下仍然是相似的。One embodiment is that the step S4 includes: using a non-negative matrix decomposition based on a graphical model to decompose each sample into a linear representation of a set of basis vectors, and the weighting coefficients in the linear representation are all positive numbers; using the The algorithm decomposes human motion behavior into part-based representations, while making similar human motion behaviors still similar under the new basis vector representation.
一种实施方式是,所述的步骤S5包括:One embodiment is that the step S5 includes:
采用矩阵和向量的联合组稀疏模型,促使属于同一类别的运动行为由相似的语义特征来重构;保留各个行为类别中具有代表性的语义特征,抑制那些只在个别类内样本中出现的特征;采用优化后来自同一个行为类别的语义特征来重构测试样本。The joint group sparse model of matrix and vector is used to promote the reconstruction of motion behaviors belonging to the same category by similar semantic features; the representative semantic features in each behavior category are preserved, and those features that only appear in individual samples within a class are suppressed ; Use optimized semantic features from the same behavioral category to reconstruct test samples.
(三)有益效果(3) Beneficial effects
本发明通过设计时空上下文特征建立稳定的底层特征,在此基础上利用基于图模型的非负矩阵分解算法学习出描述性更强的高层语义特征,接着,采用组稀疏的方法筛选出各个行为类别中具有较强代表性和区分性的高层语义,利用这些筛选出来的语义信息进行分类。这种基于高层语义的方法可以更好地学习出不同类别行为的本质属性特征,可以取得更好地识别效果。The present invention establishes stable bottom-level features by designing spatio-temporal context features, and on this basis uses a non-negative matrix decomposition algorithm based on a graph model to learn more descriptive high-level semantic features, and then uses a group sparse method to screen out each behavior category The high-level semantics with strong representativeness and distinguishability in the high-level semantics are classified by using the filtered semantic information. This method based on high-level semantics can better learn the essential attribute characteristics of different types of behaviors, and can achieve better recognition results.
附图说明Description of drawings
图1为本发明的人类行为识别方法的流程图;Fig. 1 is the flowchart of the human behavior recognition method of the present invention;
图2A和图2B为本发明的一个实施例的高层语义特征示意图。FIG. 2A and FIG. 2B are schematic diagrams of high-level semantic features of an embodiment of the present invention.
具体实施方式detailed description
为使本发明的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本发明作进一步的详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with specific embodiments and with reference to the accompanying drawings.
图1为本发明的人类行为识别方法的流程图。如图1所示,本发明的方法包括如下步骤:FIG. 1 is a flow chart of the human behavior recognition method of the present invention. As shown in Figure 1, the method of the present invention comprises the steps:
步骤S1、从视频中检测时空兴趣点。Step S1, detecting spatio-temporal interest points from the video.
所谓时空兴趣点是指通过三维空间的角点检测或者滤波得到的空间中的关键点。本发明中检测的时空兴趣点是通过对空间域采用高斯滤波,对时间域采用Gabor滤波所得到的视频中的关键点。The so-called spatio-temporal interest point refers to a key point in space obtained through corner point detection or filtering in three-dimensional space. The temporal and spatial interest points detected in the present invention are key points in the video obtained by using Gaussian filtering in the spatial domain and Gabor filtering in the temporal domain.
步骤S2、提取所述时空兴趣点的周围区域的视频底层特征。Step S2, extracting the underlying video features of the surrounding area of the spatio-temporal interest point.
所述“周围”是指以时空兴趣点所在位置的中心的一个立方体区域。本发明中提取的视频底层特征是能够表征时空兴趣点周围区域的运动特征和表观特征。The "surrounding" refers to a cube area at the center of the location of the spatio-temporal interest point. The underlying features of the video extracted in the present invention are motion features and appearance features that can characterize the area around the spatio-temporal interest point.
在具体实施方式中,可提取多尺度的时空兴趣点,例如采用光流柱状图和梯度柱状图这两种特征分别描述时空兴趣点周围区域的运动特征和表观特征。In a specific implementation, multi-scale spatio-temporal interest points can be extracted, for example, two features, optical flow histogram and gradient histogram, can be used to describe the motion characteristics and appearance features of the area around the spatio-temporal interest point respectively.
步骤S3、根据所述视频底层特征建立时空上下文特征。Step S3, establishing spatio-temporal context features according to the underlying features of the video.
所述时空上下文特征是指相邻的多个时空兴趣点共同构成的整体特征,体现了更多的上下文信息。The spatio-temporal context feature refers to an overall feature composed of multiple adjacent spatio-temporal interest points, reflecting more context information.
该步骤S3包括以单个时空兴趣点为中心,计算出距离中心时空兴趣点最近的N个相邻兴趣点,然后设计一种时空上下文特征,可以同时描述局部区域内N+1个时空兴趣点特征以及它们之间的相对位置关系。This step S3 includes taking a single spatiotemporal interest point as the center, calculating the N neighboring points of interest closest to the central spatiotemporal interest point, and then designing a spatiotemporal context feature that can simultaneously describe the features of N+1 spatiotemporal interest points in a local area and their relative positions.
同时,用一个权重向量来约束不同的相邻兴趣点特征,距离中心兴趣点越近的邻近兴趣点被赋予的权重越大。这样,对于运动行为视频中提取出来的任意一个时空兴趣点,都可以获取与它相邻的兴趣点特征以及空间位置信息。At the same time, a weight vector is used to constrain the characteristics of different adjacent interest points, and the closer the adjacent interest points are to the central interest point, the greater the weight is given. In this way, for any spatio-temporal interest point extracted from the motion behavior video, the features and spatial location information of its adjacent interest points can be obtained.
根据三维时空坐标下的欧氏距离,计算出对应于这个区域的局部视觉单词上下文特征:According to the Euclidean distance under the three-dimensional space-time coordinates, the local visual word context features corresponding to this area are calculated:
Fp=[h1,h2,…,hk]T, (1)F p =[h 1 , h 2 , . . . , h k ] T , (1)
其中,label(p)表示兴趣点p的视觉单词标签。通过对局部视觉单词上下文特征重建词包模型,每个行为视频被表示成一个基于底层特征的向量。where label(p) denotes the visual word label of interest point p. By reconstructing the bag-of-words model on local visual word context features, each action video is represented as a vector based on the underlying features.
步骤S4、采用基于图模型的非负矩阵分解算法,根据所述视频底层特征生成高层语义特征。Step S4, using a non-negative matrix factorization algorithm based on a graph model to generate high-level semantic features according to the bottom-level features of the video.
“高层语义特征”是指能够体现语义信息的高层特征,区别于传统的底层特征。"High-level semantic features" refer to high-level features that can reflect semantic information, which are different from traditional low-level features.
基于图模型的非负矩阵分解算法将每个样本分解成为一组基向量的线性表示,并且线性表示中的加权系数都为正数。运用该算法将人类运动行为分解成基于部分的表示,同时使得相似的人类运动行为在新的基向量的表示下仍然是相似的。The non-negative matrix factorization algorithm based on the graph model decomposes each sample into a linear representation of a set of basis vectors, and the weighting coefficients in the linear representation are all positive numbers. This algorithm is used to decompose human motion behavior into part-based representations, while making similar human motion behaviors still similar under the representation of new basis vectors.
令i=1,…,C、j=1,…,ni,表示第i个行为类别中的第j个视频样本的一个d维的底层特征表示。类别i中的所有视频的向量特征构成一个矩阵基于图模型的非负矩阵分解最小化如下的目标函数:make i=1, ..., C, j = 1, ..., n i , Represents a d-dimensional underlying feature representation of the jth video sample in the ith behavior category. The vector features of all videos in category i form a matrix Nonnegative matrix factorization based on graphical models minimizes the following objective function:
其中,和是两个非负矩阵,L=D-W是拉普拉斯图,W是对称非负相似性矩阵。我们采用热核权重D是对角矩阵,对角线上的元素为矩阵W中对应列(或行,因为W为对称矩阵)之和,将矩阵U的每一个列向量看作是一个行为单元。in, with are two non-negative matrices, L=DW is a Laplacian graph, W is a symmetric non-negative similarity matrix. We use thermokernel weights D is a diagonal matrix, and the elements on the diagonal are the sum of the corresponding columns (or rows, because W is a symmetric matrix) in the matrix W, and each column vector of the matrix U is regarded as a row unit.
图2A和图2B为本发明的一个实施例的高层语义特征示意图。如图2A和图2B所示,“行走”这个行为是由躯干的运动以及四肢的运动构成,“挥手”这个行为是由两个手臂的运动所构成。FIG. 2A and FIG. 2B are schematic diagrams of high-level semantic features of an embodiment of the present invention. As shown in Figure 2A and Figure 2B, the behavior of "walking" is composed of the movement of the trunk and the movement of the limbs, and the behavior of "waving" is composed of the movement of two arms.
步骤S5、利用基于L2,1范数的组稀疏在高层语义特征基础上筛选出具有代表性和区分性的高层语义。Step S5, using group sparseness based on the L 2,1 norm to screen out representative and distinguishable high-level semantics based on the high-level semantic features.
一种实施方式是,在现有的矩阵组稀疏算法基础上,建立基于向量的组稀疏算法,利用矩阵和向量的联合组稀疏模型,促使属于同一类别的人类运动行为由相似的语义特征来重构,抑制那些只在个别类内样本中出现的特征。通过模型的优化,保留各个行为类别中具有代表性的语义特征,同时只采用优化后来自同一个行为类别的语义特征来重构测试样本。One implementation is to establish a vector-based group sparse algorithm based on the existing matrix group sparse algorithm, and use the joint group sparse model of matrix and vector to promote human motion behaviors belonging to the same category to be reproduced by similar semantic features. structure, suppressing those features that only appear in individual samples within a class. Through the optimization of the model, the representative semantic features in each behavior category are retained, and only the optimized semantic features from the same behavior category are used to reconstruct the test samples.
对于向量b=[b1,b2,…,bm]T,其中的元素被分成G组,第g组的元素数量为mg,向量b的L2,1范数定义为:For the vector b=[b 1 , b 2 ,..., b m ] T , the elements in it are divided into G groups, and the number of elements in the gth group is m g , The L 2,1 norm of a vector b is defined as:
假设第i个行为类别含有mi个行为单元,初始化字典B=[B1,B2,…,BC],其中bij表示第i类的第j个行为单元。我们提出的基于行为单元选择的稀疏模型如下:Suppose the i-th behavior category contains m i behavior units, Initialize dictionary B = [B 1 , B 2 , . . . , B C ], where b ij represents the jth behavioral unit of the i-th class. Our proposed sparse model based on behavioral unit selection is as follows:
其中,‖·‖F表示矩阵的Frobenius范数,‖·‖2,1表示L2,1范数。对向量中的每组元素作为一个整体来惩罚,促使每个运动行为由来自同一类别的行为单元来重构。‖Xi‖2,1对矩阵Xi的每一行作为整体来惩罚,实现行间的稀疏,促使来自同一类别的运动行为由相似的行为单元来重构。γ1和γ2为正则化系数。Among them, ‖· ‖F represents the Frobenius norm of the matrix, and ‖·‖ 2,1 represents the L 2,1 norm. pair vector Each group of elements in is penalized as a whole, prompting each motor behavior to be reconstructed by behavioral units from the same category. ‖X i ‖ 2,1 penalizes each row of the matrix X i as a whole to achieve sparseness between rows and encourage motion behaviors from the same category to be reconstructed by similar behavioral units. γ 1 and γ 2 are regularization coefficients.
步骤S6、利用筛选出的高层语义特征来训练分类器,利用训练好的分类器对视频样本进行分类。Step S6, using the selected high-level semantic features to train a classifier, and using the trained classifier to classify the video samples.
在具体实施时,我们希望训练一个分类模型,其中φ(yt,B)为稀疏模型,B为模型中的字典,f(xt)=f(φ(yt,B))为预测模型,lt为行为视频yt的类别标签,为分类损失函数,P为训练样本的数量。In specific implementation, we hope to train a classification model, where φ(y t , B) is a sparse model, B is a dictionary in the model, and f(x t )=f(φ(y t , B)) is a prediction model , l t is the category label of the action video y t , is the classification loss function, and P is the number of training samples.
字典优化过程采用迭代的方法,由两部分构成:已知字典B的前提下求解样本的稀疏表示,已知样本的稀疏表示的前提下更新字典。The dictionary optimization process adopts an iterative method, which consists of two parts: solving the sparse representation of the sample under the premise of knowing the dictionary B, and updating the dictionary under the premise of knowing the sparse representation of the sample.
通过上述的训练过程可以学习出各个行为类别中代表性和区分性较强的高层语义特征,利用这些高层语义特征来训练分类器(如SVM),得到分类器参数,利用训练好的SVM分类器模型对测试视频进行分类,并输出分类结果。Through the above training process, we can learn high-level semantic features with strong representativeness and discrimination in each behavior category, use these high-level semantic features to train classifiers (such as SVM), obtain classifier parameters, and use the trained SVM classifier The model classifies the test video and outputs the classification result.
下面举一个具体实施例进行说明。A specific example is given below for description.
如图2A所示,对于一个运动行为“行走”,传统的提取底层特征的方法只能统计出一个基于梯度、光流等特征的柱状图来表示该运动行为,这种方法忽略了完整运动行为所包含的局部运动,对于不同运动行为的区分能力较弱。我们提出的高层语义特征通过不同的局部运动,如“躯干的运动”、“四肢的运动”等,来分析完整的运动行为,相比于传统的底层特征具有更强的描述能力。As shown in Figure 2A, for a motion behavior "walking", the traditional method of extracting underlying features can only count a histogram based on gradient, optical flow and other features to represent the motion behavior, which ignores the complete motion behavior The included local motions are less able to discriminate between different motion behaviors. The high-level semantic features we propose analyze the complete motion behavior through different local motions, such as "trunk motion", "limb motion", etc., which have stronger description capabilities than traditional low-level features.
以上所述的具体实施例,对本发明的目的、技术方案和有益效果进行了进一步详细说明,应理解的是,以上所述仅为本发明的具体实施例而已,并不用于限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above have further described the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the present invention. Within the spirit and principles of the present invention, any modifications, equivalent replacements, improvements, etc., shall be included in the protection scope of the present invention.
Claims (3)
1.一种基于语义特征自动学习与筛选的人类行为识别方法,其特征在于,该方法包括:1. A human behavior recognition method based on automatic learning and screening of semantic features, characterized in that the method comprises: 步骤S1、从视频中检测时空兴趣点,所述时空兴趣点是指通过三维空间的角点检测或者滤波得到的空间中的关键点,其检测的时空兴趣点是通过对空间域采用高斯滤波,对时间域采用Gabor滤波所得到的视频中的关键点;Step S1, detecting spatio-temporal interest points from the video, said spatio-temporal interest points refer to key points in space obtained by corner point detection or filtering in three-dimensional space, and the spatio-temporal interest points detected are obtained by using Gaussian filtering in the spatial domain, Key points in the video obtained by Gabor filtering in the time domain; 步骤S2、提取所述时空兴趣点的周围区域的视频底层特征,所述周围区域是指以时空兴趣点所在位置的中心的一个立方体区域,所述视频底层特征是能够表征时空兴趣点周围区域的运动特征和表观特征,利用梯度柱状图特征来提取时空兴趣点周围区域的表观特征,利用光流柱状图特征来提取时空兴趣点周围区域的运动特征;Step S2, extracting the video underlying features of the surrounding area of the spatio-temporal interest point, the surrounding area refers to a cubic area in the center of the location of the spatio-temporal interest point, and the video underlying feature can characterize the surrounding area of the spatio-temporal interest point Motion features and appearance features, use the gradient histogram feature to extract the apparent features of the area around the spatio-temporal interest point, and use the optical flow histogram feature to extract the motion features of the area around the spatio-temporal interest point; 步骤S3、根据所述视频底层特征建立时空上下文特征,所述时空上下文特征是指相邻的多个时空兴趣点共同构成的整体特征,体现了更多的上下文信息,以单个时空兴趣点为中心,搜索出距离中心时空兴趣点最近的N个相邻兴趣点,设计一种时空上下文特征,可以同时描述局部区域内N+1个时空兴趣点特征以及它们之间的相对位置关系,用一个权重向量来约束不同的相邻兴趣点特征,距离中心兴趣点越近的邻近兴趣点被赋予的权重越大,对于运动行为视频中提取出来的任意一个时空兴趣点,都可以获取与它相邻的兴趣点特征以及空间位置信息;Step S3. Establishing spatio-temporal context features based on the underlying features of the video. The spatio-temporal context features refer to the overall features formed by adjacent multiple spatio-temporal interest points, reflecting more context information, centered on a single spatio-temporal interest point , search out the N adjacent points of interest closest to the central spatio-temporal point of interest, design a spatio-temporal context feature, which can simultaneously describe the features of N+1 spatio-temporal point of interest in the local area and the relative positional relationship between them, using a weight vectors to constrain the features of different adjacent POIs, the closer to the central POI is given the greater weight, for any spatio-temporal POI extracted from the motion behavior video, the adjacent POIs can be obtained POI characteristics and spatial location information; 根据三维时空坐标下的欧氏距离,计算出对应于这个区域的局部视觉单词上下文特征:According to the Euclidean distance under the three-dimensional space-time coordinates, the local visual word context features corresponding to this area are calculated: Fp=[h1,h2,…,hK]T,F p = [h 1 , h 2 , . . . , h K ] T , hh ii == 11 ii ff ll aa bb ee ll (( pp )) == ii ΣΣ jj == 11 NN -- 11 ββ ·&Center Dot; δδ (( ll aa bb ee ll (( qq jj )) -- ii )) DD. σσ (( pp ,, qq jj )) oo tt hh ee rr ww ii sthe s ee 其中,label(p)表示兴趣点p的视觉单词标签,通过对局部视觉单词上下文特征重建词包模型,每个行为视频被表示成一个基于底层特征的向量;Among them, label(p) represents the visual word label of the point of interest p. By reconstructing the bag-of-words model for the local visual word context features, each behavioral video is represented as a vector based on the underlying features; 步骤S4、采用基于图模型的非负矩阵分解算法,根据所述视频底层特征生成高层语义特征;Step S4, using a non-negative matrix factorization algorithm based on a graph model to generate high-level semantic features according to the bottom-level features of the video; 步骤S5、利用基于L2,1范数的组稀疏在高层语义特征基础上筛选出具有代表性和区分性的高层语义;Step S5, using group sparseness based on the L2,1 norm to select representative and distinguishable high-level semantics on the basis of high-level semantic features; 步骤S6、利用筛选出的高层语义特征来训练分类器,利用训练好的分类别对视频进行分类。Step S6 , using the selected high-level semantic features to train a classifier, and classifying the videos by using the trained subcategories. 2.根据权利要求1所述的基于语义特征自动学习与筛选的人类行为识别方法,其特征在于,所述的步骤S4包括:2. the human behavior recognition method based on semantic feature automatic learning and screening according to claim 1, is characterized in that, described step S4 comprises: 采用基于图模型的非负矩阵分解将每个样本分解成为一组基向量的线性表示,并且线性表示中的加权系数都为正数;运用该算法将人类运动行为分解成基于部分的表示,同时使得相似的人类运动行为在新的基向量的表示下仍然是相似的。Using the non-negative matrix decomposition based on the graph model to decompose each sample into a linear representation of a set of basis vectors, and the weighting coefficients in the linear representation are all positive numbers; use this algorithm to decompose human motion behavior into part-based representations, and at the same time So that similar human motion behaviors are still similar under the representation of the new basis vector. 3.根据权利要求1所述的基于语义特征自动学习与筛选的人类行为识别方法,其特征在于,所述的步骤S5包括:3. the human behavior recognition method based on semantic feature automatic learning and screening according to claim 1, is characterized in that, described step S5 comprises: 采用矩阵和向量的联合组稀疏模型,促使属于同一类别的运动行为由相似的语义特征来重构;保留各个行为类别中具有代表性的语义特征,抑制那些只在个别类内样本中出现的特征;采用优化后来自同一个行为类别的语义特征来重构测试样本。The joint group sparse model of matrix and vector is used to promote the reconstruction of motion behaviors belonging to the same category by similar semantic features; the representative semantic features in each behavior category are preserved, and those features that only appear in individual samples within a class are suppressed ; Use optimized semantic features from the same behavioral category to reconstruct test samples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410319126.5A CN104063721B (en) | 2014-07-04 | 2014-07-04 | A kind of human behavior recognition methods learnt automatically based on semantic feature with screening |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410319126.5A CN104063721B (en) | 2014-07-04 | 2014-07-04 | A kind of human behavior recognition methods learnt automatically based on semantic feature with screening |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104063721A CN104063721A (en) | 2014-09-24 |
CN104063721B true CN104063721B (en) | 2017-06-16 |
Family
ID=51551423
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410319126.5A Active CN104063721B (en) | 2014-07-04 | 2014-07-04 | A kind of human behavior recognition methods learnt automatically based on semantic feature with screening |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104063721B (en) |
Families Citing this family (8)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104881655B (en) * | 2015-06-03 | 2018-08-28 | 东南大学 | A kind of human behavior recognition methods based on the fusion of multiple features time-space relationship |
CN106529467B (en) * | 2016-11-07 | 2019-08-23 | 南京邮电大学 | Group behavior recognition methods based on multi-feature fusion |
CN109508698B (en) * | 2018-12-19 | 2023-01-10 | 中山大学 | A Human Behavior Recognition Method Based on Binary Tree |
CN111861275B (en) * | 2020-08-03 | 2024-04-02 | 河北冀联人力资源服务集团有限公司 | Household work mode identification method and device |
CN112347879B (en) * | 2020-10-27 | 2021-06-29 | 中国搜索信息科技股份有限公司 | Theme mining and behavior analysis method for video moving target |
CN112560817B (en) * | 2021-02-22 | 2021-07-06 | 西南交通大学 | Human body action recognition method and device, electronic equipment and storage medium |
CN113590971B (en) * | 2021-08-13 | 2023-11-07 | 浙江大学 | A method and system for recommending points of interest based on brain-like spatial and temporal perception representations |
CN117676187B (en) * | 2023-04-18 | 2024-07-26 | 德联易控科技(北京)有限公司 | Video data processing method and device, electronic equipment and storage medium |
Citations (5)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1850271B1 (en) * | 2003-01-29 | 2009-09-09 | Sony Deutschland Gmbh | Method for video mode classification |
CN102324031A (en) * | 2011-09-07 | 2012-01-18 | 江西财经大学 | Implicit Semantic Feature Extraction Method in Multi-Biometric Identity Authentication for Elderly Users |
CN102393910A (en) * | 2011-06-29 | 2012-03-28 | 浙江工业大学 | Human behavior identification method based on non-negative matrix decomposition and hidden Markov model |
CN103077535A (en) * | 2012-12-31 | 2013-05-01 | 中国科学院自动化研究所 | Target tracking method on basis of multitask combined sparse representation |
CN103150579A (en) * | 2013-02-25 | 2013-06-12 | 东华大学 | Abnormal human behavior detecting method based on video sequence |
-
2014
- 2014-07-04 CN CN201410319126.5A patent/CN104063721B/en active Active
Patent Citations (5)
* Cited by examiner, † Cited by third partyPublication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1850271B1 (en) * | 2003-01-29 | 2009-09-09 | Sony Deutschland Gmbh | Method for video mode classification |
CN102393910A (en) * | 2011-06-29 | 2012-03-28 | 浙江工业大学 | Human behavior identification method based on non-negative matrix decomposition and hidden Markov model |
CN102324031A (en) * | 2011-09-07 | 2012-01-18 | 江西财经大学 | Implicit Semantic Feature Extraction Method in Multi-Biometric Identity Authentication for Elderly Users |
CN103077535A (en) * | 2012-12-31 | 2013-05-01 | 中国科学院自动化研究所 | Target tracking method on basis of multitask combined sparse representation |
CN103150579A (en) * | 2013-02-25 | 2013-06-12 | 东华大学 | Abnormal human behavior detecting method based on video sequence |
Also Published As
Publication number | Publication date |
---|---|
CN104063721A (en) | 2014-09-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104063721B (en) | 2017-06-16 | A kind of human behavior recognition methods learnt automatically based on semantic feature with screening |
Yu et al. | 2021 | Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition |
He et al. | 2018 | Remote sensing scene classification using multilayer stacked covariance pooling |
Ouyang et al. | 2016 | DeepID-Net: Object detection with deformable part based convolutional neural networks |
Zhang et al. | 2017 | Relationship proposal networks |
Tang et al. | 2018 | Multi-stream deep neural networks for rgb-d egocentric action recognition |
Yuan et al. | 2015 | Scene recognition by manifold regularized deep learning architecture |
Hariharan et al. | 2016 | Object instance segmentation and fine-grained localization using hypercolumns |
Sultani et al. | 2014 | Human action recognition across datasets by foreground-weighted histogram decomposition |
Wan et al. | 2020 | Action recognition based on two-stream convolutional networks with long-short-term spatiotemporal features |
Chen et al. | 2015 | Combining unsupervised learning and discrimination for 3D action recognition |
CN106909938A (en) | 2017-06-30 | Perspective-independent Behavior Recognition Method Based on Deep Learning Network |
CN103065158A (en) | 2013-04-24 | Action identification method of independent subspace analysis (ISA) model based on relative gradient |
Chen et al. | 2014 | Action recognition using ensemble weighted multi-instance learning |
CN108229401A (en) | 2018-06-29 | A kind of multi-modal Modulation recognition method based on AFSA-SVM |
Pei et al. | 2020 | Consistency guided network for degraded image classification |
Moayedi et al. | 2015 | Structured sparse representation for human action recognition |
Le et al. | 2017 | DeepSafeDrive: A grammar-aware driver parsing approach to Driver Behavioral Situational Awareness (DB-SAW) |
Kolouri et al. | 2017 | Explaining distributed neural activations via unsupervised learning |
CN104376308A (en) | 2015-02-25 | Human action recognition method based on multitask learning |
CN107967441B (en) | 2021-03-30 | Video behavior identification method based on two-channel 3D-2D RBM model |
Cai et al. | 2021 | Performance analysis of distance teaching classroom based on machine learning and virtual reality |
Chen et al. | 2018 | Relational long short-term memory for video action recognition |
Pei et al. | 2016 | Action recognition by learning temporal slowness invariant features |
Khamis et al. | 2015 | Walking and talking: A bilinear approach to multi-label action recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
2014-09-24 | C06 | Publication | |
2014-09-24 | PB01 | Publication | |
2014-10-22 | C10 | Entry into substantive examination | |
2014-10-22 | SE01 | Entry into force of request for substantive examination | |
2017-06-16 | GR01 | Patent grant |