patents.google.com

CN106909938A - Perspective-independent Behavior Recognition Method Based on Deep Learning Network - Google Patents

️Fri Jun 30 2017

Perspective-independent Behavior Recognition Method Based on Deep Learning Network Download PDF

Info

Publication number

CN106909938A

CN106909938A CN201710082263.5A CN201710082263A CN106909938A CN 106909938 A CN106909938 A CN 106909938A CN 201710082263 A CN201710082263 A CN 201710082263A CN 106909938 A CN106909938 A CN 106909938A Authority

China

Prior art keywords

deep learning

viewing angle

learning network

model

method based

Prior art date

2017-02-16

Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)

Granted

Application number

CN201710082263.5A

Other languages

Chinese (zh)

Other versions

CN106909938B (en

Inventor

王传旭

胡国锋

刘继超

杨建滨

孙海峰

崔雪红

李辉

刘云

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Qingdao Shengruida Technology Co ltd

Original Assignee

Qingdao University of Science and Technology

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2017-02-16

Filing date

2017-02-16

Publication date

2017-06-30

2017-02-16 Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology

2017-02-16 Priority to CN201710082263.5A priority Critical patent/CN106909938B/en

2017-06-30 Publication of CN106909938A publication Critical patent/CN106909938A/en

2020-02-21 Application granted granted Critical

2020-02-21 Publication of CN106909938B publication Critical patent/CN106909938B/en

Status Active legal-status Critical Current

2037-02-16 Anticipated expiration legal-status Critical

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Data Mining & Analysis (AREA)
Physics & Mathematics (AREA)
Life Sciences & Earth Sciences (AREA)
Artificial Intelligence (AREA)
General Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
Evolutionary Computation (AREA)
Bioinformatics & Computational Biology (AREA)
Evolutionary Biology (AREA)
Computer Vision & Pattern Recognition (AREA)
Bioinformatics & Cheminformatics (AREA)
Computational Linguistics (AREA)
Biomedical Technology (AREA)
Biophysics (AREA)
Health & Medical Sciences (AREA)
General Health & Medical Sciences (AREA)
Molecular Biology (AREA)
Computing Systems (AREA)
Mathematical Physics (AREA)
Software Systems (AREA)
Image Analysis (AREA)

Abstract

The present invention proposes a kind of viewing angle independence Activity recognition method based on deep learning network, comprises the following steps：By the video frame images typing under a certain visual angle, low-level image feature extraction and processing are carried out by the way of deep learning；Low-level image feature to obtaining is modeled, and cube model is obtained in chronological order；The cube model at all visual angles is converted into the cylinder feature space mapping of unchanged view angle, after be entered into grader and be trained, obtain video behavior viewing angle independence grader.Technical scheme is analyzed using deep learning network to the human body behavior under various visual angles, improves the robustness of disaggregated model；It is especially suitable for being trained based on big data, being learnt, can have well given play to its advantage.

Description

基于深度学习网络的视角无关性行为识别方法Perspective-independent Behavior Recognition Method Based on Deep Learning Network

技术领域technical field

本发明计算机视觉技术领域，特别是指一种基于深度学习网络的视角无关性行为识别方法。The invention relates to the technical field of computer vision, in particular to a view-independent behavior recognition method based on a deep learning network.

背景技术Background technique

随着信息技术的飞速发展，计算机视觉伴随着VR、AR以及人工智能等概念的出现迎来了最好的发展时期，作为计算机视觉领域最重要的视频行为分析也越来越受到国内外学者的青睐。视频监控、人机交互、医疗看护、视频检索等一系列的领域中，视频行为分析占据了很大的比重。例如现在比较流行的无人驾驶汽车项目，视频行为分析非常具有挑战性。由于人体动作的复杂性和多样性的特点，再加上多个视角下人体自遮挡、多尺度以及视角旋转、平移等因素的影响，使得视频行为识别的难度非常大。如何能够精确地识别实际生活中多个角度下人体行为，并对人体行为进行分析，一直都是非常重要的研究课题，并且社会对行为分析的要求也越来越高。With the rapid development of information technology, computer vision has ushered in the best development period with the emergence of concepts such as VR, AR and artificial intelligence. As the most important video behavior analysis in the field of computer vision, it has also been increasingly recognized by domestic and foreign scholars. favor. In a series of fields such as video surveillance, human-computer interaction, medical care, and video retrieval, video behavior analysis occupies a large proportion. For example, in the popular self-driving car project, video behavior analysis is very challenging. Due to the complexity and diversity of human body movements, coupled with the influence of human body self-occlusion, multi-scale, viewing angle rotation, and translation under multiple viewing angles, video behavior recognition is very difficult. How to accurately identify and analyze human behavior from multiple perspectives in real life has always been a very important research topic, and the society's requirements for behavior analysis are also getting higher and higher.

传统的研究方法包含以下几种：Traditional research methods include the following:

基于时空特征点：对提取到的视频帧图像提取其中的时空特征点，然后时空特征点建模、分析，最后进行分类。Based on spatio-temporal feature points: extract the spatio-temporal feature points from the extracted video frame images, then model, analyze, and finally classify the spatio-temporal feature points.

基于人体骨架：通过算法或者深度相机提取到人体骨架信息，然后通过对骨架信息进行描述、建模，继而对视频行为分类。Based on the human skeleton: the human skeleton information is extracted through the algorithm or the depth camera, and then the video behavior is classified by describing and modeling the skeleton information.

基于时空特征点和骨架信息的行为分析方法，在传统单视角下或者单人模式下取得了显著地成果，但是针对现在像大街、机场、车站等行人流量比较大的地区或者人体遮挡、光照变化、视角变换等一系列复杂问题的出现，单纯的使用这两种分析方法在实际生活中效果往往达不到人们的要求，有时算法的鲁棒性也很差。The behavior analysis method based on spatio-temporal feature points and skeleton information has achieved remarkable results in the traditional single-view or single-person mode, but it is not suitable for areas with relatively large pedestrian traffic such as streets, airports, and stations, or human body occlusion and light changes. The appearance of a series of complicated problems such as , perspective transformation, etc., the effect of simply using these two analysis methods in real life often fails to meet people's requirements, and sometimes the robustness of the algorithm is also very poor.

发明内容Contents of the invention

为了解决以上现有技术存在的缺陷，本发明提出一种基于深度学习网络的视角无关性行为识别方法，采用深度学习网络对多视角下的人体行为进行分析，提升分类模型的鲁棒性；尤其深度学习网络适合基于大数据进行训练、学习，能够很好地发挥出其的优点。In order to solve the above defects in the prior art, the present invention proposes a view-independent behavior recognition method based on a deep learning network, which uses a deep learning network to analyze human behavior under multiple perspectives, and improves the robustness of the classification model; especially The deep learning network is suitable for training and learning based on big data, and can give full play to its advantages.

本发明的技术方案是这样实现的：Technical scheme of the present invention is realized like this:

一种基于深度学习网络的视角无关性行为识别方法，包括利用训练样本集获得分类器的训练过程及利用分类器识别测试样本的识别过程；A view-independent behavior recognition method based on a deep learning network, comprising a training process of using a training sample set to obtain a classifier and a recognition process of using the classifier to identify test samples;

所述训练过程包括以下步骤：The training process includes the following steps:

S1)将某一视角下的视频帧图像Image 1到Image i按照时间顺序进行输入；S1) Input the video frame images Image 1 to Image i under a certain viewing angle in chronological order;

S2)对步骤S1)输入的图像采用CNN(Convolutional Neural Network，卷积神经网络)进行底层特征提取并对其进行池化，将池化后的底层特征采用STN(Spatial TransformNetworks，空间转换网络)进行强化；S2) Use CNN (Convolutional Neural Network, Convolutional Neural Network) to extract the underlying features of the image input in step S1) and perform pooling on it, and use STN (Spatial TransformNetworks, spatial transformation network) to perform pooling of the underlying features after pooling strengthen;

S3)对步骤S2)强化后的特征图像(Feature Map)进行池化并输入RNN(RecurrentNeural Network，递归神经网络层)进行时间建模，获得时序关联的立方体模型；S3) pooling the enhanced feature image (Feature Map) in step S2) and inputting it into RNN (Recurrent Neural Network, recurrent neural network layer) for time modeling to obtain a temporally associated cube model;

S4)重复步骤S1)至S3)得到多个视角下同一个行为的空间立方体模型，将各视角的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其作为该类行为的训练样本输入到分类器中进行训练；S4) Repeat steps S1) to S3) to obtain the spatial cube model of the same behavior under multiple perspectives, convert the spatial cube model of each perspective into a cylinder feature space map that does not change the perspective, and use it as the behavior of this type The training samples are input into the classifier for training;

S5)重复以上各步骤，得到各种行为的视角无关性分类器；S5) repeating the above steps to obtain a view-independent classifier of various behaviors;

所述识别过程包括以下步骤：The identification process includes the following steps:

S6)录入某一视角下的视频帧图像，采用上述步骤S1)至S3)对其进行底层特征提取和建模，得到该视角下的空间立方体模型；S6) input the video frame image under a certain viewing angle, and use the above steps S1) to S3) to extract and model the underlying features to obtain the spatial cube model under the viewing angle;

S7)将步骤S6)得到的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其输入到分类器中进行识别得到视频行为类别。S7) Transform the spatial cube model obtained in step S6) into a perspective-invariant cylinder feature space map, and input it into a classifier for recognition to obtain video behavior categories.

上述技术方案中，步骤S2)优选采用三层卷积操作来提取底层特征；步骤S2)和步骤S3)优选采用最大池化方法对特征图像进行降维操作。In the above technical solution, step S2) preferably uses a three-layer convolution operation to extract the underlying features; step S2) and step S3) preferably use the maximum pooling method to perform dimensionality reduction operations on the feature image.

上述技术方案中，步骤S3)得到的是同一个行为某一个视角下的空间立方体模型，反复操作步骤S1)至S3)得到多个视角下同一个行为的空间立方体模型。In the above technical solution, step S3) obtains the spatial cube model of the same behavior under a certain viewing angle, and repeats steps S1) to S3) to obtain the spatial cube model of the same behavior under multiple viewing angles.

本发明的技术方案中，优选采用LSTM网络(Long-Short Term Memory，简称LSTM)进行时间建模，因为深度学习网络的后向传播过程采用的是随机梯度下降法，采用LSTM中的特殊门操作，可以防止各层的梯度消失问题。In the technical solution of the present invention, it is preferable to use LSTM network (Long-Short Term Memory, referred to as LSTM) for time modeling, because the backward propagation process of the deep learning network adopts the stochastic gradient descent method, and the special gate operation in LSTM is adopted , which can prevent the vanishing gradient problem of each layer.

上述技术方案中，步骤S4)具体包括：In the above technical solution, step S4) specifically includes:

S41)重复操作步骤S1)至S3)，得到同一个行为各视角的空间立方体模型，并将其整合到以x，y，z为坐标轴的圆柱体空间中，圆柱体空间表示各视角下运动特征的轨迹描述；S41) Repeat steps S1) to S3) to obtain the same spatial cube model of different viewing angles, and integrate it into a cylindrical space with x, y, z as coordinate axes, and the cylindrical space represents the movement under each viewing angle Trajectory description of features;

S42)对步骤S41)得到的模型采用公式：S42) adopt formula to the model that step S41) obtains:

进行极坐标变换，得到角度不变的柱体空间映射。Carry out polar coordinate transformation to obtain the cylinder space mapping with constant angle.

上述技术方案中，还包括：S0)构建数据集，本发明优选采用IXMAS数据集。The above technical solution also includes: S0) constructing a data set, and the present invention preferably adopts the IXMAS data set.

与现有技术相比较，本发明的技术方案有以下不同：Compared with the prior art, the technical solution of the present invention has the following differences:

1、使用CNN的方法对底层特征进行特征提取，得到全局的特征而不是传统方法所得到的关键点。1. Use the CNN method to extract the underlying features to obtain global features instead of key points obtained by traditional methods.

2、使用STN方法对得到的全局特征进行特征强化，而不是对得到的特征直接进行建模。2. Use the STN method to enhance the obtained global features instead of directly modeling the obtained features.

3、使用LSTM网络对经过强化以及降维操作以后的全局特征进行时间建模，加入重要的时间信息，使其具有时间关联性。3. Use the LSTM network to model the global features after strengthening and dimensionality reduction operations, and add important time information to make it time-relevant.

4、使用极坐标变换对同一个行为各视角的空间立方体模型进行坐标变换，得到角度不变的柱体空间映射，再由CNN完成训练和分类识别。4. Use polar coordinate transformation to perform coordinate transformation on the spatial cube model of the same behavior and different viewing angles to obtain a cylinder space mapping with constant angle, and then complete the training and classification recognition by CNN.

本发明的优点在于：使用CNN的方法得出的是全局高级特征，经过STN的特征强化，对实际生活中的视频具有很好的鲁棒性，然后使用RNN网络建立时间信息，最后经过极坐标变换对多视角中的不同特征进行融合，使用CNN对得到的角度不变的描述符进行训练与分类，而不用使用传统的骨架和关键点提取操作，全局特征得到的特征更全面；RNN网络获得帧间时间信息，使得行为描述地更加完整，适用性更强。The advantage of the present invention is that the global advanced features are obtained by using the CNN method, and after the feature enhancement of the STN, it has good robustness to the video in real life, and then the RNN network is used to establish the time information, and finally the polar coordinates Transformation fuses different features in multiple perspectives, and uses CNN to train and classify the obtained descriptors with invariant angles, instead of using traditional skeleton and key point extraction operations, the features obtained by global features are more comprehensive; RNN network obtains Inter-frame time information makes the behavior description more complete and more applicable.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained according to these drawings on the premise of not paying creative efforts.

图1为本发明训练过程的流程示意图；Fig. 1 is a schematic flow chart of the training process of the present invention;

图2为本发明识别过程的流程示意图；Fig. 2 is a schematic flow chart of the identification process of the present invention;

图3为一般人体行为识别流程示意图；FIG. 3 is a schematic diagram of a general human behavior recognition process;

图4为简化的底层特征的提取与建模流程图；Fig. 4 is the extraction and modeling flowchart of simplified bottom layer feature;

图5为一般CNN的处理流程图；Fig. 5 is the processing flowchart of general CNN;

图6为一般RNN简化结构图；Figure 6 is a simplified structure diagram of a general RNN;

图7为LSTM框图；Figure 7 is a block diagram of LSTM;

图8为对各个视角进行融合分类的流程图；Fig. 8 is a flow chart of fusion and classification of various perspectives;

图9为图8中的Motion History Volume经过极坐标变换以后的模型示意图。FIG. 9 is a schematic diagram of the model of the Motion History Volume in FIG. 8 after polar coordinate transformation.

具体实施方式detailed description

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

如图1及图2所示，本发明的基于深度学习网络的视角无关性行为识别方法，包括利用训练样本集获得分类器的训练过程及利用分类器识别测试样本的识别过程；As shown in Figure 1 and Figure 2, the perspective-independent behavior recognition method based on the deep learning network of the present invention includes the training process of using the training sample set to obtain the classifier and the recognition process of using the classifier to identify the test sample;

所述训练过程如图1所示，包括以下步骤：Described training process is shown in Figure 1, comprises the following steps:

S1)将某一视角下的视频帧图像Image 1到Image i按照时间顺序进行输入；S1) Input the video frame images Image 1 to Image i under a certain viewing angle in chronological order;

S2)对步骤S1)输入的图像采用CNN进行底层特征提取并对其进行池化，将池化后的底层特征采用STN进行强化；S2) using CNN to extract the underlying features of the image input in step S1) and pooling it, and using STN to strengthen the pooled underlying features;

S3)对步骤S2)强化后的特征图像进行池化并输入RNN进行时间建模，获得时序关联的立方体模型；S3) Pooling the enhanced feature image in step S2) and inputting it into RNN for time modeling to obtain a temporally associated cube model;

S5)重复以上各步骤，得到各种行为的视角无关性分类器。S5) Repeat the above steps to obtain view-independent classifiers for various behaviors.

所述识别过程如图2所示，包括以下步骤：The identification process is shown in Figure 2, comprising the following steps:

上述技术方案中，步骤S4)具体包括：In the above technical solution, step S4) specifically includes:

S42)对步骤S41)得到的模型采用公式：S42) adopt formula to the model that step S41) obtains:

进行极坐标变换，得到角度不变的柱体空间映射。Carry out polar coordinate transformation to obtain the cylinder space mapping with constant angle.

上述技术方案中，还包括：S0)构建数据集。The above technical solution also includes: S0) constructing a data set.

本发明优选采用IXMAS数据集，数据集包含五个不同视角、12个人每人14个动作，每个动作重复三次。使用其中的11个人作为训练数据集，剩余的1人作为测试数据集。The present invention preferably adopts the IXMAS data set, which includes five different perspectives, 12 people each with 14 actions, and each action is repeated three times. 11 of them are used as the training data set, and the remaining 1 person is used as the test data set.

具体的，例如要识别“跑步”这个行为，首先采集五种视角下12个人的跑步视频，其中11个人的跑步视频作为训练数据集，剩余1人作为验证数据集。首先将某一个人的一视角下的跑步视频帧图像按照上述步骤S1)至S3)进行操作，最终得到的是该视角下的“跑步”视频行为的时序关联的立方体模型，即在该视角下“跑步”行为的空间立方体模型；然后重复步骤S1)至S3)依次得到其他四种视角下“跑步”行为的空间立方体模型；将以上五种视角下“跑步”行为的空间立方体模型转化为一个视角不变的柱体特征空间映射，并将其作为这个人的“跑步”这种类别行为的训练样本，输入分类器训练；经过多次不同人的训练样本训练后，得到“跑步”行为的视角无关性分类器。同理，可以构建各种视频行为的视角无关性分类器。Specifically, for example, to identify the behavior of "running", first collect running videos of 12 people from five perspectives, of which 11 people's running videos are used as a training data set, and the remaining 1 person is used as a verification data set. Firstly, the running video frame image of a certain person under a viewing angle is operated according to the above steps S1) to S3), and finally the cube model of the temporal correlation of the "running" video behavior under this viewing angle is obtained, that is, under this viewing angle The spatial cube model of the "running" behavior; then repeat steps S1) to S3) to obtain the spatial cube models of the "running" behavior in the other four perspectives; convert the spatial cube models of the "running" behavior in the above five perspectives into a The feature space mapping of the cylinder with the same viewing angle is used as the training sample of the person's "running" category behavior, and input to the classifier for training; after multiple training samples from different people, the "running" behavior is obtained. View-independent classifier. In the same way, view-independent classifiers for various video behaviors can be constructed.

当进行识别时，执行上述步骤S6)和S7)，首先将测试样本中的一个人的某一个视角下的视频帧图像按照上述步骤S1)至S3)进行操作，得到该视角下该行为的空间立方体模型，再经过极坐标变换转化为一个柱体特征空间映射，将其输入分类器中识别出行为类别。其他视角的识别过程与此同。When performing recognition, the above steps S6) and S7) are executed, first, the video frame image of a person in the test sample under a certain viewing angle is operated according to the above steps S1) to S3), and the space of the behavior under the viewing angle is obtained The cube model is then transformed into a cylindrical feature space map through polar coordinate transformation, which is input into the classifier to identify the behavior category. The recognition process for other perspectives is the same.

为了更好地理解和阐述本发明的技术方案，以下通过对上述技术方案涉及到的有关技术进行详细讲解和分析。In order to better understand and illustrate the technical solution of the present invention, the relevant technologies involved in the above technical solution are explained and analyzed in detail below.

本发明的方法模型包含两个主要阶段，一是对底层特征提取、建模；第二是对各个视角进行融合、分类，主要的创新工作如下。The method model of the present invention includes two main stages, one is to extract and model the underlying features; the other is to fuse and classify various perspectives, and the main innovative work is as follows.

人体行为识别一般的流程如图3所示，该图中特征提取与特征表示阶段是行为识别的重点，这一阶段的结果将最终影响识别的精确度，以及算法的鲁棒性，本发明采用了深度学习的方法进行特征提取。The general flow of human behavior recognition is shown in Figure 3. The feature extraction and feature representation stages in this figure are the focus of behavior recognition. The results of this stage will ultimately affect the accuracy of recognition and the robustness of the algorithm. The present invention uses The method of deep learning is used for feature extraction.

如图4所示为简化的底层特征提取与建模流程图。Figure 4 is a simplified flow chart of the underlying feature extraction and modeling.

本发明的技术方案中，采用的深度学习框架是Caffe，图4中的某一视角下的视频帧Image 1到Image i是按照时间顺序输入到网络中。首先对输入图像使用CNN进行特征提取，然后使用STN对特征进行强化，使其对平移、尺度变化、角度变化具有一定的鲁棒性，然后对特征图像(Feature Map)进行池化操作，这里采用的是最大池化方法，然后将经过池化操作的特征图像输入到RNN层中进行时间建模，最后得到带有帧间时间关联性的特征图像序列(Feature Maps Sequences)。In the technical solution of the present invention, the deep learning framework adopted is Caffe, and the video frames Image 1 to Image i under a certain viewing angle in FIG. 4 are input into the network in chronological order. First, CNN is used to extract features from the input image, and then STN is used to enhance the features to make it robust to translation, scale change, and angle change, and then the feature image (Feature Map) is pooled. Here we use The most important is the maximum pooling method, and then input the pooled feature image into the RNN layer for time modeling, and finally obtain the feature image sequence (Feature Maps Sequences) with inter-frame time correlation.

本发明的技术方案采用三层卷积操作来提取底层特征，然后通过最大池化方法对特征进行降维操作。将池化以后的特征图像输入到STN层中对特征进行强化操作，STN网络的功能是能够使得到的特征具有对平移、旋转和尺度变化具有鲁棒性。然后将STN输出的特征图像进行最大池化，再次进行降维处理，然后输入到RNN网络中使其置入时间信息，最后按时间顺序，将得到的Feature Maps组合成空间立方体。本发明中用到的RNN网络为LSTM网络，因为深度学习网络的后向传播过程采用的是随机梯度下降法，采用LSTM中的特殊门操作，可以防止各层的梯度消失问题。The technical solution of the present invention adopts a three-layer convolution operation to extract the underlying features, and then performs a dimensionality reduction operation on the features through a maximum pooling method. The feature image after pooling is input into the STN layer to strengthen the features. The function of the STN network is to make the obtained features robust to translation, rotation and scale changes. Then, the feature image output by the STN is max-pooled, dimensionally reduced again, and then input into the RNN network to incorporate time information. Finally, the obtained Feature Maps are combined into a spatial cube in chronological order. The RNN network used in the present invention is an LSTM network, because the backward propagation process of the deep learning network adopts the stochastic gradient descent method, and the special gate operation in the LSTM can prevent the gradient disappearance problem of each layer.

上述技术方案中，CNN是近年来发展起来的发展起来、并引起重视的高效识别方法。20世纪60年代，Hubel和Wiesel在研究猫脑皮层中用于局部敏感和方向选择的神经元时发现其独特的网络结构可以有效地降低反馈神经网络的复杂性，继而提出了CNN。现在，CNN已经成为众多科学领域的研究热点之一，特别是在模式分类领域，由于该网络避免了对图像的复杂前期预处理，可以直接输入原始图像，因而得到了更为广泛的应用。Among the above technical solutions, CNN is an efficient identification method that has been developed in recent years and has attracted attention. In the 1960s, Hubel and Wiesel discovered that its unique network structure can effectively reduce the complexity of the feedback neural network when studying the neurons used for local sensitivity and direction selection in the cat's cerebral cortex, and then proposed CNN. Now, CNN has become one of the research hotspots in many scientific fields, especially in the field of pattern classification, because the network avoids the complex preprocessing of images and can directly input original images, so it has been more widely used.

一般地，CNN的基本结构包括两层，其一为特征提取层，每个神经元的输入与前一层的局部接受域相连，并提取该局部的特征。一旦该局部特征被提取后，它与其它特征间的位置关系也随之确定下来；其二是特征映射层，网络的每个计算层由多个特征映射组成，每个特征映射是一个平面，平面上所有神经元的权值相等。Generally, the basic structure of CNN includes two layers, one is the feature extraction layer, the input of each neuron is connected to the local receptive field of the previous layer, and the local features are extracted. Once the local feature is extracted, the positional relationship between it and other features is also determined; the second is the feature map layer, each calculation layer of the network is composed of multiple feature maps, each feature map is a plane, All neurons on the plane have equal weights.

本发明的技术方案中就是使用特征映射层，提取视频帧图像中的全局底层特征，而后对底层特征进行更深层次的处理。In the technical solution of the present invention, the feature mapping layer is used to extract the global bottom-level features in the video frame image, and then perform deeper processing on the bottom-level features.

CNN的一般化处理流程如图5所示。The generalized processing flow of CNN is shown in Figure 5.

本发明的技术方案要使用的层就是在经过卷积以后得到的Feature Map，我们忽略后面的池化和全连接层。CNN得到的是单张图像的特征信息，而要处理的是视频信息，因此需要引入时间信息，所以单纯的使用CNN不能达到处理视频行为的要求。The layer to be used in the technical solution of the present invention is the Feature Map obtained after convolution, and we ignore the subsequent pooling and fully connected layers. What CNN obtains is the characteristic information of a single image, but what is to be processed is video information, so time information needs to be introduced, so simply using CNN cannot meet the requirements of processing video behavior.

上述技术方案中，RNN或者叫做循环神经网络是在前馈神经网络(Feed-forwardNeural Networks，简称FNNs)的基础上发展而来。不同于传统的FNNs，RNN引入了定向循环，能够处理那些输入之间前后关联的问题。RNN包含输入单元(Input units)，输入集标记为{x₀，x₁，…，x_t-1，x_t，x_t+1，…}，而输出单元(Output units)的输出集则被标记为{o₀，o₁，…，o_t-1，o_t，o_t+1，…}。RNN还包含隐含单元(Hidden units)，我们将其输出集标记为{s₀，s₁，…，s_t-1，s_t，S_t+1，…}，这些隐含单元完成了最为主要的工作。In the above technical solution, RNN or recurrent neural network is developed on the basis of feed-forward neural networks (Feed-forward Neural Networks, FNNs for short). Different from traditional FNNs, RNN introduces directed loops, which can deal with the problems of context and context between inputs. RNN contains input units (Input units), the input set is marked as {x ₀ , x ₁ , ..., x _t-1 , x _t , x _t+1 , ...}, while the output set of the output unit (Output units) is Labeled as {o ₀ , o ₁ , ..., o _t−1 , o _t , o _t+1 , ...}. RNN also contains hidden units (Hidden units), we mark its output set as {s ₀ , s ₁ ,…, s _t-1 , s _t , S _t+1 ,…}, these hidden units complete the most The main work.

如图6所示为一般的RNN简化结构，图6中，有一条单向流动的信息流是从输入单元到达隐含单元的，与此同时另一条单向流动的信息流从隐含单元到达输出单元。在某些情况下，RNN会打破后者的限制，引导信息从输出单元返回隐含单元，这些被称为“BackProjections”，并且隐含层的输入还包括上一隐含层的状态，即隐含层内的节点可以自连也可以互连。因此，在隐含层就实现了时间信息的连接，不需要再额外的考虑时间信息的问题。这也是RNN在处理视频行为特征时的一大优势。因此，一般带有时序信息的处理，在深度学习中都是交给RNN来处理。Figure 6 shows the general RNN simplified structure. In Figure 6, there is a one-way flow of information flowing from the input unit to the hidden unit, while another one-way flow of information flows from the hidden unit. output unit. In some cases, RNN will break the limitation of the latter, guide the information from the output unit back to the hidden unit, these are called "BackProjections", and the input of the hidden layer also includes the state of the previous hidden layer, that is, hidden Nodes within a layer can be self-connected or interconnected. Therefore, the connection of time information is realized in the hidden layer, and there is no need to additionally consider the issue of time information. This is also a major advantage of RNN when dealing with video behavior features. Therefore, the processing with timing information is generally handed over to RNN in deep learning.

在RNN的基础上又发展了一个新的处理时间信息的模型：长段时间记忆(Long-Short Term Memory，简称LSTM)。因为深度学习网络后向传播采用的随机梯度下降法，因此，RNN会出现一种梯度消失的问题，也就是后面时间的节点对于前面时间的节点感知力下降。所以LSTM引入一个核心元素就是Cell。LSTM的大致框图如图7所示。On the basis of RNN, a new model for processing time information has been developed: Long-Short Term Memory (LSTM for short). Because of the stochastic gradient descent method used in the backward propagation of the deep learning network, there will be a problem of gradient disappearance in RNN, that is, the perception of the nodes at the later time to the nodes at the previous time will decrease. So LSTM introduces a core element that is Cell. A rough block diagram of LSTM is shown in Figure 7.

图8所示为对各个视角进行融合分类的流程图。FIG. 8 is a flow chart of performing fusion classification on various views.

按照图4的方法得到多个视角下同一个动作的空间立方体模型，然后将各视角的空间立方体模型整合到以x，y，z为坐标轴的圆柱体空间中，圆柱体空间表示各视角下运动特征的轨迹描述，然后使用数学方法进行极坐标变换，将之转化到r，θ，z坐标轴的空间中，公式如下所示：According to the method in Figure 4, the spatial cube model of the same action under multiple viewing angles is obtained, and then the spatial cube models of each viewing angle are integrated into a cylinder space with x, y, and z as coordinate axes, and the cylinder space represents each viewing angle. The trajectory description of the motion characteristics, and then use mathematical methods to perform polar coordinate transformation, and transform it into the space of r, θ, and z coordinate axes. The formula is as follows:

然后得到角度不变的柱体空间映射(Invariant Cylinder Space Map)，最后将得到的柱体空间映射输入到分类器中，得到行为类别，这里使用CNN的方式进行分类，区别于SVM分类器，因为CNN最开始是用来分类使用的。图8中的Motion History Volume(运动历史柱)以及经过极坐标变换以后的模型如图9所示。Then obtain the invariant cylinder space map (Invariant Cylinder Space Map), and finally input the obtained cylinder space map into the classifier to obtain the behavior category. Here, CNN is used for classification, which is different from the SVM classifier, because CNN was originally used for classification. The Motion History Volume (motion history column) in Figure 8 and the model after polar coordinate transformation are shown in Figure 9 .

本发明的技术方案采用深度学习的方法提取到的底层信息比传统方法的时空特征点以及骨架信息更加高级并且骨棒性更好。The underlying information extracted by the technical solution of the present invention using the method of deep learning is more advanced and more robust than the spatio-temporal feature points and skeleton information of the traditional method.

以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the scope of the present invention. within the scope of protection.

Claims (6)

1. a kind of viewing angle independence Activity recognition method based on deep learning network, is classified using training sample set The training process of device and the identification process using grader identification test sample；It is characterized in that：

The training process is comprised the following steps：

S1) video frame images Image 1 to the Image i under a certain visual angle are input into sequentially in time；

S2) to step S1) input image low-level image feature extraction is carried out using CNN and pond is carried out to it, by the bottom of Chi Huahou Layer feature is strengthened using STN；

S3) to step S2) characteristic image after reinforcing carries out pond and is input into RNN carrying out time modeling, obtains sequential correlation Cube model；

S4) repeat step S1) to S3) the spatial cuboids model of same behavior under multiple visual angles is obtained, by the sky at each visual angle Between cube model be converted into the cylinder feature space mapping of unchanged view angle, and as the training sample of the class behavior It is input in grader and is trained；

S5 each step more than) repeating, obtains the viewing angle independence grader of various actions；

The identification process is comprised the following steps：

S6) the video frame images under a certain visual angle of typing, using above-mentioned steps S1) to S3) it is carried out low-level image feature extract and Modeling, obtains the spatial cuboids model under the visual angle；

S7) by step S6) the spatial cuboids model conversation that obtains is a cylinder feature space mapping for unchanged view angle, and will It is identified obtaining video behavior classification in being input to grader.

2. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that：

Step S2) low-level image feature is extracted using the operation of three-layer coil product.

3. the viewing angle independence Activity recognition method based on deep learning network according to claim 2, it is characterised in that：

Step S2) and step S3) dimensionality reduction operation is carried out to characteristic image using maximum pond method.

4. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that：

Step S3) time modeling is carried out using LSTM networks.

5. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that Step S4) specifically include：

S41 step S1) is repeated) to S3), the spatial cuboids model at each visual angle of same behavior is obtained, and integrated To with x, y, z are in the cylindrical space of reference axis, cylindrical space represents the track description of motion feature under each visual angle；

S42) to step S41) model that obtains uses formula：

Polar coordinate transform is carried out, isogonal cylinder space mapping is obtained.

6. the viewing angle independence Activity recognition method based on deep learning network according to claim 1, it is characterised in that Also include：

S0 data set) is built.

CN201710082263.5A 2017-02-16 2017-02-16 Perspective-independent behavior recognition method based on deep learning network Active CN106909938B (en)

Priority Applications (1)

Application Number	Priority Date	Filing Date	Title
CN201710082263.5A CN106909938B (en)	2017-02-16	2017-02-16	Perspective-independent behavior recognition method based on deep learning network

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
CN201710082263.5A CN106909938B (en)	2017-02-16	2017-02-16	Perspective-independent behavior recognition method based on deep learning network

Publications (2)

Publication Number	Publication Date
CN106909938A true CN106909938A (en)	2017-06-30
CN106909938B CN106909938B (en)	2020-02-21

Family

ID=59208388

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
CN201710082263.5A Active CN106909938B (en)	2017-02-16	2017-02-16	Perspective-independent behavior recognition method based on deep learning network