patents.google.com

CN112036276B - Artificial intelligent video question-answering method - Google Patents

  • ️Fri Apr 07 2023

CN112036276B - Artificial intelligent video question-answering method - Google Patents

Artificial intelligent video question-answering method Download PDF

Info

Publication number
CN112036276B
CN112036276B CN202010839563.5A CN202010839563A CN112036276B CN 112036276 B CN112036276 B CN 112036276B CN 202010839563 A CN202010839563 A CN 202010839563A CN 112036276 B CN112036276 B CN 112036276B Authority
CN
China
Prior art keywords
features
attention mechanism
feature
visual
question
Prior art date
2020-08-19
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010839563.5A
Other languages
Chinese (zh)
Other versions
CN112036276A (en
Inventor
王田
李嘉锟
李泽贤
张奇鹏
彭泰膺
吕金虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beihang University
Original Assignee
Beihang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2020-08-19
Filing date
2020-08-19
Publication date
2023-04-07
2020-08-19 Application filed by Beihang University filed Critical Beihang University
2020-08-19 Priority to CN202010839563.5A priority Critical patent/CN112036276B/en
2020-12-04 Publication of CN112036276A publication Critical patent/CN112036276A/en
2023-04-07 Application granted granted Critical
2023-04-07 Publication of CN112036276B publication Critical patent/CN112036276B/en
Status Active legal-status Critical Current
2040-08-19 Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种人工智能视频问答方法,包括以下步骤:S1、获取视觉特征和文字特征;S2、视觉特征提取,对视觉特征和语义特征进行多模态融合,获得融合特征;S3、根据融合特征和语义特征生成答案。本发明公开的人工智能视频问答方法,参数量小,运算速度快,能够正确理解问题和备选答案、各备选答案间的逻辑关系,得到的答案准确率有明显提高。

Figure 202010839563

The invention discloses an artificial intelligence video question answering method, comprising the following steps: S1, acquiring visual features and text features; S2, extracting visual features, performing multimodal fusion of visual features and semantic features to obtain fusion features; S3, according to Fusion of features and semantic features to generate answers. The artificial intelligence video question-and-answer method disclosed by the invention has small parameters and fast operation speed, can correctly understand the question, the alternative answers, and the logical relationship between each alternative answer, and the accuracy of the obtained answers is significantly improved.

Figure 202010839563

Description

一种人工智能视频问答方法An artificial intelligence video question answering method

技术领域Technical Field

本发明涉及一种人工智能视频问答方法,属于人工智能领域。The invention relates to an artificial intelligence video question-answering method and belongs to the field of artificial intelligence.

背景技术Background Art

在计算机硬件技术和互联网技术飞速发展下,产生了大规模的视频数据,如何利用这些数据,对视频的内容进行时空情景分析与理解,已成为日益增长的需求。With the rapid development of computer hardware technology and Internet technology, large-scale video data has been generated. How to use this data to conduct spatiotemporal scenario analysis and understanding of video content has become an increasingly growing demand.

同时,自然语言是人类社会最重要的工具之一,用自然语言与计算机进行通信,计算机根据给定的视频,视觉与自然语言处理相结合,自动求解并输出相应的答案,将这个过程称之为视频问答,视频问答能够实现快速的对视频内容进行处理。At the same time, natural language is one of the most important tools in human society. We use natural language to communicate with computers. Computers automatically solve and output corresponding answers based on given videos, combining vision and natural language processing. This process is called video question and answer. Video question and answer can quickly process video content.

传统的视频问答,往往是解决某一特定任务,这使得模型只需要具有特定信息的理解能力即可,举例来说,通常情况下针对人体动作识别任务的模型只需要识别出人物并进行时序建模,而不需要识别出车辆、动物等其他物体,但这种模型适应性差,可扩展能力差。Traditional video question-answering often solves a specific task, which means that the model only needs to have the ability to understand specific information. For example, under normal circumstances, the model for human action recognition tasks only needs to recognize people and perform time series modeling, without having to recognize other objects such as vehicles and animals. However, this model has poor adaptability and poor scalability.

视频问答需要模型理解问题和备选答案间的逻辑关系,即一些问题会提供多个备选答案。对于这类问题,现有的研究工作采用将问题和单个答案串接形成设问句,考虑到设问句是常见且合理的语言表述形式,经过预训练的语言模型直接应用便能正确理解问题和备选答案间的逻辑关系。举例来说,模型的输入为“汽车是什么颜色的?黑色”语句和相应的视频片段,然后模型对综合二者信息给出该语句置信度评价,经过对多个备选答案组成的多个设问句评价后,选取置信度最高的作为最终答案即可。Video question answering requires the model to understand the logical relationship between questions and alternative answers, that is, some questions provide multiple alternative answers. For such questions, existing research works use the method of concatenating questions and single answers to form interrogative sentences. Considering that interrogative sentences are a common and reasonable form of language expression, the pre-trained language model can be directly applied to correctly understand the logical relationship between questions and alternative answers. For example, the input of the model is the sentence "What color is the car? Black" and the corresponding video clip. The model then gives a confidence evaluation of the sentence based on the information of the two. After evaluating multiple interrogative sentences composed of multiple alternative answers, the one with the highest confidence is selected as the final answer.

但是,这一普遍的做法忽略了备选答案间的逻辑关系。当备选答案中存在和正确答案十分相似的干扰项时,如果采用上述的置信度评价的算法,很可能对二者都给出较高且接近的置信度,从而很有可能生成错误的答案。However, this common practice ignores the logical relationship between the alternative answers. When there are interference items in the alternative answers that are very similar to the correct answer, if the above confidence evaluation algorithm is used, it is likely that both of them will be given high and similar confidences, thus generating a wrong answer.

因此,亟待设计一种高准确性的人工智能视频问答方法。Therefore, it is urgent to design a highly accurate artificial intelligence video question answering method.

发明内容Summary of the invention

为了克服上述问题,本发明人进行了锐意研究,设计出一种人工智能视频问答方法,包括以下步骤:In order to overcome the above problems, the inventors conducted intensive research and designed an artificial intelligence video question answering method, which includes the following steps:

S1、获取语义特征;S1, obtain semantic features;

S2、视觉特征提取,对视觉特征和语义特征进行多模态融合,获得融合特征;S2, visual feature extraction, multi-modal fusion of visual features and semantic features to obtain fusion features;

S3、根据融合特征和语义特征生成答案。S3. Generate answers based on fusion features and semantic features.

具体地,在步骤S1中,所述采用GLoVe词嵌入模型对问题进行处理获取词向量表达,将词向量表达按语序输入双层LSTM模型,将LSTM模型的最后时刻状态的输出作为语义特征。Specifically, in step S1, the GLoVe word embedding model is used to process the problem to obtain a word vector expression, the word vector expression is input into a two-layer LSTM model in word order, and the output of the last moment state of the LSTM model is used as a semantic feature.

在一个优选的实施方式中,将问题与所有备选答案一起作为GLoVe词嵌入模型的输入,对GLoVe词嵌入模型进行训练。In a preferred embodiment, the question and all candidate answers are used as inputs of the GLoVe word embedding model to train the GLoVe word embedding model.

更优选地,在将问题与所有备选答案一起作为GLoVe词嵌入模型的输入时,对备选答案进行标注,使得将答案和问题在语义层面划分开。More preferably, when the question and all candidate answers are used as inputs of the GLoVe word embedding model, the candidate answers are labeled so that the answer and the question are separated at the semantic level.

在步骤S2中,所述视觉特征提取包括以下子步骤:In step S2, the visual feature extraction includes the following sub-steps:

S21、对视频图像在空间维度建模,获取图像特征;S21, modeling the video image in the spatial dimension to obtain image features;

S22、对图像特征在时间维度建模,提取时序特征,获得视觉特征;S22, modeling image features in the time dimension, extracting temporal features, and obtaining visual features;

所述图像特征为景物在视频一帧图像中的特征,所述时序特征为景物在视频不同帧图像中的特征。The image feature is a feature of a scene in one frame of a video, and the time sequence feature is a feature of a scene in different frames of a video.

进一步地,在步骤S21中,建立FPN模型获取目标级特征,建立ResNet模型获取图像的全局特征;Further, in step S21, an FPN model is established to obtain target-level features, and a ResNet model is established to obtain global features of the image;

在步骤S22中,通过LSTM模型对每一帧图像的图像特征进行时序分析,获得视觉特征。In step S22, the image features of each frame of the image are analyzed in time series through the LSTM model to obtain visual features.

根据本发明,在步骤S2中,通过注意力机制对语义特征和视觉特征信息进行融合,根据语义特征对视觉特征进行加权,完成视觉特征信息融合。According to the present invention, in step S2, semantic features and visual feature information are fused through an attention mechanism, and visual features are weighted according to the semantic features to complete the fusion of visual feature information.

在本发明中,所述注意力机制表示如下:In the present invention, the attention mechanism is expressed as follows:

Figure BDA0002640947280000031

Figure BDA0002640947280000031

其中,in,

α=[α1、α2、…、αi、…、αn],α=[α 1 , α 2 ,..., α i ,..., α n ],

V=[V1、V2、…、Vi、…、Vn],V=[V 1 , V 2 ,..., Vi ,..., V n ],

T为语义特征,V为视觉相关特征,Vi表示第i个视觉相关特征,f(*,*为语义特征和第i个视觉相关特征的融合函数,αi为第i个视觉相关特征的权重,

Figure BDA0002640947280000032

为输出的加权结果,表示视觉相关特征向量个数。T is the semantic feature, V is the visual feature, Vi is the i-th visual feature, f(*,*) is the fusion function of the semantic feature and the i-th visual feature, αi is the weight of the i-th visual feature,

Figure BDA0002640947280000032

is the weighted result of the output, indicating the number of visually relevant feature vectors.

进一步地,所述注意力机制包括空间注意力机制和时间注意力机制;Furthermore, the attention mechanism includes a spatial attention mechanism and a temporal attention mechanism;

在获得的图像特征之后,应用空间注意力机制对每一帧图像的图像特征进行加权,After obtaining the image features, the spatial attention mechanism is applied to weight the image features of each frame.

步骤S22中,在LSTM模型中应用时间注意力机制,对不同帧图像进行加权;In step S22, a temporal attention mechanism is applied in the LSTM model to weight different frame images;

在空间注意力机制中,所述视觉相关特征V是指图像特征的不同区域;In the spatial attention mechanism, the visually relevant features V refer to different regions of image features;

在时间注意力机制中,所述视觉相关特征V是指不同帧的图像;In the temporal attention mechanism, the visually relevant features V refer to images of different frames;

更进一步地,采用softmax函数对视觉相关特征的重要程度进行归一化,获得空间注意力机制中和时间注意力机制中的权重αiFurthermore, the softmax function is used to normalize the importance of visual related features to obtain the weights α i in the spatial attention mechanism and the temporal attention mechanism.

在本发明一个优选的实施方式中,采用双层感知机作为注意力机制的融合算法,感知机包括两个全连接层,分别进行空间注意力机制和时间注意力机制。In a preferred embodiment of the present invention, a two-layer perceptron is used as a fusion algorithm of the attention mechanism, and the perceptron includes two fully connected layers, which respectively perform a spatial attention mechanism and a temporal attention mechanism.

本发明所述的人工智能视频问答方法,具有的有益效果包括:The artificial intelligence video question-answering method of the present invention has the following beneficial effects:

(1)根据本发明提供的人工智能视频问答方法,性能良好,准确率较其它方法有明显提高;(1) The artificial intelligence video question-answering method provided by the present invention has good performance and its accuracy is significantly improved compared with other methods;

(2)根据本发明提供的人工智能视频问答方法,能够正确理解问题和备选答案、各备选答案间的逻辑关系;(2) According to the artificial intelligence video question-answering method provided by the present invention, it is possible to correctly understand the logical relationship between questions and alternative answers, and between the alternative answers;

(3)根据本发明提供的人工智能视频问答方法,参数量小,运算速度快。(3) The artificial intelligence video question-answering method provided by the present invention has a small number of parameters and a fast calculation speed.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1示出一种优选实施方式的人工智能视频问答方法流程示意图;FIG1 is a schematic diagram showing a flow chart of an artificial intelligence video question-answering method according to a preferred embodiment;

图2示出实施例1中人工智能视频问答方法流程示意图;FIG2 is a schematic diagram showing a flow chart of the artificial intelligence video question-answering method in Example 1;

图3示出实施例1中的问题形式。FIG. 3 shows the question format in Embodiment 1. FIG.

具体实施方式DETAILED DESCRIPTION

下面通过附图对本发明进一步详细说明。通过这些说明,本发明的特点和优点将变得更为清楚明确。The present invention will be further described in detail below with reference to the accompanying drawings. Through these descriptions, the features and advantages of the present invention will become more clear and distinct.

本发明提供了一种人工智能视频问答方法,如图1所示,包括以下步骤:The present invention provides an artificial intelligence video question answering method, as shown in FIG1, comprising the following steps:

S1、获取语义特征;S1, obtain semantic features;

S2、视觉特征提取,对视觉特征和语义特征进行多模态融合,获得融合特征;S2, visual feature extraction, multi-modal fusion of visual features and semantic features to obtain fusion features;

S3、根据融合特征和语义特征生成答案。S3. Generate answers based on fusion features and semantic features.

在步骤S1中,视频问答中的问题是以自然语言表达的,所述语义特征是指能够表征问题的特征,In step S1, the question in the video question answering is expressed in natural language, and the semantic feature refers to a feature that can characterize the question.

在本发明中,采用GLoVe词嵌入模型对问题进行处理获取词向量表达,将词向量表达按语序输入双层LSTM模型,将LSTM模型的最后时刻状态的输出作为语义特征。In the present invention, the GLoVe word embedding model is used to process the problem to obtain word vector expression, the word vector expression is input into a two-layer LSTM model in word order, and the output of the last moment state of the LSTM model is used as the semantic feature.

相较于LMO、Bert等模型,GLoVe词嵌入模型具有自身的参数量小、计算速度快等优势,尤其适合自然语言有语句短、单词常见的视频问答中,其只需要分析单个问句、用词常见,训练的规模无需过大即可满足要求。Compared with models such as LMO and Bert, the GLoVe word embedding model has the advantages of small number of parameters and fast calculation speed. It is particularly suitable for video question-answering in natural language with short sentences and common words. It only needs to analyze a single question and common words, and the training scale does not need to be too large to meet the requirements.

在使用GLoVe词嵌入模型之前,需要将问题和答案进行关联,对GLoVe词嵌入模型进行训练,以使GLoVe词嵌入模型能够理解问题的含义。Before using the GLoVe word embedding model, you need to associate the question and answer and train the GLoVe word embedding model so that the GLoVe word embedding model can understand the meaning of the question.

在一个优选的实施方式中,将问题和单个备选答案串接成一个设问句的形式,由于备选答案有多个,能够形成多个设问句,将所有设问句与对应的视频片段作为模型输入,正确答案作为输出,优选将正确答案的编号作为输出在对GLoVe词嵌入模型进行训练,训练后的GLoVe词嵌入模型可以对多个设问句进行置信度评价,从而选取置信度最高的作为最终答案,从而获取语义特征。In a preferred embodiment, the question and a single alternative answer are concatenated into a question sentence. Since there are multiple alternative answers, multiple question sentences can be formed. All question sentences and corresponding video clips are used as model inputs, and the correct answer is used as output. Preferably, the number of the correct answer is used as output to train the GLoVe word embedding model. The trained GLoVe word embedding model can perform confidence evaluation on multiple question sentences, thereby selecting the one with the highest confidence as the final answer, thereby obtaining semantic features.

发明人发现,采用上述方式,当备选答案中存在和正确答案十分相似的干扰项时,干扰项与正确答案的置信度都较高且接近,导致生成答案错误率高,如何解决干扰项的问题是本发明的难点所在。The inventors found that when using the above method, when there are interference items in the alternative answers that are very similar to the correct answer, the confidence levels of the interference items and the correct answer are both high and close, resulting in a high error rate in the generated answers. How to solve the problem of interference items is the difficulty of the present invention.

在一个更优选的实施方式中,将问题与所有备选答案一起作为GLoVe词嵌入模型的输入,将正确答案作为输出,对GLoVe词嵌入模型进行训练。In a more preferred embodiment, the question together with all the candidate answers is used as the input of the GLoVe word embedding model, and the correct answer is used as the output to train the GLoVe word embedding model.

进一步地,在将问题与所有备选答案一起作为GLoVe词嵌入模型的输入时,对备选答案进行标注,例如在备选答案前添加“<aa>”(alternative answers)作为标识符,从而在语义层面将答案和问题进行划分。Furthermore, when the question and all alternative answers are used as input to the GLoVe word embedding model, the alternative answers are labeled, for example, "<aa>" (alternative answers) is added as an identifier before the alternative answers, so as to divide the answer and the question at the semantic level.

通过将问题与所有备选答案一起作为输入,并对备选答案进行标注的方式,使得模型能够正确理解问题和备选答案,以及能够正确理解各备选答案间的逻辑关系,从而提高生成答案的正确率。By taking the question and all the alternative answers as input and labeling the alternative answers, the model can correctly understand the question and the alternative answers, as well as the logical relationship between the alternative answers, thereby improving the accuracy of generated answers.

所述LSTM模型是一种特殊的循环神经网络模型,能够学习长期的规律,是由Hochreiter&Schmidhuber在1997年首先提出的,其在循环神经网络(RNN)的基础上加入了多个门控结构,允许信息持续存在,通过循环网络,实现语义的持续输出。The LSTM model is a special recurrent neural network model that can learn long-term rules. It was first proposed by Hochreiter & Schmidhuber in 1997. It adds multiple gating structures on the basis of the recurrent neural network (RNN), allowing information to persist and achieve continuous semantic output through the recurrent network.

在步骤S2中,所述视觉特征提取过程包括图像特征提取和时序特征提取,所述图像特征为景物在视频一帧图像中的特征,所述时序特征为景物在视频不同帧图像中的特征,在本发明中,所述视觉特征提取包括以下子步骤:In step S2, the visual feature extraction process includes image feature extraction and time sequence feature extraction. The image feature is a feature of a scene in one frame of a video, and the time sequence feature is a feature of a scene in different frames of a video. In the present invention, the visual feature extraction includes the following sub-steps:

S21、对视频图像在空间维度建模,获取图像特征;S21, modeling the video image in the spatial dimension to obtain image features;

所述图像特征包括图像的图像中的目标级特征和全局特征。The image features include object-level features and global features in the image.

不同于传统的视频问答,本发明的图像特征不仅具有目标级特征,还具有全局特征,使得模型能够全面的理解视频的内容,从而获取更加全面、有效的视觉特征,以满足更为复杂的问答内容。Different from traditional video question answering, the image features of the present invention not only have target-level features but also global features, so that the model can fully understand the content of the video, thereby obtaining more comprehensive and effective visual features to meet more complex question-answering content.

在本发明中,针对视频的每一帧图像,在空间维度进行建模,通过模型获得每一帧图像的图像特征。In the present invention, a model is built in the spatial dimension for each frame of the video, and the image features of each frame are obtained through the model.

具体地,建立FPN模型获取目标级特征,建立ResNet模型获取图像的全局特征,进一步地,将包含日常生活场景的数据集作为FPN模型和ResNet模型的训练样本。Specifically, an FPN model is established to obtain target-level features, and a ResNet model is established to obtain global features of the image. Furthermore, a dataset containing daily life scenes is used as training samples for the FPN model and the ResNet model.

优选地,将COCO数据集作为FPN模型的训练样本,所述COCO数据集是由微软发布的一个专为对象检测、分割、人体关键点检测、语义分割和字幕生成而设计大型图像数据库,其包括超过200,000张图像,图像中包括80类生活常见目标。Preferably, the COCO dataset is used as a training sample for the FPN model. The COCO dataset is a large-scale image database released by Microsoft and designed for object detection, segmentation, human key point detection, semantic segmentation and caption generation. The COCO dataset includes more than 200,000 images, including 80 categories of common objects in life.

优选地,将ImageNet数据集作为ResNet模型的训练样本,所述ImageNet数据集是一个用于视觉对象识别软件研究的大型可视化数据库,包含超过14,000,000张图像,图像中包括1000类生活常见对象,如各类动物、各类交通工具等。Preferably, the ImageNet dataset is used as a training sample for the ResNet model. The ImageNet dataset is a large-scale visualization database for visual object recognition software research, containing more than 14,000,000 images, including 1,000 categories of common objects in life, such as various animals, various means of transportation, etc.

S22、对图像特征在时间维度建模,提取时序特征,获得视觉特征。S22. Model the image features in the time dimension, extract the temporal features, and obtain the visual features.

在步骤S11中,图像特征基于单帧获取,只采集到了空间维度的视觉信息,在获得空间维度的信息后,需要对图像的时间维度进行建模,获取视觉特征。In step S11, the image features are acquired based on a single frame, and only the visual information of the spatial dimension is collected. After obtaining the information of the spatial dimension, the time dimension of the image needs to be modeled to obtain the visual features.

优选地,通过LSTM模型对每一帧图像的图像特征进行时序分析,进而获得视觉特征。Preferably, the image features of each frame of the image are analyzed in time series through the LSTM model to obtain the visual features.

所述LSTM模型以当前时刻信息xt和上一时刻状态ht-1作为输入,输出当前时刻状态ht,从而可以完成时间序列分析,获得视觉特征。The LSTM model takes the current moment information x t and the previous moment state h t-1 as input, and outputs the current moment state h t , thereby completing time series analysis and obtaining visual features.

进一步地,在步骤S2中,需要将视觉特征和语义特征进行多模态融合,通过语义特征对视觉特征信息的进一步筛选与理解,去除掉与语义特征关联度较低的视觉特征信息,从而获得融合特征。Furthermore, in step S2, it is necessary to perform multimodal fusion of visual features and semantic features, further screen and understand the visual feature information through semantic features, remove the visual feature information with low correlation with the semantic features, and thus obtain fused features.

如何将视觉特征与语义特征进行融合是难点所在,若直接将视觉特征与语义特征进行拼接,然后采用全连接网络进一步提取特征后输出答案,由于忽略了两种特征向量的差异,得到的结果较差。The difficulty lies in how to fuse visual features with semantic features. If the visual features and semantic features are directly concatenated, and then a fully connected network is used to further extract the features and output the answer, the result will be poor because the difference between the two feature vectors is ignored.

在本发明中,通过注意力机制对语义特征和视觉特征信息进行融合,具体地,根据语义特征对视觉特征进行加权,完成视觉特征信息融合。In the present invention, semantic features and visual feature information are fused through an attention mechanism. Specifically, visual features are weighted according to semantic features to complete the fusion of visual feature information.

根据语义特征对视觉特征进行加权,实现了视觉特征与语义特征的信息交互,同时实现了对视觉信息的筛选,增强了视觉信息中和问题相关的特征,削弱了无关的特征。By weighting the visual features according to the semantic features, the information interaction between the visual features and the semantic features is realized. At the same time, the visual information is screened, the features in the visual information that are relevant to the problem are enhanced, and the irrelevant features are weakened.

所述注意力机制是机器学习中的一种数据处理方法,广泛应用在自然语言处理、图像识别及语音识别等各种不同类型的机器学习任务中,其可以根据具体任务目标,对关注的方向和加权模型进行调整,可以附着在多种神经网络模型下。The attention mechanism is a data processing method in machine learning, which is widely used in various types of machine learning tasks such as natural language processing, image recognition and speech recognition. It can adjust the focus direction and weighting model according to specific task objectives, and can be attached to a variety of neural network models.

在本发明中,通过注意力机制将视觉特征信息中与语义特征相关的特征加大权重,与语义特征无关的特征减小权重。In the present invention, the weight of features related to semantic features in visual feature information is increased through the attention mechanism, and the weight of features unrelated to semantic features is reduced.

在一个优选的实施方式中,所述注意力机制可表示如下:In a preferred embodiment, the attention mechanism can be expressed as follows:

Figure BDA0002640947280000081

Figure BDA0002640947280000081

其中,in,

α=[α1、α2、…、αi、…、αn],α=[α 1 , α 2 ,..., α i ,..., α n ],

V=[V1、V2、…、Vi、…、Vn],V=[V 1 , V 2 ,..., Vi ,..., V n ],

T为语义特征,V为视觉相关特征,Vi表示第i个视觉相关特征,f(*,*为语义特征和第i个视觉相关特征的融合函数,αi为第i个视觉相关特征的权重,

Figure BDA0002640947280000082

为输出的加权结果,表示视觉相关特征向量个数。T is the semantic feature, V is the visual feature, Vi is the i-th visual feature, f(*,*) is the fusion function of the semantic feature and the i-th visual feature, αi is the weight of the i-th visual feature,

Figure BDA0002640947280000082

is the weighted result of the output, indicating the number of visually relevant feature vectors.

在一个优选的实施方式中,所述注意力机制包括空间注意力机制,步骤S21获得的图像特征之后,应用空间注意力机制对每一帧图像的图像特征进行加权,使得空间维度上和问题相关的图像特征被筛选出。In a preferred embodiment, the attention mechanism includes a spatial attention mechanism. After the image features are obtained in step S21, the spatial attention mechanism is applied to weight the image features of each frame of the image so that the image features related to the problem in the spatial dimension are screened out.

进一步地,在空间注意力机制中,所述视觉相关特征是指图像特征的不同区域,对不同区域进行加权,获得图像特征表达。Furthermore, in the spatial attention mechanism, the visually relevant features refer to different regions of image features, and different regions are weighted to obtain image feature expressions.

例如,若ResNet模型输出的图像特征具有49个区域,则n=49,不同区域分别表示为V1、V2、…、V49,为对这49个区域的重要程度进行打分,分别为α1、α2、…、α49;然后根据式一对49个区域进行加权合并,得到最终的图像特征表达。For example, if the image feature output by the ResNet model has 49 regions, then n=49, and the different regions are represented as V 1 , V 2 , …, V 49 , respectively. The importance of these 49 regions is scored as α 1 , α 2 , …, α 49 , respectively. Then, the 49 regions are weightedly merged according to formula 1 to obtain the final image feature expression.

在一个优选的实施方式中,所述注意力机制包括时间注意力机制,在步骤S22中,在LSTM模型中应用时间注意力机制,对不同帧图像进行加权,使得时间维度上和问题相关的视觉特征被筛选出。In a preferred embodiment, the attention mechanism includes a temporal attention mechanism. In step S22, the temporal attention mechanism is applied in the LSTM model to weight different frame images so that visual features related to the problem in the temporal dimension are screened out.

进一步地,在时间注意力机制中,所述视觉相关特征是指不同帧的图像,对不同帧图像进行加权,获得视觉特征表达。Furthermore, in the temporal attention mechanism, the visual related features refer to images of different frames, and the images of different frames are weighted to obtain visual feature expressions.

例如,若视频长度为60帧,则n=60,不同不同帧图像分别表示为V1、V2、…、V60,为对这60帧图像的重要程度进行打分,分别为α1、α2、…、α60;然后根据式一对60帧图像进行加权合并,得到最终的视觉特征表达。For example, if the video length is 60 frames, then n=60, and different frame images are represented as V 1 , V 2 , …, V 60 , respectively. The importance of these 60 frame images is scored as α 1 , α 2 , …, α 60 , respectively. Then, the 60 frame images are weightedly merged according to formula 1 to obtain the final visual feature expression.

在一个优选的实施方式中,采用双层感知机作为注意力机制的融合算法,感知机包括两个全连接层,分别进行空间注意力机制和时间注意力机制。In a preferred embodiment, a two-layer perceptron is used as a fusion algorithm of the attention mechanism. The perceptron includes two fully connected layers, which respectively perform a spatial attention mechanism and a temporal attention mechanism.

具体地,语义特征和视觉特征在串联之后,经过两个全连接层进行相关性分析,设置感知机中间层的特征维度为256~1024,优选为512,输出层的特征维度为1,其中,输出层的特征维度代表视觉特征的重要程度。Specifically, after the semantic features and visual features are connected in series, they are subjected to correlation analysis through two fully connected layers. The feature dimension of the middle layer of the perceptron is set to 256-1024, preferably 512, and the feature dimension of the output layer is 1, wherein the feature dimension of the output layer represents the importance of the visual features.

进一步地,采用softmax函数对视觉相关特征的重要程度进行归一化,即可得到空间注意力机制中和时间注意力机制中的权重αi,根据权重对视觉相关特征进行加权求和,获得融合特征。Furthermore, the importance of visual related features is normalized by using the softmax function to obtain the weights α i in the spatial attention mechanism and the temporal attention mechanism, and the visual related features are weighted and summed according to the weights to obtain the fusion features.

在步骤S3中,将语义特征与融合特征串联,构建卷积神经网络,优选地,所述卷积神经网络采用双层全连接进行推理,生成最终的问题答案,完成视频问答任务。In step S3, the semantic features are connected in series with the fusion features to construct a convolutional neural network. Preferably, the convolutional neural network uses a double-layer full connection for reasoning to generate the final answer to the question and complete the video question-answering task.

实施例Example

实施例1Example 1

在大型公开视频问答数据集TGIF-QA上进行实验,其中,如图2所示,在步骤S1中,将问题与所有备选答案一起作为GLoVe词嵌入模型的输入,在备选答案前添加“<aa>”作为标识符,对GLoVe词嵌入模型进行训练,获得词向量表达,将词向量表达按语序输入双层LSTM模型,选取LSTM模型的最后时刻状态输出作为最终的语义特征;Experiments were conducted on a large-scale public video question-answering dataset, TGIF-QA. As shown in FIG2 , in step S1, the question and all the candidate answers are used as the input of the GLoVe word embedding model. “<aa>” is added before the candidate answer as an identifier. The GLoVe word embedding model is trained to obtain the word vector expression. The word vector expression is input into a two-layer LSTM model in word order. The state output of the LSTM model at the last moment is selected as the final semantic feature.

在步骤S2中,采用双层感知机作为注意力机制的融合算法,双层感知机具有两个全连接层,其中中间层的特征维度为512,输出层特征维度为1,具体地,使用在ImageNet进行预训练的ResNet-50模型提取单帧图像特征,并将语义特征与图像特征串联组成全连接层,应用空间注意力机制对每一帧图像的图像特征进行加权,将加权后的多帧图像特征输入双层LSTM进一步提取时序特征,并在LSTM模型中应用时间注意力机制,将语义特征与时序特征串联组成全连接层,输出融合特征。In step S2, a two-layer perceptron is used as a fusion algorithm of the attention mechanism. The two-layer perceptron has two fully connected layers, wherein the feature dimension of the middle layer is 512 and the feature dimension of the output layer is 1. Specifically, a ResNet-50 model pre-trained on ImageNet is used to extract single-frame image features, and the semantic features and image features are connected in series to form a fully connected layer. The spatial attention mechanism is applied to weight the image features of each frame of the image, and the weighted multi-frame image features are input into a two-layer LSTM to further extract temporal features. The temporal attention mechanism is applied in the LSTM model, and the semantic features and temporal features are connected in series to form a fully connected layer, and the fusion features are output.

其中,采用softmax函数对视觉相关特征的重要程度进行归一化,获得空间注意力机制权重和时间注意力机制权重。Among them, the softmax function is used to normalize the importance of visual related features to obtain the spatial attention mechanism weights and the temporal attention mechanism weights.

在步骤S3中,将原始语义特征与加权后的视觉特征串联,再采用双层全连接进行推理,从而生成最终的问题答案。In step S3, the original semantic features are concatenated with the weighted visual features, and then a two-layer full connection is used for reasoning to generate the final answer to the question.

数据集TGIF-QA包括约160,000个问题及对应视频,根据问题类型可分为计数问题(Count)、画面问题(Frame)、动作问题(Action)和时序问题(Transition)四类,具体各类问题及用于训练、测试的数据数量如表一所示。The dataset TGIF-QA includes about 160,000 questions and corresponding videos, which can be divided into four categories according to the question type: counting questions (Count), frame questions (Frame), action questions (Action) and timing questions (Transition). The specific types of questions and the amount of data used for training and testing are shown in Table 1.

表一Table 1

训练train 测试test 合计total 计数问题Counting Problem 26,84326,843 3,5543,554 30,39730,397 画面问题Graphics issues 39,39239,392 13,69113,691 53,08353,083 动作问题Movement Problems 20,47520,475 2,2742,274 22,74922,749 时序问题Timing Issues 52,70452,704 6,2326,232 58,93658,936 合计total 139,414139,414 25,75125,751 165,165165,165

在TGIF-QA数据集中,对于一段视频,计数问题所询问的是人物或动物执行某类动作的次数,如“How many times does the man blink eyes?”;画面问题所询问的是可由单帧画面得到的静态信息,包括物体颜色、目标数量等,如“What is the color of thecat”;动作问题所询问的是人物或动物所执行的动作类型,如“What does the girl do3times”;时序问题询问的是人物或动物在执行某一动作之前或之后的动作类型,如“Whatdoes the man do before walk away”,如图3所示。In the TGIF-QA dataset, for a video, counting questions ask about the number of times a person or animal performs a certain type of action, such as “How many times does the man blink eyes?”; picture questions ask about static information that can be obtained from a single frame, including object color, target quantity, etc., such as “What is the color of the cat”; action questions ask about the type of action performed by a person or animal, such as “What does the girl do 3 times”; timing questions ask about the type of action a person or animal performs before or after performing a certain action, such as “What does the man do before walk away”, as shown in Figure 3.

对比例1Comparative Example 1

采用论文Jang Y,Song Y,Yu Y,et al.Tgif-qa:Toward spatio-temporalreasoning in visual question answering[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition.2017:2758-2766.中的方法替代实施例1中步骤S2中的视觉特征提取方式,其中,图像特征提取采用ResNet模型,时序特征提取采用C3D模型的三维时空特征。The method in the paper Jang Y, Song Y, Yu Y, et al. Tgif-qa: Toward spatio-temporal reasoning in visual question answering [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2758-2766. is used to replace the visual feature extraction method in step S2 in Example 1, wherein the image feature extraction adopts the ResNet model, and the temporal feature extraction adopts the three-dimensional spatiotemporal features of the C3D model.

对比例2Comparative Example 2

采用论文Gao J,Ge R,Chen K,et al.Motion-appearance co-memory networksfor video question answering[C]//Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition.2018:6576-6585.中的方法替代实施例1中步骤S2中的视觉特征提取方式,其中,图像特征提取采用ResNet模型,时序特征提取采用光流法(Optical Flow)。The method in the paper Gao J, Ge R, Chen K, et al. Motion-appearance co-memory networks for video question answering [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6576-6585. is used to replace the visual feature extraction method in step S2 of Example 1, wherein the image feature extraction adopts the ResNet model and the temporal feature extraction adopts the optical flow method (Optical Flow).

对比例3Comparative Example 3

采用论文Gao L,Zeng P,Song J,et al.Structured Two-Stream AttentionNetwork for Video Question Answering[C]//Proceedings of the AAAI Conferenceon Artificial Intelligence.2019,33:6391-6398.中的模型在数据集TGIF-QA上进行与实施例1相同实验。The model in the paper Gao L, Zeng P, Song J, et al. Structured Two-Stream Attention Network for Video Question Answering [C] // Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 6391-6398. was used to conduct the same experiment as Example 1 on the dataset TGIF-QA.

对比例4Comparative Example 4

采用论文Jang Y,Song Y,Yu Y,et al.Tgif-qa:Toward spatio-temporalreasoning in visual question answering[C]//Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition.2017:2758-2766.中的模型在数据集TGIF-QA上进行与实施例1相同实验,即在TGIF-QA数据集中,选择相同的视频进行与实施例1相同的询问,包括计数问题、画面问题、动作问题或时序问题的询问。The model in the paper Jang Y, Song Y, Yu Y, et al. Tgif-qa: Toward spatio-temporal reasoning in visual question answering [C] // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 2758-2766. is used to conduct the same experiment as in Example 1 on the dataset TGIF-QA. That is, in the TGIF-QA dataset, the same video is selected to conduct the same inquiries as in Example 1, including inquiries on counting questions, picture questions, action questions or timing questions.

对比例5Comparative Example 5

采用实施例1中的方法进行实验,区别在于,在步骤S2中,不应用注意力机制,仅通过ResNet模型和LSTM模型获得视觉特征。The experiment was conducted using the method in Example 1, except that in step S2, the attention mechanism was not applied, and visual features were obtained only through the ResNet model and the LSTM model.

对比例6Comparative Example 6

采用实施例1中的方法进行实验,区别在于,在步骤S2中,不应用空间注意力机制。The experiment is conducted using the method in Example 1, except that in step S2, the spatial attention mechanism is not applied.

对比例7Comparative Example 7

采用实施例1中的方法进行实验,区别在于,在步骤S2中,不应用时间注意力机制。The experiment is conducted using the method in Example 1, except that in step S2, the temporal attention mechanism is not applied.

对比例8Comparative Example 8

采用实施例4中的方法进行实验,区别在于,不应用注意力机制。The experiment was conducted using the method in Example 4, except that the attention mechanism was not applied.

对比例9Comparative Example 9

采用实施例4中的方法进行实验,区别在于,不应用空间注意力机制。The experiment was conducted using the method in Example 4, except that the spatial attention mechanism was not applied.

对比例10Comparative Example 10

采用实施例4中的方法进行实验,区别在于,不应用时间注意力机制。The experiment is conducted using the method in Example 4, except that the temporal attention mechanism is not applied.

实验例1Experimental Example 1

画面问题、动作问题和时序问题都采用正确率作为评价指标,计数问题采用均方误差作为评价指标,对实施例1、对比例1、对比例2、对比例3的实验结果进行分析,正确率越高、均方误差越低说明方法的性能越好。The accuracy rate is used as the evaluation index for picture problems, action problems and timing problems, and the mean square error is used as the evaluation index for counting problems. The experimental results of Example 1, Comparative Example 1, Comparative Example 2 and Comparative Example 3 are analyzed. The higher the accuracy rate and the lower the mean square error, the better the performance of the method.

其中,实施例1、对比例1和对比例2实验结果如表二所示:The experimental results of Example 1, Comparative Example 1 and Comparative Example 2 are shown in Table 2:

表二Table 2

Figure BDA0002640947280000141

Figure BDA0002640947280000141

从表二可知,实施例1中的方法具有较低的均方误差和较高的准确度说明实施例1中的方法十分有效,性能优异。As can be seen from Table 2, the method in Example 1 has a lower mean square error and a higher accuracy, which shows that the method in Example 1 is very effective and has excellent performance.

实验例2Experimental Example 2

以实施例1中的方法为基础进行消融实验,具体地,在步骤S1中,将自然语言的问题改为文本信息输入,在使用GLoVe词嵌入模型之前,分别采用:An ablation experiment was conducted based on the method in Example 1. Specifically, in step S1, the natural language question was changed to text information input. Before using the GLoVe word embedding model, the following were used:

将问题与所有备选答案一起作为GLoVe词嵌入模型的输入;和Feed the question along with all the possible answers as input to the GLoVe word embedding model; and

将问题和单个备选答案串接成一个设问句的形式,多个备选答案形成多个设问句,将所有设问句与对应的视频片段作为模型输入;The question and a single alternative answer are concatenated into a question sentence, and multiple alternative answers form multiple question sentences. All the question sentences and the corresponding video clips are used as model input;

两种方式进行实验,实验结果如表三所示。The experiments were conducted in two ways and the experimental results are shown in Table III.

表三Table 3

文本输入形式Text input form 动作问题正确率(%)Correct rate of action questions (%) 时序问题正确率(%)Correct rate of timing questions (%) 问题+单一备选答案Question + Single Alternative Answer 58.8558.85 73.6273.62 问题+全部备选答案Question + all possible answers 86.3386.33 96.6896.68

根据表三可知,采用问题与所有备选答案一起作为模型输入的方式,能够大幅度提升模型性能。According to Table 3, using the question and all alternative answers as model input can significantly improve model performance.

实验例3Experimental Example 3

对实施例1、对比例5~10的实验结果进行分析,结果如表四所示。The experimental results of Example 1 and Comparative Examples 5 to 10 were analyzed, and the results are shown in Table 4.

表四Table 4

Figure BDA0002640947280000151

Figure BDA0002640947280000151

根据结果可以看到,对比例8所设计的注意力机制模块在一些情况下反而会降低模型性能,如动作问题上应用空间注意力机制导致准确率降低2.8%(57.33%-60.13%)。而实施例1所设计的注意力机制模块,即采用双层感知机分析视觉信息和语义信息相关性,在所有种类的问题上都全面提升了模型的性能,说明对比例1的注意力机制设计是合理且有效的。According to the results, it can be seen that the attention mechanism module designed in Example 8 will reduce the model performance in some cases. For example, the application of the spatial attention mechanism on the action problem leads to a 2.8% (57.33%-60.13%) decrease in accuracy. The attention mechanism module designed in Example 1, that is, the use of a two-layer perceptron to analyze the relevance of visual information and semantic information, comprehensively improves the performance of the model on all types of problems, indicating that the attention mechanism design of Example 1 is reasonable and effective.

在本发明的描述中,需要说明的是,术语“上”、“下”、“内”和“外”等指示的方位或位置关系为基于本发明工作状态下的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本发明的限制。In the description of the present invention, it should be noted that the terms "upper", "lower", "inside" and "outside" etc. indicate orientations or positional relationships based on the orientations or positional relationships in the working state of the present invention, and are only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as a limitation on the present invention.

以上结合了优选的实施方式对本发明进行了说明,不过这些实施方式仅是范例性的,仅起到说明性的作用。在此基础上,可以对本发明进行多种替换和改进,这些均落入本发明的保护范围内。The present invention has been described above in conjunction with preferred embodiments, but these embodiments are only exemplary and serve only as an illustration. On this basis, the present invention may be subjected to a variety of substitutions and improvements, all of which fall within the scope of protection of the present invention.

Claims (7)

1.一种人工智能视频问答方法,包括以下步骤:1. An artificial intelligence video question answering method, comprising the following steps: S1、获取语义特征;S1, obtain semantic features; S2、视觉特征提取,对视觉特征和语义特征进行多模态融合,获得融合特征;S2, visual feature extraction, multi-modal fusion of visual features and semantic features to obtain fusion features; S3、根据融合特征和语义特征生成答案;S3, generate answers based on fusion features and semantic features; 在步骤S1中,采用GLoVe词嵌入模型对问题进行处理获取词向量表达,将词向量表达按语序输入双层LSTM模型,将LSTM模型的最后时刻状态的输出作为语义特征;In step S1, the GLoVe word embedding model is used to process the problem to obtain the word vector expression, the word vector expression is input into the double-layer LSTM model in word order, and the output of the last moment state of the LSTM model is used as the semantic feature; 将问题与所有备选答案一起作为GLoVe词嵌入模型的输入,对GLoVe词嵌入模型进行训练;Take the question and all the alternative answers as the input of the GLoVe word embedding model and train the GLoVe word embedding model; 在将问题与所有备选答案一起作为GLoVe词嵌入模型的输入时,对备选答案进行标注,使得将答案和问题在语义层面划分开。When the question and all the alternative answers are used as input to the GLoVe word embedding model, the alternative answers are annotated so that the answers and questions are separated at the semantic level. 2.根据权利要求1所述的人工智能视频问答方法,其特征在于,2. The artificial intelligence video question-answering method according to claim 1, characterized in that: 在步骤S2中,所述视觉特征提取包括以下子步骤:In step S2, the visual feature extraction includes the following sub-steps: S21、对视频图像在空间维度建模,获取图像特征;S21, modeling the video image in the spatial dimension to obtain image features; S22、对图像特征在时间维度建模,提取时序特征,获得视觉特征;S22, modeling image features in the time dimension, extracting temporal features, and obtaining visual features; 所述图像特征为景物在视频一帧图像中的特征,所述时序特征为景物在视频不同帧图像中的特征。The image feature is a feature of a scene in one frame of a video, and the time sequence feature is a feature of a scene in different frames of a video. 3.根据权利要求2所述的人工智能视频问答方法,其特征在于,3. The artificial intelligence video question-answering method according to claim 2, characterized in that: 在步骤S21中,建立FPN模型获取目标级特征,建立ResNet模型获取图像的全局特征;In step S21, an FPN model is established to obtain target-level features, and a ResNet model is established to obtain global features of the image; 在步骤S22中,通过LSTM模型对每一帧图像的图像特征进行时序分析,获得视觉特征。In step S22, the image features of each frame of the image are analyzed in time series through the LSTM model to obtain visual features. 4.根据权利要求1所述的人工智能视频问答方法,其特征在于,4. The artificial intelligence video question-answering method according to claim 1, characterized in that: 在步骤S2中,通过注意力机制对语义特征和视觉特征信息进行融合,根据语义特征对视觉特征进行加权,完成视觉特征信息融合。In step S2, the semantic feature and visual feature information are fused through the attention mechanism, and the visual feature is weighted according to the semantic feature to complete the fusion of visual feature information. 5.根据权利要求4所述的人工智能视频问答方法,其特征在于,5. The artificial intelligence video question-answering method according to claim 4, characterized in that: 所述注意力机制表示如下:The attention mechanism is expressed as follows:

Figure FDA0003937483450000021

Figure FDA0003937483450000021

其中,in, α=[α1、α2、…、αi、…、αn],α=[α 1 , α 2 ,..., α i ,..., α n ], V=[V1、V2、…、Vi、…、Vn],V=[V 1 , V 2 ,..., Vi ,..., V n ], T为语义特征,V为视觉相关特征,Vi表示第i个视觉相关特征,f(*,*)为语义特征和第i个视觉相关特征的融合函数,αi为第i个视觉相关特征的权重,

Figure FDA0003937483450000022

为输出的加权结果,n表示视觉相关特征向量个数。
T is the semantic feature, V is the visual feature, Vi is the i-th visual feature, f(*,*) is the fusion function of the semantic feature and the i-th visual feature, αi is the weight of the i-th visual feature,

Figure FDA0003937483450000022

is the weighted result of the output, and n represents the number of visual related feature vectors.
6.根据权利要求5所述的人工智能视频问答方法,其特征在于,6. The artificial intelligence video question-answering method according to claim 5, characterized in that: 所述注意力机制包括空间注意力机制和时间注意力机制;The attention mechanism includes a spatial attention mechanism and a temporal attention mechanism; 在获得的图像特征之后,应用空间注意力机制对每一帧图像的图像特征进行加权,After obtaining the image features, the spatial attention mechanism is applied to weight the image features of each frame. 步骤S22中,在LSTM模型中应用时间注意力机制,对不同帧图像进行加权;In step S22, a temporal attention mechanism is applied in the LSTM model to weight different frame images; 在空间注意力机制中,所述视觉相关特征V是指图像特征的不同区域;In the spatial attention mechanism, the visually relevant features V refer to different regions of image features; 在时间注意力机制中,所述视觉相关特征V是指不同帧的图像;In the temporal attention mechanism, the visually relevant features V refer to images of different frames; 采用softmax函数对视觉相关特征的重要程度进行归一化,获得空间注意力机制中和时间注意力机制中的权重αiThe softmax function is used to normalize the importance of visual related features to obtain the weights α i in the spatial attention mechanism and the temporal attention mechanism. 7.根据权利要求4所述的人工智能视频问答方法,其特征在于,7. The artificial intelligence video question-answering method according to claim 4, characterized in that: 采用双层感知机作为注意力机制的融合算法,感知机包括两个全连接层,分别进行空间注意力机制和时间注意力机制。A two-layer perceptron is used as the fusion algorithm of the attention mechanism. The perceptron includes two fully connected layers, which perform spatial attention mechanism and temporal attention mechanism respectively.
CN202010839563.5A 2020-08-19 2020-08-19 Artificial intelligent video question-answering method Active CN112036276B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010839563.5A CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010839563.5A CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Publications (2)

Publication Number Publication Date
CN112036276A CN112036276A (en) 2020-12-04
CN112036276B true CN112036276B (en) 2023-04-07

Family

ID=73577605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010839563.5A Active CN112036276B (en) 2020-08-19 2020-08-19 Artificial intelligent video question-answering method

Country Status (1)

Country Link
CN (1) CN112036276B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860847B (en) * 2021-01-19 2022-08-19 中国科学院自动化研究所 Video question-answer interaction method and system
CN113010656B (en) * 2021-03-18 2022-12-20 广东工业大学 Visual question-answering method based on multi-mode fusion and structural control
CN113128415B (en) * 2021-04-22 2023-09-29 合肥工业大学 Environment distinguishing method, system, equipment and storage medium
CN113536952B (en) * 2021-06-22 2023-04-21 电子科技大学 A video question answering method based on motion-captured attention networks
CN113762268A (en) * 2021-09-07 2021-12-07 上海车点信息服务有限公司 Deep learning method for video interview question-answering process
CN113837047B (en) * 2021-09-16 2022-10-28 广州大学 Video quality evaluation method, system, computer equipment and storage medium
CN114707022B (en) * 2022-05-31 2022-09-06 浙江大学 Video question-answer data set labeling method and device, storage medium and electronic equipment
CN117917696A (en) * 2022-10-20 2024-04-23 华为技术有限公司 Video question-answering method and electronic equipment
CN119202720A (en) * 2024-09-09 2024-12-27 国网四川省电力公司电力科学研究院 A high-resolution reconstruction method for the spatiotemporal distribution of XCO2 and related products

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9965705B2 (en) * 2015-11-03 2018-05-08 Baidu Usa Llc Systems and methods for attention-based configurable convolutional neural networks (ABC-CNN) for visual question answering
CN108549658B (en) * 2018-03-12 2021-11-30 浙江大学 Deep learning video question-answering method and system based on attention mechanism on syntax analysis tree
CN111008293A (en) * 2018-10-06 2020-04-14 上海交通大学 Visual question-answering method based on structured semantic representation
CN110727824B (en) * 2019-10-11 2022-04-01 浙江大学 Method for solving question-answering task of object relationship in video by using multiple interaction attention mechanism
CN110889340A (en) * 2019-11-12 2020-03-17 哈尔滨工程大学 A Visual Question Answering Model Based on Iterative Attention Mechanism

Also Published As

Publication number Publication date
CN112036276A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN112036276B (en) 2023-04-07 Artificial intelligent video question-answering method
CN110717431B (en) 2023-03-24 Fine-grained visual question and answer method combined with multi-view attention mechanism
CN110532900B (en) 2021-07-27 Facial Expression Recognition Method Based on U-Net and LS-CNN
CN111563452B (en) 2023-04-21 A Multi-Human Pose Detection and State Discrimination Method Based on Instance Segmentation
CN114936623B (en) 2024-02-27 Aspect-level emotion analysis method integrating multi-mode data
CN113609896B (en) 2023-09-01 Object-level Remote Sensing Change Detection Method and System Based on Dual Correlation Attention
CN110021051A (en) 2019-07-16 One kind passing through text Conrad object image generation method based on confrontation network is generated
CN108765383B (en) 2022-03-18 Video description method based on deep migration learning
CN110390363A (en) 2019-10-29 An image description method
CN114549850B (en) 2023-08-08 A multimodal image aesthetic quality assessment method to solve the missing modality problem
CN112487949B (en) 2023-05-16 Learner behavior recognition method based on multi-mode data fusion
Jing et al. 2019 Recognizing american sign language manual signs from rgb-d videos
CN110705490B (en) 2022-09-02 Visual emotion recognition method
CN114912512B (en) 2024-07-23 Method for automatically evaluating image description result
CN112949622A (en) 2021-06-11 Bimodal character classification method and device fusing text and image
CN110490189A (en) 2019-11-22 A kind of detection method of the conspicuousness object based on two-way news link convolutional network
CN115482595B (en) 2023-04-07 Specific character visual sense counterfeiting detection and identification method based on semantic segmentation
CN110502655A (en) 2019-11-26 A method for generating natural image description sentences embedded in scene text information
CN117149944A (en) 2023-12-01 Multi-mode situation emotion recognition method and system based on wide time range
CN113780350B (en) 2023-12-19 ViLBERT and BiLSTM-based image description method
CN118446292A (en) 2024-08-06 Knowledge graph construction method, model, detection device and method for household behaviors
CN117975216A (en) 2024-05-03 Salient object detection method based on multi-modal feature refinement and fusion
CN117373058A (en) 2024-01-09 Identification method for small-difference classroom behaviors
CN118155119B (en) 2024-09-10 Video classification method and system for intelligent elevator passenger intention analysis
Ling et al. 2021 A facial expression recognition system for smart learning based on yolo and vision transformer

Legal Events

Date Code Title Description
2020-12-04 PB01 Publication
2020-12-04 PB01 Publication
2020-12-22 SE01 Entry into force of request for substantive examination
2020-12-22 SE01 Entry into force of request for substantive examination
2023-04-07 GR01 Patent grant
2023-04-07 GR01 Patent grant