patents.google.com

CN117690180A - Eyeball fixation recognition method and electronic equipment - Google Patents

  • ️Tue Mar 12 2024

CN117690180A - Eyeball fixation recognition method and electronic equipment - Google Patents

Eyeball fixation recognition method and electronic equipment Download PDF

Info

Publication number
CN117690180A
CN117690180A CN202310793444.4A CN202310793444A CN117690180A CN 117690180 A CN117690180 A CN 117690180A CN 202310793444 A CN202310793444 A CN 202310793444A CN 117690180 A CN117690180 A CN 117690180A Authority
CN
China
Prior art keywords
image
feature
layer
downsampler
eye
Prior art date
2023-06-29
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310793444.4A
Other languages
Chinese (zh)
Inventor
龚少庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Honor Device Co Ltd
Original Assignee
Honor Device Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
2023-06-29
Filing date
2023-06-29
Publication date
2024-03-12
2023-06-29 Application filed by Honor Device Co Ltd filed Critical Honor Device Co Ltd
2023-06-29 Priority to CN202310793444.4A priority Critical patent/CN117690180A/en
2024-03-12 Publication of CN117690180A publication Critical patent/CN117690180A/en
Status Pending legal-status Critical Current

Links

  • 238000000034 method Methods 0.000 title claims abstract description 81
  • 210000005252 bulbus oculi Anatomy 0.000 title abstract description 9
  • 238000011176 pooling Methods 0.000 claims description 64
  • 238000012545 processing Methods 0.000 claims description 34
  • 230000015654 memory Effects 0.000 claims description 14
  • 238000004590 computer program Methods 0.000 claims description 10
  • 230000009471 action Effects 0.000 claims description 6
  • 210000001508 eye Anatomy 0.000 abstract description 230
  • 230000000694 effects Effects 0.000 abstract description 13
  • 238000013527 convolutional neural network Methods 0.000 description 45
  • 238000004422 calculation algorithm Methods 0.000 description 27
  • 238000010586 diagram Methods 0.000 description 20
  • 230000008569 process Effects 0.000 description 17
  • 230000001815 facial effect Effects 0.000 description 15
  • 238000006243 chemical reaction Methods 0.000 description 10
  • 230000001186 cumulative effect Effects 0.000 description 10
  • 238000001514 detection method Methods 0.000 description 10
  • 230000006870 function Effects 0.000 description 9
  • 230000002452 interceptive effect Effects 0.000 description 9
  • 238000000605 extraction Methods 0.000 description 6
  • 230000003993 interaction Effects 0.000 description 5
  • 238000007726 management method Methods 0.000 description 5
  • 238000004891 communication Methods 0.000 description 4
  • 238000005315 distribution function Methods 0.000 description 4
  • 230000004044 response Effects 0.000 description 4
  • 230000001133 acceleration Effects 0.000 description 3
  • 238000013528 artificial neural network Methods 0.000 description 3
  • 230000009466 transformation Effects 0.000 description 3
  • 230000009286 beneficial effect Effects 0.000 description 2
  • 210000000988 bone and bone Anatomy 0.000 description 2
  • 230000014509 gene expression Effects 0.000 description 2
  • 238000010295 mobile communication Methods 0.000 description 2
  • 230000003287 optical effect Effects 0.000 description 2
  • 238000009877 rendering Methods 0.000 description 2
  • 238000013473 artificial intelligence Methods 0.000 description 1
  • 230000003190 augmentative effect Effects 0.000 description 1
  • 230000008901 benefit Effects 0.000 description 1
  • 238000013529 biological neural network Methods 0.000 description 1
  • 230000005540 biological transmission Effects 0.000 description 1
  • 238000004364 calculation method Methods 0.000 description 1
  • 230000001413 cellular effect Effects 0.000 description 1
  • 230000008859 change Effects 0.000 description 1
  • 230000019771 cognition Effects 0.000 description 1
  • 238000012937 correction Methods 0.000 description 1
  • 238000013500 data storage Methods 0.000 description 1
  • 238000005516 engineering process Methods 0.000 description 1
  • 210000000887 face Anatomy 0.000 description 1
  • 239000010985 leather Substances 0.000 description 1
  • 239000011159 matrix material Substances 0.000 description 1
  • 238000003062 neural network model Methods 0.000 description 1
  • 238000010606 normalization Methods 0.000 description 1
  • 239000013307 optical fiber Substances 0.000 description 1
  • 230000008520 organization Effects 0.000 description 1
  • 230000008707 rearrangement Effects 0.000 description 1
  • 238000013515 script Methods 0.000 description 1
  • 239000004065 semiconductor Substances 0.000 description 1
  • 239000007787 solid Substances 0.000 description 1
  • 238000012549 training Methods 0.000 description 1
  • 230000000007 visual effect Effects 0.000 description 1

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Ophthalmology & Optometry (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供了一种眼球注视识别方法和电子设备。该方法可应用于手机、平板电脑等电子设备。实施该方法,电子设备可首先对摄像头采集的包括人像的图像进行裁剪,获取图像中的左眼图像、右眼图像、人脸图像等子图,然后利用预设的CNN‑A、skip‑CNN或混合transformer等特征提取器,获取上述各个子图的特征,再对各个子图的特征进行拼接,利用拼接的特征确定用户眼球在屏幕上的注视点,以提升识别到的注视点的准确度,提升眼球注视识别效果。

This application provides an eye gaze recognition method and electronic device. This method can be applied to electronic devices such as mobile phones and tablet computers. To implement this method, the electronic device can first crop the image including the portrait collected by the camera, obtain the left eye image, right eye image, face image and other sub-images in the image, and then use the preset CNN‑A, skip‑CNN Or mix a feature extractor such as a transformer to obtain the features of each of the above sub-images, and then splice the features of each sub-image, and use the spliced features to determine the gaze point of the user's eyeballs on the screen to improve the accuracy of the identified gaze point. , improve the eye gaze recognition effect.

Description

眼球注视识别方法和电子设备Eye gaze recognition method and electronic device

技术领域Technical field

本申请涉及终端领域,尤其涉及眼球注视识别方法和电子设备。The present application relates to the field of terminals, and in particular to eye gaze recognition methods and electronic devices.

背景技术Background technique

随着移动终端崛起及通信技术的成熟,人们开始探索脱离鼠标和键盘的新型人机交互方式,例如语音控制、手势识别控制、眼球注视控制等,进而实现新型的人机交互方式,为用户的更多样化更便捷的交互体验。眼球注视控制是指:识别用户眼球在屏幕上的注视点,并基于注视点在屏幕中的位置执行对应交互操作。眼球注视控制具有速度快、操作便捷的优点,可以满足用户任何场景下的交互控制。然而,当前,眼球特征提取不足导致识别到的注视点准确度较低,眼球注视控制效果较差。With the rise of mobile terminals and the maturity of communication technology, people have begun to explore new human-computer interaction methods that are independent of the mouse and keyboard, such as voice control, gesture recognition control, eye gaze control, etc., thereby realizing new human-computer interaction methods and providing users with A more diverse and convenient interactive experience. Eye gaze control refers to identifying the gaze point of the user's eyeballs on the screen and performing corresponding interactive operations based on the position of the gaze point on the screen. Eye gaze control has the advantages of fast speed and convenient operation, and can meet the user's interactive control in any scenario. However, currently, insufficient eyeball feature extraction results in low accuracy of identified fixation points and poor eye gaze control effect.

发明内容Contents of the invention

本申请实施例提供了一种眼球注视识别方法和电子设备。实施该方法的电子设备可对摄像头采集的包括人像的图像进行裁剪,然后分别获取裁剪后的子图的特征,再对各个子图的特征进行拼接,利用拼接的特征确定用户眼球在屏幕上的注视点,以提升识别到的注视点的准确度,提升眼球注视识别效果。Embodiments of the present application provide an eye gaze recognition method and electronic device. The electronic device implementing this method can crop images including portraits collected by the camera, and then obtain the features of the cropped sub-images respectively, and then splice the features of each sub-image, and use the spliced features to determine the position of the user's eyeballs on the screen. fixation points to improve the accuracy of the identified fixation points and improve the eye gaze recognition effect.

第一方面,本申请提供了一种眼球注视识别方法。该应用于包括屏幕的电子设备。该方法包括:获取第一图像;从第一图像中获取第一左眼图像、第一右眼图像、第一人脸图像;从第一左眼图像中获取第一左眼特征,从第一右眼图像中获取第一右眼特征,从第一人脸图像中获取第一人脸特征;组合第一左眼图像、第一右眼特征和第一人脸特征得到第一特征;确定屏幕上的目标注视点,目标注视点根据第一特征得到。In the first aspect, this application provides an eye gaze recognition method. The application includes electronic devices including screens. The method includes: acquiring a first image; acquiring a first left eye image, a first right eye image, and a first face image from the first image; acquiring a first left eye feature from the first left eye image, and acquiring a first left eye feature from the first left eye image. Obtain the first right eye feature from the right eye image and obtain the first face feature from the first face image; combine the first left eye image, the first right eye feature and the first face feature to obtain the first feature; determine the screen The target fixation point on is obtained based on the first feature.

实施第一方面提供的方法,电子设备可以对摄像头采集的包括人像的图像进行裁剪,获取图像中的左眼图像、右眼图像、人脸图像等子图,然后分别获取上述裁剪后的子图的特征,再对各个子图的特征进行拼接,利用拼接的特征确定用户眼球在屏幕上的注视点。相比于现有的直接提取图像特征的方法,这种裁剪后分别提取特征,然后利用拼接后的特征确定眼球注视点的方法可以提升眼球注视点的识别准确度,即提升眼球注视识别效果。By implementing the method provided in the first aspect, the electronic device can crop the image including the portrait collected by the camera, obtain the left eye image, the right eye image, the face image and other sub-images in the image, and then obtain the above-mentioned cropped sub-images respectively. The features of each sub-image are then spliced together, and the spliced features are used to determine the focus point of the user's eyeballs on the screen. Compared with the existing method of directly extracting image features, this method of extracting features separately after cropping, and then using the spliced features to determine the eye gaze point can improve the accuracy of eye gaze point recognition, that is, improve the eye gaze recognition effect.

结合第一方面提供的方法,在一些实施例中,该方法还包括:利用转制编码器对第一左眼图像和第一右眼图像进行编码;这时,组合第一左眼图像、第一右眼特征和第一人脸特征得到第一特征,具体包括:组合编码后的第一左眼图像、编码后的第一右眼特征和第一人脸特征得到第一特征。In conjunction with the method provided in the first aspect, in some embodiments, the method further includes: encoding the first left eye image and the first right eye image using a transcoding encoder; at this time, combining the first left eye image, the first right eye image The right eye feature and the first face feature are used to obtain the first feature, which specifically includes: combining the encoded first left eye image, the encoded first right eye feature and the first face feature to obtain the first feature.

实施上述实施例提供的方法,在获取到左眼特征和右眼特征后,电子设备可以利用转制编码器(即transformer encoder)对上述眼部特征进行编码,也称重构,然后利用编码后的眼部特征和人脸特征确定用户眼球注视点。利用转制编码器重构后的眼部特征进行眼球注视点识别可以进一步提升眼球注视识别效果。Implementing the method provided by the above embodiments, after acquiring the left eye features and right eye features, the electronic device can use a transformer encoder to encode the above eye features, also called reconstruction, and then use the encoded Eye features and facial features determine the user's eye gaze point. Using the eye features reconstructed by the transform encoder for eye gaze point recognition can further improve the eye gaze recognition effect.

结合第一方面提供的方法,在一些实施例中,电子设备预置有第一特征提取器、第二特征提取器,这时,从第一左眼图像中获取第一左眼特征,从第一右眼图像中获取第一右眼特征,从第一人脸图像中获取第一人脸特征,具体包括:利用第一特征提取器从第一左眼图像中获取第一左眼特征,从第一右眼图像中获取第一右眼特征;利用第二特征提取器从第一人脸图像中获取第一人脸特征。Combined with the method provided in the first aspect, in some embodiments, the electronic device is preset with a first feature extractor and a second feature extractor. At this time, the first left eye feature is obtained from the first left eye image, and the first left eye feature is obtained from the first left eye image. Obtaining the first right eye feature from a right eye image, and acquiring the first face feature from the first face image, specifically includes: using a first feature extractor to acquire the first left eye feature from the first left eye image, from Obtain the first right eye feature from the first right eye image; use the second feature extractor to acquire the first face feature from the first face image.

优选的,第一特征提取器与第二特征提取器不同。即,电子设备可根据眼部图像和人脸特征图像的区别分别构建针对眼部图像和人脸特征的特征提取器,以尽可能地提升提取到的眼部特征和人脸特征的质量,提升眼球注视识别效果。Preferably, the first feature extractor and the second feature extractor are different. That is, the electronic device can construct feature extractors for eye images and facial features respectively based on the differences between eye images and facial feature images, so as to improve the quality of the extracted eye features and facial features as much as possible, and improve Eye gaze recognition effect.

在一些实施例中,第一特征提取和第二特征提取器也可以设置为相同的特征提取器。这样可以减少眼球注视识别算法的体量,节省存储空间。In some embodiments, the first feature extractor and the second feature extractor may also be set to the same feature extractor. This can reduce the size of the eye gaze recognition algorithm and save storage space.

在一些实施例中,第一特征提取器是基于卷积神经网络建立的。In some embodiments, the first feature extractor is built based on a convolutional neural network.

优选的,第一特征提取器包括第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层;其中,第一卷积层的卷积核的尺寸为7×7,步长为1;第二卷积层的卷积核的尺寸为5×5,步长为1;第三卷积层的卷积核的尺寸为3×3,步长为1;第一池化层和第二池化层的池化核的尺寸为2×2,步长为2。Preferably, the first feature extractor includes a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer; wherein, the convolution kernel of the first convolution layer The size of the convolution kernel is 7×7, and the step size is 1; the size of the convolution kernel of the second convolution layer is 5×5, and the step size is 1; the size of the convolution kernel of the third convolution layer is 3×3, and the step size is 1. The length is 1; the size of the pooling kernel of the first pooling layer and the second pooling layer is 2×2, and the step size is 2.

在一些实施例中,第一特征提取器包括多个处理层和一个或多个下采样器,处理层包括卷积层和池化层。In some embodiments, the first feature extractor includes a plurality of processing layers including convolutional layers and pooling layers and one or more downsamplers.

其中,多个处理层包括第一处理层,一个或多个下采样器包括下采样器i。以上述第一处理层和下采样器i为例,在第一特征提取中,处理层与下采样器在结构上满足:下采样器i的输入与第一处理层的输入相同,下采样器i的输出用于与第二处理层的输出拼接;第二处理层等于第一处理层,或者,第二处理层为第一处理层之后的一个处理层。Wherein, the plurality of processing layers include a first processing layer, and the one or more downsamplers include downsampler i. Taking the above-mentioned first processing layer and downsampler i as an example, in the first feature extraction, the processing layer and downsampler are structurally satisfied: the input of downsampler i is the same as the input of the first processing layer, and the downsampler The output of i is used for splicing with the output of the second processing layer; the second processing layer is equal to the first processing layer, or the second processing layer is a processing layer after the first processing layer.

实施上述实施例提供的方法,结合卷积层、池化层和下采样器的第一特征提取器可以获取更多的眼部特征,丰富特征空间,进而提升眼球注视识别效果。By implementing the method provided in the above embodiment, the first feature extractor that combines the convolution layer, the pooling layer and the downsampler can obtain more eye features, enrich the feature space, and thereby improve the eye gaze recognition effect.

结合上述实施例提供的方法,在一些实施例中,多个处理层包括第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层。In combination with the methods provided in the above embodiments, in some embodiments, the multiple processing layers include a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer.

在上述包括第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层的基础上,在一些实施例中,多个下采样器包括第一下采样器、第二下采样器、第三下采样器;第一下采样器的输入与第一卷积层的输入相同,第一下采样器的输出用于与第一卷积层的输出拼接,并输入到第一池化层;第二下采样器的输入与第一池化层的输入相同,第二下采样器的输出用于与第二卷积层的输出拼接,并输入到第二池化层;第三下采样器的输入与第二卷积层的输入相同,第三下采样器的输出用于与第三卷积层的输出拼接并输出。On the basis of the above including the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, and the third convolution layer, in some embodiments, the plurality of downsamplers include the first Downsampler, second downsampler, third downsampler; the input of the first downsampler is the same as the input of the first convolutional layer, and the output of the first downsampler is used with the output of the first convolutional layer concatenated and input to the first pooling layer; the input of the second downsampler is the same as the input of the first pooling layer, and the output of the second downsampler is used to be spliced with the output of the second convolutional layer and input to The second pooling layer; the input of the third downsampler is the same as the input of the second convolutional layer, and the output of the third downsampler is used to splice and output with the output of the third convolutional layer.

在上述包括第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层的基础上,在一些实施例中,多个下采样器包括或还包括第四下采样器、第五下采样器;第四下采样器的输入与第一卷积层的输入相同,第四下采样器的输出用于与第二卷积层的输出拼接,并输入到第二池化层;第五下采样器的输入与第一池化层的输入相同,第五下采样器的输出用于与第三卷积层的输出拼接并输出。On the basis of the above including the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, and the third convolution layer, in some embodiments, the multiple downsamplers include or also It includes a fourth downsampler and a fifth downsampler; the input of the fourth downsampler is the same as the input of the first convolutional layer, and the output of the fourth downsampler is used to splice with the output of the second convolutional layer, and Input to the second pooling layer; the input of the fifth downsampler is the same as the input of the first pooling layer, and the output of the fifth downsampler is used to splice and output with the output of the third convolutional layer.

在上述包括第一卷积层、第一池化层、第二卷积层、第二池化层、第三卷积层的基础上,在一些实施例中,多个下采样器包括或还包括第六下采样器,第六下采样器的输入与第一卷积层的输入相同,第六下采样器的输出用于与第三卷积层的输出拼接并输出。On the basis of the above including the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer, and the third convolution layer, in some embodiments, the multiple downsamplers include or also It includes a sixth downsampler, the input of the sixth downsampler is the same as the input of the first convolutional layer, and the output of the sixth downsampler is used to splice and output with the output of the third convolutional layer.

在一些实施例中,第一特征提取器的输入通道数量为1、输出通道数量为168;或者,第一特征提取器的输入通道数量为3、输出通道数量为184。In some embodiments, the number of input channels of the first feature extractor is 1 and the number of output channels is 168; or, the number of input channels of the first feature extractor is 3 and the number of output channels is 184.

结合第一方面提供的方法,在一些实施例中,在获取第一图像之前,该方法还包括:获取第二图像;从第二图像中获取第二左眼图像、第二右眼图像、第二人脸图像;从第二左眼图像中获取第二左眼特征,从第二右眼图像中获取第二右眼特征,从第二人脸图像中获取第二人脸特征;组合第二左眼图像、第二右眼特征和第二人脸特征得到第二特征;其中,目标注视点还根据第二特征得到。In conjunction with the method provided in the first aspect, in some embodiments, before acquiring the first image, the method further includes: acquiring a second image; acquiring a second left eye image, a second right eye image, and a second image from the second image. Two face images; obtain the second left eye feature from the second left eye image, obtain the second right eye feature from the second right eye image, and obtain the second face feature from the second face image; combine the second The second feature is obtained from the left eye image, the second right eye feature and the second face feature; wherein the target gaze point is also obtained based on the second feature.

在先获取的第二图像也称带标定的参考图像。上述标定即已知的注视点。其中,目标注视点还根据第二特征得到,具体包括:目标注视点根据标定和注视点距离得到,其中,注视点距离是基于第一特征和第二特征确定的。The previously acquired second image is also called a calibrated reference image. The above calibration is the known fixation point. The target fixation point is also obtained based on the second feature, which specifically includes: the target fixation point is obtained based on the calibration and the fixation point distance, where the fixation point distance is determined based on the first feature and the second feature.

实施上述实施例提供的方法,电子设备可以利用两张图像的特征,特别是利用当前图像(即第一图像)与参考图像之间的差异,提升眼球注视点的识别效果。By implementing the method provided by the above embodiments, the electronic device can utilize the characteristics of the two images, especially the difference between the current image (i.e., the first image) and the reference image, to improve the recognition effect of the eye gaze point.

优选的,电子设备在从第二左眼图像中获取第二左眼特征,从第二右眼图像中获取第二右眼特征时,也使用第一特征提取。Preferably, the electronic device also uses the first feature extraction when acquiring the second left eye feature from the second left eye image and the second right eye feature from the second right eye image.

结合第一方面提供的方法,在一些实施例中,该方法还包括:确定目标注视点对应的热区;执行热区对应的第一动作。In conjunction with the method provided in the first aspect, in some embodiments, the method further includes: determining a hot zone corresponding to the target gaze point; and performing a first action corresponding to the hot zone.

在识别到眼球注视点后,电子设备可以确定根据注视点在显示屏中的位置确定注视点所处热区,进而执行与上述热区匹配的交互控制操作,为用户提供眼球注视控制的交互方式。例如,在确定眼球注视点处于相机应用的应用图标热区内后,电子设备可开启相机应用,显示相机应用的主界面,为用户提供各式各样的拍摄服务。又比如,在确定用户眼球注视点在通知栏热区内后,电子设备100可显示通知界面,以供用户浏览各个通知。After identifying the eye gaze point, the electronic device can determine the hot zone where the eye gaze point is located based on the position of the eye gaze point on the display screen, and then perform interactive control operations matching the above hot zone to provide the user with an interactive method of eye gaze control. . For example, after determining that the eye gaze point is within the hot zone of the application icon of the camera application, the electronic device can open the camera application, display the main interface of the camera application, and provide the user with various shooting services. For another example, after determining that the user's eyeball gaze point is within the notification bar hot zone, the electronic device 100 may display a notification interface for the user to browse various notifications.

第二方面,本申请提供了一种电子设备,该电子设备包括一个或多个处理器和一个或多个存储器;其中,一个或多个存储器与一个或多个处理器耦合,一个或多个存储器用于存储计算机程序代码,计算机程序代码包括计算机指令,当一个或多个处理器执行计算机指令时,使得电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a second aspect, the present application provides an electronic device, which includes one or more processors and one or more memories; wherein one or more memories are coupled to one or more processors, and one or more The memory is used to store computer program code. The computer program code includes computer instructions. When one or more processors execute the computer instructions, the electronic device performs the method described in the first aspect and any possible implementation manner of the first aspect.

第三方面,本申请实施例提供了一种芯片系统,该芯片系统应用于电子设备,该芯片系统包括一个或多个处理器,该处理器用于调用计算机指令以使得该电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a third aspect, embodiments of the present application provide a chip system, which is applied to an electronic device. The chip system includes one or more processors, and the processor is used to call computer instructions to cause the electronic device to execute the first step. aspect and the method described in any possible implementation manner in the first aspect.

第四方面,本申请提供一种计算机可读存储介质,包括指令,当上述指令在电子设备上运行时,使得上述电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a fourth aspect, the present application provides a computer-readable storage medium, including instructions. When the above instructions are run on an electronic device, the above electronic device causes the above-mentioned electronic device to execute as described in the first aspect and any possible implementation manner of the first aspect. method.

第五方面,本申请提供一种包含指令的计算机程序产品,当上述计算机程序产品在电子设备上运行时,使得上述电子设备执行如第一方面以及第一方面中任一可能的实现方式描述的方法。In a fifth aspect, the present application provides a computer program product containing instructions. When the computer program product is run on an electronic device, the electronic device causes the electronic device to execute as described in the first aspect and any possible implementation manner of the first aspect. method.

可以理解地,上述第二方面提供的电子设备、第三方面提供的芯片系统、第四方面提供的计算机存储介质、第五方面提供的计算机程序产品均用于执行本申请所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。It can be understood that the electronic device provided by the second aspect, the chip system provided by the third aspect, the computer storage medium provided by the fourth aspect, and the computer program product provided by the fifth aspect are all used to execute the method provided by this application. Therefore, the beneficial effects it can achieve can be referred to the beneficial effects in the corresponding methods, and will not be described again here.

附图说明Description of the drawings

图1是本申请实施例提供的一种眼球注视识别算法的示意图;Figure 1 is a schematic diagram of an eye gaze recognition algorithm provided by an embodiment of the present application;

图2A-图2B是本申请实施例提供的一组获取左眼图像、右眼图像和人脸图像的示意图;2A-2B are a set of schematic diagrams for acquiring left eye images, right eye images and face images provided by embodiments of the present application;

图3是本申请实施例提供的一种CNN-A的网络结构示意图;Figure 3 is a schematic diagram of the network structure of CNN-A provided by the embodiment of the present application;

图4是本申请实施例提供的特征拼接的示意图;Figure 4 is a schematic diagram of feature splicing provided by the embodiment of the present application;

图5A-图5B是本申请实施例提供一组基于眼球注视识别进行交互的用户界面;Figures 5A and 5B are a set of user interfaces for interaction based on eye gaze recognition provided by an embodiment of the present application;

图6A-图6B是本申请实施例提供一组基于眼球注视识别进行交互的用户界面;Figures 6A-6B are a set of user interfaces for interaction based on eye gaze recognition provided by an embodiment of the present application;

图7是本申请实施例提供的skip-CNN的网络结构示意图;Figure 7 is a schematic diagram of the network structure of skip-CNN provided by the embodiment of this application;

图8是本申请实施例提供的电子设备100利用转制编码器对眼部特征进行重构的流程图;Figure 8 is a flow chart of the electronic device 100 using the conversion encoder to reconstruct eye features provided by the embodiment of the present application;

图9是本申请实施例提供转制编码器的结构示意图;Figure 9 is a schematic structural diagram of a conversion encoder provided by an embodiment of the present application;

图10是本申请实施例提供的另一种眼球注视识别算法的示意图;Figure 10 is a schematic diagram of another eye gaze recognition algorithm provided by an embodiment of the present application;

图11A-图11B是本申请实施提供的一组累计分布函数(Cumulative distributionfunction,CDF)示意图;Figures 11A-11B are schematic diagrams of a set of cumulative distribution functions (Cumulative distribution function, CDF) provided by the implementation of this application;

图12示出了电子设备100的硬件结构示意图。FIG. 12 shows a schematic diagram of the hardware structure of the electronic device 100.

具体实施方式Detailed ways

本申请以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。The terms used in the following embodiments of the present application are only for the purpose of describing specific embodiments and are not intended to limit the present application.

手机、平板电脑等电子设备100内一般都配置有一个或多个前置摄像头。电子设备100可通过上述一个或多个前置摄像头采集人像,用于自拍、面部解锁等服务。Electronic devices 100 such as mobile phones and tablets are generally equipped with one or more front cameras. The electronic device 100 can collect portraits through the one or more front-facing cameras for services such as selfies and face unlocking.

在本申请实施例中,电子设备100还可利用上述一个或多个前置摄像头采集的人像进行眼球注视识别,确定当前用户目光在电子设备100的显示屏上的注视点,进而根据上述注视点在显示屏中的位置确定对应热区,执行与上述热区匹配的交互控制操作,从而为用户提供眼球注视控制的交互方式。In this embodiment of the present application, the electronic device 100 can also use the portraits collected by the one or more front-facing cameras to perform eye gaze recognition, determine the gaze point of the current user's gaze on the display screen of the electronic device 100, and then determine the gaze point based on the gaze point. The corresponding hot zone is determined at a position in the display screen, and an interactive control operation matching the hot zone is performed, thereby providing the user with an interactive method of eye gaze control.

不限于手机、平板电脑,电子设备100还可以是桌面型计算机、膝上型计算机、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR)设备、虚拟现实(virtual reality,VR)设备、人工智能(artificial intelligence,AI)设备、可穿戴式设备、车载设备、智能家居设备和/或智慧城市设备,本申请实施例对上述终端的具体类型不作特殊限制。Not limited to mobile phones and tablet computers, the electronic device 100 can also be a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a personal digital Assistant (personal digital assistant, PDA), augmented reality (AR) device, virtual reality (VR) device, artificial intelligence (AI) device, wearable device, vehicle-mounted device, smart home device and/or smart city equipment. The embodiments of this application do not place special restrictions on the specific types of the above terminals.

图1是本申请实施例提供的一种眼球注视识别算法的示意图。Figure 1 is a schematic diagram of an eye gaze recognition algorithm provided by an embodiment of the present application.

首先,S101、电子设备100获取一帧图像,记为图像X。First, in S101, the electronic device 100 acquires a frame of image, which is recorded as image X.

电子设备100可通过前置摄像头获取图像X。本申请实施例对于电子设备100获取图像X的具体时刻不作限定,即对于在何种场景下获取图像X、执行眼球注视识别不作限定。在一些实施例中,电子设备100可在显示锁屏界面时,开启前置摄像头,获取一帧包括人脸的图像X。在另一些实施例中,电子设备100也可在显示桌面的过程中,开启前置摄像头,获取一帧包括人脸的图像X。在另一些实施例中,电子设备100也可在开启某一应用后,开启前置摄像头,获取一帧包括人脸的图像X。The electronic device 100 can acquire the image X through the front camera. The embodiment of the present application does not limit the specific moment when the electronic device 100 acquires the image In some embodiments, when the lock screen interface is displayed, the electronic device 100 can turn on the front camera and obtain a frame of image X including a human face. In other embodiments, the electronic device 100 can also turn on the front camera to obtain a frame of image X including a human face during the process of displaying the desktop. In other embodiments, the electronic device 100 can also turn on the front camera after opening an application to obtain a frame of image X including a human face.

电子设备100可配置有不同类型的前置摄像头,包括但不限于广角摄像头、超广角摄像头、长焦摄像头、结构光深感摄像头、红外摄像头等等。其中,广角摄像头、超广角摄像头、长焦摄像头可采集并输出三通道的RGB图像;结构光深感摄像头可输出四通道的RGBD图像;红外摄像头可输出单通道的反映红外光强度的红外(Infrared,IR)图像。由上可知,根据所使用的摄像头类型的不同,图像X可以是单通道的、三通道的或其他多通道的图像。任意类型的摄像头采集的任意类型的图像均可作为本申请眼球注视识别算法的输入图像,即上述图像X,本申请实施例对此不作限制。The electronic device 100 can be configured with different types of front cameras, including but not limited to wide-angle cameras, ultra-wide-angle cameras, telephoto cameras, structured light depth-sensing cameras, infrared cameras, and so on. Among them, the wide-angle camera, ultra-wide-angle camera, and telephoto camera can collect and output three-channel RGB images; the structured light depth-sensing camera can output four-channel RGBD images; and the infrared camera can output a single-channel infrared (Infrared) image that reflects the infrared light intensity. , IR) image. It can be seen from the above that, depending on the type of camera used, the image X can be a single-channel, three-channel or other multi-channel image. Any type of image collected by any type of camera can be used as the input image of the eye gaze recognition algorithm of this application, that is, the above image X, and the embodiment of this application does not limit this.

在获得图像X之后,S102、电子设备100可从图像X中获取左眼图像、右眼图像和人脸图像。After obtaining the image X, S102. The electronic device 100 may obtain the left eye image, the right eye image and the face image from the image X.

电子设备100可通过人脸检测算法确定图像X中的人脸位置,进而基于上述人脸位置获取人脸图像。电子设备100可通过人脸关键点检测算法确定图像X中的双眼位置,其中双眼位置包括左眼位置和右眼位置。进而,电子设备100可基于左眼位置获取左眼图像、基于右眼位置获取右眼图像。The electronic device 100 can determine the face position in the image X through the face detection algorithm, and then obtain the face image based on the face position. The electronic device 100 can determine the positions of both eyes in the image X through a facial key point detection algorithm, where the positions of both eyes include the position of the left eye and the position of the right eye. Furthermore, the electronic device 100 can acquire the left eye image based on the left eye position and the right eye image based on the right eye position.

图2A-图2B是本申请实施例提供的一组获取左眼图像、右眼图像和人脸图像的示意图。2A-2B are a set of schematic diagrams for acquiring left eye images, right eye images and face images provided by embodiments of the present application.

首先,图2A是本申请实施例提供的获取人脸图像的示意图。First, FIG. 2A is a schematic diagram of acquiring a face image provided by an embodiment of the present application.

电子设备100可将获取到的图像X输入人脸检测算法,执行人脸检测。人脸检测算法可识别图像中的人脸并输出人脸框。如图2A所示,人脸框可进一步具体地标记图像X中人脸的位置。人脸框合围区域对应的图像即人脸图像。然后,电子设备100可按人脸框裁剪图像X得到独立的人脸图像。The electronic device 100 can input the acquired image X into the face detection algorithm to perform face detection. Face detection algorithms identify faces in images and output face frames. As shown in Figure 2A, the face frame can further specifically mark the position of the face in the image X. The image corresponding to the area enclosed by the face frame is the face image. Then, the electronic device 100 can crop the image X according to the face frame to obtain an independent face image.

图2B是本申请实施例提供的获取左眼图像和右眼图像的示意图。FIG. 2B is a schematic diagram of acquiring a left eye image and a right eye image according to an embodiment of the present application.

电子设备100可将获取到的图像X输入人脸关键点检测算法,执行人脸关键点检测。人脸关键点检测算法可识别图像中的人脸关键点。人脸关键点包括但不限于左眼点、右眼点、鼻点、左唇尖点、右唇尖点。如图2B所示,电子设备100可获取到人脸关键点:左眼点a、右眼点b、鼻点c、左唇尖点d、右唇尖点e。电子设备100可以以左眼点a为中心确定一块矩形区域。该矩形区域对应的图像即左眼图像。然后,电子设备100可按上述矩形区域裁剪图像X得到左眼图像。同理,电子设备100可以以右眼点b为中心确定一块矩形区域。该矩形区域对应的图像即右眼图像。然后,电子设备100可按上述矩形区域裁剪图像X得到右眼图像。The electronic device 100 can input the acquired image X into the facial key point detection algorithm to perform facial key point detection. The facial key point detection algorithm identifies facial key points in the image. Key facial points include but are not limited to left eye point, right eye point, nose point, left lip tip point, and right lip tip point. As shown in FIG. 2B , the electronic device 100 can obtain the key points of the human face: left eye point a, right eye point b, nose point c, left lip tip point d, and right lip tip point e. The electronic device 100 can determine a rectangular area with the left eye point a as the center. The image corresponding to this rectangular area is the left eye image. Then, the electronic device 100 can crop the image X according to the above rectangular area to obtain the left eye image. In the same way, the electronic device 100 can determine a rectangular area with the right eye point b as the center. The image corresponding to this rectangular area is the right eye image. Then, the electronic device 100 can crop the image X according to the above rectangular area to obtain the right eye image.

可以理解的,电子设备100可以并行地进行人脸检测和人脸关键点检测,从而并行地获取左眼图像、右眼图像和人脸图像。这也就是说,获取人脸图像的过程不会影响电子设备100获取左眼图像、右眼图像;获取左眼图像、右眼图像的过程也不会影响电子设备100获取人脸图像。It can be understood that the electronic device 100 can perform face detection and face key point detection in parallel, thereby acquiring the left eye image, the right eye image and the face image in parallel. That is to say, the process of obtaining the face image will not affect the electronic device 100's acquisition of the left eye image and the right eye image; the process of obtaining the left eye image and the right eye image will not affect the electronic device 100's acquisition of the human face image.

在一些实施例中,在获取左眼图像、右眼图像和人脸图像之前,电子设备100还可对图像X进行旋转校正,以保证用户面部端正,进而提升左眼图像、右眼图像和人脸图像的质量,提升眼球注视识别的识别效果。In some embodiments, before acquiring the left eye image, right eye image and human face image, the electronic device 100 may also perform rotation correction on the image face image quality and improve the recognition effect of eye gaze recognition.

在获取到左眼图像、右眼图像和人脸图像之后,S103、电子设备100可将左眼图像、右眼图像和人脸图像分别输入特征提取器,获取对应的左眼特征、右眼特征和人脸特征。After acquiring the left eye image, right eye image and face image, S103, the electronic device 100 can input the left eye image, right eye image and face image into the feature extractor respectively to obtain the corresponding left eye features and right eye features. and facial features.

左眼图像和右眼图像可统称为眼部图像。对应的,基于左眼图像获得左眼特征和基于右眼图像获得右眼特征可统称为眼部特征。如图1所示,在本申请实施例中,电子设备100可预置有特征提取器A和特征提取器B。其中,特征提取器A可用于处理眼部图像,获取眼部特征;而特征提取器B可用于处理人脸图像,获取人脸特征。The left eye image and the right eye image may be collectively referred to as eye images. Correspondingly, the left eye features obtained based on the left eye image and the right eye features obtained based on the right eye image can be collectively referred to as eye features. As shown in FIG. 1 , in this embodiment of the present application, the electronic device 100 may be preset with feature extractor A and feature extractor B. Among them, feature extractor A can be used to process eye images and obtain eye features; while feature extractor B can be used to process face images and obtain facial features.

在一种实现方式中,特征提取器A可以是基于卷积神经网络(ConvolutionalNeural Network,CNN)建立的一套神经网络模型。因此,特征提取器A也可记为CNN-A。本申请实施例对应特征提取器B所使用的特征提取算法不作限定。在特征提取器B同样采用CNN算法的场景下,优选的,特征提取器B(也称CNN-B)的网络结构和参数设置与CNN-A不同,即CNN-B与CNN-A不同。可选的,CNN-B也可以采用与CNN-A相同的网络结构和参数设置,即CNN-B与CNN-A相同。In one implementation, feature extractor A may be a set of neural network models based on a convolutional neural network (Convolutional Neural Network, CNN). Therefore, feature extractor A can also be recorded as CNN-A. The feature extraction algorithm used by feature extractor B in the embodiment of the present application is not limited. In the scenario where feature extractor B also uses the CNN algorithm, preferably, the network structure and parameter settings of feature extractor B (also called CNN-B) are different from CNN-A, that is, CNN-B is different from CNN-A. Optionally, CNN-B can also adopt the same network structure and parameter settings as CNN-A, that is, CNN-B is the same as CNN-A.

图3是本申请实施例提供的一种CNN-A的网络结构示意图。Figure 3 is a schematic diagram of the network structure of CNN-A provided by an embodiment of the present application.

如图3所示,CNN-A的网络结构可包括3层卷积层C1~C3和2层池化层P1~P2,卷积层与池化层交替排列。在一种具体的实现方式中,卷积层C1的卷积核的尺寸为7×7,步长为1;卷积层C2的卷积核的尺寸为5×5,步长为1;卷积层C3的卷积核的尺寸为3×3,步长为1;池化层P1和P2的池化核的尺寸均为2×2,步长均为2。As shown in Figure 3, the network structure of CNN-A can include 3 convolutional layers C1 to C3 and 2 pooling layers P1 to P2, with the convolutional layers and pooling layers arranged alternately. In a specific implementation, the size of the convolution kernel of the convolution layer C1 is 7×7, and the step size is 1; the size of the convolution kernel of the convolution layer C2 is 5×5, and the step size is 1; The size of the convolution kernel of the multilayer C3 is 3×3 and the stride is 1; the size of the pooling kernels of the pooling layers P1 and P2 are both 2×2 and the stride is 2.

在将左眼图像输入CNN-A后,经过上述卷积层和池化层的处理,CNN-A可获得左眼特征。在将右眼图像输入CNN-A后,经过上述卷积层和池化层的处理,CNN-A可获得右眼特征。After inputting the left eye image into CNN-A, CNN-A can obtain the left eye features after the above-mentioned convolution layer and pooling layer processing. After inputting the right eye image into CNN-A, CNN-A can obtain the right eye features after the above-mentioned convolution layer and pooling layer processing.

CNN-A输出的特征(例如上述左眼特征、右眼特征)的尺寸可通过输入图像(例如上述左眼图像、右眼图像)的尺寸和网络结构中各层的核的尺寸和步长确定。以H1×W1的输入图像为例,在经过上述卷积层和池化层的处理后,CNN-A可以输出H2×W2的特征。在不考虑边缘填充(padding)的场景下,在上述例举的卷积核和池化核的尺寸和步长的场景下,H1×W1与H2×W2的关系如下:The size of the features output by CNN-A (such as the above-mentioned left eye features and right eye features) can be determined by the size of the input image (such as the above-mentioned left eye image and right eye image) and the size and step size of the kernels of each layer in the network structure. . Taking the input image of H1×W1 as an example, after the above-mentioned convolution layer and pooling layer processing, CNN-A can output the features of H2×W2. Without considering edge padding, in the above-mentioned scenario of the size and step size of the convolution kernel and pooling kernel, the relationship between H1×W1 and H2×W2 is as follows:

H2=(H1-25)/4;W2=(W1-25)/4。H2=(H1-25)/4; W2=(W1-25)/4.

其中,CNN-A的网络结构中每一层输出的特征的尺寸也可以根据这一层及其之前所经过的卷积层和池化层的核的尺寸和步长确定,这里不再赘述。Among them, the size of the features output by each layer in the network structure of CNN-A can also be determined based on the size and step size of the kernel of this layer and the convolution layer and pooling layer it passes before, which will not be described again here.

CNN-A还包括输入通道和输出通道。CNN-A also includes input channels and output channels.

输入通道的数量通常与输入图像的颜色结构对应。例如,当输入图像为三通道的RGB图像时,CNN-A的输入通道的数量为3;当输入图像为单通道的IR图像时,CNN-A的输入通道的数量为1。具体的,CNN-A可首先识别输入图像的颜色结构,确定输入图像的颜色通道的数量,然后CNN-A可设定与颜色通道数量一致的输入通道。例如,当输入图像为三通道的RGB图像时,CNN-A可确定输入图像的颜色通道的数量为3,对应的,CNN-A可设定输入通道的数量为3。The number of input channels usually corresponds to the color structure of the input image. For example, when the input image is a three-channel RGB image, the number of input channels of CNN-A is 3; when the input image is a single-channel IR image, the number of input channels of CNN-A is 1. Specifically, CNN-A can first identify the color structure of the input image, determine the number of color channels of the input image, and then CNN-A can set the input channel consistent with the number of color channels. For example, when the input image is a three-channel RGB image, CNN-A can determine the number of color channels of the input image to be 3. Correspondingly, CNN-A can set the number of input channels to 3.

输出通道的数量即输出特征的数量。输出通道的数量通常大于输入通道的数量。示例性的,CNN-A的输出通道的数量可以为60。当输出通道的数量为60时,CNN-A可从输入图像中提取并输出60组特征。具体的,CNN-A可多次更新各层卷积核或池化核的值,以得到远大于输入通道数量的特征。输出通道的数量越多,CNN-A输出的特征越多,同时,CNN-A的计算成本越大。输出通道的数量可根据经验设定,以实现在控制计算成本的前提下尽可能获取多的特征。The number of output channels is the number of output features. The number of output channels is usually greater than the number of input channels. For example, the number of output channels of CNN-A may be 60. When the number of output channels is 60, CNN-A can extract and output 60 sets of features from the input image. Specifically, CNN-A can update the values of the convolution kernel or pooling kernel of each layer multiple times to obtain features that are much larger than the number of input channels. The greater the number of output channels, the more features CNN-A outputs, and at the same time, the greater the computational cost of CNN-A. The number of output channels can be set based on experience to obtain as many features as possible while controlling computational costs.

在CNN-A的输出通道的数量为N1,CNN-B的输出通道的数量为N2的场景下,在将左眼图像、右眼图像输入CNN-A后,电子设备100可获得N1组左眼特征和N1组右眼特征;在将人脸图像输入CNN-B之后,电子设备100可获得N2组人脸特征。然后,S104、电子设备100可对左眼特征、右眼特征进行拼接获取眼部特征,并继续对眼部特征和人脸特征进行拼接获取查询特征。In a scenario where the number of output channels of CNN-A is N1 and the number of output channels of CNN-B is N2, after inputting the left eye image and the right eye image into CNN-A, the electronic device 100 can obtain N1 sets of left eyes. features and the N1 set of right eye features; after inputting the face image into CNN-B, the electronic device 100 can obtain the N2 set of face features. Then, in S104, the electronic device 100 can splice left eye features and right eye features to obtain eye features, and continue to splice eye features and face features to obtain query features.

拼接操作也即组合操作。拼接后的特征为拼接前的两组特征的和。因此,拼接后得到的眼部特征也即从图像X的眼部图像中获取到的全部特征,拼接后得到的查询特征也即从图像X中获取到的全部特征。The splicing operation is also a combination operation. The feature after splicing is the sum of the two sets of features before splicing. Therefore, the eye features obtained after splicing are all features obtained from the eye image of image X, and the query features obtained after splicing are all features obtained from image X.

图4是本申请实施例提供的特征拼接的示意图。如图4所示,上方左侧的灰色三层网格可表示CNN-A输出的左眼图像的N1组左眼特征,上方右侧的白色三层网格可表示CNN-A输出的右眼图像的N1组右眼特征;下方的六层网格可表示拼接后得到的眼部特征:2×N1组,包括上述N1组左眼特征和N1组右眼特征。Figure 4 is a schematic diagram of feature splicing provided by an embodiment of the present application. As shown in Figure 4, the gray three-layer grid on the upper left side can represent the N1 group of left eye features of the left eye image output by CNN-A, and the white three-layer grid on the upper right side can represent the right eye output by CNN-A. The N1 group right eye features of the image; the six-layer grid below can represent the eye features obtained after splicing: 2×N1 group, including the above-mentioned N1 group left eye features and N1 group right eye features.

进一步的,在对2×N1组眼部特征和N2组人脸特征进行拼接后,电子设备100可得到查询特征:(2×N1+N2)组。Further, after splicing the 2×N1 group of eye features and the N2 group of face features, the electronic device 100 can obtain the query feature: (2×N1+N2) group.

在获得查询特征后,S105、电子设备100可将查询特征输入全连接(FullConnection,FC)层,经过FC的预测得到注视点(x,y)。注视点(x,y)即眼球注视识别算法识别图像X确定的注视点。After obtaining the query features, S105, the electronic device 100 can input the query features into the FullConnection (FC) layer, and obtain the gaze point (x, y) through FC prediction. The gaze point (x, y) is the gaze point determined by the eye gaze recognition algorithm to identify the image X.

在得到注视点(x,y)后,电子设备100确定与注视点(x,y)匹配的热区,触发上述热区对应的功能。After obtaining the gaze point (x, y), the electronic device 100 determines the hot zone that matches the gaze point (x, y), and triggers the function corresponding to the hot zone.

热区是指显示屏上一块形状、大小固定的区域。示例性的,在显示桌面的场景中,电子设备100可设置有应用热区。桌面上任一应用图标对应的显示区域可称为一个应用热区。在检测到作用于该应用热区的用户操作后,电子设备100可开启或设置该应用热区对应的应用程序。应用热区的面积大于等于对应应用图标的显示面积,且完全覆盖该应用图标。各个应用图标对应的应用热区互相不重叠。A hot spot refers to an area of fixed shape and size on the display screen. For example, in a scenario where a desktop is displayed, the electronic device 100 may be provided with an application hot zone. The display area corresponding to any application icon on the desktop can be called an application hot spot. After detecting a user operation on the application hot zone, the electronic device 100 can start or set the application program corresponding to the application hot zone. The area of the application hot zone is greater than or equal to the display area of the corresponding application icon, and completely covers the application icon. The application hot spots corresponding to each application icon do not overlap with each other.

图5A是本申请实施例提供一个电子设备100的桌面。如图5A所示,电子设备100的桌面上可显示有一个或多个应用程序图标,例如电话应用图标、浏览器应用图标、相机应用图标、天气应用图标等等。以相机应用为例,区域501可表示相机热区。相机热区完全覆盖相机应用图标。在通过S101~S105所示的方法确定用户眼球注视点(x,y)之后,电子设备100可确定注视点(x,y)是否在相机热区内,即区域501内。在确定注视点(x,y)在相机热区内且注视时长达到预设时长后,电子设备100可开启相机应用。参考图5B,电子设备100可显示相应应用的拍摄界面。FIG. 5A is a desktop of an electronic device 100 provided by an embodiment of the present application. As shown in FIG. 5A , one or more application icons may be displayed on the desktop of the electronic device 100, such as a phone application icon, a browser application icon, a camera application icon, a weather application icon, and so on. Taking the camera application as an example, area 501 may represent a camera hot spot. The camera hotspot completely covers the camera app icon. After determining the user's eyeball gaze point (x, y) through the methods shown in S101 to S105, the electronic device 100 can determine whether the gaze point (x, y) is within the camera hot zone, that is, within the area 501. After determining that the gaze point (x, y) is within the camera hot zone and the gaze duration reaches the preset duration, the electronic device 100 can start the camera application. Referring to FIG. 5B , the electronic device 100 may display a shooting interface of a corresponding application.

参考图6A,电子设备100还可设置区域601,区域601可称为通知栏热区。在通过S101~S105所示的方法确定用户眼球注视点(x,y)之后,电子设备100可确定注视点(x,y)是否在通知栏热区内,即区域601内。在确定注视点(x,y)在通知栏热区内且注视时长达到预设时长后,电子设备100可显示通知界面,参考图6B。Referring to FIG. 6A, the electronic device 100 may also be provided with an area 601, which may be called a notification bar hot zone. After determining the user's eyeball gaze point (x, y) through the methods shown in S101 to S105, the electronic device 100 can determine whether the gaze point (x, y) is within the notification bar hot zone, that is, within the area 601. After determining that the gaze point (x, y) is within the notification bar hot zone and the gaze duration reaches the preset duration, the electronic device 100 may display a notification interface, see FIG. 6B .

这样,用户可以通过眼球注视操作代替作用于屏幕的触控操作,实现对电子设备100的交互控制。In this way, the user can realize interactive control of the electronic device 100 through eye gaze operations instead of touch operations on the screen.

进一步的,为了提升眼球注视识别算法的准确度,使算法预测的注视点与用户实际注视的注视点之间的误差更小,电子设备100可使用跳层(skip-connection)特征提取器来代替原来的CNN-A。Furthermore, in order to improve the accuracy of the eye gaze recognition algorithm and make the error between the gaze point predicted by the algorithm and the gaze point actually gazed by the user smaller, the electronic device 100 may use a skip-connection feature extractor instead. The original CNN-A.

跳层特征提取器在CNN-A的基础上,在卷积层和池化层之间引入多个下采样器。因此,跳层特征提取器可也称为skip-CNN。下采样器用于对输入的图像进行下采样、降低输入图像的尺寸。下采样器输出的压缩后的图像矩阵也可视为一组描述该图像的特征。skip-CNN可结合下采样器输出的特征和卷积层、池化层处理后输出的特征,丰富特征空间,从而提升注视点的识别效果。The skip layer feature extractor is based on CNN-A and introduces multiple downsamplers between the convolutional layer and the pooling layer. Therefore, the skip layer feature extractor can also be called skip-CNN. The downsampler is used to downsample the input image and reduce the size of the input image. The compressed image matrix output by the downsampler can also be regarded as a set of features describing the image. Skip-CNN can combine the features output by the downsampler and the features output by the convolution layer and pooling layer to enrich the feature space, thereby improving the recognition effect of fixation points.

图7是本申请实施例提供的skip-CNN的网络结构示意图。Figure 7 is a schematic diagram of the network structure of skip-CNN provided by the embodiment of this application.

如图7所示,skip-CNN也包括卷积层C1~C3、池化层P1~P2。其中,skip-CNN中卷积层C1~C3与池化层P1~P2的连接关系,以及卷积层C1~C3、池化层P1~P2的核的尺寸与步长,均与CNN-A相同,这里不再赘述。本申请实施例将具体介绍skip-CNN在卷积层和池化层之间引入的多个下采样器的位置,以及卷积层、池化层与下采样器之间的组合方式。As shown in Figure 7, skip-CNN also includes convolutional layers C1~C3 and pooling layers P1~P2. Among them, the connection relationship between the convolutional layers C1~C3 and the pooling layers P1~P2 in skip-CNN, as well as the size and step size of the kernels of the convolutional layers C1~C3 and pooling layers P1~P2, are all the same as those of CNN-A The same, so we won’t go into details here. The embodiment of this application will specifically introduce the positions of the multiple downsamplers introduced by skip-CNN between the convolutional layer and the pooling layer, as well as the combination method between the convolutional layer, the pooling layer and the downsampler.

电子设备100可以以卷积层为界将图3所示的卷积层与池化层构成的网络结构划分为3个阶层,参考表1:The electronic device 100 can divide the network structure composed of the convolution layer and the pooling layer shown in Figure 3 into three levels with the convolution layer as the boundary. Refer to Table 1:

表1Table 1

如图7所示,电子设备100可以设置与每个阶层对应的下采样器,例如下采样器1~3。进一步的,电子设备100还可以设置跨越两个阶层的下采样器,例如下采样器4~5;以及跨越三个阶层的下采样器,例如下采样器6。可以理解的,当原始的CNN-A中还包括更多的卷积层和池化层时,对应的skip-CNN也还可以包括更多下采样器。As shown in FIG. 7 , the electronic device 100 may be provided with downsamplers corresponding to each layer, such as downsamplers 1 to 3. Furthermore, the electronic device 100 may also be provided with downsamplers spanning two levels, such as downsamplers 4 to 5; and downsamplers spanning three levels, such as downsampler 6. It can be understood that when the original CNN-A also includes more convolutional layers and pooling layers, the corresponding skip-CNN can also include more downsamplers.

(1)下采样器的输出尺寸与输出尺寸配置。(1) Output size and output size configuration of the downsampler.

根据下采样器所处的阶层不同,各个下采样器的输入尺寸、输出尺寸也各不相同。输入尺寸是指输入数据(输入图像或输入特征)的尺寸,输出尺寸是指输出数据(输出图像或输出特征)的尺寸。在本申请实施例中,一个下采样器的输入尺寸等于该下采样器对应阶层的输入尺寸,输出尺寸等于该下采样器对应阶层的输出尺寸。Depending on the level of the downsampler, the input size and output size of each downsampler are also different. The input size refers to the size of the input data (input image or input feature), and the output size refers to the size of the output data (output image or output feature). In this embodiment of the present application, the input size of a downsampler is equal to the input size of the corresponding layer of the downsampler, and the output size is equal to the output size of the corresponding layer of the downsampler.

首先,表2是本申请实施例提供的CNN-A中各卷积层、池化层的输入尺寸与输出尺寸:First, Table 2 is the input size and output size of each convolution layer and pooling layer in CNN-A provided by the embodiment of this application:

表2Table 2

在H1×W1已知的情况下,H11×W11可根据H1×W1与卷积层C1的卷积核的尺寸、步长确定,H12×W12可根据H11×W11与池化层P1的池化核的尺寸、步长确定,依次类推,H13×W13、H14×W14、H2×W2均是可以确定的。When H1×W1 is known, H11×W11 can be determined based on the size and step size of H1×W1 and the convolution kernel of convolution layer C1, and H12×W12 can be determined based on the pooling of H11×W11 and pooling layer P1. The size and step size of the kernel are determined, and by analogy, H13×W13, H14×W14, and H2×W2 can all be determined.

在表2所示的各卷积层、池化层的输入尺寸与输出尺寸的基础上,对应的,下采样器1~6的输入尺寸与输出尺寸可参考表3:Based on the input size and output size of each convolution layer and pooling layer shown in Table 2, correspondingly, the input size and output size of downsamplers 1 to 6 can be referred to Table 3:

表3table 3

下采样器Downsampler 阶层class 输入尺寸Enter dimensions 输出尺寸Output size 下采样器1Downsampler 1 11 H1×W1H1×W1 H11×W11H11×W11 下采样器2Downsampler 2 22 H11×W11H11×W11 H13×W13H13×W13 下采样器3Downsampler 3 33 H13×W13H13×W13 H2×W2H2×W2 下采样器4Downsampler 4 1&21&2 H1×W1H1×W1 H13×W13H13×W13 下采样器5Downsampler 5 2&32&3 H11×W11H11×W11 H2×W2H2×W2 下采样器6Downsampler 6 1&2&31&2&3 H1×W1H1×W1 H2×W2H2×W2

这样,基于相同的输出尺寸,在每一个阶层之后,skip-CNN可对该阶层输出的特征与下采样器输出的特征进行拼接。In this way, based on the same output size, after each layer, skip-CNN can concatenate the features output by this layer and the features output by the downsampler.

(2)基于下采样器的skip-CNN的通道配置。(2) Channel configuration of skip-CNN based on downsampler.

与CNN-A相同,skip-CNN的输入通道数量根据输入图像的颜色结构确定,这里不再赘述。The same as CNN-A, the number of input channels of skip-CNN is determined based on the color structure of the input image, which will not be described here.

表4是本申请实施例提供的一种skip-CNN中针对单通道图像(例如IR图像)的输出通道数量的配置信息。Table 4 is the configuration information of the number of output channels for a single-channel image (such as an IR image) in a skip-CNN provided by the embodiment of the present application.

表4Table 4

其中,[A/B]表示实体(包括卷积层、池化层、下采样器)的输入通道数量为A,输出通道数量为B。[C]表示实体的输入通道数量为C,输出通道数量也为C。Among them, [A/B] means that the number of input channels of the entity (including convolution layer, pooling layer, and downsampler) is A, and the number of output channels is B. [C] means that the number of input channels of the entity is C, and the number of output channels is also C.

如表4所示,对于单通道的输入图像,skip-CNN的输入通道数量为1,对应的,卷积层C1、下采样器1、下采样器4、下采样器6的输入通道数量均为1。卷积层C1的输出通道数量可设置为32。下采样器的输入通道数量与输出通道数量一致。这时,卷积层C1可输出32组特征。在对卷积层C1输出的32组特征和下采样器1输出的1组特征进行拼接后,skip-CNN可得到33组特征。于是,池化层P1、下采样器2、下采样器5的输入通道数量均为33。卷积层C2的输出通道数量可设置与卷积层C2的输入通道数量相同,即为33。这时,卷积层C2、下采样器2、下采样器5可分别输出33组特征。在对卷积层C2输出的33组特征、下采样器2输出的33组特征以及下采样器4输出的1组特征进行拼接后,skip-CNN可得到67组特征。于是,池化层P2、下采样器3的输入通道数量均为67。卷积层C3的输出通道数量可设置为67。这时,卷积层C3、下采样器3可分配输出67组特征。在对卷积层C3输出的67组特征合、下采样器3输出的67组特征、下采样器5输出的33组特征以及下采样器6输出的1组特征进行拼接后,skip-CNN可得到168组特征。As shown in Table 4, for a single-channel input image, the number of input channels of skip-CNN is 1. Correspondingly, the number of input channels of convolution layer C1, downsampler 1, downsampler 4, and downsampler 6 are all the same. is 1. The number of output channels of convolutional layer C1 can be set to 32. The number of input channels to the downsampler is the same as the number of output channels. At this time, the convolution layer C1 can output 32 sets of features. After splicing 32 sets of features output by convolutional layer C1 and 1 set of features output by downsampler 1, skip-CNN can obtain 33 sets of features. Therefore, the number of input channels of pooling layer P1, downsampler 2, and downsampler 5 is all 33. The number of output channels of convolution layer C2 can be set to be the same as the number of input channels of convolution layer C2, which is 33. At this time, the convolution layer C2, downsampler 2, and downsampler 5 can respectively output 33 sets of features. After splicing 33 sets of features output by convolutional layer C2, 33 sets of features output by downsampler 2, and 1 set of features output by downsampler 4, skip-CNN can obtain 67 sets of features. Therefore, the number of input channels of pooling layer P2 and downsampler 3 is both 67. The number of output channels of convolutional layer C3 can be set to 67. At this time, the convolution layer C3 and the downsampler 3 can allocate and output 67 sets of features. After splicing 67 sets of features output by convolutional layer C3, 67 sets of features output by downsampler 3, 33 sets of features output by downsampler 5, and 1 set of features output by downsampler 6, skip-CNN can 168 sets of features were obtained.

表5是本申请实施例提供的一种skip-CNN中针对三通道图像(例如RGB图像)的输出通道数量的配置信息。Table 5 is the configuration information of the number of output channels for a three-channel image (such as an RGB image) in a skip-CNN provided by the embodiment of the present application.

表5table 5

三通道图像中各个实体的输出通道数量,以及最后输出通道数量的组成可参考上述表4的介绍,这里不再赘述。The number of output channels of each entity in the three-channel image, and the composition of the final number of output channels can be referred to the introduction in Table 4 above, and will not be repeated here.

进一步的,在使用skip-CNN获取左眼特征和右眼特征后,电子设备100还可以使用转制编码器(Transformer encoder)对skip-CNN输出的左眼特征和右眼特征进行重构,并最终输出重构的左眼特征和右眼特征。然后,电子设备100可基于重构的左眼特征和右眼特征得到重构的眼部特征,并利用重构的眼部特征和人脸特征进行预测,确定注视点,从而进一步提升注视点识别的准确性。skip-CNN与转制编码器的组合也称为混合transformer。Further, after using skip-CNN to obtain the left eye features and right eye features, the electronic device 100 can also use a transformer encoder to reconstruct the left eye features and right eye features output by skip-CNN, and finally Output the reconstructed left eye features and right eye features. Then, the electronic device 100 can obtain reconstructed eye features based on the reconstructed left eye features and right eye features, and use the reconstructed eye features and facial features to predict and determine the gaze point, thereby further improving gaze point recognition. accuracy. The combination of skip-CNN and transform encoder is also called hybrid transformer.

图8是本申请实施例提供的电子设备100利用转制编码器对眼部特征进行重构的流程图。FIG. 8 is a flow chart of the electronic device 100 using the transformation encoder to reconstruct eye features provided by the embodiment of the present application.

S201、将图像格式的眼部特征转化为序列格式。S201. Convert the eye features in image format into sequence format.

S202、对转化过程中图像格式的眼部特征的进行位置编码,获取眼部特征的位置张量。S202. Perform position coding on the eye features of the image format during the conversion process, and obtain the position tensor of the eye features.

S203、将序列格式的眼部特征和对应的位置张量输入转制编码器获得重构眼部特征。S203. Input the eye features in sequence format and the corresponding position tensor into the conversion encoder to obtain reconstructed eye features.

张量用于指示数据的组织形式。张量[b,c,H,W]中的b(batchsize)指示样本数量,c(channel)指示通道数量,H和W分别指示数据的高和宽。Tensors are used to indicate how data is organized. b(batchsize) in tensor [b,c,H,W] indicates the number of samples, c(channel) indicates the number of channels, and H and W indicate the height and width of the data respectively.

首先,原始的输入skip-CNN的眼部图像的张量可表示为image[b,c,H,W]。示例性的,64×64的RGB图像的张量可表示为image[1,3,64,64]。在经过skip-CNN的处理后,skip-CNN输出的眼部特征的张量可表示为feature_map[b,c,H,W]。feature_map[b,c,H,W]格式的特征也称为图像格式的特征。First, the tensor of the original eye image input to skip-CNN can be expressed as image[b,c,H,W]. For example, the tensor of a 64×64 RGB image can be expressed as image[1,3,64,64]. After being processed by skip-CNN, the tensor of eye features output by skip-CNN can be expressed as feature_map[b,c,H,W]. Features in feature_map[b,c,H,W] format are also called features in image format.

在执行S201的过程中,电子设备100可首先将feature_map[b,c,H,W]转化为feature_map1[b,c,H*W],然后再对feature_map1进行维度重排,得到feature_map2[H*W,b,c]。feature_map2也称序列格式的特征。同时,电子设备100可随机生成特殊字符cls_token[1,b,c]。然后,电子设备100可将feature_map2与cls_token进行拼接,得到featuremaps3[H*W+1,b,c]。During the execution of S201, the electronic device 100 may first convert feature_map[b,c,H,W] into feature_map1[b,c,H*W], and then rearrange the dimensions of feature_map1 to obtain feature_map2[H* W,b,c]. feature_map2 is also called a feature in sequence format. At the same time, the electronic device 100 can randomly generate special characters cls_token[1,b,c]. Then, the electronic device 100 can splice feature_map2 and cls_token to obtain featuremaps3[H*W+1,b,c].

基于feature maps3中的H*W+1,电子设备100可从0到H*W+1进行有序位置编码,得到位置张量pos_feature[H*W+1,c]。Based on H*W+1 in feature maps3, the electronic device 100 can perform ordered position encoding from 0 to H*W+1 to obtain the position tensor pos_feature[H*W+1,c].

然后,电子设备100可将序列格式的特征feature maps3和位置张量pos_feature输入转制编码器。在经过转制编码器的处理后,电子设备100可得到重构眼部特征,其张量可表示为feature_seq[H*W+1,b,c]。然后,电子设备100可对重构眼部特征的数据组织形式进行维度重排。重排后的重构眼部特征的张量可表示为feature_out[b,c,H*W+1]。Then, the electronic device 100 may input the sequence format feature maps3 and the position tensor pos_feature into the conversion encoder. After being processed by the conversion encoder, the electronic device 100 can obtain the reconstructed eye features, and its tensor can be expressed as feature_seq[H*W+1,b,c]. Then, the electronic device 100 may perform dimensionality rearrangement on the data organization form of the reconstructed eye features. The rearranged reconstructed eye feature tensor can be expressed as feature_out[b,c,H*W+1].

表6示例性示出了在特征重构过程中64×64的IR图像和64×64的RGB图像的张量的变化过程。Table 6 exemplarily shows the change process of the tensors of the 64×64 IR image and the 64×64 RGB image during the feature reconstruction process.

表6Table 6

图9是本申请实施例提供转制编码器的结构示意图。Figure 9 is a schematic structural diagram of a conversion encoder provided by an embodiment of the present application.

如图9所示,转制编码器包括Norm、多头注意力(multi-head attention)和MLP。norm使用LayerNorm方法,用于数据归一化。多头注意力包括多个(N)点积自注意力(Scaleddot-product attention),用于丰富特征空间,增加特征多样性。MLP由两层FC层构成,用于缩小多头注意力扩大的维度。转制编码器包括L层Norm、多头注意力和MLP结构。转制编码器为现有的,本申请对于其所包括的各个方法的作用和数据处理过程不再赘述。As shown in Figure 9, the transformation encoder includes Norm, multi-head attention (multi-head attention) and MLP. norm uses the LayerNorm method for data normalization. Multi-head attention includes multiple (N) dot product self-attention (Scaleddot-product attention), which is used to enrich the feature space and increase feature diversity. The MLP consists of two FC layers, which are used to reduce the dimension of multi-head attention expansion. The transformation encoder includes L-layer Norm, multi-head attention and MLP structure. The conversion encoder is existing, and this application will not elaborate on the functions and data processing processes of each method included in it.

在本申请实施例中,转制编码器的主要参数如下表所示:In the embodiment of this application, the main parameters of the conversion encoder are as shown in the following table:

表7Table 7

Layers LLayers L Heads NHeads N dim of Q,K,Vdim of Q,K,V dim of hidden linear layersdim of hidden linear layers 44 22 168(IR)/184(RGB)168(IR)/184(RGB) 512512

其中,点积自注意力公式如下所示:Among them, the dot product self-attention formula is as follows:

图10是本申请实施例提供的另一种眼球注视识别算法的示意图。Figure 10 is a schematic diagram of another eye gaze recognition algorithm provided by an embodiment of the present application.

如图10所示,在电子设备100获取图像X之前,电子设备100还可获取带标定的图像Y。上述标定即图像Y对应的注视点(x0,y0),也称参考注视点。As shown in FIG. 10 , before the electronic device 100 acquires the image X, the electronic device 100 may also acquire the calibrated image Y. The above calibration is the gaze point (x0, y0) corresponding to the image Y, also called the reference gaze point.

电子设备100可使用上述介绍的任意一种特征提取器(CNN-A/skip-CNN/混合transformer)对图像Y中的左眼图像和右眼图像进行处理,获取图像Y的左眼特征和右眼特征,并进一步与获取到的人脸特征拼接得到参考特征。参考特征是从图像Y中获取到的全部特征。The electronic device 100 can use any of the feature extractors (CNN-A/skip-CNN/hybrid transformer) introduced above to process the left eye image and right eye image in the image Y, and obtain the left eye feature and right eye feature of the image Y. Eye features are further spliced with the acquired facial features to obtain reference features. The reference features are all features obtained from image Y.

在获取到图像X的查询特征之后,电子设备100可按照同样的流程获取图像X的左眼特征、右眼特征以及人脸特征。特别的,电子设备100在获取图像X的左眼特征和右眼特征的过程中,需使用与图像Y相同的特征提取器。例如,当电子设备100采用CNN-A获取图像Y的左眼特征和右眼特征时,在获取到图像X之后,电子设备100也需使用采用CNN-A获取图像X的左眼特征和右眼特征;当电子设备100采用skip-CNN获取图像Y的左眼特征和右眼特征时,在获取到图像X之后,电子设备100也需使用采用skip-CNN获取图像X的左眼特征和右眼特征。After obtaining the query features of the image X, the electronic device 100 can obtain the left eye features, right eye features and face features of the image X according to the same process. In particular, in the process of obtaining the left eye features and right eye features of image X, the electronic device 100 needs to use the same feature extractor as that of image Y. For example, when the electronic device 100 uses CNN-A to obtain the left eye features and right eye features of the image Y, after acquiring the image X, the electronic device 100 also needs to use CNN-A to obtain the left eye features and right eye features of the image X. Features; when the electronic device 100 uses skip-CNN to obtain the left eye features and right eye features of the image Y, after acquiring the image X, the electronic device 100 also needs to use skip-CNN to obtain the left eye features and right eye features of the image X. feature.

然后,电子设备100可将查询特征与参考特征进行拼接,然后将拼接后的特征输入FC,得到目标注视点与参考注视点的注视点距离(△x,△y)。其中,目标注视点即眼球注视识别算法需要预测的图像X对应的注视点。于是,结合参考注视点和注视点距离,电子设备100可确定目标注视点:Then, the electronic device 100 can splice the query features and the reference features, and then input the spliced features into the FC to obtain the gaze distance (Δx, Δy) between the target gaze point and the reference gaze point. Among them, the target gaze point is the gaze point corresponding to the image X that the eye gaze recognition algorithm needs to predict. Therefore, combining the reference gaze point and the gaze point distance, the electronic device 100 can determine the target gaze point:

(x,y)=(x0,y0)+(△x,△y)。(x,y)=(x0,y0)+(Δx,Δy).

图10所示的FC与图1所示的FC是基于不同的训练数据训练得到的,对应的,在使用阶段,两者的输入输出不一样。具体的,在图1所示方法中,FC的输入为图像X的查询特征,输出为目标注视点;在图10所示的方法中,FC的输入为图像X的查询特征和图像Y的参考特征,输出为注视点距离。结合图像Y对应的参考注视点和上述注视点距离,电子设备100再进一步确定目标注视点。The FC shown in Figure 10 and the FC shown in Figure 1 are trained based on different training data. Correspondingly, in the use stage, the input and output of the two are different. Specifically, in the method shown in Figure 1, the input of FC is the query feature of image X, and the output is the target fixation point; in the method shown in Figure 10, the input of FC is the query feature of image X and the reference of image Y Features, the output is the fixation point distance. Combining the reference gaze point corresponding to the image Y and the above gaze point distance, the electronic device 100 further determines the target gaze point.

图11A-图11B是本申请实施提供的一组累计分布函数(Cumulative distributionfunction,CDF)示意图。11A-11B are schematic diagrams of a set of cumulative distribution functions (Cumulative distribution function, CDF) provided by the implementation of this application.

图11A是本申请实施提供的采用CNN-A与采用skip-CNN的眼球注视识别算法的误差CDF。Figure 11A is the error CDF of the eye gaze recognition algorithm using CNN-A and skip-CNN provided by this application.

在图11A中,横坐标表示识别到的注视点与实际注视点之间的误差,纵坐标表示累计分布百分比。例如,误差X对应的累计分布百分比P表示:误差在X之内的预测结果在全部预测结果中的占比为P。P越高表示该方法的注视点识别效果越好。In Figure 11A, the abscissa represents the error between the recognized gaze point and the actual gaze point, and the ordinate represents the cumulative distribution percentage. For example, the cumulative distribution percentage P corresponding to the error X indicates that the proportion of prediction results with an error within X among all prediction results is P. The higher the P, the better the fixation point recognition effect of this method is.

误差在1.9cm内的注视点是可接受的。如图11A所示,误差1.9cm上的百分比从高到低的曲线依次为:Gaze points with errors within 1.9cm are acceptable. As shown in Figure 11A, the curves from high to low of the percentage of error 1.9cm are as follows:

skip-CNN(RGB)曲线(即delta_contrast_skip_connection_all_data_0215-RGB);skip-CNN (RGB) curve (ie delta_contrast_skip_connection_all_data_0215-RGB);

CNN-A(RGB)曲线(即delta_contrast_all_data_0215-RGB);CNN-A(RGB) curve (ie delta_contrast_all_data_0215-RGB);

skip-CNN(IR)曲线(即delta_contrast_skip_connection_all_data_0215-IR);skip-CNN(IR) curve (ie delta_contrast_skip_connection_all_data_0215-IR);

CNN-A(IR)曲线(即delta_contrast_all_data_0215-IR);CNN-A(IR) curve (ie delta_contrast_all_data_0215-IR);

对应的累计分布比分比依次为:0.567、0.547、0.513、0.503。The corresponding cumulative distribution score ratios are: 0.567, 0.547, 0.513, 0.503.

由此可见无论是IR图像,还是RGB图像,相比于采用CNN-A的眼球注视识别算法,采用skip-CNN的眼球注视识别算法的累计分布百分比更高,即注视点的识别效果更好。It can be seen that whether it is an IR image or an RGB image, compared to the eye gaze recognition algorithm using CNN-A, the eye gaze recognition algorithm using skip-CNN has a higher cumulative distribution percentage, that is, the recognition effect of gaze points is better.

图11B是本申请实施提供的采用skip-CNN与采用skip-CNN和转制编码器的眼球注视识别算法的误差CDF。Figure 11B is the error CDF of the eye gaze recognition algorithm using skip-CNN and using skip-CNN and the transform encoder provided by the implementation of this application.

同样以误差1.9cm为例,误差1.9cm上的百分比从高到低的曲线依次为:Also taking the error of 1.9cm as an example, the curves of the percentages on the error of 1.9cm from high to low are as follows:

skip-CNN+Transformer(RGB)曲线(即delta_contrast_hybrid_transformer_all_data_0215-RGB);skip-CNN+Transformer(RGB) curve (ie delta_contrast_hybrid_transformer_all_data_0215-RGB);

skip-CNN(RGB)曲线(即delta_contrast_skip_connection_all_data_0215-RGB);skip-CNN (RGB) curve (ie delta_contrast_skip_connection_all_data_0215-RGB);

skip-CNN+Transformer(IR)曲线(即delta_contrast_hybrid_transformer_all_data_0215-IR);skip-CNN+Transformer(IR) curve (ie delta_contrast_hybrid_transformer_all_data_0215-IR);

skip-CNN(IR)曲线(即delta_contrast_skip_connection_all_data_0215-IR);skip-CNN(IR) curve (ie delta_contrast_skip_connection_all_data_0215-IR);

对应的累计分布比分比依次为:0.591、0.567、0.544、0.513。The corresponding cumulative distribution score ratios are: 0.591, 0.567, 0.544, and 0.513.

由此可见,无论是IR图像,还是RGB图像,相比于仅采用skip-CNN的眼球注视识别方法,采用混合transformer的眼球注视识别算法的累计分布百分比更高,即注视点的识别效果越好。It can be seen that whether it is an IR image or an RGB image, compared to the eye gaze recognition method using only skip-CNN, the cumulative distribution percentage of the eye gaze recognition algorithm using hybrid transformer is higher, that is, the better the recognition effect of gaze points. .

在上述实施例中:In the above example:

图像X可称为第一图像,从图像X中裁剪得到的左眼图像可称为第一左眼图像,右眼图像可称为第一右眼图像,人脸图像可称为第一人脸图像;利用CNN-A、skip-CNN或混合transformer处理图像X的左眼图像得到的左眼特征可称为第一左眼特征,处理图像X的右眼图像得到的右眼特征可称为第一右眼特征,图像X的人脸图像得到的人脸特征可称为第一人脸特征;拼接图像X的左眼特征、右眼特征和人脸特征得到的查询特征可称为第一特征;The image X may be called the first image, the left-eye image cropped from the image Image; the left-eye feature obtained by processing the left-eye image of image X using CNN-A, skip-CNN or hybrid transformer can be called the first left-eye feature, and the right-eye feature obtained by processing the right-eye image of image One right eye feature, the face feature obtained from the face image of image X can be called the first face feature; the query feature obtained by splicing the left eye feature, right eye feature and face feature of image X can be called the first feature ;

图像Y可称为第二图像,从图像Y中裁剪得到的左眼图像可称为第二左眼图像,右眼图像可称为第二右眼图像,人脸图像可称为第二人脸图像;利用CNN-A、skip-CNN或混合transformer处理图像Y的左眼图像得到的左眼特征可称为第二左眼特征,处理图像Y的右眼图像得到的右眼特征可称为第二右眼特征,图像Y的人脸图像得到的人脸特征可称为第二人脸特征;拼接图像Y的左眼特征、右眼特征和人脸特征得到的参考特征可称为第二特征;The image Y may be called the second image, the left-eye image cropped from the image Y may be called the second left-eye image, the right-eye image may be called the second right-eye image, and the face image may be called the second face. Image; the left eye feature obtained by processing the left eye image of image Y using CNN-A, skip-CNN or hybrid transformer can be called the second left eye feature, and the right eye feature obtained by processing the right eye image of image Y can be called the third Second right eye feature, the face feature obtained from the face image of image Y can be called the second face feature; the reference feature obtained by splicing the left eye feature, right eye feature and face feature of image Y can be called the second feature ;

其中,用于眼部特征的CNN-A、skip-CNN和混合transformer均可称为第一特征提取;用于提取人脸图像的CNN-B可称为第二特征提取;Among them, CNN-A, skip-CNN and hybrid transformer used for eye features can all be called first feature extraction; CNN-B used for extracting face images can be called second feature extraction;

图3所示的卷积层C1可称为第一卷积层,卷积层C2可称为第二卷积层,卷积层C3可称为第三卷积层,池化层P1可称为第一池化层,池化层P2可称为第二池化层;上述任一一个卷积层或池化层可称为一个处理层;The convolution layer C1 shown in Figure 3 can be called the first convolution layer, the convolution layer C2 can be called the second convolution layer, the convolution layer C3 can be called the third convolution layer, and the pooling layer P1 can be called is the first pooling layer, and the pooling layer P2 can be called the second pooling layer; any of the above convolution layers or pooling layers can be called a processing layer;

图7所示的下采样器1可称为第一下采样器,下采样器2可称为第二下采样器,下采样器3可称为第三下采样器,下采样器4可称为第四下采样器,下采样器5可称为第五下采样器,下采样器6可称为第六下采样器;Downsampler 1 shown in Figure 7 can be called a first downsampler, downsampler 2 can be called a second downsampler, downsampler 3 can be called a third downsampler, and downsampler 4 can be called is the fourth downsampler, downsampler 5 can be called the fifth downsampler, and downsampler 6 can be called the sixth downsampler;

在图5A-图5B所示的应用场景中,图5B所示的打开相机应用、显示相机应用主界面的动作可称为第一动作;在图6A-图6B所示的应用场景中,图6B所示的显示通知栏的动作也可称为第一动作。In the application scenario shown in Figures 5A-5B, the action of opening the camera application and displaying the main interface of the camera application shown in Figure 5B can be called the first action; in the application scenario shown in Figures 6A-6B, Figure The action of displaying the notification bar shown in 6B may also be called the first action.

图12示出了电子设备100的硬件结构示意图。FIG. 12 shows a schematic diagram of the hardware structure of the electronic device 100.

电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and Subscriber identification module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.

可以理解的是,本发明实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the electronic device 100 . In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently. The components illustrated may be implemented in hardware, software, or a combination of software and hardware.

处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processingunit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), an image signal processor ( image signal processor (ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processing unit (NPU), etc. Among them, different processing units can be independent devices or integrated in one or more processors.

控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。处理器110中还可以设置存储器,用于存储指令和数据。The controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions. The processor 110 may also be provided with a memory for storing instructions and data.

充电管理模块140用于从充电器接收充电输入。电源管理模块141用于连接电池142,充电管理模块140与处理器110。The charging management module 140 is used to receive charging input from the charger. The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.

电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.

电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。显示屏194用于显示图像,视频等。在本申请实施例中,电子设备100通过GPU,显示屏194,以及应用处理器等提供的显示功能,显示交互界面,例如图5A-图5B、图6A-图6B所示的用户界面。The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. The display screen 194 is used to display images, videos, etc. In this embodiment of the present application, the electronic device 100 displays an interactive interface, such as the user interface shown in Figures 5A-5B and 6A-6B, through the display functions provided by the GPU, the display screen 194, and the application processor.

电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。摄像头193用于捕获静态图像或视频。在本申请实施例中,摄像头193包括但不限于广角摄像头、超广角摄像头、长焦摄像头、结构光深感摄像头、红外摄像头等。基于上述不同类型的摄像头,电子设备100可以获取不同颜色结构的输入图像。The electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like. Camera 193 is used to capture still images or video. In this embodiment of the present application, the camera 193 includes but is not limited to a wide-angle camera, an ultra-wide-angle camera, a telephoto camera, a structured light depth-sensing camera, an infrared camera, etc. Based on the above-mentioned different types of cameras, the electronic device 100 can acquire input images with different color structures.

NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用。在本申请实施例中,电子设备100可通过NPU执行眼球注视识别算法,进而通过采集到的用户面部图像识别用户的眼球注视位置。NPU is a neural-network (NN) computing processor. By drawing on the structure of biological neural networks, it can quickly process input information and can continuously learn by itself. Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU. In the embodiment of the present application, the electronic device 100 can execute the eye gaze recognition algorithm through the NPU, and then identify the user's eye gaze position through the collected user's facial image.

内部存储器121可以包括一个或多个随机存取存储器(random access memory,RAM)和一个或多个非易失性存储器(non-volatile memory,NVM)。The internal memory 121 may include one or more random access memories (RAM) and one or more non-volatile memories (NVM).

RAM可以由处理器110直接进行读写,可以用于存储操作系统或其他正在运行中的程序的可执行程序(例如机器指令),还可以用于存储用户及应用程序的数据等。NVM也可以存储可执行程序和存储用户及应用程序的数据等,可以提前加载到RAM中,用于处理器110直接进行读写。在本申请实施例中,眼球注视识别算法对应的应用程序代码可存储到NVM中。在运行眼球注视识别算法识别眼球注视点时,眼球注视识别算法对应的应用程序代码可被加载到RAM中。RAM can be directly read and written by the processor 110, can be used to store executable programs (such as machine instructions) of the operating system or other running programs, and can also be used to store user and application data, etc. NVM can also store executable programs and user and application data, etc., which can be loaded into RAM in advance for direct reading and writing by the processor 110. In the embodiment of the present application, the application code corresponding to the eye gaze recognition algorithm can be stored in NVM. When running the eye gaze recognition algorithm to identify the eye gaze point, the application code corresponding to the eye gaze recognition algorithm can be loaded into RAM.

外部存储器接口120可以用于连接外部的非易失性存储器,实现扩展电子设备100的存储能力。The external memory interface 120 can be used to connect an external non-volatile memory to expand the storage capacity of the electronic device 100 .

电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.

压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。陀螺仪传感器180B可以用于确定电子设备100围绕三个轴(即,x,y和z轴)的角速度。加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。气压传感器180C用于测量气压。磁传感器180D包括霍尔传感器,可以利用磁传感器180D检测翻盖皮套的开合。距离传感器180F用于测量距离。接近光传感器180G可以包括例如发光二极管(LED)和光检测器,可用于检测用户手持电子设备100贴近用户的场景。环境光传感器180L用于感知环境光亮度。指纹传感器180H用于采集指纹。温度传感器180J用于检测温度。骨传导传感器180M可以获取振动信号。触摸传感器180K,也称“触控器件”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。按键190包括开机键,音量键等。马达191可以产生振动提示。指示器192可以是指示灯。SIM卡接口195用于连接SIM卡。The pressure sensor 180A is used to sense pressure signals and can convert the pressure signals into electrical signals. Gyro sensor 180B may be used to determine the angular velocity of electronic device 100 about three axes (ie, x, y, and z axes). The acceleration sensor 180E can detect the acceleration of the electronic device 100 in various directions (generally three axes). Air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a Hall sensor, and the magnetic sensor 180D can be used to detect the opening and closing of the flip leather case. Distance sensor 180F is used to measure distance. The proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, and may be used to detect a scene in which the user holds the electronic device 100 close to the user. The ambient light sensor 180L is used to sense ambient light brightness. Fingerprint sensor 180H is used to collect fingerprints. Temperature sensor 180J is used to detect temperature. Bone conduction sensor 180M can acquire vibration signals. Touch sensor 180K, also known as "touch device". The touch sensor 180K can be disposed on the display screen 194. The touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The buttons 190 include a power button, a volume button, etc. The motor 191 can generate vibration prompts. Indicator 192 may be an indicator light. The SIM card interface 195 is used to connect a SIM card.

本申请的说明书和权利要求书及附图中的术语“用户界面(user interface,UI)”,是应用程序或操作系统与用户之间进行交互和信息交换的介质接口,它实现信息的内部形式与用户可以接受形式之间的转换。应用程序的用户界面是通过java、可扩展标记语言(extensible markup language,XML)等特定计算机语言编写的源代码,界面源代码在终端设备上经过解析,渲染,最终呈现为用户可以识别的内容,比如图片、文字、按钮等控件。控件(control)也称为部件(widget),是用户界面的基本元素,典型的控件有工具栏(toolbar)、菜单栏(menu bar)、文本框(text box)、按钮(button)、滚动条(scrollbar)、图片和文本。界面中的控件的属性和内容是通过标签或者节点来定义的,比如XML通过<Textview>、<ImgView>、<VideoView>等节点来规定界面所包含的控件。一个节点对应界面中一个控件或属性,节点经过解析和渲染之后呈现为用户可视的内容。此外,很多应用程序,比如混合应用(hybrid application)的界面中通常还包含有网页。网页,也称为页面,可以理解为内嵌在应用程序界面中的一个特殊的控件,网页是通过特定计算机语言编写的源代码,例如超文本标记语言(hyper text markup language,GTML),层叠样式表(cascading style sheets,CSS),java脚本(JavaScript,JS)等,网页源代码可以由浏览器或与浏览器功能类似的网页显示组件加载和显示为用户可识别的内容。网页所包含的具体内容也是通过网页源代码中的标签或者节点来定义的,比如GTML通过<p>、<img>、<video>、<canvas>来定义网页的元素和属性。The term "user interface (UI)" in the description, claims and drawings of this application is a media interface for interaction and information exchange between an application program or an operating system and a user. It implements the internal form of information. Conversion to and from a user-acceptable form. The user interface of an application is source code written in specific computer languages such as Java and extensible markup language (XML). The interface source code is parsed and rendered on the terminal device, and finally presented as content that the user can recognize. Such as pictures, text, buttons and other controls. Control, also called widget, is the basic element of user interface. Typical controls include toolbar, menu bar, text box, button, and scroll bar. (scrollbar), images and text. The properties and contents of controls in the interface are defined through tags or nodes. For example, XML specifies the controls contained in the interface through nodes such as <Textview>, <ImgView>, and <VideoView>. A node corresponds to a control or property in the interface. After parsing and rendering, the node is rendered into user-visible content. In addition, many applications, such as hybrid applications, often include web pages in their interfaces. A web page, also known as a page, can be understood as a special control embedded in an application interface. A web page is source code written in a specific computer language, such as hypertext markup language (GTML), cascading styles Tables (cascading style sheets, CSS), java scripts (JavaScript, JS), etc., web page source code can be loaded and displayed as user-recognizable content by a browser or a web page display component with functions similar to the browser. The specific content contained in the web page is also defined through tags or nodes in the web page source code. For example, GTML defines the elements and attributes of the web page through <p>, <img>, <video>, and <canvas>.

用户界面常用的表现形式是图形用户界面(graphic user interface,GUI),是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在终端设备的显示屏中显示的一个图标、窗口、控件等界面元素,其中控件可以包括图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。The commonly used form of user interface is graphical user interface (GUI), which refers to a user interface related to computer operations that is displayed graphically. It can be an icon, window, control and other interface elements displayed on the display screen of the terminal device. The control can include icons, buttons, menus, tabs, text boxes, dialog boxes, status bars, navigation bars, widgets, etc. Visual interface elements.

在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括复数表达形式,除非其上下文中明确地有相反指示。还应当理解,本申请中使用的术语“和/或”是指并包含一个或多个所列出项目的任何或所有可能组合。上述实施例中所用,根据上下文,术语“当…时”可以被解释为意思是“如果…”或“在…后”或“响应于确定…”或“响应于检测到…”。类似地,根据上下文,短语“在确定…时”或“如果检测到(所陈述的条件或事件)”可以被解释为意思是“如果确定…”或“响应于确定…”或“在检测到(所陈述的条件或事件)时”或“响应于检测到(所陈述的条件或事件)”。As used in the specification and appended claims of this application, the singular expressions "a," "an," "the," "above," "the" and "the" are intended to also include Plural expressions unless the context clearly indicates otherwise. It will also be understood that the term "and/or" as used in this application refers to and includes any and all possible combinations of one or more of the listed items. As used in the above embodiments, the term "when" may be interpreted to mean "if..." or "after" or "in response to determining..." or "in response to detecting..." depending on the context. Similarly, depending on the context, the phrase "when determining..." or "if (stated condition or event) is detected" may be interpreted to mean "if it is determined..." or "in response to determining..." or "on detecting (stated condition or event)” or “in response to detecting (stated condition or event)”.

在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如DVD)、或者半导体介质(例如固态硬盘)等。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center through wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more available media integrated. The available media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media (eg, solid state drive), etc.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,该流程可以由计算机程序来指令相关的硬件完成,该程序可存储于计算机可读取存储介质中,该程序在执行时,可包括如上述各方法实施例的流程。而前述的存储介质包括:ROM或随机存储记忆体RAM、磁碟或者光盘等各种可存储程序代码的介质。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments are implemented. This process can be completed by instructing relevant hardware through a computer program. The program can be stored in a computer-readable storage medium. When the program is executed, , may include the processes of the above method embodiments. The aforementioned storage media include: ROM, random access memory (RAM), magnetic disks, optical disks and other media that can store program codes.

Claims (16)

1. An eye gaze recognition method applied to an electronic device comprising a screen, the method comprising:

acquiring a first image;

acquiring a first left eye image, a first right eye image and a first face image from the first image;

acquiring a first left-eye feature from the first left-eye image, acquiring a first right-eye feature from the first right-eye image, and acquiring a first face feature from the first face image;

combining the first left eye image, the first right eye feature and the first face feature to obtain a first feature;

and determining a target fixation point on the screen, wherein the target fixation point is obtained according to the first characteristic.

2. The method according to claim 1, wherein the method further comprises:

Encoding the first left-eye image and the first right-eye image with a rotary encoder;

the combining the first left eye image, the first right eye feature and the first face feature to obtain a first feature specifically includes:

and combining the encoded first left-eye image, the encoded first right-eye feature and the first face feature to obtain the first feature.

3. The method according to claim 1 or 2, wherein the electronic device is preset with a first feature extractor and a second feature extractor, the acquiring a first left-eye feature from the first left-eye image, acquiring a first right-eye feature from the first right-eye image, and acquiring a first face feature from the first face image specifically includes:

acquiring the first left-eye feature from the first left-eye image and the first right-eye feature from the first right-eye image by using the first feature extractor; and acquiring the first face features from the first face image by using the second feature extractor.

4. The method of claim 3, wherein the first feature extractor comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer;

Wherein the size of the convolution kernel of the first convolution layer is 7×7, and the step size is 1; the size of the convolution kernel of the second convolution layer is 5×5, and the step length is 1; the size of the convolution kernel of the third convolution layer is 3×3, and the step length is 1; the size of the pooling core of the first pooling layer and the second pooling layer is 2×2, and the step size is 2.

5. A method as claimed in claim 3, wherein the first feature extractor comprises a plurality of processing layers and one or more downsamplers, the processing layers comprising a convolution layer and a pooling layer.

6. The method of claim 5, wherein the plurality of processing layers comprises a first processing layer, wherein the one or more downsamplers comprises a downsampler i having an input identical to the input of the first processing layer, and wherein an output of the downsampler i is used for stitching with an output of a second processing layer; the second treatment layer is equal to the first treatment layer, or the second treatment layer is one treatment layer after the first treatment layer.

7. The method of claim 5, wherein the plurality of processing layers comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer.

8. The method of claim 7, wherein the plurality of downsamplers comprises a first downsampler, a second downsampler, a third downsampler;

the input of the first downsampler is the same as the input of the first convolution layer, and the output of the first downsampler is used for being spliced with the output of the first convolution layer and is input to a first pooling layer; the input of the second downsampler is the same as the input of the first pooling layer, and the output of the second downsampler is used for being spliced with the output of the second convolution layer and is input to the second pooling layer; the input of the third downsampler is the same as the input of the second convolution layer, and the output of the third downsampler is used to splice with and output the output of the third convolution layer.

9. The method of claim 7 or 8, wherein the plurality of downsamplers further comprises a fourth downsampler, a fifth downsampler;

the input of the fourth downsampler is the same as the input of the first convolution layer, and the output of the fourth downsampler is used for being spliced with the output of the second convolution layer and is input to a second pooling layer; the input of the fifth downsampler is the same as the input of the first pooling layer, and the output of the fifth downsampler is used to splice with and output the output of the third convolution layer.

10. The method of any of claims 7-9, wherein the plurality of downsamplers further comprises a sixth downsampler having an input identical to the input of the first convolutional layer, an output of the sixth downsampler for stitching with and outputting the output of the third convolutional layer.

11. The method according to any one of claims 3 to 10, wherein,

the number of input channels of the first feature extractor is 1, and the number of output channels is 168;

alternatively, the first feature extractor has a number of input channels of 3 and a number of output channels of 184.

12. The method of any one of claims 1-11, wherein prior to the acquiring the first image, the method further comprises:

acquiring a second image;

acquiring a second left eye image, a second right eye image and a second face image from the second image;

acquiring a second left eye feature from the second left eye image, acquiring a second right eye feature from the second right eye image, and acquiring a second face feature from the second face image;

combining the second left eye image, the second right eye feature and the second face feature to obtain a second feature;

Wherein the target gaze point is further derived from the second feature.

13. The method according to any one of claims 1-12, further comprising:

determining a hot zone corresponding to the target fixation point;

and executing a first action corresponding to the hot zone.

14. An electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories for storing computer program code comprising computer instructions that, when executed by the one or more processors, cause the method of any of claims 1-13 to be performed.

15. A chip system for application to an electronic device, the chip system comprising one or more processors configured to invoke computer instructions to cause performance of the method of any of claims 1-13.

16. A computer readable storage medium comprising instructions which, when run on an electronic device, cause the method of any one of claims 1-13 to be performed.

CN202310793444.4A 2023-06-29 2023-06-29 Eyeball fixation recognition method and electronic equipment Pending CN117690180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310793444.4A CN117690180A (en) 2023-06-29 2023-06-29 Eyeball fixation recognition method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310793444.4A CN117690180A (en) 2023-06-29 2023-06-29 Eyeball fixation recognition method and electronic equipment

Publications (1)

Publication Number Publication Date
CN117690180A true CN117690180A (en) 2024-03-12

Family

ID=90127252

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310793444.4A Pending CN117690180A (en) 2023-06-29 2023-06-29 Eyeball fixation recognition method and electronic equipment

Country Status (1)

Country Link
CN (1) CN117690180A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020199593A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Image segmentation model training method and apparatus, image segmentation method and apparatus, and device and medium
CN112183200A (en) * 2020-08-25 2021-01-05 中电海康集团有限公司 Eye movement tracking method and system based on video image
CN114743277A (en) * 2022-04-22 2022-07-12 南京亚信软件有限公司 Liveness detection method, device, electronic device, storage medium and program product
CN115209057A (en) * 2022-08-19 2022-10-18 荣耀终端有限公司 Shooting focusing method and related electronic equipment
CN115424318A (en) * 2022-08-09 2022-12-02 华为技术有限公司 Image identification method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020199593A1 (en) * 2019-04-04 2020-10-08 平安科技(深圳)有限公司 Image segmentation model training method and apparatus, image segmentation method and apparatus, and device and medium
CN112183200A (en) * 2020-08-25 2021-01-05 中电海康集团有限公司 Eye movement tracking method and system based on video image
CN114743277A (en) * 2022-04-22 2022-07-12 南京亚信软件有限公司 Liveness detection method, device, electronic device, storage medium and program product
CN115424318A (en) * 2022-08-09 2022-12-02 华为技术有限公司 Image identification method and device
CN115209057A (en) * 2022-08-19 2022-10-18 荣耀终端有限公司 Shooting focusing method and related electronic equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
GAO HUANG ET AL: "Densely Connected Convolutional Networks", 《ARXIV:1608.06993V5》, 28 January 2018 (2018-01-28), pages 1 - 9 *
RATHEESH KALAROT ET AL: "Component Attention Guided Face Super-Resolution Network: CAGFace", 《IEEE XPLORE》, 31 December 2020 (2020-12-31), pages 370 - 380 *
珺毅同学: "DenseNet:密集连接堆叠型网络", pages 1, Retrieved from the Internet <URL:https://zhuanlan.zhihu.com/p/509160021> *

Similar Documents

Publication Publication Date Title
CN113453040B (en) 2023-03-10 Short video generation method and device, related equipment and medium
CN111738122B (en) 2023-08-22 Image processing method and related device
WO2021052458A1 (en) 2021-03-25 Machine translation method and electronic device
WO2023160170A1 (en) 2023-08-31 Photographing method and electronic device
KR20160055337A (en) 2016-05-18 Method for displaying text and electronic device thereof
CN113642359A (en) 2021-11-12 Face image generation method and device, electronic equipment and storage medium
CN117011156A (en) 2023-11-07 Image processing method, device, equipment and storage medium
WO2024179101A1 (en) 2024-09-06 Photographing method
CN115643485B (en) 2023-10-24 Photography methods and electronic equipment
WO2022143314A1 (en) 2022-07-07 Object registration method and apparatus
CN118312035A (en) 2024-07-09 Display method and electronic equipment
CN112528760B (en) 2024-01-09 Image processing method, device, computer equipment and medium
CN117690180A (en) 2024-03-12 Eyeball fixation recognition method and electronic equipment
CN116055867B (en) 2023-11-24 A photographing method and electronic device
CN117710541A (en) 2024-03-15 Method, device and equipment for generating audio-driven three-dimensional face animation model
CN117472256B (en) 2024-08-23 Image processing method and electronic device
CN117974711B (en) 2024-07-26 Video frame insertion method and related equipment
CN117764853B (en) 2024-07-05 Face image enhancement method and electronic equipment
CN116055861B (en) 2023-10-20 Video editing method and electronic equipment
CN117690130B (en) 2024-09-06 Image title generation method and related device
CN117076702B (en) 2023-12-15 Image search method and electronic device
EP4372518A1 (en) 2024-05-22 System, song list generation method, and electronic device
CN117692714B (en) 2024-11-08 Video display method, electronic device, computer program product, and storage medium
WO2022100602A1 (en) 2022-05-19 Method for displaying information on electronic device, and electronic device
WO2024027570A1 (en) 2024-02-08 Interface display method and related apparatus

Legal Events

Date Code Title Description
2024-03-12 PB01 Publication
2024-03-12 PB01 Publication
2024-03-29 SE01 Entry into force of request for substantive examination
2024-03-29 SE01 Entry into force of request for substantive examination
2025-02-25 CB02 Change of applicant information
2025-02-25 CB02 Change of applicant information

Country or region after: China

Address after: Unit 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong 518040

Applicant after: Honor Terminal Co.,Ltd.

Address before: 3401, unit a, building 6, Shenye Zhongcheng, No. 8089, Hongli West Road, Donghai community, Xiangmihu street, Futian District, Shenzhen, Guangdong

Applicant before: Honor Device Co.,Ltd.

Country or region before: China