patents.google.com

CN107330027B - Weak supervision depth station caption detection method - Google Patents

️Fri May 22 2020

CN107330027B - Weak supervision depth station caption detection method - Google Patents

Weak supervision depth station caption detection method Download PDF

Info

Publication number

CN107330027B

CN107330027B CN201710485397.1A CN201710485397A CN107330027B CN 107330027 B CN107330027 B CN 107330027B CN 201710485397 A CN201710485397 A CN 201710485397A CN 107330027 B CN107330027 B CN 107330027B Authority

China

Prior art keywords

station caption

station

network

caption

picture

Prior art date

2017-06-23

Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)

Expired - Fee Related

Application number

CN201710485397.1A

Other languages

Chinese (zh)

Other versions

CN107330027A (en

Inventor

操晓春

张月莹

伍蹈

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Institute of Information Engineering of CAS

Original Assignee

Institute of Information Engineering of CAS

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2017-06-23

Filing date

2017-06-23

Publication date

2020-05-22

2017-06-23 Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS

2017-06-23 Priority to CN201710485397.1A priority Critical patent/CN107330027B/en

2017-11-07 Publication of CN107330027A publication Critical patent/CN107330027A/en

2020-05-22 Application granted granted Critical

2020-05-22 Publication of CN107330027B publication Critical patent/CN107330027B/en

Status Expired - Fee Related legal-status Critical Current

2037-06-23 Anticipated expiration legal-status Critical

Images

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/74—Browsing; Visualisation therefor
- G06F16/743—Browsing; Visualisation therefor a collection of video files or sequences
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
- G06F16/7335—Graphical querying, e.g. query-by-region, query-by-sketch, query-by-trajectory, GUIs for designating a person/face/object as a query predicate
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning

Landscapes

Engineering & Computer Science (AREA)
Theoretical Computer Science (AREA)
Data Mining & Analysis (AREA)
Physics & Mathematics (AREA)
Multimedia (AREA)
General Physics & Mathematics (AREA)
General Engineering & Computer Science (AREA)
Computer Vision & Pattern Recognition (AREA)
Databases & Information Systems (AREA)
Bioinformatics & Computational Biology (AREA)
Life Sciences & Earth Sciences (AREA)
Artificial Intelligence (AREA)
Bioinformatics & Cheminformatics (AREA)
Human Computer Interaction (AREA)
Evolutionary Biology (AREA)
Evolutionary Computation (AREA)
Computational Linguistics (AREA)
Probability & Statistics with Applications (AREA)
Mathematical Physics (AREA)
Image Analysis (AREA)

Abstract

本发明提供一种弱监督的深度台标检测方法，其步骤为：对海量网络视频数据文件进行预处理，得到一个仅标记台标类别的大数据集和一个仅标记台标位置的小数据集；将上述小数据集输入台标定位网络进行训练，得到能预测台标区域的台标定位网络；将上述大数据集输入上述已训练好的台标定位网络，得到所述大数据集中每张图片的若干预测台标区域，并将所述每张图片的若干预测台标区域输入台标分类网络进行训练，得到能为台标分类的台标分类网络；对待检测视频进行与上述相同的部分预处理，并将预处理后得到的图片输入训练好的台标定位网络中，得到图片的预测台标区域；将上述图片的预测台标区域输入训练好的台标分类网络中，得到图片的台标位置及类别。

The invention provides a weakly supervised deep station logo detection method, the steps of which are: preprocessing massive network video data files to obtain a large data set that only marks the category of the station logo and a small data set that only marks the position of the station logo ; Input the above-mentioned small data set into the station logo positioning network for training to obtain a station logo positioning network that can predict the station logo area; Input the above-mentioned large data set into the above-mentioned trained station logo positioning network to obtain each image in the large data set. Several predicted station logo areas of the picture, and several predicted station logo areas of each picture are input into the station logo classification network for training to obtain a station logo classification network that can classify the station logo; Preprocess, and input the image obtained after preprocessing into the trained station logo positioning network to obtain the predicted station logo area of the picture; input the predicted station logo area of the above picture into the trained station logo classification network to obtain the picture The location and category of the logo.

Description

Weak supervision depth station caption detection method

Technical Field

The invention relates to the field of deep learning, in particular to a method for detecting a depth station caption under weak supervision.

Background

With the development of the internet and the rise of multimedia technology, network video carries more and more contents, and becomes a main content carrier in the big data era. Different video sources tend to present different video content information, and by detecting the video station caption, network video data can be more effectively managed, the video sources and the content information can be mastered in advance, and videos containing bad information can be supervised. Therefore, the video station caption detection has strong practical significance and research value.

Station caption data widely exists in network videos, and station caption detection is to detect a plurality of key frames extracted from the network videos. The station target detection has specificity compared to general object detection, and the detection target is present at a more fixed position and occupies a smaller proportion in each frame. Most data sets for object detection do not have such characteristics at present, and the data specificity of station logo detection requires collection and preprocessing of a large amount of data before detection. For such data objects with particularity, how to perform detection quickly, efficiently and accurately is the key point for completing the task of detecting the station caption.

At present, the methods for realizing station caption detection mainly comprise the following methods:

(1) based on template matching method

The template matching is an intuitive method, which judges the similarity between the local area of the image frame and the station caption template according to a certain similarity criterion, thereby judging whether the area contains the station caption. The matching degree of the template and the area needing to be matched is calculated one by one, so that the template matching method brings a large amount of operation.

(2) Feature matching based method

Extracting image features to measure similarity is one of the mainstream methods. The moment features are also called invariant moments, have rotation, scale and translation invariance which common features do not have, and are widely applied to the field of image analysis. SIFT features and SURF features are also widely used. However, this manual feature extraction method does not provide a good detection result.

(3) Neural network based method

Station caption identification method based on neural network is popular and mainstream nowadays. In the detection task, the characteristic extraction by using the neural network has better detection effect than the characteristic extraction by hand. However, most of the classical neural network-based detection models often require a large number of labels, which are labor-consuming and time-consuming.

In summary, the above station logo detection methods all have various disadvantages. Therefore, the platform scale detection method for weak supervision research has important research value and application prospect.

Disclosure of Invention

Aiming at the defects of the conventional station caption detection method, the invention provides a weak supervision deep station caption detection method. The method can effectively change the conditions of low efficiency and poor result of the conventional station caption detection method, and greatly saves the time and labor required by labeling.

Aiming at the defects, the technical scheme adopted by the invention is as follows:

a weak supervision depth station caption detection method comprises the following steps:

1) preprocessing a mass network video data file to obtain a large data set only marking the station caption category and a small data set only marking the station caption position (ground route);

2) organizing a station caption positioning network and a station caption classifying network according to a weakly supervised framework, inputting the small data set into the station caption positioning network for training to obtain a station caption positioning network capable of predicting a station caption area;

3) inputting the large data set into the trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture in the large data set, and inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain a station caption classifying network capable of classifying the station captions;

4) performing partial preprocessing on the video to be detected, which is the same as the preprocessing in the step 1), and inputting the picture obtained after preprocessing into the station caption positioning network trained in the step 2) to obtain a predicted station caption area of the picture;

5) inputting the predicted station caption area of the picture into the trained station caption classification network in the step 3) to obtain the station caption position and the station caption category of the picture.

Further, the step 1) specifically comprises:

1-1) removing the duplicate of the massive network video data files according to MD5(Message-Digest 5) codes;

1-2) extracting a plurality of key frames from the network video after the duplication removal by using a key frame extraction method;

1-3) performing M grid segmentation on each network video key frame, and only keeping one M times of pictures at corners, wherein M takes values according to the size and the position of a station caption;

1-4) classifying all the pictures to obtain pictures with station captions;

1-5) performing Data enhancement (Data Augmentation) on the picture with the station caption by using a traditional method;

1-6) carrying out balanced distribution on the image with the station caption after the data enhancement according to the station caption category to obtain a big data set only marking the station caption category and a small data set only marking the station caption position.

Further, the MD5 code deduplication in step 1-1) refers to: and judging whether the network video data files are repeated or not by comparing MD5 values, only keeping one network video with the same MD5 value for a plurality of network videos, and removing the rest network video data files.

Furthermore, in the step 1-3), the grid segmentation is performed on each network video key frame, and only one ninth picture at four corners is reserved.

Further, a classifier based on a Convolutional Neural Network (CNN) is used for classifying in the step 1-4), and if N classes of station labels are to be detected, the classifier performs N +1 classification, wherein the classification comprises a background class; the conventional methods in steps 1-5) are, but not limited to, geometric transformation, smooth filtering, JPEG compression, contrast and brightness adjustment.

Further, the partial pretreatment in step 4) includes only 1-2) and 1-3) in step 1).

Further, the station positioning network in step 2) is a station positioning network based on rpn (region pro posal net); selecting the sizes and the aspect ratio of the anchor boxes by using a K-means clustering method in the step 2); the anchor boxes are initial prediction frames generated by the station positioning network at each positioning center point.

Further, in step 2), after each picture in the small data set is input into a station caption positioning network for training for a plurality of rounds, a newly generated station caption positioning network is selected to obtain a confidence degree corresponding to each predicted station caption region, an IOU (overlap ratio) of the predicted station caption region and the station caption position is calculated, the IOU is compared with the confidence degree corresponding to the predicted station caption region to obtain a difficult case in the current state, in a plurality of rounds of training, the difficult case is preferentially selected to enter the station caption positioning network for training by using a difficult case mining method of bootstrap (bootstrap method), and the steps are repeated until the station caption positioning network is converged to obtain the station caption positioning network capable of predicting the station caption region.

Further, the station caption classifying network in the step 3) is a station caption classifying network based on Fast RCNN.

Further, in step 3), inputting each picture in the big data set into a trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture, inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain the confidence coefficient of each predicted station caption area, and dividing the predicted station caption areas into N types of foreground areas (namely N types of station captions) and background areas according to the confidence coefficient of each predicted station caption area to obtain a station caption classifying network capable of classifying the station captions; and the category of the foreground area is marked according to the category of the station caption on the corresponding picture.

The invention has the advantages that:

1) the invention uses the object detection method based on the convolutional neural network, and the method can accurately and quickly detect whether a frame contains station caption data through a trained model.

2) Because the object detection method based on the convolutional neural network is usually trained and evaluated on public data sets such as PASCAL VOC and ILSVRC, target objects on the data sets do not have the characteristics of station caption data. In order to improve the detection effect on the station caption data, the invention collects a large amount of station caption sample data (namely network video data files), and carries out a series of data preprocessing according to the characteristics of the station caption sample data in the model training, thereby improving the data processing efficiency and the station caption detection effect.

3) When the station caption positioning network is trained, a clustering method and a difficult case mining method are used, and the accuracy and the recall rate of station caption positioning are improved.

4) Classical object detection methods often require a large number of manual labels. To address this problem, the present invention uses a weakly supervised framework. The frame only needs to mark a small part of station caption position data; the rest of mass data only needs to label the station caption category, and the station caption position can be generated according to the trained station caption positioning network. This greatly reduces the number of data labels, saving time and labor required for labeling.

Drawings

FIG. 1 is a flow chart of the training of a station caption detection model provided by the present invention;

FIG. 2 is a flow chart of a method for detecting a weakly supervised depth station caption according to the present invention;

fig. 3 is a diagram of a station positioning network structure provided by the present invention;

FIG. 4 is a comparison of the positioning effect of the station logo;

fig. 5 is a comparison graph of the detection effect of the station caption.

Detailed Description

In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.

The invention provides a weak supervision deep station caption detection method, the training flow of a station caption detection model is shown in figure 1, the flow chart of the method is shown in figure 2, the method comprises a training stage and a detection stage, and the training stage mainly comprises the following steps:

(1) and (3) removing the duplicate of the massive network video data files according to MD5 codes, and reserving effective data so as to facilitate later data processing and ensure effective training.

(2) And extracting a plurality of key frames from the duplicate-removed network video by using a key frame extraction method, performing M-grid segmentation on each network video key frame, and only keeping four M-th picture blocks positioned at corners. The processing of the subsequent steps is to spread out the picture blocks of one ninth of the key frame, and the picture blocks are referred to as the "pictures" in the following. Wherein M takes values according to the size and the position of the station caption. Generally, a preferable effect can be obtained when M is 9. According to the specific situation, M generally takes the value of a²And a is 1, 2, 3, 4, 5, 6.

(3) And classifying all the pictures to obtain the pictures with the station caption. And classifying the pictures by using a simple classifier or a network, and removing the pictures which do not contain the station caption.

(4) And performing data enhancement on the pictures with the station captions by using a traditional method, and avoiding the influence on training caused by the quantity difference of different types of station caption samples.

(5) And the data-enhanced picture with the station caption is uniformly distributed into a large data set and a small data set according to the station caption category. And marking the station caption position of only the small data set, and using the small data set as training data of the station caption positioning network. And only labeling the station caption category on the large data set, and inputting the large data set into the trained station caption positioning network.

(6) And organizing a station caption positioning network and a station caption classifying network according to a weakly supervised framework, and training the station caption positioning network based on the RPN by taking the small data set as training data of the station caption positioning network to obtain the station caption positioning network capable of predicting the station caption area. A K-means clustering method and a difficult case mining method are used in training to improve the training effect.

(7) Inputting the big data set into the trained station caption positioning network to obtain a plurality of predicted station caption areas of each image with the station caption category marked in the big data set, and training a station caption classifying network based on Fast RCNN by using the plurality of predicted station caption areas of each image with the station caption category marked as training data to obtain a station caption classifying network capable of classifying the station caption.

In step (1), MD5 is a message digest algorithm. According to the algorithm, file summary information with differences can be generated for different network video data files. Whether the network video data file is duplicated is judged by comparing MD5 values. Only one network video with the same MD5 value is reserved, and the rest network videos are rejected.

In the step (2), the invention uses an open-source key frame extraction method, and can accurately extract the network video key frame. Usually, about 100 key frames are extracted from a piece of network video. In these key frames, the station logo has the obvious characteristic that the station logo appears in a relatively fixed position (four corners), and each corner often only appears with a unique station logo and occupies a small proportion of the whole picture. Therefore, in this embodiment, the key frame pictures are divided into nine parts, and only one ninth picture located at the corner is selected for inputting into the network.

In step (3), the present invention uses a CNN-based classifier, which is used to classify the pictures. If N types of station labels are to be detected, the classifier performs N +1 classification (including a background class).

In the step (4), the invention enhances the data by means of geometric transformation, smooth filtering, JPEG compression and contrast and brightness adjustment, thereby avoiding the overlarge difference of different types of station caption data in magnitude.

In step (5), the data used for training the station positioning network is much less than the data used for training the station classification network. In the invention, 20 pictures containing station captions in each category are selected to train the station caption positioning network, and 1000 pictures containing station captions in each category are selected to train the station caption classification network. When the data is labeled, only the station caption position is labeled on the small data set, and only the station caption category is labeled on the large data set, so that the labeling workload is reduced by about 50 times.

In the step (6), according to the characteristics of the station caption data, the invention improves the station caption positioning network by using a clustering method. During the training of the station positioning network, the sliding window can slide on the convolution characteristic diagram to generate a plurality of anchor boxes. The size, proportion and number of these anchor boxes are usually selected by a large number of experiments, which takes a lot of time. The method uses the ratio and the length and the width of the station caption position marked in the small data set of the training station caption positioning network, inputs the normalized position and the normalized position into a K-means clustering algorithm after 0-1 normalization, and automatically selects the ratio and the size of the length and the width of anchor boxes according to the station caption data to obtain a predicted station caption area.

In the step (6), in order to effectively train the difficult-case data and achieve better network generalization capability, the invention uses a difficult-case mining method to improve the station caption positioning network. Inputting each picture only marking the station caption position in the small data set into a station caption positioning network for training for a plurality of rounds, and then selecting the newly generated station caption positioning network to obtain the confidence degree corresponding to each predicted station caption region (namely, the confidence degree that the predicted station caption region is the region with the station caption position). And comparing the relationship between the predicted station caption areas and the IOU of the station caption position and the corresponding confidence degrees of the predicted station caption areas, judging hard positive cases, hard negative cases or normal cases according to a threshold value, and preferentially selecting the hard cases for training in the next rounds of training. And after a plurality of rounds of training, selecting the difficult case by using the latest station logo positioning network again, then performing training again, and repeating the steps until the station logo positioning network converges to obtain the station logo positioning network capable of predicting the station logo area. The threshold value is a specific value selected according to the relation between the confidence coefficient and the IOU, and the selection range of the threshold value is generally 0.5-1.

In step (7), several predicted logo areas of each picture in the big data set are generated by using the logo positioning network trained in (6). Inputting a plurality of predicted station caption areas of each picture in the big data set only marked with the station caption category into a station caption classifying network for training, obtaining the confidence coefficient of the predicted station caption areas positioned on the pictures, and dividing the predicted station caption areas into N types of foreground areas and background areas according to the confidence coefficient, thereby obtaining the station caption classifying network capable of classifying the station caption. The foreground area category is marked according to the station caption category on the corresponding picture.

The procedure for detecting the station caption in the station caption detection model is as follows:

1) processing the video to be detected in the same way as the step (2);

2) inputting the processed picture into a trained station caption positioning network to obtain a predicted station caption area of the picture;

3) and inputting the predicted station caption area into a trained station caption classification network to obtain the station caption position and the station caption category of the picture.

The station logo detection model mainly comprises three parts: the system comprises a data preprocessing module, a station caption positioning network and a station caption classifying network.

(1) Data pre-processing

The data preprocessing mainly comprises the steps of removing the duplicate of the network video data file, extracting key frames, segmenting M grids and enhancing data. The data preprocessing module inputs massive network video data files and outputs a plurality of key frame pictures of which the number is one M. The data preprocessing module is an initial module of the whole station caption detection model.

(2) Station caption positioning network

The station positioning network is an RPN-based station positioning network, and the structure of the station positioning network is shown in fig. 3. Inputting the preprocessed picture only marking the station caption position into a station caption positioning network, extracting features from the convolution layer, calculating to obtain a convolution feature map, and performing sliding scanning on the last layer of convolution feature map (each central position of a sliding window predicts k anchor boxes with different scales and proportions). The final layer of convolution has 256 feature maps, and generates a full-connected feature vector with 256-dimensional length. And then accessing two regression and classified full connection layers to respectively obtain the frame regression coordinates and the corresponding confidence of anchor boxes on each picture. And (4) regressing the anchor boxes according to the frame regression coordinates to obtain a predicted station caption area, and judging whether the predicted station caption area is a foreground area or a background area according to the confidence coefficient. When the station caption classifying network is trained and tested, a station caption positioning network is needed to be used for generating a predicted station caption area, and the predicted station caption area is input into the station caption classifying network.

(3) Station caption classification network

The station caption classifying network is a station caption classifying network based on Fast RCNN, and for N (in an experiment, N is 168, such as logo _1.. logo _168 in fig. 1) station captions, the input of the station caption classifying network is a predicted station caption region generated by the station caption positioning network, and the output of the predicted station caption region is the confidence that each predicted station caption region belongs to N types of station captions (i.e., N types of foreground regions) and a background type (background region), so that the station caption classification in the predicted station caption region is determined.

The invention provides a method for detecting a weakly supervised depth station caption, which comprises the following test environments and experimental data:

(1) test environment

The system environment is as follows: ubuntu14.04LTS

A processor:

Xeon(R)CPU E5-1603 v3@2.80GHz x 4

a detection frame; faster R-CNN

A feature extraction model: ZF

(2) Experimental data

Aiming at a network video station caption detection scene and 168 kinds of station captions, the invention collects Twitter real network data and performs data enhancement on the samples to construct a station caption positioning data set (namely a small data set only marking the position of the station caption) and a station caption classification data set (namely a large data set only marking the category of the station caption). The station caption positioning data set comprises 20 pictures of each station caption of training data and 336 pictures of testing data after the data enhancement is carried out on the samples. The station caption classification data set comprises 1000 pictures of each station caption of training data and 10000 pictures of test data after data enhancement is carried out on a sample. When a sample is marked, the position of the station caption is marked only by positioning the data of the station caption, and the other data only needs to mark the type of the station caption, so that a large amount of marking cost is saved.

In order to illustrate the effect of station caption positioning, the following methods are respectively adopted to train the station caption positioning network, and the test is carried out on the test set:

1) directly training a station mark positioning network by using the RPN;

2) adding a clustering method on the basis of 1);

3) adding a difficult case excavation method on the basis of 1);

4) the station caption positioning method is used.

The station caption positioning network trained by the four methods is tested, and the recall rate (R), the precision (P) and the accuracy (A) are calculated, and the comparison graph of the station caption positioning effect is shown in fig. 4. It can be known from the figure that the RPN + clustering method in the 2) and the RPN + hard case mining method in the 3) are adopted to improve the recall rate, the precision and the accuracy compared with the method of directly training the station logo positioning network by adopting the RPN in the 1) and the station logo positioning method has small-amplitude improvement on the recall rate, the precision and the accuracy compared with the RPN + clustering method in the 2) and the station logo positioning method can enhance the generalization capability of the network and can also aim at hard case training and accelerate the training speed.

In order to illustrate the overall effect of station caption detection, the following methods are respectively adopted to train the station caption detection model, and the test is carried out on the test set:

1) training a station caption positioning network directly by adopting RPN and training a station caption classifying network by adopting Fast RCNN (based on Fast R-CNN);

2) the station caption detection method is used.

The station caption detection models trained by the two methods are tested, and the average precision mean (Map) and the average mean under the ROC curve (Map AUC) are calculated, and a comparison graph of the station caption detection effects is shown in FIG. 5. As can be clearly seen from FIG. 5, both Map and Map AUC are greatly improved after the station caption detection method provided by the invention is adopted, and the effectiveness and the usability of the method are proved.

The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (7)

1. A weak supervision depth station caption detection method comprises the following steps:

1) preprocessing a mass network video data file to obtain a large data set only marking the station caption category and a small data set only marking the station caption position; the method specifically comprises the following steps:

1-1) removing the duplication of the massive network video data files according to MD5 codes;

1-2) extracting a plurality of key frames from the network video after the duplication removal by using a key frame extraction method;

1-4) classifying all the pictures to obtain pictures with station captions;

1-5) performing data enhancement on the picture with the station caption by using a traditional method;

1-6) carrying out balanced distribution on the picture with the station caption after the data enhancement according to the station caption category to obtain a large data set only marking the station caption category and a small data set only marking the station caption position;

the station caption positioning network is a station caption positioning network based on RPN; selecting the sizes and the aspect ratios of anchors by using a K-means clustering method; the anchors are initial prediction frames generated by the station positioning network at each positioning central point;

inputting each picture in the small data set into a station caption positioning network for training for a plurality of rounds, selecting a newly generated station caption positioning network to obtain a confidence coefficient corresponding to each predicted station caption region, calculating an IOU (input output) of the predicted station caption region and a station caption position, comparing the confidence coefficient corresponding to the IOU and the predicted station caption region to obtain a difficult case in the current state, preferentially selecting the difficult case to enter the station caption positioning network for training by using a difficult case mining method of bootstrap in a plurality of subsequent rounds of training, and repeating the steps until the station caption positioning network converges to obtain the station caption positioning network capable of predicting the station caption region;

3) inputting the large data set into a trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture in the large data set, and inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain a station caption classifying network capable of classifying the station captions;

2. The method of claim 1, wherein the MD5 code deduplication in step 1-1) is: and judging whether the network video data files are repeated or not by comparing MD5 values, only keeping one network video with the same MD5 value for a plurality of network videos, and removing the rest network video data files.

3. The method as claimed in claim 1, wherein step 1-3) performs squared segmentation on each network video key frame, and only retains one ninth of pictures located at four corners.

4. The method according to claim 1, wherein in step 1-4), a classifier based on a convolutional neural network is used for classification, and if N classes of station labels are to be detected, the classifier performs N +1 classification, wherein the classification comprises a background class; the conventional methods in steps 1-5) are, but not limited to, geometric transformation, smooth filtering, JPEG compression, contrast and brightness adjustment.

5. The method of claim 1, wherein the partial pretreatment in step 4) comprises only 1-2) and 1-3) of step 1).

6. The method according to claim 1, wherein the station label classification network in step 3) is a station label classification network based on FastRCNN.

7. The method according to claim 1, wherein in step 3), each picture in the big data set is input into a trained station caption positioning network to obtain a plurality of predicted station caption regions of each picture, the plurality of predicted station caption regions of each picture are input into a station caption classifying network to be trained to obtain the confidence coefficient of each predicted station caption region, and the predicted station caption regions are divided into N types of foreground regions and background regions according to the confidence coefficient of each predicted station caption region to obtain a station caption classifying network capable of classifying the station captions; and the category of the foreground area is marked according to the category of the station caption on the corresponding picture.

CN201710485397.1A 2017-06-23 2017-06-23 Weak supervision depth station caption detection method Expired - Fee Related CN107330027B (en)

Priority Applications (1)

Application Number	Priority Date	Filing Date	Title
CN201710485397.1A CN107330027B (en)	2017-06-23	2017-06-23	Weak supervision depth station caption detection method

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
CN201710485397.1A CN107330027B (en)	2017-06-23	2017-06-23	Weak supervision depth station caption detection method

Publications (2)

Publication Number	Publication Date
CN107330027A CN107330027A (en)	2017-11-07
CN107330027B true CN107330027B (en)	2020-05-22

Family

ID=60194741

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
CN201710485397.1A Expired - Fee Related CN107330027B (en)	2017-06-23	2017-06-23	Weak supervision depth station caption detection method

Country Status (1)

Country	Link
CN (1)	CN107330027B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
CN110619255B (en) *	2018-06-19	2022-08-26	杭州海康威视数字技术股份有限公司	Target detection method and device
CN109345529B (en) *	2018-09-30	2021-09-24	福州大学	Fault identification method based on improved secondary target detection network for wire clip and voltage equalizing ring
CN110147462A (en) *	2019-05-20	2019-08-20	新联智慧信息技术（深圳）有限公司	The verification method and Related product of the short-sighted frequency of religion
CN110210362A (en) *	2019-05-27	2019-09-06	中国科学技术大学	A kind of method for traffic sign detection based on convolutional neural networks
CN110287888A (en) *	2019-06-26	2019-09-27	中科软科技股份有限公司	A kind of TV station symbol recognition method and system
CN111275044A (en) *	2020-02-21	2020-06-12	西北工业大学	Weak supervision target detection method based on sample selection and self-adaptive hard case mining
CN111368682B (en) *	2020-02-27	2023-12-12	上海电力大学	Method and system for detecting and identifying station caption based on master RCNN
CN112215252B (en) *	2020-08-12	2023-05-30	南强智视(厦门)科技有限公司	Weak supervision target detection method based on-line difficult sample mining

Citations (2)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
US8626697B1 (en) *	2010-03-01	2014-01-07	magnify360, Inc.	Website user profiling using anonymously collected data
CN106599892A (en) *	2016-12-14	2017-04-26	四川长虹电器股份有限公司	Television station logo identification system based on deep learning

Family Cites Families (3)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
CN102436575A (en) *	2011-09-22	2012-05-02	Tcl集团股份有限公司	Method for automatically detecting and classifying station captions
CN103336954B (en) *	2013-07-08	2016-09-07	北京捷成世纪科技股份有限公司	A kind of TV station symbol recognition method and apparatus in video
CN106845442A (en) *	2017-02-15	2017-06-13	杭州当虹科技有限公司	A kind of station caption detection method based on deep learning

2017
- 2017-06-23 CN CN201710485397.1A patent/CN107330027B/en not_active Expired - Fee Related

Patent Citations (2)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
US8626697B1 (en) *	2010-03-01	2014-01-07	magnify360, Inc.	Website user profiling using anonymously collected data
CN106599892A (en) *	2016-12-14	2017-04-26	四川长虹电器股份有限公司	Television station logo identification system based on deep learning

Also Published As

Publication number	Publication date
CN107330027A (en)	2017-11-07

Publication	Publication Date	Title
CN107330027B (en)	2020-05-22	Weak supervision depth station caption detection method
CN111476284B (en)	2023-12-26	Image recognition model training and image recognition method and device and electronic equipment
Ping et al.	2020	A deep learning approach for street pothole detection
CN107424159B (en)	2020-02-07	Image semantic segmentation method based on super-pixel edge and full convolution network
CN110796186A (en)	2020-02-14	Dry and wet garbage identification and classification method based on improved YOLOv3 network
WO2020133442A1 (en)	2020-07-02	Text recognition method and terminal device
CN106156777B (en)	2020-06-02	Text image detection method and device
CN111460927A (en)	2020-07-28	Method for extracting structured information of house property certificate image
Sahel et al.	2021	Logo detection using deep learning with pretrained CNN models
CN102508923A (en)	2012-06-20	Automatic video annotation method based on automatic classification and keyword marking
US11600088B2 (en)	2023-03-07	Utilizing machine learning and image filtering techniques to detect and analyze handwritten text
CN110705412A (en)	2020-01-17	Video target detection method based on motion history image
Zhu et al.	2017	Deep residual text detection network for scene text
CN114372968A (en)	2022-04-19	Defect detection method combining attention mechanism and adaptive memory fusion network
CN116665054A (en)	2023-08-29	Remote sensing image small target detection method based on improved YOLOv3
US8467607B1 (en)	2013-06-18	Segmentation-based feature pooling for object models
CN114429577B (en)	2024-03-08	Flag detection method, system and equipment based on high confidence labeling strategy
CN106469293A (en)	2017-03-01	The method and system of quick detection target
CN115937879A (en)	2023-04-07	Academic content target detection method and system based on multi-scale feature fusion network
CN114972947A (en)	2022-08-30	Depth scene text detection method and device based on fuzzy semantic modeling
CN112232288A (en)	2021-01-15	A satellite image target recognition method and system based on deep learning
Ibrahem et al.	2020	Weakly supervised traffic sign detection in real time using single CNN architecture for multiple purposes
KR102026280B1 (en)	2019-11-26	Method and system for scene text detection using deep learning
CN110728229A (en)	2020-01-24	Image processing method, device, equipment and storage medium
Moumtzidou et al.	2013	Discovery of environmental resources based on heatmap recognition

Legal Events

Date	Code	Title	Description
2017-11-07	PB01	Publication
2017-11-07	PB01	Publication
2017-12-01	SE01	Entry into force of request for substantive examination
2017-12-01	SE01	Entry into force of request for substantive examination
2020-05-22	GR01	Patent grant
2020-05-22	GR01	Patent grant
2024-06-21	CF01	Termination of patent right due to non-payment of annual fee	Granted publication date: 20200522
2024-06-21	CF01	Termination of patent right due to non-payment of annual fee

CN107330027B - Weak supervision depth station caption detection method - Google Patents