CN107330027B - Weak supervision depth station caption detection method - Google Patents
- ️Fri May 22 2020
CN107330027B - Weak supervision depth station caption detection method - Google Patents
Weak supervision depth station caption detection method Download PDFInfo
-
Publication number
- CN107330027B CN107330027B CN201710485397.1A CN201710485397A CN107330027B CN 107330027 B CN107330027 B CN 107330027B CN 201710485397 A CN201710485397 A CN 201710485397A CN 107330027 B CN107330027 B CN 107330027B Authority
- CN
- China Prior art keywords
- station caption
- station
- network
- caption
- picture Prior art date
- 2017-06-23 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 45
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000000034 method Methods 0.000 claims description 53
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000000605 extraction Methods 0.000 claims description 8
- 238000005065 mining Methods 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 5
- 238000003064 k means clustering Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000007796 conventional method Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 description 11
- 238000012360 testing method Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000009412 basement excavation Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013434 data augmentation Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/74—Browsing; Visualisation therefor
- G06F16/743—Browsing; Visualisation therefor a collection of video files or sequences
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/732—Query formulation
- G06F16/7335—Graphical querying, e.g. query-by-region, query-by-sketch, query-by-trajectory, GUIs for designating a person/face/object as a query predicate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
本发明提供一种弱监督的深度台标检测方法,其步骤为:对海量网络视频数据文件进行预处理,得到一个仅标记台标类别的大数据集和一个仅标记台标位置的小数据集;将上述小数据集输入台标定位网络进行训练,得到能预测台标区域的台标定位网络;将上述大数据集输入上述已训练好的台标定位网络,得到所述大数据集中每张图片的若干预测台标区域,并将所述每张图片的若干预测台标区域输入台标分类网络进行训练,得到能为台标分类的台标分类网络;对待检测视频进行与上述相同的部分预处理,并将预处理后得到的图片输入训练好的台标定位网络中,得到图片的预测台标区域;将上述图片的预测台标区域输入训练好的台标分类网络中,得到图片的台标位置及类别。
The invention provides a weakly supervised deep station logo detection method, the steps of which are: preprocessing massive network video data files to obtain a large data set that only marks the category of the station logo and a small data set that only marks the position of the station logo ; Input the above-mentioned small data set into the station logo positioning network for training to obtain a station logo positioning network that can predict the station logo area; Input the above-mentioned large data set into the above-mentioned trained station logo positioning network to obtain each image in the large data set. Several predicted station logo areas of the picture, and several predicted station logo areas of each picture are input into the station logo classification network for training to obtain a station logo classification network that can classify the station logo; Preprocess, and input the image obtained after preprocessing into the trained station logo positioning network to obtain the predicted station logo area of the picture; input the predicted station logo area of the above picture into the trained station logo classification network to obtain the picture The location and category of the logo.
Description
Technical Field
The invention relates to the field of deep learning, in particular to a method for detecting a depth station caption under weak supervision.
Background
With the development of the internet and the rise of multimedia technology, network video carries more and more contents, and becomes a main content carrier in the big data era. Different video sources tend to present different video content information, and by detecting the video station caption, network video data can be more effectively managed, the video sources and the content information can be mastered in advance, and videos containing bad information can be supervised. Therefore, the video station caption detection has strong practical significance and research value.
Station caption data widely exists in network videos, and station caption detection is to detect a plurality of key frames extracted from the network videos. The station target detection has specificity compared to general object detection, and the detection target is present at a more fixed position and occupies a smaller proportion in each frame. Most data sets for object detection do not have such characteristics at present, and the data specificity of station logo detection requires collection and preprocessing of a large amount of data before detection. For such data objects with particularity, how to perform detection quickly, efficiently and accurately is the key point for completing the task of detecting the station caption.
At present, the methods for realizing station caption detection mainly comprise the following methods:
(1) based on template matching method
The template matching is an intuitive method, which judges the similarity between the local area of the image frame and the station caption template according to a certain similarity criterion, thereby judging whether the area contains the station caption. The matching degree of the template and the area needing to be matched is calculated one by one, so that the template matching method brings a large amount of operation.
(2) Feature matching based method
Extracting image features to measure similarity is one of the mainstream methods. The moment features are also called invariant moments, have rotation, scale and translation invariance which common features do not have, and are widely applied to the field of image analysis. SIFT features and SURF features are also widely used. However, this manual feature extraction method does not provide a good detection result.
(3) Neural network based method
Station caption identification method based on neural network is popular and mainstream nowadays. In the detection task, the characteristic extraction by using the neural network has better detection effect than the characteristic extraction by hand. However, most of the classical neural network-based detection models often require a large number of labels, which are labor-consuming and time-consuming.
In summary, the above station logo detection methods all have various disadvantages. Therefore, the platform scale detection method for weak supervision research has important research value and application prospect.
Disclosure of Invention
Aiming at the defects of the conventional station caption detection method, the invention provides a weak supervision deep station caption detection method. The method can effectively change the conditions of low efficiency and poor result of the conventional station caption detection method, and greatly saves the time and labor required by labeling.
Aiming at the defects, the technical scheme adopted by the invention is as follows:
a weak supervision depth station caption detection method comprises the following steps:
1) preprocessing a mass network video data file to obtain a large data set only marking the station caption category and a small data set only marking the station caption position (ground route);
2) organizing a station caption positioning network and a station caption classifying network according to a weakly supervised framework, inputting the small data set into the station caption positioning network for training to obtain a station caption positioning network capable of predicting a station caption area;
3) inputting the large data set into the trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture in the large data set, and inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain a station caption classifying network capable of classifying the station captions;
4) performing partial preprocessing on the video to be detected, which is the same as the preprocessing in the step 1), and inputting the picture obtained after preprocessing into the station caption positioning network trained in the step 2) to obtain a predicted station caption area of the picture;
5) inputting the predicted station caption area of the picture into the trained station caption classification network in the step 3) to obtain the station caption position and the station caption category of the picture.
Further, the step 1) specifically comprises:
1-1) removing the duplicate of the massive network video data files according to MD5(Message-Digest 5) codes;
1-2) extracting a plurality of key frames from the network video after the duplication removal by using a key frame extraction method;
1-3) performing M grid segmentation on each network video key frame, and only keeping one M times of pictures at corners, wherein M takes values according to the size and the position of a station caption;
1-4) classifying all the pictures to obtain pictures with station captions;
1-5) performing Data enhancement (Data Augmentation) on the picture with the station caption by using a traditional method;
1-6) carrying out balanced distribution on the image with the station caption after the data enhancement according to the station caption category to obtain a big data set only marking the station caption category and a small data set only marking the station caption position.
Further, the MD5 code deduplication in step 1-1) refers to: and judging whether the network video data files are repeated or not by comparing MD5 values, only keeping one network video with the same MD5 value for a plurality of network videos, and removing the rest network video data files.
Furthermore, in the step 1-3), the grid segmentation is performed on each network video key frame, and only one ninth picture at four corners is reserved.
Further, a classifier based on a Convolutional Neural Network (CNN) is used for classifying in the step 1-4), and if N classes of station labels are to be detected, the classifier performs N +1 classification, wherein the classification comprises a background class; the conventional methods in steps 1-5) are, but not limited to, geometric transformation, smooth filtering, JPEG compression, contrast and brightness adjustment.
Further, the partial pretreatment in step 4) includes only 1-2) and 1-3) in step 1).
Further, the station positioning network in step 2) is a station positioning network based on rpn (region pro posal net); selecting the sizes and the aspect ratio of the anchor boxes by using a K-means clustering method in the step 2); the anchor boxes are initial prediction frames generated by the station positioning network at each positioning center point.
Further, in step 2), after each picture in the small data set is input into a station caption positioning network for training for a plurality of rounds, a newly generated station caption positioning network is selected to obtain a confidence degree corresponding to each predicted station caption region, an IOU (overlap ratio) of the predicted station caption region and the station caption position is calculated, the IOU is compared with the confidence degree corresponding to the predicted station caption region to obtain a difficult case in the current state, in a plurality of rounds of training, the difficult case is preferentially selected to enter the station caption positioning network for training by using a difficult case mining method of bootstrap (bootstrap method), and the steps are repeated until the station caption positioning network is converged to obtain the station caption positioning network capable of predicting the station caption region.
Further, the station caption classifying network in the step 3) is a station caption classifying network based on Fast RCNN.
Further, in step 3), inputting each picture in the big data set into a trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture, inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain the confidence coefficient of each predicted station caption area, and dividing the predicted station caption areas into N types of foreground areas (namely N types of station captions) and background areas according to the confidence coefficient of each predicted station caption area to obtain a station caption classifying network capable of classifying the station captions; and the category of the foreground area is marked according to the category of the station caption on the corresponding picture.
The invention has the advantages that:
1) the invention uses the object detection method based on the convolutional neural network, and the method can accurately and quickly detect whether a frame contains station caption data through a trained model.
2) Because the object detection method based on the convolutional neural network is usually trained and evaluated on public data sets such as PASCAL VOC and ILSVRC, target objects on the data sets do not have the characteristics of station caption data. In order to improve the detection effect on the station caption data, the invention collects a large amount of station caption sample data (namely network video data files), and carries out a series of data preprocessing according to the characteristics of the station caption sample data in the model training, thereby improving the data processing efficiency and the station caption detection effect.
3) When the station caption positioning network is trained, a clustering method and a difficult case mining method are used, and the accuracy and the recall rate of station caption positioning are improved.
4) Classical object detection methods often require a large number of manual labels. To address this problem, the present invention uses a weakly supervised framework. The frame only needs to mark a small part of station caption position data; the rest of mass data only needs to label the station caption category, and the station caption position can be generated according to the trained station caption positioning network. This greatly reduces the number of data labels, saving time and labor required for labeling.
Drawings
FIG. 1 is a flow chart of the training of a station caption detection model provided by the present invention;
FIG. 2 is a flow chart of a method for detecting a weakly supervised depth station caption according to the present invention;
fig. 3 is a diagram of a station positioning network structure provided by the present invention;
FIG. 4 is a comparison of the positioning effect of the station logo;
fig. 5 is a comparison graph of the detection effect of the station caption.
Detailed Description
In order to make the aforementioned and other features and advantages of the invention more comprehensible, embodiments accompanied with figures are described in detail below.
The invention provides a weak supervision deep station caption detection method, the training flow of a station caption detection model is shown in figure 1, the flow chart of the method is shown in figure 2, the method comprises a training stage and a detection stage, and the training stage mainly comprises the following steps:
(1) and (3) removing the duplicate of the massive network video data files according to MD5 codes, and reserving effective data so as to facilitate later data processing and ensure effective training.
(2) And extracting a plurality of key frames from the duplicate-removed network video by using a key frame extraction method, performing M-grid segmentation on each network video key frame, and only keeping four M-th picture blocks positioned at corners. The processing of the subsequent steps is to spread out the picture blocks of one ninth of the key frame, and the picture blocks are referred to as the "pictures" in the following. Wherein M takes values according to the size and the position of the station caption. Generally, a preferable effect can be obtained when M is 9. According to the specific situation, M generally takes the value of a2And a is 1, 2, 3, 4, 5, 6.
(3) And classifying all the pictures to obtain the pictures with the station caption. And classifying the pictures by using a simple classifier or a network, and removing the pictures which do not contain the station caption.
(4) And performing data enhancement on the pictures with the station captions by using a traditional method, and avoiding the influence on training caused by the quantity difference of different types of station caption samples.
(5) And the data-enhanced picture with the station caption is uniformly distributed into a large data set and a small data set according to the station caption category. And marking the station caption position of only the small data set, and using the small data set as training data of the station caption positioning network. And only labeling the station caption category on the large data set, and inputting the large data set into the trained station caption positioning network.
(6) And organizing a station caption positioning network and a station caption classifying network according to a weakly supervised framework, and training the station caption positioning network based on the RPN by taking the small data set as training data of the station caption positioning network to obtain the station caption positioning network capable of predicting the station caption area. A K-means clustering method and a difficult case mining method are used in training to improve the training effect.
(7) Inputting the big data set into the trained station caption positioning network to obtain a plurality of predicted station caption areas of each image with the station caption category marked in the big data set, and training a station caption classifying network based on Fast RCNN by using the plurality of predicted station caption areas of each image with the station caption category marked as training data to obtain a station caption classifying network capable of classifying the station caption.
In step (1), MD5 is a message digest algorithm. According to the algorithm, file summary information with differences can be generated for different network video data files. Whether the network video data file is duplicated is judged by comparing MD5 values. Only one network video with the same MD5 value is reserved, and the rest network videos are rejected.
In the step (2), the invention uses an open-source key frame extraction method, and can accurately extract the network video key frame. Usually, about 100 key frames are extracted from a piece of network video. In these key frames, the station logo has the obvious characteristic that the station logo appears in a relatively fixed position (four corners), and each corner often only appears with a unique station logo and occupies a small proportion of the whole picture. Therefore, in this embodiment, the key frame pictures are divided into nine parts, and only one ninth picture located at the corner is selected for inputting into the network.
In step (3), the present invention uses a CNN-based classifier, which is used to classify the pictures. If N types of station labels are to be detected, the classifier performs N +1 classification (including a background class).
In the step (4), the invention enhances the data by means of geometric transformation, smooth filtering, JPEG compression and contrast and brightness adjustment, thereby avoiding the overlarge difference of different types of station caption data in magnitude.
In step (5), the data used for training the station positioning network is much less than the data used for training the station classification network. In the invention, 20 pictures containing station captions in each category are selected to train the station caption positioning network, and 1000 pictures containing station captions in each category are selected to train the station caption classification network. When the data is labeled, only the station caption position is labeled on the small data set, and only the station caption category is labeled on the large data set, so that the labeling workload is reduced by about 50 times.
In the step (6), according to the characteristics of the station caption data, the invention improves the station caption positioning network by using a clustering method. During the training of the station positioning network, the sliding window can slide on the convolution characteristic diagram to generate a plurality of anchor boxes. The size, proportion and number of these anchor boxes are usually selected by a large number of experiments, which takes a lot of time. The method uses the ratio and the length and the width of the station caption position marked in the small data set of the training station caption positioning network, inputs the normalized position and the normalized position into a K-means clustering algorithm after 0-1 normalization, and automatically selects the ratio and the size of the length and the width of anchor boxes according to the station caption data to obtain a predicted station caption area.
In the step (6), in order to effectively train the difficult-case data and achieve better network generalization capability, the invention uses a difficult-case mining method to improve the station caption positioning network. Inputting each picture only marking the station caption position in the small data set into a station caption positioning network for training for a plurality of rounds, and then selecting the newly generated station caption positioning network to obtain the confidence degree corresponding to each predicted station caption region (namely, the confidence degree that the predicted station caption region is the region with the station caption position). And comparing the relationship between the predicted station caption areas and the IOU of the station caption position and the corresponding confidence degrees of the predicted station caption areas, judging hard positive cases, hard negative cases or normal cases according to a threshold value, and preferentially selecting the hard cases for training in the next rounds of training. And after a plurality of rounds of training, selecting the difficult case by using the latest station logo positioning network again, then performing training again, and repeating the steps until the station logo positioning network converges to obtain the station logo positioning network capable of predicting the station logo area. The threshold value is a specific value selected according to the relation between the confidence coefficient and the IOU, and the selection range of the threshold value is generally 0.5-1.
In step (7), several predicted logo areas of each picture in the big data set are generated by using the logo positioning network trained in (6). Inputting a plurality of predicted station caption areas of each picture in the big data set only marked with the station caption category into a station caption classifying network for training, obtaining the confidence coefficient of the predicted station caption areas positioned on the pictures, and dividing the predicted station caption areas into N types of foreground areas and background areas according to the confidence coefficient, thereby obtaining the station caption classifying network capable of classifying the station caption. The foreground area category is marked according to the station caption category on the corresponding picture.
The procedure for detecting the station caption in the station caption detection model is as follows:
1) processing the video to be detected in the same way as the step (2);
2) inputting the processed picture into a trained station caption positioning network to obtain a predicted station caption area of the picture;
3) and inputting the predicted station caption area into a trained station caption classification network to obtain the station caption position and the station caption category of the picture.
The station logo detection model mainly comprises three parts: the system comprises a data preprocessing module, a station caption positioning network and a station caption classifying network.
(1) Data pre-processing
The data preprocessing mainly comprises the steps of removing the duplicate of the network video data file, extracting key frames, segmenting M grids and enhancing data. The data preprocessing module inputs massive network video data files and outputs a plurality of key frame pictures of which the number is one M. The data preprocessing module is an initial module of the whole station caption detection model.
(2) Station caption positioning network
The station positioning network is an RPN-based station positioning network, and the structure of the station positioning network is shown in fig. 3. Inputting the preprocessed picture only marking the station caption position into a station caption positioning network, extracting features from the convolution layer, calculating to obtain a convolution feature map, and performing sliding scanning on the last layer of convolution feature map (each central position of a sliding window predicts k anchor boxes with different scales and proportions). The final layer of convolution has 256 feature maps, and generates a full-connected feature vector with 256-dimensional length. And then accessing two regression and classified full connection layers to respectively obtain the frame regression coordinates and the corresponding confidence of anchor boxes on each picture. And (4) regressing the anchor boxes according to the frame regression coordinates to obtain a predicted station caption area, and judging whether the predicted station caption area is a foreground area or a background area according to the confidence coefficient. When the station caption classifying network is trained and tested, a station caption positioning network is needed to be used for generating a predicted station caption area, and the predicted station caption area is input into the station caption classifying network.
(3) Station caption classification network
The station caption classifying network is a station caption classifying network based on Fast RCNN, and for N (in an experiment, N is 168, such as logo _1.. logo _168 in fig. 1) station captions, the input of the station caption classifying network is a predicted station caption region generated by the station caption positioning network, and the output of the predicted station caption region is the confidence that each predicted station caption region belongs to N types of station captions (i.e., N types of foreground regions) and a background type (background region), so that the station caption classification in the predicted station caption region is determined.
The invention provides a method for detecting a weakly supervised depth station caption, which comprises the following test environments and experimental data:
(1) test environment
The system environment is as follows: ubuntu14.04LTS
a detection frame; faster R-CNN
A feature extraction model: ZF
(2) Experimental data
Aiming at a network video station caption detection scene and 168 kinds of station captions, the invention collects Twitter real network data and performs data enhancement on the samples to construct a station caption positioning data set (namely a small data set only marking the position of the station caption) and a station caption classification data set (namely a large data set only marking the category of the station caption). The station caption positioning data set comprises 20 pictures of each station caption of training data and 336 pictures of testing data after the data enhancement is carried out on the samples. The station caption classification data set comprises 1000 pictures of each station caption of training data and 10000 pictures of test data after data enhancement is carried out on a sample. When a sample is marked, the position of the station caption is marked only by positioning the data of the station caption, and the other data only needs to mark the type of the station caption, so that a large amount of marking cost is saved.
In order to illustrate the effect of station caption positioning, the following methods are respectively adopted to train the station caption positioning network, and the test is carried out on the test set:
1) directly training a station mark positioning network by using the RPN;
2) adding a clustering method on the basis of 1);
3) adding a difficult case excavation method on the basis of 1);
4) the station caption positioning method is used.
The station caption positioning network trained by the four methods is tested, and the recall rate (R), the precision (P) and the accuracy (A) are calculated, and the comparison graph of the station caption positioning effect is shown in fig. 4. It can be known from the figure that the RPN + clustering method in the 2) and the RPN + hard case mining method in the 3) are adopted to improve the recall rate, the precision and the accuracy compared with the method of directly training the station logo positioning network by adopting the RPN in the 1) and the station logo positioning method has small-amplitude improvement on the recall rate, the precision and the accuracy compared with the RPN + clustering method in the 2) and the station logo positioning method can enhance the generalization capability of the network and can also aim at hard case training and accelerate the training speed.
In order to illustrate the overall effect of station caption detection, the following methods are respectively adopted to train the station caption detection model, and the test is carried out on the test set:
1) training a station caption positioning network directly by adopting RPN and training a station caption classifying network by adopting Fast RCNN (based on Fast R-CNN);
2) the station caption detection method is used.
The station caption detection models trained by the two methods are tested, and the average precision mean (Map) and the average mean under the ROC curve (Map AUC) are calculated, and a comparison graph of the station caption detection effects is shown in FIG. 5. As can be clearly seen from FIG. 5, both Map and Map AUC are greatly improved after the station caption detection method provided by the invention is adopted, and the effectiveness and the usability of the method are proved.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and a person skilled in the art can make modifications or equivalent substitutions to the technical solution of the present invention without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.
Claims (7)
1. A weak supervision depth station caption detection method comprises the following steps:
1) preprocessing a mass network video data file to obtain a large data set only marking the station caption category and a small data set only marking the station caption position; the method specifically comprises the following steps:
1-1) removing the duplication of the massive network video data files according to MD5 codes;
1-2) extracting a plurality of key frames from the network video after the duplication removal by using a key frame extraction method;
1-3) performing M grid segmentation on each network video key frame, and only keeping one M times of pictures at corners, wherein M takes values according to the size and the position of a station caption;
1-4) classifying all the pictures to obtain pictures with station captions;
1-5) performing data enhancement on the picture with the station caption by using a traditional method;
1-6) carrying out balanced distribution on the picture with the station caption after the data enhancement according to the station caption category to obtain a large data set only marking the station caption category and a small data set only marking the station caption position;
2) organizing a station caption positioning network and a station caption classifying network according to a weakly supervised framework, inputting the small data set into the station caption positioning network for training to obtain a station caption positioning network capable of predicting a station caption area;
the station caption positioning network is a station caption positioning network based on RPN; selecting the sizes and the aspect ratios of anchors by using a K-means clustering method; the anchors are initial prediction frames generated by the station positioning network at each positioning central point;
inputting each picture in the small data set into a station caption positioning network for training for a plurality of rounds, selecting a newly generated station caption positioning network to obtain a confidence coefficient corresponding to each predicted station caption region, calculating an IOU (input output) of the predicted station caption region and a station caption position, comparing the confidence coefficient corresponding to the IOU and the predicted station caption region to obtain a difficult case in the current state, preferentially selecting the difficult case to enter the station caption positioning network for training by using a difficult case mining method of bootstrap in a plurality of subsequent rounds of training, and repeating the steps until the station caption positioning network converges to obtain the station caption positioning network capable of predicting the station caption region;
3) inputting the large data set into a trained station caption positioning network to obtain a plurality of predicted station caption areas of each picture in the large data set, and inputting the plurality of predicted station caption areas of each picture into a station caption classifying network for training to obtain a station caption classifying network capable of classifying the station captions;
4) performing partial preprocessing on the video to be detected, which is the same as the preprocessing in the step 1), and inputting the picture obtained after preprocessing into the station caption positioning network trained in the step 2) to obtain a predicted station caption area of the picture;
5) inputting the predicted station caption area of the picture into the trained station caption classification network in the step 3) to obtain the station caption position and the category of the picture.
2. The method of claim 1, wherein the MD5 code deduplication in step 1-1) is: and judging whether the network video data files are repeated or not by comparing MD5 values, only keeping one network video with the same MD5 value for a plurality of network videos, and removing the rest network video data files.
3. The method as claimed in claim 1, wherein step 1-3) performs squared segmentation on each network video key frame, and only retains one ninth of pictures located at four corners.
4. The method according to claim 1, wherein in step 1-4), a classifier based on a convolutional neural network is used for classification, and if N classes of station labels are to be detected, the classifier performs N +1 classification, wherein the classification comprises a background class; the conventional methods in steps 1-5) are, but not limited to, geometric transformation, smooth filtering, JPEG compression, contrast and brightness adjustment.
5. The method of claim 1, wherein the partial pretreatment in step 4) comprises only 1-2) and 1-3) of step 1).
6. The method according to claim 1, wherein the station label classification network in step 3) is a station label classification network based on FastRCNN.
7. The method according to claim 1, wherein in step 3), each picture in the big data set is input into a trained station caption positioning network to obtain a plurality of predicted station caption regions of each picture, the plurality of predicted station caption regions of each picture are input into a station caption classifying network to be trained to obtain the confidence coefficient of each predicted station caption region, and the predicted station caption regions are divided into N types of foreground regions and background regions according to the confidence coefficient of each predicted station caption region to obtain a station caption classifying network capable of classifying the station captions; and the category of the foreground area is marked according to the category of the station caption on the corresponding picture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710485397.1A CN107330027B (en) | 2017-06-23 | 2017-06-23 | Weak supervision depth station caption detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710485397.1A CN107330027B (en) | 2017-06-23 | 2017-06-23 | Weak supervision depth station caption detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107330027A CN107330027A (en) | 2017-11-07 |
CN107330027B true CN107330027B (en) | 2020-05-22 |
Family
ID=60194741
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710485397.1A Expired - Fee Related CN107330027B (en) | 2017-06-23 | 2017-06-23 | Weak supervision depth station caption detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107330027B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110619255B (en) * | 2018-06-19 | 2022-08-26 | 杭州海康威视数字技术股份有限公司 | Target detection method and device |
CN109345529B (en) * | 2018-09-30 | 2021-09-24 | 福州大学 | Fault identification method based on improved secondary target detection network for wire clip and voltage equalizing ring |
CN110147462A (en) * | 2019-05-20 | 2019-08-20 | 新联智慧信息技术(深圳)有限公司 | The verification method and Related product of the short-sighted frequency of religion |
CN110210362A (en) * | 2019-05-27 | 2019-09-06 | 中国科学技术大学 | A kind of method for traffic sign detection based on convolutional neural networks |
CN110287888A (en) * | 2019-06-26 | 2019-09-27 | 中科软科技股份有限公司 | A kind of TV station symbol recognition method and system |
CN111275044A (en) * | 2020-02-21 | 2020-06-12 | 西北工业大学 | Weak supervision target detection method based on sample selection and self-adaptive hard case mining |
CN111368682B (en) * | 2020-02-27 | 2023-12-12 | 上海电力大学 | Method and system for detecting and identifying station caption based on master RCNN |
CN112215252B (en) * | 2020-08-12 | 2023-05-30 | 南强智视(厦门)科技有限公司 | Weak supervision target detection method based on-line difficult sample mining |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8626697B1 (en) * | 2010-03-01 | 2014-01-07 | magnify360, Inc. | Website user profiling using anonymously collected data |
CN106599892A (en) * | 2016-12-14 | 2017-04-26 | 四川长虹电器股份有限公司 | Television station logo identification system based on deep learning |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102436575A (en) * | 2011-09-22 | 2012-05-02 | Tcl集团股份有限公司 | Method for automatically detecting and classifying station captions |
CN103336954B (en) * | 2013-07-08 | 2016-09-07 | 北京捷成世纪科技股份有限公司 | A kind of TV station symbol recognition method and apparatus in video |
CN106845442A (en) * | 2017-02-15 | 2017-06-13 | 杭州当虹科技有限公司 | A kind of station caption detection method based on deep learning |
-
2017
- 2017-06-23 CN CN201710485397.1A patent/CN107330027B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8626697B1 (en) * | 2010-03-01 | 2014-01-07 | magnify360, Inc. | Website user profiling using anonymously collected data |
CN106599892A (en) * | 2016-12-14 | 2017-04-26 | 四川长虹电器股份有限公司 | Television station logo identification system based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN107330027A (en) | 2017-11-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107330027B (en) | 2020-05-22 | Weak supervision depth station caption detection method |
CN111476284B (en) | 2023-12-26 | Image recognition model training and image recognition method and device and electronic equipment |
Ping et al. | 2020 | A deep learning approach for street pothole detection |
CN107424159B (en) | 2020-02-07 | Image semantic segmentation method based on super-pixel edge and full convolution network |
CN110796186A (en) | 2020-02-14 | Dry and wet garbage identification and classification method based on improved YOLOv3 network |
WO2020133442A1 (en) | 2020-07-02 | Text recognition method and terminal device |
CN106156777B (en) | 2020-06-02 | Text image detection method and device |
CN111460927A (en) | 2020-07-28 | Method for extracting structured information of house property certificate image |
Sahel et al. | 2021 | Logo detection using deep learning with pretrained CNN models |
CN102508923A (en) | 2012-06-20 | Automatic video annotation method based on automatic classification and keyword marking |
US11600088B2 (en) | 2023-03-07 | Utilizing machine learning and image filtering techniques to detect and analyze handwritten text |
CN110705412A (en) | 2020-01-17 | Video target detection method based on motion history image |
Zhu et al. | 2017 | Deep residual text detection network for scene text |
CN114372968A (en) | 2022-04-19 | Defect detection method combining attention mechanism and adaptive memory fusion network |
CN116665054A (en) | 2023-08-29 | Remote sensing image small target detection method based on improved YOLOv3 |
US8467607B1 (en) | 2013-06-18 | Segmentation-based feature pooling for object models |
CN114429577B (en) | 2024-03-08 | Flag detection method, system and equipment based on high confidence labeling strategy |
CN106469293A (en) | 2017-03-01 | The method and system of quick detection target |
CN115937879A (en) | 2023-04-07 | Academic content target detection method and system based on multi-scale feature fusion network |
CN114972947A (en) | 2022-08-30 | Depth scene text detection method and device based on fuzzy semantic modeling |
CN112232288A (en) | 2021-01-15 | A satellite image target recognition method and system based on deep learning |
Ibrahem et al. | 2020 | Weakly supervised traffic sign detection in real time using single CNN architecture for multiple purposes |
KR102026280B1 (en) | 2019-11-26 | Method and system for scene text detection using deep learning |
CN110728229A (en) | 2020-01-24 | Image processing method, device, equipment and storage medium |
Moumtzidou et al. | 2013 | Discovery of environmental resources based on heatmap recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
2017-11-07 | PB01 | Publication | |
2017-11-07 | PB01 | Publication | |
2017-12-01 | SE01 | Entry into force of request for substantive examination | |
2017-12-01 | SE01 | Entry into force of request for substantive examination | |
2020-05-22 | GR01 | Patent grant | |
2020-05-22 | GR01 | Patent grant | |
2024-06-21 | CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200522 |
2024-06-21 | CF01 | Termination of patent right due to non-payment of annual fee |