CN113343012B - News matching method, device, equipment and storage medium - Google Patents
- ️Fri Mar 04 2022
Disclosure of Invention
In view of the above problems in the prior art, it is an object of the present invention to provide a method, an apparatus, a device and a storage medium for matching news images, so as to improve the efficiency and quality of matching of the non-image news images.
In order to solve the technical problems, the specific technical scheme is as follows:
in one aspect, provided herein is a method of matching news images, the method comprising:
inputting the news of the graph to be matched into a trained neural network model to obtain a text vector of the news of the graph to be matched;
determining a plurality of candidate historical text vectors according to the text vector of the news to be matched and a historical text vector library, wherein the historical text vector library is a text vector set obtained by known map news through a trained neural network model;
determining a plurality of candidate historical pictures corresponding to the plurality of candidate historical text vectors;
and determining the score value of each candidate historical picture according to the candidate historical pictures and a preset score rule, and taking the candidate historical picture with the highest score value as a target picture of the news of the graph to be matched.
Further, the neural network model is obtained by training through the following steps:
initializing training model parameters to obtain an initial neural network model;
acquiring historical news data, and preprocessing the historical news data to obtain data to be trained, wherein the data to be trained comprises training input data and target output data;
inputting the training input data into an initial neural network model for training to obtain prediction data;
training the initial neural network model according to the prediction data and the target output data and a training rule to obtain fixed training model parameters;
and bringing the fixed training model parameters into an initial neural network model to obtain a trained neural network model.
Further, the trained neural network model comprises:
the input layer is used for receiving an initial text vector and an input word vector of a text;
the aggregation layer is used for aggregating the initial text vector and the input word vector to form an aggregated vector;
the hidden layer is used for hiding the aggregation vector to generate an output word vector aiming at a preset word;
and the prediction function is used for updating the initial text vector according to the output word vector and the preset word to obtain the text vector of the text.
Further, the determining a plurality of historical candidate texts according to the text vector of the news to be matched and a historical text vector library includes:
calculating and obtaining the similarity between the text vector and each historical text vector according to the text vector of the news to be matched with the graph and a historical text vector library;
determining a plurality of historical text vectors corresponding to the similarity meeting the specified conditions;
and taking the historical texts corresponding to the plurality of historical text vectors as a plurality of historical candidate texts.
Further, the calculating obtains a similarity between the text vector and each historical text vector, including:
determining the text vector and the vector length of each historical text vector;
and calculating to obtain a vector inner product of the text vector and each historical text vector according to the vector length of the text vector and the vector length of the historical text vector, and taking the vector inner product as the similarity between the text vector and each historical text vector.
Further, the determining a plurality of historical text vectors corresponding to the similarity satisfying the specified condition includes:
sorting the similarity according to size to obtain a similarity sequence;
and determining the similarity of a specified number of larger values in the similarity sequence as a plurality of historical text vectors.
Further, the historical text vector library is established by the following steps:
acquiring historical news data, wherein the historical news data comprises historical news texts and picture URL addresses corresponding to the historical news texts;
classifying the historical news data to obtain a plurality of news data sets;
sequentially inputting each historical news text in each news data set into a trained neural network model to obtain a historical news text vector set aiming at different news types;
according to the historical news text vector set and the picture URL address corresponding to the historical news text, establishing a mapping relation between the historical news text vector and the picture URL address;
and establishing the historical text vector library according to the historical news text vector sets of different news types, the picture URL addresses corresponding to the historical news texts and the mapping relation.
Further, the determining a plurality of candidate history pictures corresponding to the plurality of candidate history text vectors comprises:
determining a picture URL address corresponding to each candidate historical text vector according to the candidate historical text vectors and the mapping relation;
and extracting to obtain a plurality of candidate historical pictures according to each picture URL address.
Further, after extracting a plurality of candidate historical pictures according to each picture URL address, the method further includes:
and according to the news of the graph to be matched, removing the candidate historical pictures meeting the preset conditions from the plurality of candidate historical pictures to obtain a plurality of candidate historical pictures after removal.
Further, the removing, according to the news to be matched, candidate historical pictures meeting preset conditions from the plurality of candidate historical pictures to obtain a plurality of removed candidate historical pictures includes:
determining keyword information of the news with the graph to be matched according to the news with the graph to be matched, wherein the keyword information comprises at least one of the following elements: time, place, people, and event;
identifying keyword information in each candidate historical picture;
matching the keyword information of the news to be matched with the keyword information in the candidate historical pictures to determine the candidate historical pictures with inconsistent keyword information;
and removing candidate historical pictures with inconsistent keyword information to obtain a plurality of updated candidate historical pictures so as to determine a target picture from the plurality of updated candidate historical pictures.
Further, the determining the score value of each candidate history picture according to the candidate history pictures and a preset score rule includes:
sequentially determining the size parameter and the quality parameter of each candidate historical picture;
calculating to obtain scoring parameters of the candidate historical picture according to the size parameters and the quality parameters, wherein the scoring parameters comprise picture tidiness, picture definition, picture size proportion and picture pixel proportion;
and calculating the score value of each candidate historical picture according to the score parameters.
Further, after the candidate history picture with the highest score value is used as the target picture of the news to be matched, the method further comprises the following steps:
and obtaining the use authorization information of the target picture, and determining the matched picture of the news to be matched according to the use authorization information and a plurality of candidate historical pictures.
Further, the obtaining of the usage authorization information of the target picture and determining the matching picture of the news with the picture to be matched according to the usage authorization information and the candidate historical pictures includes:
extracting copyright information of the picture from the target picture and/or determining the copyright information of the picture from the picture URL address of the target picture;
generating a picture authorization request according to the copyright information, and sending the picture authorization request to a picture authorization mechanism to obtain the use authorization information of the target picture;
receiving the use authorization information sent by the picture authorization mechanism, and determining whether to adopt the target picture according to the use authorization information;
if so, taking the target picture as a matching picture of the news of the picture to be matched;
and if not, updating the target pictures according to the sequence of the scores from high to low, and sequentially executing the step of judging whether the target pictures are adopted or not until determining the matched pictures of the news to be matched.
In another aspect, this document also provides a news mapping apparatus, including:
the text vector acquisition module is used for inputting the news of the graph to be matched into the trained neural network model to obtain a text vector of the news of the graph to be matched;
the candidate historical text vector determining module is used for determining a plurality of candidate historical text vectors according to the text vector of the news to be matched and a historical text vector library, wherein the historical text vector library is a text vector set obtained by the known news with pictures through a trained neural network model;
a candidate history picture determination module for determining a plurality of candidate history pictures corresponding to the plurality of candidate history text vectors;
and the target picture determining module is used for determining the score value of each candidate historical picture according to the candidate historical pictures and a preset score rule, and taking the candidate historical picture with the highest score value as the target picture of the news to be matched.
In another aspect, a computer device is also provided herein, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method as described above when executing the computer program.
Finally, a computer-readable storage medium is also provided herein, which stores a computer program that, when executed by a processor, implements the method as described above.
By adopting the technical scheme, the news matching method, the device, the equipment and the storage medium, a neural network model for generating a text vector is trained in advance, the text vector of the news to be matched is obtained by inputting the news to be matched into the trained training model, a plurality of candidate historical text vectors are determined in a text vector comparison mode, the candidate historical text vectors are obtained by the trained neural network model from the known news with pictures, then candidate historical pictures corresponding to the candidate historical text vectors are determined, the candidate historical picture with the highest evaluation value is taken as a target picture of the news to be matched through further grading of the candidate historical pictures, and the efficiency of matching the news without pictures can be improved and the quality of matching the pictures can be improved through the text vector comparison mode.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments herein without making any creative effort, shall fall within the scope of protection.
It should be noted that the terms "first," "second," and the like in the description and claims herein and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments herein described are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or device.
In the prior art, a plurality of methods for matching images of news are provided, wherein the manual image matching method has low efficiency, and the quality of the manual image matching is uneven, so that the method is difficult to adapt to the requirements of a large number of image matching at present; in addition, automatic matching is carried out by establishing a graph-text matching model, a multi-mode model is trained by using image-label or image-text description, and then the matching degree of the non-graph news and the pictures in the picture library is calculated based on the model.
In order to solve the above problem, an embodiment of the present disclosure provides a method for matching news, as shown in fig. 1, which is an implementation environment schematic diagram of the method, and may include a news matching device 20 and a server 10, where the server 10 is configured to collect and store historical news data, the news matching device 20 performs training of a neural network model by obtaining the historical news data in the server 10, so as to obtain a trained neural network model capable of generating text vectors, and then trains the historical news data in the server 10, so as to obtain a text vector library corresponding to the historical news data, and then sends and stores the text vector library in the server 10, in addition, the server 10 further stores picture information in the historical news data, and in the matching process, the news matching device 20 receives the to-be-matched map news input by a user 30, and then training through a trained neural network model to obtain a text vector of the news with the to-be-matched graph, then extracting a candidate text vector with high similarity from a text vector library in the server 10 through the text vector by the news matching device 20, further determining a candidate historical picture corresponding to the candidate text vector, and further screening the candidate historical picture to obtain a target picture corresponding to the news with the to-be-matched graph.
The
server10 can acquire and store historical news data in real time, and provide a storage space for a text vector library, and the server may be an independent server or a distributed server, which is not limited in the embodiments of the present specification.
Optionally, embodiments herein provide a method for matching news, which can improve the efficiency and quality of matching non-map news. Fig. 2 is a schematic diagram of steps of a news mapping method provided in an embodiment herein, and the present specification provides the method operation steps as described in the embodiment or the flowchart, but may include more or less operation steps based on conventional or non-creative labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual system or apparatus product executes, it can execute sequentially or in parallel according to the method shown in the embodiment or the figures. Specifically, as shown in fig. 2, the method may include:
s101: inputting the news of the graph to be matched into a trained neural network model to obtain a text vector of the news of the graph to be matched;
s102: determining a plurality of candidate historical text vectors according to the text vector of the news to be matched and a historical text vector library, wherein the historical text vector library is a text vector set obtained by known map news through a trained neural network model;
s103: determining a plurality of candidate historical pictures corresponding to the plurality of candidate historical text vectors;
s104: and determining the score value of each candidate historical picture according to the candidate historical pictures and a preset score rule, and taking the candidate historical picture with the highest score value as a target picture of the news of the graph to be matched.
It can be understood that, in the embodiment of the present specification, a neural network model is trained in advance, for example, historical news data is used for training to obtain a neural network model capable of generating a text vector, a historical text vector of the news to be mapped can be obtained through the trained neural network model, then a plurality of candidate historical text vectors with higher similarity are obtained by performing similarity matching with a historical text vector library, and then a plurality of candidate historical pictures corresponding to the plurality of candidate historical text vectors are determined, and the candidate historical picture with the highest score is determined as a target picture by scoring the candidate historical pictures.
The neural network model may generate a text vector according to known historical news data (i.e., text of historical news), and optionally, as shown in fig. 3, the neural network model is obtained by training through the following steps:
s201: initializing training model parameters to obtain an initial neural network model;
s202: acquiring historical news data, and preprocessing the historical news data to obtain data to be trained, wherein the data to be trained comprises training input data and target output data;
s203: inputting the training input data into an initial neural network model for training to obtain prediction data;
s204: training the initial neural network model according to the prediction data and the target output data and a training rule to obtain fixed training model parameters;
s205: and bringing the fixed training model parameters into an initial neural network model to obtain a trained neural network model.
In practical operation, the preprocessing of the historical news data may be to perform word segmentation processing and stop word filtering on the historical news text in the historical news data, so as to determine useful word information therein, specifically, perform word segmentation processing on each historical news text by using a word segmentation component (such as a Jieba tool) to obtain useful words and stop words (such as yes, and punctuation marks, etc.), and filter the stop words to form a word sequence only containing the useful words, where the word sequence is training input data, and the target output data is actually a target word in the word sequence, and a predicted word for the target word may be obtained by bringing the word sequence into a neural network model, and then perform training update of the neural network model according to the target word and the predicted word until the similarity (i.e., distance) between the predicted word and the target word meets a specified requirement, and obtaining the trained neural network model.
In the embodiment of the present specification, the neural network model may be a natural language processing model, a text vector model may be generated by using the neural network model, and then the calculation of text similarity is realized by using a text vector. The text similarity comparison can be rapidly and accurately carried out through text vectors, namely vector models, wherein the vector models comprise a Word2vec model, a doc2vec model and a deep learning model Bert. The Word2vec model is simple, but only has Word vectors, and the text vectors can be averaged after the Word vectors are accumulated, so that the processing mode also loses the sequence of words in the text and can not accurately express the meaning of the text; the Bert model can well represent text vectors, but has the defects of length limitation (maximum 512 characters) and slow CPU reasoning, and is not applicable to the requirements of the news field on the length uncertainty of news texts and the timeliness of news distribution, so that the doc2vec model is preferably used for representing the news text vectors in the embodiment of the specification.
Further, the structure of the doc2vec model may include:
the input layer is used for receiving an initial text vector and an input word vector of a text;
the aggregation layer is used for aggregating the initial text vector and the input word vector to form an aggregated vector;
the hidden layer is used for hiding the aggregation vector to generate an output word vector aiming at a preset word;
and the prediction function is used for updating the initial text vector according to the output word vector and the preset word to obtain the text vector of the text.
It can be understood that the input layer mainly receives a text vector d (para vector) and an input word vector (wv), and obtains an aggregation vector neu1 after vector addition in the aggregation layer, the hidden layer (synneg) is one of the parameters to be learned by the model, and the doc2vec model is basically trained after the parameters in the hidden layer are fixed. The prediction function is mainly used for measuring the similarity between the output (namely predicted words) of the hidden layer and the vectors of the words to be predicted (namely target words), if the similarity is similar, parameters do not need to be updated, if the similarity is not similar, the synneg vectors, the wv vectors and the D vectors need to be updated reversely through a gradient descent method until fixed synneg vector parameters and wv vector parameters are obtained, a trained doc2vec model is obtained, then reasoning of subsequent texts is carried out through the trained doc2vec model, and a corresponding text vector D is obtained and used as a subsequent news vector.
Illustratively, the detailed process of training the doc2vec model is as follows:
step 1: and (3) preprocessing data, collecting a large amount of historical news data, selecting news with the news length of 100-500 characters in order to improve the training speed, segmenting the text by using a jieba tool, and filtering stop words such as yes words, punctuation marks and the like. After word segmentation, the text is processed into word sequences, which are referred to as word sequences for short below;
step 2: initializing MT19973 random number generation algorithm seeds to be 0;
and step 3: randomly initializing a 256-dimensional vector as an initial text vector paramgraph vector of the text;
and 4, step 4: randomly initializing word vectors of all words, and recording the word vectors as wv;
and 5: initializing a 256-dimensional aggregation layer vector, setting the 256-dimensional aggregation layer vector as 0.0, and recording the 256-dimensional aggregation layer vector as neu 1;
step 6: the next random is initialized using MT19973 random number algorithm. The following formula (1):
next_random=2^24*randint(0, 2^24)+randint(0, 2^24) (1) ;
and 7: the word sequence is traversed and a comparison is made using next _ random > >16, and 4227327 to determine whether to retain the current word. And updates the next _ random as follows (2):
next_random=(next_random*25214903917+11)&281474976710655 (2);
and 8: for the sampled word sequence, a window of less than 5 is generated for each word. The window size calculation formula (3) is as follows:
window=next_random>>16%5 (3);
and updating the next _ random according to the formula (2);
wherein, the window refers to taking several words forward and several words backward for the current word. For example, after word segmentation, the word is [ construction, human, harmony, symbiosis, beauty, China ], and the word is "nature" in the present treatment. Assuming a window size of 2, the sequence of words sampled is [ human, and, natural, harmonious, symbiotic ]. As shown in table 1 below:
TABLE 1
And step 9: for each word after sampling, neu1 is updated based on the corresponding window. For example, if the window size is 2, then neu1 is calculated according to the following equation (4):
neu1=wv[i-2]+wv[i-1]+wv[i+1]+wv[i+2] (4);
step 10: the paramph vector was added to neu1 as shown in equation (5):
neu1=neu1+paragraph vector (5);
step 11: randomly sample 5 words from the word sequence of the historical news text as negative examples. The dot product of the word vector corresponding to the negative example and neu1 is calculated and denoted as f. If f ≦ 6 or ≧ 6, skipping the current sample; otherwise, the gradient is calculated according to the following equation (6):
g=(y-exp(f)/(exp(f)+1))*alpha (6);
step 12: updating the hidden layer parameter synneg vector as shown in the following equation (7):
synneg=synneg+g*neu1 (7);
step 13: update the wv vector, as shown in equation (8) below:
wv=wv+g*synneg (8)
step 14: adjust alpha (i.e., learning rate), return to step 6, and repeat the training 40 rounds.
Step 15: and saving the wv vector and the synneg vector obtained by training so as to obtain a trained doc2vec model.
Training the collected historical news data through the steps can obtain a trained neural network model (namely, a doc2vec model), so that a news text vector is generated through the model training, and the historical news data is inferred to obtain a historical text vector library based on the historical news data.
In this embodiment of the present specification, the similarity between texts is determined through text vector comparison, and optionally, as shown in fig. 4, the determining a plurality of historical candidate text vectors according to the text vector of the to-be-matched graph news and a historical text vector library includes:
s301: calculating and obtaining the similarity between the text vector and each historical text vector according to the text vector of the news to be matched with the graph and a historical text vector library;
s302: and determining a plurality of historical text vectors corresponding to the similarity meeting the specified conditions, and taking the plurality of historical text vectors as candidate historical text vectors.
It can be understood that, in this document, the similarity between different texts is determined through the similarity between text vectors, so as to determine the historical news with higher similarity to the news to be matched, and thus, the matching probability in the determined historical news can also adapt to the news to be matched, so that the rapid confirmation of the historical similar news can be realized through the comparison between the text vectors, and the matching efficiency can be improved.
In a further embodiment, the calculating of the similarity may be a vector distance between two texts, wherein the calculating obtains the similarity between the text vector and each historical text vector, and includes:
determining the text vector and the vector length of each historical text vector;
and calculating to obtain a vector inner product of the text vector and each historical text vector according to the vector length of the text vector and the vector length of the historical text vector, and taking the vector inner product as the similarity between the text vector and each historical text vector.
Illustratively, a 256-dimensional vector of the news to be matched can be obtained for the doc2vec model, the similarity comparison process is performed, the text vector of the news to be matched is recorded as Q, and the vector in the historical text vector library is recorded as Di, 0<=i<N, N represents the number of news in the historical text vector library,iis shown asiAnd the vector corresponding to the news. The similarity (similarity _ score) of the two vectors is measured by using the vector inner product, namely, the values of the corresponding positions of the vectors are multiplied respectively and then summed, and the calculation formula is shown as the following formula (9):
wherein, S is the similarity of the images,
is the second of 256-dimensional vectors of the news text vector to be matchedjThe position vector data of the position vector data,
for the first in the historical text vector libraryiIn a vector ofjPosition vector data.
The similarity between the text vector of the news with the graph to be matched and the historical text vector in the historical text vector library can be calculated through the formula (9), through dot product operation, the more similar the two vectors, the larger the dot product result, so that by comparing the similarity, the historical text with the higher similarity to the news with the graph to be matched in the historical text vector library can be determined, and the corresponding historical picture in the historical text (namely the historical news) can be further determined.
Therefore, on the basis of calculating the similarity, optionally, the determining a plurality of history text vectors corresponding to the similarity meeting the specified condition includes:
sorting the similarity according to size to obtain a similarity sequence;
and determining the similarity of a specified number of larger values in the similarity sequence as a plurality of historical text vectors.
For example, the decreasing sequence of the similarity may be obtained by sorting from large to small, the specified number may be 10, 20, 30, and the like, and without limitation, a specified number of history text vectors with high similarity are selected from the beginning in the sequence to serve as candidate history text vectors. In some other embodiments, the historical text vectors may also be sorted in order from small to large, and the historical text vectors with high similarity are sequentially selected from the tail to the head of the sequence, and a specific sorting manner is not limited in this specification embodiment.
In some other embodiments, the sorting step may be cancelled, a candidate historical text vector selection threshold value may be directly set, when the similarity exceeds the selection threshold value, the historical text vector corresponding to the similarity is used as the candidate historical text vector, and by setting the selection threshold value, the candidate historical text vector may be determined when the similarity is calculated in real time, thereby reducing the step of calculating the similarity, and improving the efficiency of determining the candidate historical text vector, wherein the selection threshold value is set according to an actual situation, and is not limited in the embodiments of the present specification.
Further, a quantity threshold may also be continuously set, that is, when the candidate historical text vector data determined according to the selection threshold reaches the quantity threshold, the calculation of the similarity and the determination of the subsequent candidate historical text vectors are stopped, where the quantity threshold may also be set according to an actual situation, and the embodiment of the present specification is not limited. Therefore, the speed and the efficiency of determining the candidate historical text vectors can be further improved through the determination of the selection threshold value and the quantity threshold value, and the overall matching efficiency is improved on the basis of ensuring the matching quality.
Because news is updated quickly and the data volume of historical news is generally large, in practical application, news generally includes multiple fields such as society, real-time affairs, finance, sports and the like, in order to improve the efficiency and accuracy of matching the news, classification processing can be performed in the generation process of the historical text vector library, and optionally, as shown in fig. 5, the historical text vector library is established through the following steps:
s401: acquiring historical news data, wherein the historical news data comprises historical news texts and picture URL addresses corresponding to the historical news texts;
s402: classifying the historical news data to obtain a plurality of news data sets;
s403: sequentially inputting each historical news text in each news data set into a trained neural network model to obtain a historical news text vector set aiming at different news types;
s404: according to the historical news text vector set and the picture URL address corresponding to the historical news text, establishing a mapping relation between the historical news text vector and the picture URL address;
s405: and establishing the historical text vector library according to the historical news text vector sets of different news types, the picture URL addresses corresponding to the historical news texts and the mapping relation.
It can be understood that the historical news data is mapped historical news, in order to avoid that the number of the collected news is extremely large, the processing difficulty and the storage difficulty are increased, the mapped news in the latest specified time period (such as three months, half a year and the like) can be selected to be collected, the obtained way can be obtained from various big news websites through an internet crawler technology, and optionally, the obtained way can include a Chinese website, a foreign website and the like, wherein news texts in the foreign websites can be translated into Chinese through translation software to be stored, so that reasoning of text vectors can be performed.
During classification, the news can be classified according to news types, such as social news, financial news, sports news and the like, so that a historical news text vector set of different news types, namely different historical news text vector sub-libraries can be formed. Before similarity calculation, the type of the news to be matched can be determined, and a corresponding historical news text vector sub-library is determined according to the type, so that pictures can be determined in the same type of historical news, and the accuracy and reliability of picture matching are improved.
In the further embodiment, if a single image news comprises a plurality of images, one image can be randomly selected as the image of the image news to be stored, or a designated image can be selected to be stored; of course, all pictures may be stored, so that a plurality of pictures may be extracted at the same time as a group of candidate pictures.
Exemplarily, social, current, financial and sports graphic news in the last 3 months can be constructed as a material base (historical news base), 58 ten thousand graphic news are provided, in order to improve the relevance of news, news classification is introduced as a filtering tag, so that the scene needs to simultaneously support the functions of tag filtering and picture URL address, and simultaneously has the function of similarity calculation, a traditional search engine ElasticSearch supports tag filtering and text relevance retrieval based on a word bag model well, but does not support large-scale vector retrieval, a Facebook's faiss base only supports similar vector query, has no tag filtering function, is a single node, does not support distribution, so that a distributed engine supporting character string, vector storage, tag filtering and similar vector retrieval is needed, a Vearch tool meets the requirement, and thus the historical news data is processed through the Vearch tool in the embodiments, the main steps are described below.
1) Defining the attributes of the table, including the number of partitions, the number of copies, the indexing mode, the corresponding indexing parameters and the similarity calculation model. Setting required fields such as news classification, picture URL address array, text vector and the like.
2) For each news item in the material library, a 256-dimensional vector (corresponding to the historical text) is obtained by using a trained doc2vec model, the text vector, the classification and the picture URL address are stored in a table, and then a historical news text vector library is established.
3) According to the news to be matched, which is input by a user, combining with a trained doc2vec model to obtain a text vector of the news to be matched, then calculating the similarity through a vector retrieval engine, further obtaining a candidate historical text vector with high similarity, and finally obtaining a stored picture.
Therefore, in this specification embodiment, the determining a plurality of candidate history pictures corresponding to the plurality of candidate history text vectors includes:
determining a picture URL address corresponding to each candidate historical text vector according to the candidate historical text vectors and the mapping relation;
and extracting to obtain a plurality of candidate historical pictures according to each picture URL address.
Through the steps, a plurality of candidate historical pictures of the news to be matched can be quickly obtained, and then the candidate historical pictures are further screened to obtain a final target picture.
In this embodiment of the present specification, in order to avoid a situation that expression information in the candidate history pictures is obviously inconsistent with the news to be matched, the candidate history pictures may be processed in advance, and inconsistent pictures are removed, and optionally, the extracting, according to the URL address of each picture, to obtain a plurality of candidate history pictures further includes:
and according to the news of the graph to be matched, removing the candidate historical pictures meeting the preset conditions from the plurality of candidate historical pictures to obtain a plurality of candidate historical pictures after removal.
The preset condition may be that the expression information in the to-be-matched graph news is inconsistent with the expression information of the candidate historical pictures, and the inconsistency may be understood as different information expression, for example, if the to-be-matched graph news represents a city a, but both the text information and the picture information in the candidate historical pictures represent a city B, it indicates that the text information and the picture information are inconsistent, and the candidate historical pictures may be removed.
Further, as shown in fig. 6, the removing, according to the news to be matched, candidate history pictures that meet preset conditions from the plurality of candidate history pictures to obtain a plurality of candidate history pictures after removal includes:
s501: determining keyword information of the news with the graph to be matched according to the news with the graph to be matched, wherein the keyword information comprises at least one of the following elements: time, place, people, and event;
s502: identifying keyword information in each candidate historical picture;
s503: matching the keyword information of the news to be matched with the keyword information in the candidate historical pictures to determine the candidate historical pictures with inconsistent keyword information;
s504: and removing candidate historical pictures with inconsistent keyword information to obtain a plurality of updated candidate historical pictures so as to determine a target picture from the plurality of updated candidate historical pictures.
The keyword information of the news to be matched can be obtained through a character recognition means, the keyword information in the candidate historical pictures can be obtained through character recognition or picture recognition, for example, the character information in the pictures is obtained through an image character recognition (OCR) technology, the description information of the pictures is obtained through a picture contour recognition technology, and corresponding information such as characters, landscapes, buildings and the like is obtained.
And when the keyword information in the news of the graph to be matched does not appear in the candidate historical picture and the candidate historical picture does not have corresponding other keyword information, the candidate historical picture can be kept. For example, when the keyword of the city a appears in the news of the to-be-matched graph, but the keyword of the city a does not appear in the candidate history picture, and no other keyword of the city appears, the candidate history picture can be retained.
Furthermore, whether the character information in the candidate historical pictures contains sensitive words (such as crime, pornography, religion and politics) can be determined by identifying the characters in the candidate historical pictures, and if the character information contains the sensitive words, the corresponding candidate historical pictures can be removed, so that the quality of news matching of the to-be-matched pictures can be improved.
Through the preliminary screening of a plurality of candidate historical pictures, the reliability of the retained candidate historical pictures can be ensured, so that the target picture can be further determined, as shown in fig. 7, optionally, the determining the score value of each candidate historical picture according to the plurality of candidate historical pictures and the preset score rule includes:
s601: sequentially determining the size parameter and the quality parameter of each candidate historical picture;
s602: calculating to obtain scoring parameters of the candidate historical picture according to the size parameters and the quality parameters, wherein the scoring parameters comprise picture tidiness, picture definition, picture size proportion and picture pixel proportion;
s603: and calculating the score value of each candidate historical picture according to the score parameters.
It can be understood that the quality, i.e. the score value, of each candidate historical picture can be determined through the above steps, and the quality of the whole news to be matched can be improved by selecting pictures with high quality. For example, when the pictures contain text information, it is more likely to be contradictory or irrelevant to the contents of the non-image news, so a sorting/scoring mechanism needs to be designed to sort the selected candidate historical pictures, and arrange the pictures with high quality and strong relevance in front of the pictures and arrange the pictures with low quality and weak relevance in the back of the pictures. The measurement of the picture quality includes multiple aspects, such as image resolution, image aspect ratio, number of characters in an image, and the like, so that it is necessary to calculate an image score by integrating multiple dimensions.
Illustratively, first, using an OCR model, the number of characters appearing in the picture is recognized, and the smoothness (i.e., OCR score) of the picture is obtained. The more characters, the lower the score, so F _ OCR = 1/(count +1), where F _ OCR is the smoothness of the picture and count is the number of characters in the picture.
Secondly, using an image quality model to obtain the definition score of the picture: f _ Quality, as an option, can be obtained by google image Quality model NIMA.
Then, the picture Size ratio is the ratio of the long side to the short side of the picture, in practical applications, square pictures are used more often, and square pictures are used in news, so that the higher the ratio is, the lower the Size score of the picture is, and therefore, a corresponding relationship between the Size score and the picture Size ratio, such as a functional relationship or a mapping relationship, may be set, and the larger the ratio is, the lower the Size score is, alternatively, F _ Size =1/P _ Size, where F _ Size is the Size score and P _ Size is the picture Size ratio. In some other embodiments, other arrangements are also possible, and the specific arrangement is not limited in the embodiments of the present specification.
The picture pixel proportion is a product of the picture pixel height and the picture pixel width, and the larger the result is, the better the picture quality is, the higher the corresponding pixel score is, so a corresponding relationship between the pixel score and the picture pixel proportion, such as a functional relationship or a mapping relationship, may also be set, and the larger the proportion is, the higher the pixel score is, optionally, F _ Ratio = P _ Ratio/Q _ Ratio, where F _ Ratio is the pixel score, P _ Ratio is the picture pixel proportion of the current candidate history picture, and Q _ Ratio is the sum of the picture pixel proportions of all candidate history pictures. In some other embodiments, other arrangements are also possible, and the specific arrangement is not limited in the embodiments of the present specification.
Finally, the composite score formula is as follows (10):
Score=F_OCR+F_Quality+F_Size+F_Ratio, (10)
through the steps, the pictures with high quality can be well arranged in front of the device, the pictures with low quality are arranged in the back of the device, so that the device is convenient for auditors to quickly audit, and the pictures with the highest quality are determined.
Further, in order to improve the pertinence in the picture Quality sorting process, different scoring weights can be set for different scoring parameters, so that corresponding pictures can be selected according to needs, for example, some news texts (such as political news and the like) have higher requirements on the cleanliness of the pictures, the weight value of the F _ OCR can be increased, so that the pictures with higher cleanliness can be placed at a higher sorting position, and for example, some news texts (such as sports news) have higher requirements on the picture Quality definition, the weight values of the F _ Quality and the F _ Ratio can be increased, so that the pictures with higher Quality definition can be obtained. The convenience and the rapidity of matching the map can be improved by setting the weighted value of the grading parameter.
After determining the target picture, if the target picture is directly used, there may be a possibility of infringement, which may cause some unnecessary troubles and losses, so in order to guarantee matching and normal use, optionally, the step of taking the candidate history picture with the highest score as the target picture of the news to be matched further comprises:
and obtaining the use authorization information of the target picture, and determining the matched picture of the news to be matched according to the use authorization information and a plurality of candidate historical pictures.
It can be understood that the validity and the use right of the target picture are further obtained by acquiring the use authorization information of the target picture, so that the target picture can be really determined as the matching picture of the news to be matched through the step.
Optionally, as shown in fig. 8, the obtaining of the usage authorization information of the target picture, and determining a matching picture of the news to be matched according to the usage authorization information and a plurality of candidate historical pictures includes:
s701: extracting copyright information of the picture from the target picture and/or determining the copyright information of the picture from the picture URL address of the target picture;
s702: generating a picture authorization request according to the copyright information, and sending the picture authorization request to a picture authorization mechanism to obtain the use authorization information of the target picture;
s703: receiving the use authorization information sent by the picture authorization mechanism, and determining whether to adopt the target picture according to the use authorization information;
s704: if so, taking the target picture as a matching picture of the news of the picture to be matched;
s705: and if not, updating the target pictures according to the sequence of the scores from high to low, and sequentially executing the step of judging whether the target pictures are adopted or not until determining the matched pictures of the news to be matched.
The copyright information may be a writer or a copyright party of the picture, and may be obtained by recognizing characters in a target picture through OCR, or directly obtained through a picture URL address, where the picture URL address may be a source address of the target picture, and may directly determine the copyright party information.
The picture authorization mechanism can be a copyright organization or a writer, a user sends a picture authorization request containing an authorization protocol to the picture authorization mechanism, the picture authorization mechanism generates or signs corresponding authorization information to the user according to the authorization protocol, the authorization information can include authorization amount, time, usage and the like, when the user receives the authorization information, the user can pay corresponding cost according to the protocol to obtain the usage right of the target picture, when the authorization information is not received, the target picture can be removed, pictures are sequentially selected according to the determined picture quality sequence, and the authorization information is determined until the corresponding picture is determined to be used as a matching picture of the news to be matched.
The right of use of the matched picture can be legally and reasonably obtained through the steps, so that subsequent unnecessary troubles are avoided, and the reliability of picture matching is improved.
In addition, when the copyright information of the target picture cannot be directly obtained or the target picture is a public picture without the possibility of infringement, the target picture can be directly used as a matching picture of the news to be matched.
According to the news mapping method provided by the embodiment of the specification, the correlation between news can be rapidly and accurately calculated by training the neural network model, then the related pictures are rapidly recommended to the non-map news, meanwhile, the quality sequencing is carried out on the pictures, the quality of mapping is improved, the probability of difference between the mapping and the non-map news is reduced, and finally, the legality and reasonability of the use of the pictures are ensured and the reliability of mapping is improved by determining the use right of the target pictures.
Based on the same inventive concept, an embodiment of the present specification further provides a device for matching news, as shown in fig. 9, the device includes:
the text vector acquisition module 100 is configured to input the news of the to-be-matched graph into a trained neural network model to obtain a text vector of the news of the to-be-matched graph;
a candidate historical text vector determining module 200, configured to determine multiple candidate historical text vectors according to the text vector of the news to be mapped and a historical text vector library, where the historical text vector library is a text vector set obtained by a trained neural network model from known news with maps;
a candidate history picture determination module 300, configured to determine a plurality of candidate history pictures corresponding to the plurality of candidate history text vectors;
and the target picture determining module 400 is configured to determine a score value of each candidate history picture according to the plurality of candidate history pictures and a preset score rule, and use the candidate history picture with the highest score value as the target picture of the news to be matched.
The advantages achieved by the device are consistent with those achieved by the method, and the embodiments of the specification are not limited.
As shown in fig. 10, for a computer device provided in an embodiment of the present disclosure, a news mapping apparatus in the present disclosure may be a computer device in the present embodiment, and perform the method in the present disclosure. The
computer device1002 may include one or
more processors1004, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The
computer device1002 may also include any
memory1006 for storing any kind of information, such as code, settings, data, etc. For example, and without limitation, the
memory1006 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of
computer device1002. In one case, when the
processor1004 executes the associated instructions, which are stored in any memory or combination of memories, the
computer device1002 can perform any of the operations of the associated instructions. The
computer device1002 also includes one or
more drive mechanisms1008, such as a hard disk drive mechanism, an optical disk drive mechanism, or the like, for interacting with any memory.
1002 may also include an input/output module 1010 (I/O) for receiving various inputs (via input device 1012) and for providing various outputs (via output device 1014)). One particular output mechanism may include a
presentation device1016 and an associated Graphical User Interface (GUI) 1018. In other embodiments, input/output module 1010 (I/O),
input device1012, and
output device1014 may also be excluded, as only one computer device in a network.
Computer device1002 can also include one or
more network interfaces1020 for exchanging data with other devices via one or
more communication links1022. One or
more communication buses1024 couple the above-described components together.
1022 may be implemented in any manner, such as over a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communications link 1022 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
Corresponding to the methods in fig. 2-8, the embodiments herein also provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the above-described method.
Embodiments herein also provide computer readable instructions, wherein when executed by a processor, a program thereof causes the processor to perform the method as shown in fig. 2-8.
It should be understood that, in various embodiments herein, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments herein.
It should also be understood that, in the embodiments herein, the term "and/or" is only one kind of association relation describing an associated object, meaning that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided herein, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purposes of the embodiments herein.
In addition, functional units in the embodiments herein may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present invention may be implemented in a form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The principles and embodiments of this document are explained herein using specific examples, which are presented only to aid in understanding the methods and their core concepts; meanwhile, for the general technical personnel in the field, according to the idea of this document, there may be changes in the concrete implementation and the application scope, in summary, this description should not be understood as the limitation of this document.