CN110400251A - Method for processing video frequency, device, terminal device and storage medium - Google Patents
- ️Fri Nov 01 2019
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
Currently, the popularity rate of the mobile terminal devices such as mobile phone is higher and higher, smart phone has become the indispensability of people's trip Carry-on articles.Occur various application programs as mobile Internet is quickly grown, on mobile terminal, it is many among these Application program can provide customer service function for user, allow users to carry out the business such as product consulting by customer service.
With development in science and technology, demand of the people to hommization experience in various intellectual product use processes is also gradually increasing Add, with customer service communication process, user also wishes can be not only merely to obtain the reply of text or voice, but can lead to More natural interactive mode similar with the interpersonal communication in real life is crossed to be linked up.
Inventor has found under study for action, can be by allowing customer service robot simulation true man to speak, the cordial feeling of Lai Zengjin customer service. Such as when customer service robot and user engage in the dialogue, the reply content seeked advice from user can be passed through into virtual figure image Mouth, expressed by way of voice, user can intuitively see on a user interface with virtual figure image Customer service robot " speaking ", make the communication exchange that " face-to-face " is able to carry out between user and customer service robot.
However, in actual research process, inventors have found that since people are to the vision and sense of hearing consistency of face Perceive it is more sensitive, at customer service robot " speaking ", if the facial expression and nozzle type etc. of virtual portrait have with voice it is micro- Small deviation may all bring unnatural feeling to user, influence the experience of user.
In order to improve the above problem, inventor has studied difficulty of the customer service robot to personalize during realization, Even more comprehensively consider the use demand in practical interaction scenarios, propose method for processing video frequency in the embodiment of the present application, device, Electronic equipment and storage medium.
To be situated between convenient for better understanding method for processing video frequency, device, electronic equipment and storage provided by the embodiments of the present application Matter is below first described the application environment for being suitable for the embodiment of the present application.
Referring to Fig. 1, Fig. 1 shows a kind of application environment schematic diagram suitable for the embodiment of the present application.The application is implemented The method for processing video frequency that example provides can be applied to polymorphic interactive system 100 as shown in Figure 1.Polymorphic interactive system 100 includes Terminal device 101 and server 102, server 102 and terminal device 101 communicate to connect.Wherein, server 102 can be Traditional server is also possible to cloud server, is not specifically limited herein.
Wherein, terminal device 101 can be with display screen and support the various electronic equipments of data input, including but not It is limited to smart phone, tablet computer, pocket computer on knee, desktop computer and wearable electronic equipment etc..Specifically, Data input can be based on voice module input voice, the character input module input character, figure having on terminal device 101 As input module input picture etc., can also be based on the gesture recognition module being equipped on terminal device 101, so that user can To realize the interactive modes such as gesture input.
Wherein, client application can be installed, user can be based on client application on terminal device 101 (such as APP, wechat small routine etc.) is communicated with server 102, specifically, being equipped with corresponding service on server 102 Application program is held, user can register a user account number in server 102 based on client application, and be based on the user Account number is communicated with server 102, such as user is in client application login user account number, and is based on the user account number It is inputted by client application, it can be with inputting word information, voice messaging or image information etc., client application journey After sequence receives the information of user's input, server 102 can be sent this information to, so that server 102 can receive this Information is simultaneously handled and is stored, and server 102 can also receive the information and return to a corresponding output according to the information Information is to terminal device 101.
In some embodiments, client application can be used for providing a user customer service, carry out with user Customer service is linked up, and client application can be interacted based on virtual robot with user.Specifically, client application It can receive the information of user's input, and response made to the information based on virtual robot.Wherein, virtual robot is to be based on The software program of visualized graphs, the software program can show the machine of simulation biobehavioral or thought to user after being performed The humanoid state of device.Virtual robot can be the robot of simulation true man's formula, such as be built according to user itself or other people form The robot of vertical likeness in form true man, is also possible to the robot of animation effect formula, such as zoomorphism or cartoon figure's form Robot.
In some embodiments, terminal device 101, can after obtaining return information corresponding with the information that user inputs To show the corresponding void with the return information on the display screen of terminal device 101 or other image output devices connected to it Quasi- robot graphics.It can raising by terminal device 101 while playing virtual robot image as a kind of mode Sound device or other audio output apparatus connected to it play audio corresponding with virtual robot image, can also set in terminal Standby 101 display screen display text corresponding with the return information or figure is realized in multiple sides such as image, voice, text On face with the polymorphic interaction of user.
In some embodiments, the device handled the information of user's input also can be set in terminal device On 101, so that terminal device 101 communicates the interaction that can be realized with user with the foundation of server 102 without relying on, it is polymorphic at this time Interactive system 100 can only include terminal device 101.
Above-mentioned application environment is only for convenience of example made by understanding, it is to be understood that the embodiment of the present application not only office It is limited to above-mentioned application environment.
Below will by specific embodiment to method for processing video frequency provided by the embodiments of the present application, device, terminal device and Storage medium is described in detail.
Referring to Fig. 2, the flow diagram of the method for processing video frequency provided Fig. 2 shows the application one embodiment.This The method for processing video frequency that embodiment provides can be adapted for the terminal device with display screen or other image output devices, terminal Equipment can be the electronic equipments such as smart phone, tablet computer, wearable intelligent terminal.Method for processing video frequency can first obtain user The interactive information of input, then identifies interactive information, obtains audio information specific corresponding with interactive information, then will be special For accordatura frequency information input to the first machine learning model, acquisition face feature point corresponding with audio information specific will be facial special Sign point is input to the second machine learning model, obtains simulation facial image corresponding with face feature point, wherein the second engineering It practises model and makes a living into confrontation network (Generative Adversarial Networks, GAN) model, simulation facial image is Default facial image in pre- setting video is replaced with simulation facial image, and obtains replacement facial image by two-dimension human face image The answer video comprising simulation facial image afterwards, wherein default facial image is two-dimension human face image, will be in pre- setting video It is the image replacement handled based on two dimensional image that default facial image, which replaces with simulation facial image, and finally output is for interaction letter The answer video comprising audio information specific of breath is, it can be achieved that multi-modal interaction, allows the robot to more true to nature natural Image is presented in front of the user, is optimized the quality of robot customer service, is promoted the usage experience of user.
In the particular embodiment, method for processing video frequency can be applied to video process apparatus 500 and figure as shown in Figure 8 Terminal device 600 shown in 9.It will be explained in detail below for process shown in Fig. 2.Above-mentioned method for processing video frequency tool Body it may comprise steps of:
Step S201: the interactive information of user's input is obtained.
In the present embodiment, it can be connect by much information input module integrated in terminal device or with terminal device more Kind message input device obtains the interactive information of user's input.
In some embodiments, interactive information includes but is not limited to voice messaging, text information, image information, movement Various types of information such as information.Wherein, voice messaging may include audio-frequency information (such as the Chinese, English audio of class of languages Deng) and non-language class audio-frequency information (such as music VF etc.);Text information may include the text information (example of text class Such as Chinese, English) and non-legible class text information (such as additional character, character expression etc.);Image information can wrap Include static image information (such as static images, photo etc.) and dynamic image data (such as dynamic picture, video image etc.); Action message may include user action information (such as user gesture, body action, facial expressions and acts etc.) and terminal action letter Breath (such as motion states such as position, posture and the shake of terminal device, rotation etc.).
It is understood that corresponding to different types of interactive information, letter different types of on terminal device can be passed through It ceases input module and carries out information collection.For example, the voice messaging of user can be acquired by audio input device such as microphones, pass through Touch screen or the text information of physical button acquisition user's input, by camera collection image information, by optical sensor, Gravity sensor etc. acquires action message etc..
As a kind of mode, in system front stage operation of the corresponding application program of customer service robot in terminal device, i.e., Each hardware module of terminal device can be called to obtain user and pass through the corresponding Application Program Interface input of customer service robot Interactive information.
As a kind of mode, interactive information can be used for characterizing the interaction intention that user proposes to customer service robot, can be with Be explicitly inquire, such as " I buy commodity deliver ", it is also possible to certain request, such as " me please be help to inquire me to purchase The logistics information for this part commodity bought " can also be the greeting of certain interaction wish of expression, such as " hello, my problematic need Seek advice from ", etc..
It is understood that the same problem, can correspond to different types of interactive information.For example, user want it is defeated When entering the request of " me please be helped to inquire the logistics information of this part commodity of my purchase ", user can be defeated in a manner of by voice input Enter corresponding audio, picture corresponding with " this part commodity that I buys " or the corresponding text information of input can also be uploaded, or It is that " this part commodity that I buys " corresponding virtual icon is selected to trigger the defeated of interactive information directly on Application Program Interface Enter.It is understood that corresponding to the same problem, a type of interactive information can be only inputted, can also be inputted simultaneously A plurality of types of interactive information, the counsel requests to make user definitely, are easier to be identified by customer service robot.
In the present embodiment, different types of interactive information is obtained in several ways, so that a variety of interaction sides of user Formula can be responded freely, and the human-computer interaction means of tradition machinery formula are no longer limited to, and realize the polymorphic friendship between man-machine Mutually, meet more interaction scenarios.
Step S202: identifying interactive information, obtains audio information specific corresponding with interactive information.
In the present embodiment, after the interactive information for obtaining user's input, interactive information can be identified, parsing is handed over The user intent for including in mutual information.
In the present embodiment, audio information specific can be the interactive information that customer service robot is directed to user's input, acquisition For carrying out mutually answerable audio-frequency information to user.For example, when the interactive information of user's input includes that " me please be helped to inquire this part When commodity deliver " user intent when, audio information specific corresponding with the interactive information can be for expressing " this Part commodity are estimated to deliver within 3 days " audio-frequency information.
As a kind of mode, after terminal device obtains interactive information, can terminal device it is local to interactive information into Row identification, and generate audio information specific corresponding with the interactive information.It is understood that for interactive information with it is specific The device that audio-frequency information is converted can be deployed in terminal device local, so that customer service robot still may be used under offline environment With running.
Alternatively, in the state that terminal device and server are established and communicated to connect, when terminal device obtains After interactive information, interactive information can also be sent to server, interactive information is identified by server, and generated and be somebody's turn to do The corresponding audio information specific of interactive information, then the audio information specific is sent to terminal device, it is obtained by terminal device.It can With understanding, the device for being converted to interactive information and audio information specific can also be deployed in cloud server In, so that the calculating storage pressure of terminal device local is alleviated.
In some embodiments, after obtaining interactive information, it is also based on the different type of interactive information, will be interacted In information input identification device corresponding with the type of interactive information, interactive information is identified and turned based on the identification device It changes, to obtain answer audio-frequency information corresponding with the interactive information.
Step S203: audio information specific is input to the first machine learning model, is obtained corresponding with audio information specific Face feature point.
In the present embodiment, the first machine learning model can be and (be spoken figure comprising true man based on a large amount of true man video of speaking As and with the true man corresponding true man of image that speak speak audio) and true man's face feature point when speaking training sample, lead to Cross what neural metwork training obtained.It is understood that the first machine learning model, is for converting the audio into as corresponding face The model of characteristic point.It, can be by the first machine by the way that the audio information specific obtained before is inputted the first machine learning model Learning model exports face feature point corresponding with audio information specific.
In the present embodiment, face feature point can be the set of characteristic points for describing face all or part form, The location information and depth information of each characteristic point in space on face are recorded, can be rebuild by obtaining face feature point The image of face partially or fully.As a kind of mode, face feature point can be to be chosen in advance, for example, in order to describe people Lip type, the contour line of mouth lip can be extracted, and choose multiple points for being spaced apart on demand on the contour line of lip and make For the face feature point for describing lip shape.
It is understood that face can change when people speaks, each characteristic point of corresponding face feature point Location information can also change with depth information, i.e., each pronunciation (corresponding to audio of speaking) when people speaks, corresponding In an at least facial image, and each facial image both corresponds to one group of face feature point, by speaking video in true man The corresponding true man's facial image of middle extraction audio, and face feature point is extracted from true man's facial image, face can be inferred Corresponding relationship between characteristic point and audio.
In some embodiments, as shown in figure 3, face feature point may include lip shape characteristic point, face contour feature At least one of point and facial detail characteristic point.It is understood that according to the difference of user demand and application environment, face Portion's characteristic point can also be other characteristic points for being used to describe face all or part form presented in any way.
It is understood that the face feature point of acquisition is corresponding in time with audio information specific in the present embodiment 's.For example, the face feature point quantity needed for one second is 30 groups, (every group of face feature point all includes each characteristic point in space In location information and depth information), if a length of face feature point 10 seconds, needed when the corresponding audio of audio information specific Total amount is 300 groups, this 300 groups of face feature points were aligned in time with 10 seconds audio information specifics.
In some embodiments, the first machine learning model can be run in server, be based on input by server Audio information specific corresponding face feature point is converted by the first machine learning model.As a kind of mode, eventually Interactive information can be sent to server after obtaining interactive information by end equipment, carried out identification to it by server and generated spy Determine audio-frequency information, and the audio information specific of generation is converted into face feature point by server, is i.e. generation audio information specific It can be completed by server with the data handling procedure of conversion face feature point.Alternatively, terminal device can be with Audio information specific is locally being obtained, and audio information specific is sent to server, is being sent by server according to terminal device Audio information specific obtain corresponding face feature point.By the way that the first machine learning model to be deployed in server, can subtract Few occupancy to terminal device memory capacity and calculation resources, and server need to only receive low volume data (interactive information or spy Determine the small volume of audio-frequency information), it has been greatly reduced the pressure of data transmission, the efficiency of data transmission has been improved, in this way, not Only allow memory capacity and the relatively small number of terminal device of calculation resources easily to realize method provided in this embodiment, drops Low user's threshold, improves market adaptability, also improves the response speed of terminal device simultaneously, the user experience is improved.
In other embodiments, the first machine learning model can also run on terminal device local, so that customer service Robot can provide service under offline environment.
As a kind of mode, the first machine learning model can (Recurrent Neural Network, be followed using RNN Ring neural network) model, it can use internal memory to handle the list entries of arbitrary sequence, this knows that it in voice There is more preferably computational efficiency and accuracy relative to other machines learning model in the reason of other places.
Step S204: face feature point is input to the second machine learning model, obtains mould corresponding with face feature point Anthropomorphic face image.
In the present embodiment, the second machine learning model, can be facial image when being spoken based on a large amount of true man and from The training sample of the face feature point extracted in facial image, is obtained by neural metwork training.It is understood that second Machine learning model is for the mould according to the face feature point of face building simulation facial image corresponding with face feature point Type.The second machine learning model is inputted by the face feature point that will be exported by the first machine learning model, it can be by the second machine Device learning model output simulation facial image corresponding with face.
It is understood that due to the face feature point of acquisition with audio information specific be it is corresponding, be based on facial characteristics The simulation facial image that point obtains is also corresponding with audio information specific.
In some embodiments, the second machine learning model is similar with the first machine learning model, can run on clothes It is engaged in device, terminal device local can also be run on, there is corresponding advantage under different application scenarios, it can be according to practical need It asks and is selected.
In the present embodiment, the second machine learning model can export and true man's face figure according to the face feature point of input As approximate simulation facial image, for example, in terms of face contour, spatial form, skin, it can be a degree of After training, realize that output is visually difficult to differentiate the simulation facial image of difference with true man's face.It is understood that being based on The accumulation of training samples number and training time, the second machine learning model are quasi- based on face feature point simulation facial image True degree can step up.
As a kind of mode, the second machine learning model can choose GAN (Generative Adversarial Networks generates confrontation network) model, pass through the mutual of generator (Generator) and arbiter (Discriminator) Phase Game Learning can continue to optimize the output of itself, in the case where training samples number is sufficiently large, can pass through GAN model The simulation facial image of infinite tendency true man face is obtained, realizes the effect of " mixing the spurious with the genuine ".Further, facial image is simulated For two-dimension human face image, i.e., face feature point is inputted into GAN model, two-dimensional simulation corresponding with face feature point can be obtained Facial image.
Step S205: the default facial image in pre- setting video is replaced with into simulation facial image, and obtains replacement face The answer video comprising simulation facial image after image.
In the present embodiment, pre- setting video can be the pre-prepd interactive information for for user's input to user The video fed back.Default facial image, can be the facial image for including in pre- setting video.Facial image is simulated obtaining Afterwards, the default facial image in pre- setting video can be replaced with to simulation facial image, the answer video after obtaining replacement face.
In one embodiment, default facial image is two-dimension human face image and simulation facial image is two-dimension human face Image, it is that the image handled based on two dimensional image is replaced that the default facial image in pre- setting video, which is replaced with simulation facial image, It changes, since simulation facial image is obtained according to GAN model, picture can be greatly improved according to the characteristic of GAN model Quality improves the fidelity that simulation face is spoken.
As a kind of mode, the difference in real human face region is corresponded to according to simulation facial image, it is default in pre- setting video The replacement of facial image can be whole replacements, be also possible to local replacement.
As shown in figure 4, only needing to simulate facial image to default face figure if simulation facial image is only lip shape simulation Mouth near zone as in is replaced, and the image finally obtained is the face figure for only replacing face mouth near zone Picture, and other regions in the pre- setting video in addition to the partial region can then retain the image that pre- setting video Central Plains has.It can It with understanding, is replaced relative to whole faces, the face feature point quantity that local facial replacement needs is less, data processing Amount is lower, and the efficiency for obtaining answer video also can be higher.In addition, since people is when speaking, except the variation of mouth near zone is brighter Aobvious outer, the variation in the regions such as other regions such as forehead portion, Face and cheek, eye, ear, nose on face is unobvious, therefore only Replace mouth image, can while promoting video treatment effeciency so as to after replacement face degree of verisimilitude influence minimum, The experience of optimization can be provided for user.
Step S206: output is directed to the answer video of interactive information, and replying includes audio information specific in video.
In the present embodiment, after obtaining the answer video after replacing facial image, it can will reply in video from default The original audio of video is partly or totally replaced with audio information specific, then to including audio information specific and simulation face figure The answer video of picture is exported, show the shape of simulation to user and customer service machine that sound is similar to true man it is humanoid as.
In some embodiments, pre- setting video only can not include audio comprising image, at this point, only need to be by specific sound Frequency information is blended into i.e. exportable in the answer video after replacing face.
In other embodiments, if default video includes that primitive man's sound audio and original background audio (can be true Real environment sound is also possible to the background sounds such as music), primitive man's sound in pre- setting video can be replaced with audio information specific Frequently, as a kind of mode, first primitive man's sound audio in pre- setting video can be eliminated, retain original background audio, then will be special Determine audio-frequency information to mix with original background audio, obtains the answer video comprising audio information specific.It is understood that according to The difference of application scenarios and user demand, the background audio replied in video can also be replaced or delete.As one kind Mode not only may include the answer voice audio for being fed back for interactive information in audio information specific, can be with Include other background audios.For example, may include background music, obtains reply video when playing at this time, with customer service robot Voice can play background music simultaneously, promote the usage experience of user.
In some embodiments, the synthesis of the replacement of facial image and audio can be and in the server carry out.Make For a kind of mode, server can successively carry out raw for interactive information since the interactive information that receiving terminal apparatus is sent Obtain face feature point at audio information specific, based on audio information specific, simulation facial image is generated based on face feature point, Default facial image in pre- setting video is replaced with into simulation facial image, answer video of the output comprising audio information specific extremely Terminal device allows terminal device only to carry out sending the interactive information that user inputs to server, and obtains server The answer video of feedback substantially reduces the operation storage pressure of terminal device local, improves the acquisition efficiency for replying video, So that customer service robot is timely responded with the realization that interacts of user, make the robot customer service experience of simulation true man more natural.
In a kind of specific application scenarios, as shown in figure 5, user can by open application client (such as Wechat small routine or independent APP) into the interactive interface with customer service robot, interactive interface includes video clip and chat circle Face.When user inputs text interactive information " hello " in the input frame on chat interface, the application program of customer service robot Client can be sent to server after obtaining the interactive information, identified by application program service end and generate to be directed to and be somebody's turn to do " you are good, I is that customer service robot is small by one " (voice of synthesis), according further to the spy for the audio information specific of interactive information Accordatura frequency acquisition of information replies video, and will reply video (including audio information specific) and be back to user terminal.User terminal After receiving the answer video that server issues, answer video (Fig. 5 institute can be played in the video clip on interactive interface Show reply video in callipyga be replace face after simulation true man customer service machine it is humanoid as), and can synchronize chat On its interface display correspond to audio information specific text information " you are good, I is small one~" of customer service robot.
Please continue to refer to Fig. 5, after the first round " greeting " interaction, user continues to look into customer service robot expression merchandise news The demand " I wants the merchandise news of clothes in this figure " looked for, user can pass through the voice input button below chat interface Corresponding interactive voice information is inputted by way of voice, and corresponding text interactive information can also be inputted by input frame. In the interactive information being previously entered due to user comprising " in this figure " it is this rely solely on voice or text and refer to do not know Content, the function that user continues through uploading pictures have input a commodity picture, at this point, terminal device can be inputted to user Interactive information can determine one clearly interaction intention after, by successively input " I wants the commodity of clothes in this figure Information " and commodity picture the two interactive information are transmitted to server, and are directed to the two interactive information by server, Corresponding audio information specific " just a moment,please, you is being helped to inquire " is generated, and exports corresponding answer video to terminal device, It is played in the video clip of interactive interface, and the synchronous display on chat interface corresponds to the text of audio information specific Information " just a moment,please, helping you inquire~".On the other hand, server can be after identifying interactive information, immediately in a network The merchandise news of user demand is continued to search, and after finding, sends the merchandise news to terminal device, be presented on interaction On interface (Fig. 5 is not shown).
It is understood that each step above-mentioned in the present embodiment, it can be by terminal device in local progress, it can also Point of task can be carried out as desired according to the difference of practical application scene by terminal device and server division of labor Match, to realize the Ni Zhen robot customer service optimized experience.
The method for processing video frequency that the application one embodiment provides, can by when user and robot talk with, for The interactive information of family input is matched corresponding audio information specific, and is generated based on machine learning model and believed with the specific audio Cease it is corresponding it is quasi- really simulate facial image, finally will synthesis there is the answer video of simulation facial image and specific audio to export with User is showed, realizes multi-modal interaction, allows the robot to that optimization in front of the user is presented with natural image more true to nature The quality of robot customer service promotes the usage experience of user.
Referring to Fig. 6, Fig. 6 shows the flow diagram of the method for processing video frequency of another embodiment of the application offer. It will be explained in detail below for process shown in fig. 6.Above-mentioned method for processing video frequency specifically may include following step It is rapid:
Step S301: the interactive information of user's input is obtained.
In the present embodiment, specifically describing for step S301 can be with reference to the step S201 in a upper embodiment, the present embodiment This is repeated no more.
Step S302: identifying interactive information, obtains interactive text corresponding with interactive information.
In the present embodiment, for the different type of interactive information, interactive information can be inputted and interactive information type pair In the identification model answered, and the interactive information is identified based on identification model, obtains corresponding interactive text.
Speech recognition modeling pair can be based on when the interactive information of user's input is voice messaging as a kind of mode Interactive information is identified, corresponding interactive text is obtained;It, can be without identification mould when interactive information is text information Type, directly using interactive information as interaction text;It, can be based on image recognition model to friendship when interactive information is image information Mutual information is identified, corresponding interactive text is obtained;When interactive information is action message, can be identified based on body language Model, terminal gesture recognition model or gesture identification model identify interactive information, obtain corresponding interactive text.
It is understood that by being identified and being obtained interactive text respectively for a variety of different types of interactive information, Different types of interactive information type can be normalized, reduce the complexity of entire processing system for video, promote information processing Efficiency.
As a kind of mode, the model of corresponding interaction text is identified and obtained to interactive information, such as speech recognition mould Type can be using Recognition with Recurrent Neural Network model such as LSTM (Long Short Term Memory, shot and long term memory) network model.
Step S303: inquiring in question and answer library and obtains answer text corresponding with interaction text.
In the present embodiment, question and answer library can be and preconfigured contain the database of multiple question and answer pair, wherein question and answer To interaction text and the answer text prestored corresponding with interaction text including prestoring.Each interactive text is matched It is corresponding to reply text.Answer text corresponding with interaction text can be inquired and obtained in question and answer library based on interaction text, from And realize the user's interaction intention covered for question and answer library, accurate answer can be provided.
In some embodiments, if interaction text does not inquire the directly corresponding interaction text prestored in question and answer library This, which is carried out approximate processing semantically by the method that can be analyzed by semantics recognition, and finding may correspondence The interaction text prestored, then matched answer text is obtained based on the corresponding interaction text that prestores of the possibility.For example, user The interaction text that input interactive information generates after identification is " this is that I experienced best robot customer service ", so And the interaction text does not find the corresponding interaction text prestored directly in question and answer library, can pass through semantic analysis at this time Mode, find it is corresponding with the semanteme of the interactive information to prestore interactive text be " thumbing up to you ", and it is literary to obtain corresponding answers This " thank you ".
In some embodiments, it is also based on question and answer library and establishes Question-Answering Model (can be machine learning model), ask Training can be obtained based on a large amount of question and answer by answering model, such as can will be from the sea that the communication record of the artificial customer service of magnanimity obtains Question and answer are measured to as training sample, are based on machine using the answer of customer service side as desired output using the information of user side as input The method training of device study obtains Question-Answering Model, so that answer text corresponding with interaction text is obtained by Question-Answering Model, it is real Corresponding answer can also be carried out referring now to the interaction text not prestored in question and answer library, keeps the application of scheme more intelligent.
Step S304: synthesis audio information specific corresponding with text is replied.
In the present embodiment, text input speech synthesis model can will be replied based on speech synthesis model trained in advance, Obtain audio information specific corresponding with text is replied.
As a kind of mode, speech synthesis model can choose CNN (Convolutional Neural Networks, volume Product neural network) model, feature extraction can be carried out by convolution kernel, it will be each in aligned phoneme sequence corresponding with text is replied Phoneme and spectrum information, fundamental frequency information correspond, to generate audio information specific corresponding with text is replied.
In some embodiments, speech synthesis model can also be RNN model, such as WaveRNN.
In the present embodiment, above-mentioned question and answer library, Question-Answering Model, speech synthesis model etc. can run on terminal device, Server can also be run on, is not limited thereto.
Step S305: audio information specific is input to the first machine learning model, is obtained corresponding with audio information specific Face feature point.
Step S306: face feature point is input to the second machine learning model, obtains mould corresponding with face feature point Anthropomorphic face image.
In the present embodiment, step S305, step S306 specifically describe can with reference in a upper embodiment step S203, Step S204, the present embodiment repeat no more this.
Further, in some alternative embodiments, according to the communication record of the artificial customer service of magnanimity training question and answer When model, expression packet used in artificial customer service or postfix notation can be carried out to the mark of emotion simultaneously, so that according to question and answer mould Type exports corresponding answer text and carries affective tag, and affective tag includes but is not limited to statement, query, exclamation, laugh, committee It bends, so that the audio information specific comprising corresponding emotion can be generated according to the answer text for carrying affective tag, so that The simulation facial image exported according to the audio information specific for carrying affective tag is other than corresponding with audio information specific, also More agree with the corresponding tone of audio information specific so that facial expression of the customer service robot when speaking it is more lively from So, rich in emotion.
Step S307: determine that image corresponding with simulation facial image replaces region, image replacement area in pre- setting video Domain is the regional area or whole region of the default facial image in pre- setting video.
It, can be first according to the ruler of preset simulation facial image after obtaining simulation facial image in the present embodiment Very little, shape and coordinate determine that image corresponding with simulation facial image replaces region in pre- setting video.
For example, preset simulation facial image is the rectangular image having a size of 20x10, the center of replacement position is sat It is designated as (0,50), can determine that image replacement corresponding with simulation facial image region is (- 20,40) from pre- setting video at this time To the rectangular area of (20,60), which can correspond exactly to the mouth of default facial image.
Step S308: simulation facial image is covered to image and replaces region.
In the present embodiment, after determining image replacement region, simulation facial image can be covered to image and replace region, The regional area or whole region of the default facial image in pre- setting video are replaced, to obtain the answer view after replacing face Frequently.
Step S309: output is directed to the answer video of interactive information, and replying includes audio information specific in video.
In the present embodiment, specifically describing for step S309 can be with reference to the step S206 in a upper embodiment, the present embodiment This is repeated no more.
Referring to Fig. 7, in some embodiments, the first machine learning model and the second machine learning model can be by such as Lower step carries out corresponding training.
Step S401: the first training sample set is obtained, the first training sample set includes extracting from the first pre-training video Facial image face feature point and audio corresponding with face feature point.
Step S402: the first training sample set is inputted into the first machine learning model, the first machine learning model is carried out Training.
Step S403: the second training sample set is obtained, the second training sample set includes extracting from the second pre-training video Facial image and face feature point corresponding with facial image.
Step S404: the second training sample set is inputted into the second machine learning model, the second machine learning model is carried out Training.
In some embodiments, above-mentioned for training the second pre-training video of the second machine learning model, and it is pre- Facial image in setting video can be the facial image of same people.It is understood that due in the replaced answer of face In video, other than the human face region being replaced, can may also there are other positions of the body of people, in order to keep face to replace Simulation face in answer video afterwards and other positions of the body without replacement the colour of skin be physically consistent, can will The facial image of true man in pre- setting video and corresponding face feature point are as the second training sample set, for training Second machine learning model may make the background in the simulation facial image and pre- setting video generated through the second machine learning model Human body is consistent, and the customer service machine after making replacement face is humanoid as seeming more natural.It is understood that need to only make second The default facial image in facial image and pre- setting video extracted in pre-training video is same people, i.e. the second pre-training Video can be in addition to pre- setting video other include the video of same true man's facial image.
In other embodiments, little in the colour of skin, body difference, and be able to maintain the simulation facial image of output with Under the premise of background human body is consistent, the default facial image in facial image and pre- setting video in the second pre-training video may be used also To be different people.
As a kind of mode, the facial image in the first pre-training video and the second pre-training video can be for same people's Facial image, or the facial image of different people.It is understood that since the first machine learning model is for by sound Frequency information is converted to corresponding face feature point, and in some cases, same group of face feature point can be used for describing different Face.Facial image in the first pre-training video and the second pre-training video is the facial image of same people, can be kept The consistency of the human face expression of true man in the expression of the simulation facial image of final output and pre- setting video;And it is pre- first When facial image in training video and the second pre-training video is the facial image of different people, the simulation of final output can be allowed Face is made and the human face expression that true man never make in pre- setting video, realizes more diversified and amazing answers With.
It is understood that according to user demand and the difference of application scenarios, for the first machine learning model and the The training method of two machine learning models can be diversified, and the present embodiment is not construed as limiting this.
The method for processing video frequency that another embodiment of the application provides, compared to method shown in Fig. 2, the present embodiment also into Polymorphic interactive information is converted to text by one step, and the question and answer library by pre-establishing and speech synthesis obtain corresponding spy Accordatura frequency, and part can be carried out to pre- setting video by simulation facial image and be replaced to whole images, extend this programme Application scenarios, more intelligence can be realized according to the difference of user demand, the different types of robot customer service mode of unrestricted choice The polymorphic interaction of energyization can effectively promote the usage experience of user.
Referring to Fig. 8, Fig. 8 shows the module frame chart of the video process apparatus 500 of the application one embodiment offer.It should Video process apparatus 500 is applied to the terminal device with display screen or other image output devices, and terminal device can be intelligence The electronic equipments such as energy mobile phone, tablet computer, wearable intelligent terminal.It will be illustrated below for module frame chart shown in Fig. 8, Video process apparatus 500 includes: MIM message input module 510, audio obtains module 520, characteristic point obtains module 530, face life At module 540, face replacement module 550 and Video Output Modules 560, in which:
MIM message input module 510, for obtaining the interactive information of user's input
Audio obtains module 520, for identifying to interactive information, obtains specific audio letter corresponding with interactive information Breath.Further, interactive information includes at least one of voice messaging, text information, image information, and audio obtains module 520 include:
Recognition unit obtains interactive text corresponding with interactive information for identifying to interactive information.
Query unit, for answer text corresponding with interaction text to be inquired and obtained in question and answer library.
Synthesis unit, for synthesizing audio information specific corresponding with text is replied.
Characteristic point obtains module 530, for audio information specific to be input to the first machine learning model, obtain with it is specific The corresponding face feature point of audio-frequency information.In some embodiments, face feature point includes lip shape characteristic point, face contour spy At least one of sign point and facial detail characteristic point.
Face generation module 540, for face feature point to be input to the second machine learning model, acquisition and facial characteristics The corresponding simulation facial image of point.
Face replacement module 550 for the default facial image in pre- setting video to be replaced with simulation facial image, and obtains The answer video comprising simulation facial image after facial image must be replaced.Further, face replacement module 550 includes:
Territory element, for determining that image corresponding with simulation facial image replaces region in pre- setting video, image is replaced Change the regional area or whole region that region is the default facial image in pre- setting video.
Replacement unit is covered for that will simulate facial image to image replacement region.
Video Output Modules 560, for exporting the answer video for being directed to interactive information, replying includes specific audio in video Information.
In some embodiments, further, video process apparatus 500 further include:
First sample obtains module, and for obtaining the first training sample set, the first training sample set includes from the first pre- instruction The face feature point of the facial image extracted in white silk video and audio corresponding with face feature point.
First training module, for the first training sample set to be inputted the first machine learning model, to the first machine learning Model is trained.
Second sample acquisition module, for obtaining the second training sample set, the second training sample set includes from the second pre- instruction Practice the facial image and face feature point corresponding with facial image extracted in video.In some embodiments, second is pre- Facial image in training video and pre- setting video is the facial image of same people.
Second training module, for the second training sample set to be inputted the second machine learning model, to the second machine learning Model is trained.
In some embodiments, the facial image in the first pre-training video and the second pre-training video be same people or The facial image of different people.
The video process apparatus that the application one embodiment provides, can by when user and robot talk with, for The interactive information of family input is matched corresponding audio information specific, and is generated based on machine learning model and believed with the specific audio Cease it is corresponding it is quasi- really simulate facial image, finally will synthesis there is the answer video of simulation facial image and specific audio to export with User is showed, realizes multi-modal interaction, allows the robot to that optimization in front of the user is presented with natural image more true to nature The quality of robot customer service promotes the usage experience of user.
Video process apparatus provided by the embodiments of the present application is handled for realizing video corresponding in preceding method embodiment Method, and the beneficial effect with corresponding embodiment of the method, details are not described herein.
It is apparent to those skilled in the art that video process apparatus provided by the embodiments of the present application can Realize each process in preceding method embodiment, for convenience and simplicity of description, the specific work of foregoing description device and module Make process, can be refering to the corresponding process in preceding method embodiment, details are not described herein.
In embodiment provided herein, the mutual coupling of shown or discussed module or direct-coupling or Communication connection can be through some interfaces, and the indirect coupling or communication connection of device or module can be electrical property, it is mechanical or its Its form.
In addition, each functional module in the embodiment of the present application can integrate in a processing module, it is also possible to each A module physically exists alone, and can also be integrated in a module with two or more modules.Above-mentioned integrated module was both It can take the form of hardware realization, can also be realized in the form of software function module.
Referring to Fig. 9, it illustrates a kind of structural block diagrams of terminal device 600 provided by the embodiments of the present application.The terminal Equipment 600, which can be smart phone, tablet computer, e-book etc., can run the terminal device of application program.In the application Terminal device 600 may include one or more such as lower component: processor 610, memory 620 and one or more application journey Sequence, wherein one or more application programs can be stored in memory 620 and be configured as by one or more processors 610 execute, and one or more programs are configured to carry out the method as described in preceding method embodiment.
Processor 610 may include one or more processing core.Processor 610 is whole using various interfaces and connection Various pieces in a terminal device 600, by run or execute the instruction being stored in memory 620, program, code set or Instruction set, and the data being stored in memory 620 are called, execute the various functions and processing data of terminal device 600.It can Selection of land, processor 610 can use Digital Signal Processing (Digital Signal Processing, DSP), field-programmable Gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA) at least one of example, in hardware realize.Processor 610 can integrating central processor (Central Processing Unit, CPU), in image processor (Graphics Processing Unit, GPU) and modem etc. One or more of combinations.Wherein, the main processing operation system of CPU, user interface and application program etc.;GPU is for being responsible for Show the rendering and drafting of content;Modem is for handling wireless communication.It is understood that above-mentioned modem It can not be integrated into processor 610, be realized separately through one piece of communication chip.
Memory 620 may include random access memory (Random Access Memory, RAM), also may include read-only Memory (Read-Only Memory).Memory 620 can be used for store instruction, program, code, code set or instruction set.It deposits Reservoir 620 may include storing program area and storage data area, wherein the finger that storing program area can store for realizing operating system Enable, for realizing at least one function instruction (such as touch function, sound-playing function, image player function etc.), be used for Realize the instruction etc. of following each embodiments of the method.Storage data area can be created in use with storage terminal device 600 Data (such as phone directory, audio, video data, chat record data) etc..
Referring to Fig. 10, it illustrates a kind of structural frames of computer readable storage medium provided by the embodiments of the present application Figure.Program code is stored in the computer readable storage medium 700, said program code can call execution above-mentioned by processor Method described in embodiment of the method.
Computer readable storage medium 700 can be such as flash memory, EEPROM (electrically erasable programmable read-only memory), The electronic memory of EPROM, hard disk or ROM etc.Optionally, computer readable storage medium 700 includes non-volatile meter Calculation machine readable medium (non-transitory computer-readable storage medium).Computer-readable storage Medium 700 has the memory space for the program code 710 for executing any method and step in the above method.These program codes can With from reading or be written in one or more computer program product in this one or more computer program product. Program code 710 can for example be compressed in a suitable form.
In conclusion method for processing video frequency provided by the embodiments of the present application, device, terminal device and storage medium, it can be first The interactive information for obtaining user's input, then identifies interactive information, obtains specific audio letter corresponding with interactive information Breath, then audio information specific is input to the first machine learning model, face feature point corresponding with audio information specific is obtained, Face feature point is input to the second machine learning model, simulation facial image corresponding with face feature point is obtained, will preset Default facial image in video replaces with simulation facial image, and obtaining after replacement facial image includes simulation facial image Answer video, finally output be directed to interactive information answer video, reply video in include audio information specific.The application is real Applying example, for the interactive information of user's input, can match corresponding audio information specific by when user and robot talk with, And based on machine learning model generate it is corresponding with the audio information specific intend really simulate facial image, synthesis is finally had into mould The answer video of anthropomorphic face image and specific audio is exported to show user, is realized multi-modal interaction, is allowed the robot to It is presented in front of the user with natural image more true to nature, optimizes the quality of robot customer service, promote the usage experience of user.
Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although The application is described in detail with reference to the foregoing embodiments, those skilled in the art are when understanding: it still can be with It modifies the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;And These are modified or replaceed, do not drive corresponding technical solution essence be detached from each embodiment technical solution of the application spirit and Range.