CN117352000A - Speech classification method, device, electronic equipment and computer readable medium - Google Patents
- ️Fri Jan 05 2024
Detailed Description
In order to better understand the embodiments of the present application, the following description will clearly and completely describe the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.
With the rapid development of artificial intelligence technology and the continuous improvement of the requirements of corresponding interaction experience, intelligent interaction has gradually begun to replace some traditional man-machine interaction modes. Currently, research and study is primarily conducted on voice tasks based on LLM or large voice models.
However, the inventor finds in the research that the Prompt word currently shows great potential on LLM or a large voice model, for example, the perception understanding and reasoning of voice and images can be realized, and a plurality of voice classification tasks can also be realized. However, the input mode of the Prompt word is relatively single so far. Prompt is still based on the text of the design, significantly limiting the expression of Prompt.
Therefore, in order to overcome the above-mentioned drawbacks, the embodiments of the present application provide a voice processing method, which enables the output voice content to exceed the voice information content itself through the prompt information. As shown in fig. 1, the method includes: s101 to S103.
S101: and acquiring a voice characteristic vector corresponding to the voice data to be processed.
As an embodiment, the voice data to be processed may be voice data input by a user in real time. The voice processing method is applied to the electronic equipment, a target client is installed in the electronic equipment, a user can input voice data through the target client, and the target client can obtain the target voice data based on the voice processing method and play the target voice data through the electronic equipment.
For example, the target client can provide a voice conversion function, i.e., a user inputs voice, and the client can output the voice converted target voice. For example, the target client is provided with a voice input interface, a voice input control is arranged in the interface, for example, the voice input interface can be a virtual button, and when a user can input voice data in the process of pressing the voice input control, the input voice data is used as the voice data to be processed at this time.
As another embodiment, the voice data to be processed may be pre-stored or downloaded voice data, which is not limited.
A speech feature vector is a representation that converts a speech signal into a numerical representation. The speech signal is a waveform signal in the time domain, but it is often not efficient to directly use the original waveform as a feature vector, so a feature extraction algorithm is generally used to extract relevant features of speech and represent them as feature vectors. In general, common speech feature vectors include: short-term Energy (Short-term Energy), calculating Energy change in a Short period of a voice signal, and representing the intensity of voice; the zero crossing rate (Zero Crossing Rate) is used for counting the times of zero crossing of the voice signal in a short period and reflecting the frequency characteristic of the voice signal; cepstrum coefficients (Cepstral Coefficients) are obtained by performing fourier transform and cepstrum operation on the speech signal to obtain a set of cepstrum coefficients for representing spectral features of speech; mel-frequency cepstral coefficients (Mel Frequency Cepstral Coefficients, MFCC), similar to cepstral coefficients, but employing mel-filter banks in the frequency domain to simulate human ear perception of sound, more in line with human auditory characteristics; linear predictive coding (Linear Predictive Coding, LPC) coefficients, modeling a speech signal using a linear predictive model to obtain a set of prediction coefficients for representing formant characteristics of the speech; and (3) performing short-time Fourier transform on the voice signal to obtain a three-dimensional spectrum with time-varying frequency spectrum, wherein the three-dimensional spectrum can be used for representing the frequency spectrum characteristics of voice.
In the embodiment of the application, the voice feature vector can be used in the fields of voice recognition, speaker recognition, emotion analysis and the like. In practical applications, it is often necessary to select an appropriate combination of feature vectors according to the specific task and characteristics of the data set, and to train and classify the feature vectors using machine learning or deep learning methods.
S102: and acquiring prompt information obtained by the recognition model based on the target data.
It should be noted that, the types of the target data include text types and preset types, and the prompt information includes voice classification task information.
As an embodiment, the prompt information can determine the direction or purpose of voice conversion of the voice feature vector, and can also provide auxiliary support for voice conversion operation during voice conversion.
For example, task information may typically be entered via hinting information so that a subsequent model can determine the speech conversion operation to be performed based on the task. In this embodiment of the present application, the task information may be indication information of a voice conversion task, where the voice conversion task may include tasks such as voice translation, voice completion, voice transcription, voice continuity, and the like, which can generally determine voice content of voice data to be processed after voice content conversion. Wherein, the speech translation (Speech Translation) refers to converting the speech data to be processed into text or speech output in a target language, i.e. converting the initial language of the speech data to be processed into the target language, e.g. converting chinese into japanese. Speech completion refers to speech recognition result integrity promotion in an automatic speech recognition (Automatic Speech Recognition, ASR) task. Speech recognition systems may produce erroneous or missing recognition results due to various factors (e.g., noise, accent, ambiguous sounds, etc.). In this case, speech completion aims at predicting, inserting a possibly missing speech part by using context information, grammar model, acoustic model, etc. to improve the integrity and accuracy of the recognition result. Speech transcription (Speech Transcription) is the process of converting the speaker's speech content into text form. Speech continuation (speech continuation) refers to the generation of a consistent and meaningful continuation given a piece of speech or text. It is also understood that given a context, the next speech content or text is generated.
In this embodiment of the present application, the task information in the prompt information may be voice classification task information, which indicates that the prompt information can direct the subsequent model to perform voice classification operation on the voice data to be processed, that is, classify the voice feature vector.
Illustratively, the voice classification task information may include at least one of keyword recognition, intent recognition, language recognition, false voice detection, emotion recognition, accent classification, sarcasm detection, gender recognition, voice activity detection, audio classification, dysarthria classification, and voice command recognition. Among them, keyword recognition (KS) is a voice recognition technology for detecting and recognizing specific keywords or phrases in voice. It can distinguish and locate specified keywords in real time in a continuous voice stream. The intent classification (Intent Classification, IC) aims to classify the input speech feature vectors to determine the intent or purpose of the user. Language identification (Language Identification, LID) aims to automatically identify the language used in a given audio. False speech detection (Fake Speech Detection, FSD) is a task aimed at identifying and distinguishing false speech or counterfeit speech. Emotion recognition (Emotion Recognition, ER) is a task aimed at automatically analyzing information such as human speech, text, and facial expressions to determine the emotion and emotional state that it conveys. Accent classification (Accent Classification, acC) is a task intended to identify and classify the accent of a speaker. Irony detection (Sarcasm Detection, SD) is a task aimed at automatically recognizing irony expressions in text or speech. Sex identification (Gender Identification, GID) refers to automatically identifying the sex to which data such as text, voice, or image belongs by analyzing the data. In the text and speech arts, gender identification typically determines its gender based on characteristics of the speaker's language, pronunciation, tone, etc. In the image field, gender identification is generally based on the features of a face image to determine its gender. Voice activity detection (Voice Activity Detection, VAD) is a technique for automatically identifying active (voice) and inactive (silence) portions of a speech signal. Audio classification (Audio Classification, auC) refers to the automatic categorization of audio signals into different predefined categories or tags by analyzing and processing them. Dysarthria classification (Dysarthria Classification, DYC) refers to the process of classifying and categorizing different types of dysarthria. A dysarthria is a disorder of speech caused by neuromuscular disorders, manifested as difficulty in speaking, unclear sounds, or abnormal prosody and tone. Dysarthria can be caused by a variety of diseases or causes, such as stroke, parkinson's disease, cerebral palsy, and craniocerebral trauma. Voice command recognition (Speech Command Recognition, SCR) refers to converting voice commands into recognizable text or operations by analyzing and processing voice signals.
In an embodiment, the prompt information may be input by inputting text. Illustratively, the Prompt is a Prompt, which is a piece of text or question provided to the target model for directing it to generate a corresponding answer. Hints can be used to specify tasks, set contexts, or constraint generation results. The conversational model will typically initiate a conversation with prompts and generate a reply in subsequent interactions based on the user's input and previous conversation history. For generative models, the Prompt may be a sentence or several sentences for expressing the desired output style, asking the model to answer a question, or providing an explanation, etc.
In addition, in the embodiment of the application, besides the prompt information can be input by inputting text, the prompt information can be determined by recognition based on target data of a preset type. The preset type refers to a type other than the text type, for example, the preset type may include at least one of video, voice, image and thermodynamic diagram, and the recognition model may recognize the video, voice, image and thermodynamic diagram to obtain prompt information, that is, determine the prompt information corresponding to the target data through recognizing the target data of the video, voice, image and thermodynamic diagram type.
Illustratively, the recognition model has the ability to recognize multimodal data, for example, the recognition model is a natural language processing (Language Learning Model, LLM) model, and in particular, in an application embodiment, the recognition model may be a ChatBridge model, which is a language generation model that is typically designed for natural language dialogue exchanges between humans and machines. The ChatBridge model can generate a consistent, logical reply from user input, and can express some semantic understanding and context awareness. The ChatBridge model can uniformly transmit different types of data, such as video, voice, text, images, thermodynamic diagrams and the like, to the model for processing, and generate characteristics related to the data according to the relevance among the data. The main advantage of the ChatBridge model is that multiple sources of data can be processed simultaneously, resulting in a more comprehensive and accurate analysis result. For example, in the field of speech recognition, the ChatBridge model can improve understanding ability of speech according to data of other modes such as expression and gesture of a speaker, so as to realize a more accurate speech recognition result.
The embodiment of identifying the multi-mode data through the ChatBridge model to obtain the prompt information may be that, first, the data is preprocessed, that is, the data of different types is preprocessed appropriately. For example, video is subjected to frame extraction and encoding, speech is subjected to speech recognition and conversion into text, images are subjected to feature extraction, and thermodynamic diagrams are analyzed. The goal of this step is to translate the different types of data into a unified format that can be processed by the model. Second, the preprocessed multiple types of data are integrated together for input into the ChatBridge model. For example, it may be selected to encode different types of data into fixed length vectors, or to use an attention mechanism (e.g., a transducer model) to weight integrate the data. The goal of this step is to preserve the relevance and context information between the data. And finally, the input multi-mode data is sent into the trained ChatBridge model. The model will process and reason according to the input data and generate the corresponding prompt as output.
It should be noted that, the ChatBridge model may be trained by using the integrated multi-modal data as a training set. The goal of the training is to enable the model to understand the relationships between the different types of data and to generate corresponding templates from these data. Therefore, the recognition model can map the multi-modal data into different voice classification tasks through training of different types of sample data such as video, voice, images and thermodynamic diagrams and the like.
For example, for data of a video type, the recognition model may use speech recognition techniques to transcribe audio in the video into text so that a textual representation of the audio content in the video may be obtained. The images of the video are then processed using video analysis techniques (e.g., computer vision, image processing, etc.), which may include, for example, object recognition, scene understanding, face detection, etc. The transcribed audio text and video content analysis results are then integrated into one or more complete campts. Other types, such as voice, image and thermodynamic diagram data, can also obtain feature values through a feature extraction mode, and through training of a recognition model in advance, voice classification tasks corresponding to the feature values can be recognized and obtained, so that a prompt is obtained.
The target data may be data generated during the task related to performing the voice classification operation, or may be data input by a user for indicating the task related to performing the voice classification operation. For example, the target data for the video type may be a request from a user to make a voice classification by way of recording the video. Therefore, through training the recognition model, the recognition model can have the capability of mapping multi-mode data to voice classification tasks, so that prompt information can be obtained.
S103: and performing classification operation on the voice data based on the voice feature vector and the prompt information.
As shown in fig. 2, the recognition model recognizes multi-mode target data to obtain prompt information, namely, determines voice classification information, and inputs a voice feature vector corresponding to the voice data to be processed and the prompt information into the target model together, and the target model performs classification operation.
As an embodiment, the target model may perform a classification operation on the speech data by determining the type of the whole speech data, i.e. after obtaining the speech feature vector and the prompt information, what the type of the speech data is. As another embodiment, the target model may further perform a classification operation on the voice data, after acquiring the prompt information, classifying the voice feature vector based on the prompt information, and the obtained type of each unit of the voice feature vector of the voice data.
Therefore, in the voice classification method provided by the embodiment of the application, the voice feature vector corresponding to the voice data to be processed is obtained, the prompt information obtained by the recognition model based on the target data is obtained, wherein the type of the target data comprises a text type and a preset type, the prompt information comprises voice classification task information, and the classification operation is performed on the voice data based on the voice feature vector and the prompt information. The prompt information can be determined through the identification of the target data, and the target data contains other types besides text types, so that the mode of inputting the prompt information by the data in multiple modes can be provided.
Referring to fig. 3, fig. 3 shows a voice processing method provided in an embodiment of the present application, where the method includes: s301 to S305.
S301: and acquiring a voice characteristic vector corresponding to the voice data to be processed.
As an implementation manner, in the embodiment of the present application, the system for executing the speech processing method is a speech processing system, where the speech processing system includes a speech encoder, a recognition model, a speech language model, and a speech decoder, as shown in fig. 4, a Prompt message (promt) is output through the recognition model, and is input as the speech language model together with a speech feature vector output by the speech encoder.
As one implementation, the Speech Encoder may be a Speech Encoder module, the Speech data to be processed is input to the Speech Encoder module, and the Speech token output by the Speech Encoder module is used as the Speech feature vector corresponding to the Speech data to be processed. For example, the specific Encoder module may be HuBERT. HuBERT is a speech representation learning model trained based on a self-supervised learning approach that can convert speech signals into a series of speech feature vectors. In HuBERT, the speech signal is cut into small segments, each of which is called a token.
S302: and acquiring prompt information obtained by the recognition model based on the target data.
As an implementation manner, multi-mode data such as video, voice, text, image, thermodynamic diagram and the like can be input, and the multi-mode data is unified into LLM (ChatBridge is taken as an example here), so that features are generated, and the features output by the ChatBridge are mapped into different voice classification tasks. It should be noted that, the embodiment of the present application only lists some speech classification tasks, and the present application is not limited to the listed classification tasks.
S303: and inputting the voice characteristic vector and the prompt information into a voice language model together.
In one embodiment, the task token (i.e., hint information) output after mapping is input into the Speech LM along with the Speech token output by the Speech Encoder.
For the Speech LM, only the training Prompt word Prompt is needed, and other parameters of the Speech LM may be frozen.
As one embodiment, the Speech language model may be a Speech LM (Speech Language Model), which is a language model for processing Speech data. The function of the method is to generate a corresponding consecutive text sequence according to the input voice signal or voice characteristics. The specific LM parameters remain unchanged during training, focusing on learning Prompt vectors corresponding to the Prompt, e.g., task-specific Prompt vectors, emotion-specific Prompt vectors, speaker-specific Prompt vectors.
For example, a model may be preset, where other parameters of the model are kept unchanged, and only parameters of the hint vector are trained, and the preset model may be Unit MBART, GLSM, PGLSM, and the like.
It can be appreciated that the training method is a migration learning method, and the main idea is to keep the parameters of the Speech LM unchanged and train the parameters related to the hint vector. Unit MBART, GLSM, PGLSM, and the like are pre-trained language models that contain a great deal of language knowledge and rules that can help the Speech LM to understand the language input better. These pre-training models may be used, for example, to initialize the spec LM and lock all its parameters, thereby enabling the spec LM to make full use of the knowledge of the pre-training models while only requiring training of specific hint vectors. The advantage of this is that the number of parameters to be trained can be greatly reduced, thereby reducing the training cost and improving the training speed.
In addition, the objective function may employ cross entropy loss as the objective function for all generation tasks, with loss calculated by comparing the model's predicted results to the target discrete unit labels. That is, the objective function uses cross entropy loss as the objective function for all generation tasks, and the loss is calculated by comparing the predicted result of the model with the target discrete unit tags. This objective function may be used to train various tasks, such as speech recognition, speech synthesis, speech command recognition, etc., and may be adjusted according to the particular task.
In this process, parameters other than the prompt vector parameters are kept unchanged during the training of the Speech language model, that is, the prompt vector is the only parameter to be trained in the model, and the parameters of the Speech LM are kept unchanged during the training, which ensures the consistency of the model behavior. By inserting the hint vector, the Speech LM is guided to extract task specific information from the input and increase the likelihood of producing an output that meets a specific Speech generating task.
S304: and acquiring a plurality of voice units output by the voice language model based on the voice feature vector and the prompt information.
As an embodiment, the Speech LM may output a Speech token, which generally refers to a basic unit for dividing and representing a Speech signal in the field of Speech processing, and the Speech token is a representation form obtained by preprocessing and extracting features of the Speech signal. In the tasks of Speech recognition and Speech emotion recognition, high-dimensional Speech features can be extracted by using deep neural networks, and these features can be regarded as high-dimensional Speech token. Specifically, the Speech token output by the Speech Encoder module is used as a Speech feature vector corresponding to the Speech data to be processed and is input to the Speech LM, the Speech LM processes the Speech token based on Prompt information (Prompt), so as to obtain the Speech token matched with the Prompt information, as the Speech token, it is to be noted that the Speech token output by the Speech LM is not only the Speech token converted based on the task information, but also has acoustic characteristics related to emotion description information and speaker description information, for example, the speaker is a sweet female, the emotion is pleasant, the tone in the Speech token output by the Speech LM is matched with the sweet female, and the Speech token output by the Speech LM also corresponds to a pleasant characteristic, for example, the content corresponding to the Speech token is a pleasant content under the mood.
That is, the speech encoder takes the waveform (speech data to be processed) as input and converts it into a sequence of units derived from a finite vocabulary. To shorten the sequence length, repeated consecutive units are removed to generate a compressed sequence of units. Then, the Speech LM is used as a language model of the unit sequence, and the likelihood is optimized by predicting the previous unit and the subsequent unit of the unit sequence. The specific LM is subjected to prompt adjustment through prompt information so as to guide the specific LM to generate proper units, namely specific token, according to the task.
During the outputting of the Speech token by the Speech LM, the Speech LM Units are also generated, and the Speech LM Units are Spoken LM Units, which refer to the smallest phonetic Units in the Speech language model, and generally correspond to phonemes, sub-syllables, or larger Units. The spoke LM Units are used to model the low-level features of the speech signal and can be understood as the basic unit for segmenting and slicing the speech signal. For example, for English, the spoke LM Units may correspond to phonemes such as/d/,/p/,/g/etc. Speech tags: this refers to a discrete symbol representation obtained by preprocessing and specific encoding of a speech signal. The Speech tokens may be discretized representations based on the spoke LM Units, or may be other forms of symbol encoding. The Speech tokens may be discrete symbols similar to words in text for modeling and generating Speech signals. In the Speech language model, the Speech tokens play a role similar to the vocabulary in the text language model.
That is, the spoke LM Units are segments and slivers of the Speech signal, and the spech keys are symbolic representations of further discretizing and encoding the spoke LM Units. In other words, the specific tokens are high-level abstractions and representations of the spoke LM Units.
In general, the spoke LM Units may contain some characteristics related to phonetic pronunciation and other task related characteristics, such as speaker information, language, emotion, etc., which may be extracted from the input audio or transcription.
Thus, each phonetic unit comprises phonetic unit content and information describing phonetic characteristics corresponding to the phonetic unit content, wherein the information describing phonetic characteristics is related to the phonetic classification task. The speech unit is the above-mentioned speech unit LM Units, the content of the speech unit is the above-mentioned basic unit after segmenting and splitting the speech signal, the information describing the speech characteristics may be the above-mentioned other task related features, the information describing the speech characteristics is related to the speech classification task, for example, the speech classification task is language classification, and the information describing the speech characteristics may be language labels.
S305: classifying, by a speech analysis model, each speech unit based on information describing characteristics of speech of the speech unit output by the speech language model.
Since each speech unit includes speech unit content and information describing speech characteristics corresponding to the speech unit content, wherein the information describing speech characteristics is related to the speech classification task, the respective speech units can be classified by the information describing speech characteristics of each speech unit.
As an embodiment, the speech analysis model may be a language analyzer (Verbalizer), which may be a linear model, or NN model, for mapping discrete units to categories of each task, wherein the discrete units are the aforementioned speech units.
As shown in fig. 4, multi-modal target data is input, and the multi-modal data can be perceived, understood and inferred through the ChatBridge model, so as to obtain prompt information, i.e. to determine which classification task belongs to. For example, the ChatBridge model distinguishes between language recognition and emotion recognition tasks, i.e., tasks that need to be performed to distinguish what language the input speech belongs to, and what emotion the input speech belongs to.
And then, the voice feature vector and the prompt information are input into a special LM model together, the special LM model inputs the spoke LM Units into a Verbalizer module, and the Verbalizer module can determine the type of each Unit through identifying the information describing the voice characteristics. As shown in fig. 5, the Verbalizer module recognizes that the input speech belongs to the japanese language and that the emotion of the speech is happy.
As an implementation manner, the voice language model is further adjusted by a Low-Rank Adaptation (LORA) method, so that the adjusted voice language model has a function of processing tasks other than the voice classification task.
In the above-described speech language model, low-rank adaptation may be used to optimize and improve the language model. By applying low-rank adaptation to language model fine tuning, the generalization ability of the model can be improved, overfitting reduced, and adaptability to specific fields or tasks enhanced.
Illustratively, first, pre-training of a speech language model is performed on a large, generic corpus. Such pre-training enables models to learn a wide range of linguistic knowledge and context. The pre-trained language model is then trimmed by the data with labeling data or specific tasks. For example, the model is applied to a particular text generation task, such as machine translation, text summarization, dialog systems, and the like. In the fine tuning process, low rank adaptation techniques may be introduced to further optimize the model. Specifically, the parameter space of the model can be limited by introducing low-rank constraint or adding low-rank regularization term, so that the model has the characteristic of low rank. This helps reduce the complexity of the model and improves the generalization ability of the model, especially when there is less trimming data.
That is, the voice language model is adjusted by a low-rank adaptation method so that the voice language model can be used for other tasks based on the voice feature vector and the plurality of voice marks output by the prompt information.
As shown in fig. 6, it shows a prompt information generation interface, in which a user may input target data, and after clicking a "generate prompt" control, automatically execute an operation of identifying a prompt information obtained by a model based on the target data. The interface includes a text input area and a data input area 601 of a preset type, and the user clicks the data input area 601 to select and input video, voice, image, thermodynamic diagram and other types of files as target data. Therefore, the implementation of obtaining the prompt information obtained by the recognition model based on the target data may be to obtain the target data input by the user through the prompt information generation interface, and obtain the prompt information corresponding to the target data based on the recognition model.
In one embodiment, the prompt information generation interface further includes a recommendation control 602, and when the user operates the recommendation control 602, the client can determine auxiliary classification task information in a recommendation mode, and the voice classification task information obtained by the recognition model is fused with the auxiliary classification task information to obtain the prompt information. Specifically, the voice classification task information obtained by the recognition model is used as first classification task information, the auxiliary classification task information determined in the recommendation mode is used as second classification task information, and the voice classification task information is obtained based on the first classification task information and the second classification task information so as to obtain prompt information.
As an embodiment, the second classification task information may include accent classification, illustratively, obtaining current geographic location information of the electronic device, and determining the second classification task information based on the current geographic location information. It can be appreciated that the crowd speaking modes of different location areas may be different, so that the current geographic location information of the electronic device can be obtained, the crowd speaking style corresponding to the current geographic location information can be determined, and the speaker description information matched with the crowd speaking style can be determined.
For example, the crowd speaking style may be an accent, also referred to as a dialect, characterized by the linguistic pronunciation of the residents in the area of the current geographic location, and determining the speaker descriptive information matching the crowd speaking style is performed by taking the dialect of the current geographic location information as the target accent of the accent class, that is, the accent class is to determine the class of the target accent in the plurality of speech units or to determine whether the speech data to be processed includes the target accent.
In one embodiment, the second classification task information may include speaker classification, determining whether the current geographic location information is located in a specified location range, if so, determining a target speaker identity corresponding to the specified location range, and taking the target speaker identity as a target speaker of the speaker classification. That is, the speaker classification is to determine a category belonging to the subject speaker among the plurality of voice units or to determine whether the voice data to be processed includes the subject speaker.
The specified position range may be predetermined, and the specified position range corresponds to a business or residence with a preset identity, for example, the preset identity may be a medical worker, a salesperson, etc., so that a correspondence table may be pre-established, the correspondence table includes a plurality of specified position ranges and preset identities corresponding to each specified position range, after the electronic device obtains the current geographic position information, the current geographic position information is matched with each specified position range in the correspondence table, the matched specified position range is searched for as a target position range, the preset identity corresponding to the target position range in the correspondence table is used as a target identity, and the target identity is used as a speaker description information, so that the subsequently generated target voice data can simulate the speaking kisses of the target identity.
In addition, current surrounding crowd information of the electronic device can be acquired, and speaker classification information is determined based on the current surrounding crowd information. For example, the current surrounding crowd information may include identity information of a current surrounding person, where the identity information may include information such as occupation, age, sex, and race, and the electronic device may collect an image of the surrounding person through a camera of the electronic device, and obtain the current surrounding crowd information based on analysis of the collected image, where, for example, the occupation may be identified through clothing worn by the person, and the age, sex, and race may be obtained through the morphological feature, that is, after the identity information of the surrounding person is obtained, the identity information of the surrounding person is taken as the aforementioned target speaker. For example, the current surrounding crowd information includes the age and sex of the surrounding crowd, and the age and sex are taken as speaker description information, for example, the surrounding crowd is a female of 40 years old, and the target speaker may be a middle-aged female.
As another embodiment, the second classification task information may include an emotion classification, and then historical data of the electronic device within a preset period of time is acquired, a target emotion is determined based on the historical data, and the target emotion is used as a target emotion of the emotion classification, that is, the emotion classification is to determine a category of the target emotion in a plurality of voice units or determine whether the voice data to be processed includes voice of the target emotion.
For example, the historical data may include voice data input through the electronic device in a preset period of time, and the embodiment of determining the target emotion based on the historical data may be to obtain the voice data input through the electronic device in the preset period of time as the historical voice data, analyze the historical voice data to obtain an emotion tag as the historical emotion, and determine the target emotion based on the historical emotion, for example, the historical emotion may be taken as the target emotion.
In one embodiment, the historical voice data may be voice data input by a user of the electronic device or voice data input by a dialogue object of the user of the electronic device. In this embodiment of the present application, in order to take care of emotion of a counterpart, the historical voice data are voice data input by a dialogue object of a user of an electronic device, and a manner of determining the voice data input by the dialogue object may be to obtain, in advance, identity information of the user of the electronic device and preset voiceprint features corresponding to the identity information, where the identity information may be an account number of the user logged in the electronic device or a target client, and then determine, by using all voice data collected in a preset time period of the electronic device, voiceprint features corresponding to each voice data, and use, as target voiceprint features, voiceprint features that do not match with preset voiceprint features corresponding to the identity information of the user of the electronic device, and use, as historical data, voice data corresponding to the target voiceprint features.
In addition, the history data may be operation data generated when the user operates the electronic device for a preset period of time, the operation data may include data generated when a specific application program is operated, and the specific application program may be a multimedia application program, for example, the operation data may be content browsed in the multimedia application program, emotion corresponding to the content is determined by identifying the browsed content, the emotion is taken as a history emotion, and a target emotion is determined based on the history emotion. The multimedia application program may include a video playing application program, an audio playing application program, a social media application function, etc., and the content browsed in the multimedia application program may include a played video, a played audio, a read article, etc. For example, if the historical emotion is a positive emotion, the historical emotion is taken as a target emotion, and if the historical emotion is a negative emotion, the positive emotion corresponding to the historical emotion is taken as the target emotion.
The operation data may be information on the performance of the game application, and the information may include a winning result, a passing result, and the like, which can affect the emotion of the user, so that the historical emotion of the user is determined by recognizing the information.
As an embodiment, the embodiment of obtaining the voice classification task information based on the first classification task information and the second classification task information may be that the first classification task information and the second classification task information are used together as the voice classification task information, that is, the voice classification task information for the operation of voice classification includes both the first classification task information and the second classification task information.
Referring to fig. 7, a block diagram illustrating a voice classification apparatus 700 according to an embodiment of the present application is shown, where the apparatus may include: an acquisition unit 701, a determination unit 702, and a processing unit 703.
An obtaining unit 701, configured to obtain a speech feature vector corresponding to speech data to be processed.
The determining unit 702 is configured to obtain a prompt message obtained by the recognition model based on the target data, where the type of the target data includes a text type and a preset type, and the prompt message includes voice classification task information.
Further, the preset type includes at least one of video, voice, image and thermodynamic diagram.
A processing unit 703, configured to perform a classification operation on the voice data based on the voice feature vector and the prompt information.
Further, the processing unit 703 is further configured to input the speech feature vector and the prompt information together into a speech language model; acquiring a plurality of voice units output by the voice language model based on the voice feature vector and the prompt information, wherein each voice unit comprises voice unit content and information describing voice characteristics corresponding to the voice unit content, and the information describing the voice characteristics is related to the voice classification task; classifying, by a speech analysis model, each speech unit based on information describing characteristics of speech of the speech unit output by the speech language model.
Further, the processing unit 703 is further configured to adjust the speech language model by using a low-rank adaptation method, so that the adjusted speech language model has a function of processing tasks other than the speech classification task.
Further, the voice language model is a spech LM model, and the voice analysis model is a Verbalizer model.
Further, the identification model is a ChatBridge model.
Further, the speech classification task includes at least one of keyword recognition, intent recognition, language recognition, false speech detection, emotion recognition, accent classification, irony detection, gender recognition, voice activity detection, audio classification, dysarthria classification, and voice command recognition.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.
In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.
Referring to fig. 8, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or the like capable of running an application program. The electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, wherein the one or more application programs may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more program(s) configured to perform the method as described in the foregoing method embodiments.
Processor 110 may include one or more processing cores. The processor 110 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and invoking data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 110 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 110 and may be implemented solely by a single communication chip.
The Memory 120 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the terminal 100 in use (such as phonebook, audio-video data, chat-record data), etc.
Referring to fig. 9, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 900 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.
The computer readable storage medium 900 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, computer readable storage medium 900 includes non-volatile computer readable media (non-transitory computer-readable storage medium). The computer readable storage medium 900 has storage space for program code 910 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 910 may be compressed, for example, in a suitable form.
In summary, according to the voice classification method, the device, the electronic equipment and the computer readable medium provided by the application, the voice feature vector corresponding to the voice data to be processed is obtained, the prompt information obtained by the recognition model based on the target data is obtained, wherein the type of the target data comprises a text type and a preset type, the prompt information comprises voice classification task information, and the classification operation is performed on the voice data based on the voice feature vector and the prompt information. The prompt information can be determined through the identification of the target data, and the target data contains other types besides text types, so that the mode of inputting the prompt information by the data in multiple modes can be provided.
In addition, through training and optimizing the Prompt words of the multi-mode data, the model can input the Prompt of a plurality of modes, so that the perception and understanding of the input Prompt words of any mode are realized; through training a plurality of voice tasks, the input voice can be identified by a plurality of voice classification tasks, so that the richness and the fineness of voice interaction can be greatly improved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.