patents.google.com

CN109473091B - Voice sample generation method and device - Google Patents

️Tue Aug 10 2021

CN109473091B - Voice sample generation method and device - Google Patents

Voice sample generation method and device Download PDF

Info

Publication number

CN109473091B

CN109473091B CN201811593971.6A CN201811593971A CN109473091B CN 109473091 B CN109473091 B CN 109473091B CN 201811593971 A CN201811593971 A CN 201811593971A CN 109473091 B CN109473091 B CN 109473091B Authority

China

Prior art keywords

voice

variable

mel

speech

value

Prior art date

2018-12-25

Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)

Active

Application number

CN201811593971.6A

Other languages

Chinese (zh)

Other versions

CN109473091A (en

Inventor

魏华强

李锐

彭凝多

唐博

彭恒进

Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)

Sichuan Hongwei Technology Co Ltd

Original Assignee

Sichuan Hongwei Technology Co Ltd

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2018-12-25

Filing date

2018-12-25

Publication date

2021-08-10

2018-12-25 Application filed by Sichuan Hongwei Technology Co Ltd filed Critical Sichuan Hongwei Technology Co Ltd

2018-12-25 Priority to CN201811593971.6A priority Critical patent/CN109473091B/en

2019-03-15 Publication of CN109473091A publication Critical patent/CN109473091A/en

2021-08-10 Application granted granted Critical

2021-08-10 Publication of CN109473091B publication Critical patent/CN109473091B/en

Status Active legal-status Critical Current

2038-12-25 Anticipated expiration legal-status Critical

Images

Classifications

- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0635—Training updating or merging of old and new templates; Mean values; Weighting
- G10L2015/0636—Threshold criteria for the updating

Landscapes

Engineering & Computer Science (AREA)
Computational Linguistics (AREA)
Health & Medical Sciences (AREA)
Audiology, Speech & Language Pathology (AREA)
Human Computer Interaction (AREA)
Physics & Mathematics (AREA)
Acoustics & Sound (AREA)
Multimedia (AREA)
Artificial Intelligence (AREA)
Evolutionary Computation (AREA)
Signal Processing (AREA)
Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a voice sample generation method and a device, wherein the method comprises the following steps: after the first voice variable is obtained, extracting a Mel frequency characteristic value of the first voice variable; calculating a loss function of the Mel frequency eigenvalue of the first speech variable and the Mel frequency eigenvalue of the target speech by using a neural network; and optimizing the loss function by adjusting the value of the sampling point in the first voice variable by utilizing an optimization algorithm in the neural network until the value of the optimized loss function is smaller than a preset threshold value, and taking the voice variable of which the value of the loss function is smaller than the preset threshold value as a target voice sample. Therefore, the inverse Mel transformation of the voice variable is solved based on the neural network, the errors of the Mel frequency characteristic value of the voice variable and the Mel frequency characteristic value of the target voice are optimized through the neural network, so that the voice variable when the errors are smaller than a preset threshold value is obtained, and the voice variable at the moment is used as an antagonistic sample, so that the voice sample set of the voice recognition system is enriched.

Description

Voice sample generation method and device

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice sample generation method and device.

Background

In the existing speech recognition system based on the deep learning model, due to the reasons of incomplete corpus, deficient speech sample set and the like, the robustness of the speech recognition system is not strong enough, and the speech recognition system is easily interfered by countersamples.

Disclosure of Invention

The invention provides a voice sample generation method and a voice sample generation device, which aim to solve the problem of lack of a voice sample set of a voice recognition system.

In order to achieve the above object, the technical solutions provided by the embodiments of the present invention are as follows:

in a first aspect, an embodiment of the present invention provides a method for generating a speech sample, including: after a first voice variable is obtained, extracting a Mel frequency characteristic value of the first voice variable; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel; calculating a loss function of the Mel frequency eigenvalue of the first speech variable and the Mel frequency eigenvalue of the target speech using a neural network; and optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, wherein the voice variable with the value of the loss function smaller than the preset threshold value is a target voice sample. Therefore, the inverse Mel transformation of the voice variable is solved based on the neural network, the errors of the Mel frequency characteristic value of the voice variable and the Mel frequency characteristic value of the target voice are optimized through the neural network, so that the voice variable when the errors are smaller than a preset threshold value is obtained, and the voice variable at the moment is used as an antagonistic sample, so that the voice sample set of the voice recognition system is enriched.

In an optional embodiment of the present invention, the extracting the mel-frequency feature value of the first speech variable includes: carrying out Fourier transform on each frame of the first voice variable to obtain a second voice variable; carrying out Mel filtering on the second voice variable to obtain a third voice variable; and performing discrete cosine transform on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable. Therefore, the process of extracting the mel-frequency feature values of the speech variables may be: fourier transform, Mel filtering and discrete cosine transform are carried out, so that the obtained Mel scale cepstrum is used as the Mel frequency characteristic value of the voice variable, and the voice variable can be better represented.

In an optional embodiment of the present invention, after the discrete cosine transforming the third speech variable to obtain a mel scale cepstrum, the method further includes: carrying out difference operation on the Mel scale cepstrum; the taking the mel scale cepstrum as the mel frequency characteristic value of the first voice variable comprises: and inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable. Therefore, the difference value of the Mel scale cepstrum extracted from the previous frame and the next frame in the voice variable is used as a parameter representing the inter-frame dynamic information of the voice variable, and is supplemented into the Mel scale cepstrum to be used as the Mel frequency characteristic value of the voice variable, so that the voice recognition system has a wider application range after being trained by using the voice variable.

In an optional embodiment of the present invention, before fourier transforming each frame of the first speech variable to obtain a second speech variable, the method further comprises: and carrying out high-pass filtering processing on the first voice variable, dividing the first voice variable after filtering processing into continuous frames, and carrying out windowing processing on each frame. Therefore, before solving the mel-frequency characteristic value of the voice variable, pre-emphasis processing such as filtering, framing, windowing and the like can be firstly carried out on the voice variable, so that the voice variable obtained by processing is more beneficial to solving the mel-frequency characteristic value.

In an alternative embodiment of the present invention, the obtaining the first speech variable includes: generating a voice segment; and formatting the voice fragment to obtain the first voice variable, so that the characteristic parameter of the first voice variable is the same as the characteristic parameter of the target voice. Therefore, the speech variable may be a segment of randomly generated speech, and the length, sampling rate, and vocal tract of the speech should be the same as the length, sampling rate, and vocal tract of the target speech, so as to ensure that the finally optimized speech variable can be used as a sample of the speech recognition system.

In an optional embodiment of the invention, before the extracting the mel-frequency feature value of the first speech variable, the method further comprises: obtaining the target voice; and extracting the Mel frequency characteristic value of the target voice. Therefore, before processing the voice variable, a target voice, which is a target for optimizing the voice variable, may be obtained first.

In an optional embodiment of the present invention, after the speech variable satisfying the loss function and having a value smaller than the preset threshold is a target speech sample, the method further comprises: and training a voice recognition system by using the neural network by taking the target voice sample as a sample. Therefore, after the neural network is used for obtaining the speech variable meeting the standard, the speech variable can be used as a training sample to train the speech recognition system, and the robustness of the speech system is improved.

In a second aspect, an embodiment of the present invention provides a speech sample generation apparatus, including: the device comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting a Mel frequency characteristic value of a first voice variable after the first voice variable is obtained; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel; a first calculation module, configured to calculate a loss function between the mel-frequency feature value of the first speech variable and the mel-frequency feature value of the target speech by using a neural network; and the optimization module is used for optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, and the voice variable of which the value of the loss function is smaller than the preset threshold value is taken as a target voice sample. Therefore, the first extraction module is used for solving the inverse Mel transformation of the voice variable based on the neural network, the optimization module is used for optimizing errors of the Mel frequency characteristic value of the voice variable and the Mel frequency characteristic value of the target voice through the neural network so as to obtain the voice variable when the errors are smaller than a preset threshold value, and the voice variable at the moment is used as an antagonistic sample, so that the voice sample set of the voice recognition system is enriched.

In an optional embodiment of the invention, the first extraction module comprises: the first transformation module is used for carrying out Fourier transformation on each frame of the first voice variable to obtain a second voice variable; the first filtering module is used for carrying out Mel filtering on the second voice variable to obtain a third voice variable; and the second transformation module is used for performing discrete cosine transformation on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable. Therefore, the process of extracting the mel-frequency feature values of the speech variables by the first extraction module may be as follows: the first transformation module is used for Fourier transformation, the first filtering module is used for Mel filtering, and the second transformation module is used for discrete cosine transformation, so that the obtained Mel scale cepstrum is used as a Mel frequency characteristic value of a voice variable, and the voice variable can be better represented.

In an alternative embodiment of the invention, the apparatus further comprises: the second calculation module is used for carrying out difference operation on the Mel scale cepstrum; the second transformation module comprises: and the inserting module is used for inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable. Therefore, the difference value of the mel scale cepstrum extracted from the previous frame and the next frame in the voice variable calculated in the second calculation module is used as a parameter representing the inter-frame dynamic information of the voice variable, and is supplemented into the mel scale cepstrum by using the insertion module, and the mel scale cepstrum are used as the mel frequency characteristic value of the voice variable, so that the voice recognition system has a wider application range after being trained by using the voice variable.

In an alternative embodiment of the invention, the apparatus further comprises: and the third filtering module is used for carrying out high-pass filtering processing on the first voice variable, dividing the first voice variable after filtering processing into continuous frames and carrying out windowing processing on each frame. Therefore, before the first extraction module is used for solving the mel-frequency characteristic value of the voice variable, the third filtering module can be used for carrying out pre-emphasis processing such as filtering, framing and windowing on the voice variable, so that the processed voice variable is more beneficial to solving the mel-frequency characteristic value.

In an optional embodiment of the invention, the first extraction module comprises: the generating module is used for generating a voice segment; and the formatting module is used for formatting the voice fragments to obtain the first voice variable, so that the characteristic parameters of the first voice variable are the same as the characteristic parameters of the target voice. Therefore, the voice variable can be a section of voice randomly generated by the generation module, and the length, sampling rate and vocal tract of the voice should be the same as the length, sampling rate and vocal tract of the target voice, so that it can be ensured that the finally optimized voice variable can be used as a sample of the voice recognition system.

In an alternative embodiment of the invention, the apparatus further comprises: an obtaining module, configured to obtain the target voice; and the second extraction module is used for extracting the Mel frequency characteristic value of the target voice. Therefore, before processing the voice variable, a target voice, which is a target for optimizing the voice variable, may be obtained by the obtaining module.

In an alternative embodiment of the invention, the apparatus further comprises: and the training module is used for training the voice recognition system by using the target voice sample as a sample and utilizing the neural network. Therefore, after the neural network is used for obtaining the speech variable meeting the standard, the training module can be used for training the speech recognition system by taking the speech variable as a training sample, so that the robustness of the speech system is improved.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the method of any of the first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the method described in any one of the optional implementation manners of the first aspect.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below. It is appreciated that the following drawings depict only some embodiments of the invention and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

Fig. 1 is a flowchart of a method for generating a speech sample according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for generating a speech sample according to an embodiment of the present invention;

FIG. 3 is a flow chart of another method for generating a speech sample according to an embodiment of the present invention;

FIG. 4 is a flow chart of another method for generating a speech sample according to an embodiment of the present invention;

FIG. 5 is a flow chart of another method for generating a speech sample according to an embodiment of the present invention;

fig. 6 is a block diagram of a speech sample generation apparatus according to an embodiment of the present invention.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

In the description of the present invention, it should be noted that the terms "middle", "upper", "lower", "horizontal", "inner", "outer", and the like indicate orientations or positional relationships based on orientations or positional relationships shown in the drawings or orientations or positional relationships conventionally laid out when products of the present invention are used, and are only used for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like are used merely to distinguish one description from another, and are not to be construed as indicating or implying relative importance.

Furthermore, the terms "horizontal", "vertical" and the like do not imply that the components are required to be absolutely horizontal or pendant, but rather may be slightly inclined. For example, "horizontal" merely means that the direction is more horizontal than "vertical" and does not mean that the structure must be perfectly horizontal, but may be slightly inclined.

In the description of the present invention, it should be noted that the terms "disposed," "connected," and "connected" are to be construed broadly and may be, for example, fixedly connected, detachably connected, or integrally connected unless otherwise explicitly stated or limited. Either mechanically or electrically. They may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Some embodiments of the invention are described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

First embodiment

Referring to fig. 1, fig. 1 is a flowchart of a method for generating a speech sample according to an embodiment of the present invention, where the method includes the following steps:

step S100: after obtaining a first voice variable, extracting a Mel frequency characteristic value of the first voice variable.

Specifically, in the field of sound processing, mel-frequency cepstrum is a linear transformation of the log-energy spectrum based on a non-linear mel scale of sound frequencies that more closely approximates the human auditory system than the linearly spaced frequency bands used in the normal log-cepstrum, so that such a non-linear representation can provide a better representation of speech signals in multiple domains. The method for extracting the Mel Frequency Cepstrum Coefficient (MFCC) feature value of a section of speech is a common means of those skilled in the art, and can be obtained in many ways, and the embodiment of the present invention is not limited specifically.

For example, the speech is pre-emphasized, framed, and windowed, and then for each short analysis window, a Fast Fourier Transform (FFT) is used to obtain a corresponding spectrum. And then, obtaining a Mel frequency spectrum from the obtained frequency spectrum through a Mel filter bank, and finally performing cepstrum analysis on the Mel frequency spectrum to obtain a Mel frequency cepstrum coefficient, wherein the Mel frequency cepstrum coefficient is the characteristic of the frame of voice. Wherein the cepstral analysis may include: taking the logarithm and performing an inverse transformation, the actual inverse transformation is usually implemented by Discrete Cosine Transform (DCT).

It should be noted that, in the embodiment of the present invention, the first speech variable for performing mel-frequency feature value extraction may be a randomly generated or adopted section of speech, and may be noise, silence, or any speech. However, after obtaining such a segment of speech, it needs to format it to make the feature parameters of the segment of speech the same as those of a segment of target speech, where the feature parameters may include: length, sampling rate, and channel, etc. The target speech is a reference object of the speech variable, which is ultimately required to be as close to the target speech as possible. Besides, the feature parameters of the first voice variable are ensured to be the same as the feature parameters of the target voice by adopting a formatting mode, a section of voice with the feature parameters being the same as the feature parameters of the target voice can be generated when the voice variable is generated, so that the process of formatting the section of voice is simplified.

Step S200: calculating a loss function of the Mel frequency eigenvalue of the first speech variable and the Mel frequency eigenvalue of the target speech using a neural network.

Specifically, after the mel-frequency feature value of the first speech variable is calculated in step S100, in order to determine the error between the first speech variable and the target speech at that time, the error between the mel-frequency feature value of the first speech variable and the mel-frequency feature value of the target speech may be represented by a loss function. The process can be realized by a neural network, and the error between the Mel frequency characteristic value of the first voice variable and the Mel frequency characteristic value of the target voice is solved by using a log logarithmic loss function, a square loss function, an exponential loss function or other loss functions in the neural network.

It should be noted that the method for solving the error by using the loss function in the neural network is a common means of those skilled in the art, and may be obtained in many ways, and the embodiment of the present invention is not particularly limited.

Step S300: and optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, wherein the voice variable with the value of the loss function smaller than the preset threshold value is a target voice sample.

Specifically, after the error between the mel-frequency characteristic value of the first voice variable and the mel-frequency characteristic value of the target voice is expressed by using the loss function in the neural network in step S200, the loss function may be continuously optimized by using an optimization algorithm in the neural network, when each error value is calculated, the calculated error value is compared with a preset threshold, when the error value is greater than the preset threshold, the mel-frequency characteristic value of the new voice variable and the loss function between the mel-frequency characteristic value of the new voice variable and the mel-frequency characteristic value of the target voice are calculated by changing the values of a plurality of sampling points in the voice variable, the optimization process is repeated and the optimized error value is compared with the preset threshold again, and when the error value is greater than the preset threshold, the process is repeated; and when the error value is smaller than the preset threshold value, ending the optimization, and outputting a voice variable corresponding to the error value at the moment, wherein the voice variable is the voice sample which is closest to the target voice and meets the requirement in the embodiment of the invention. The preset threshold is a fixed value preset by a user according to an actual situation, and in the process of optimizing the loss function by the neural network, when the optimization result is smaller than the fixed value, the loss function is considered to reach a minimum value, so that the problem that the optimization process is too long and a voice variable meeting the requirement cannot be obtained all the time is avoided.

It should be noted that, the method of optimizing the loss function by using the neural network to obtain the optimal solution satisfying the condition that the loss function is sufficiently small is a common means of those skilled in the art, and the loss function may be optimized in many ways, such as gradient descent, least square method, etc., and the embodiment of the present invention is not limited specifically.

In the embodiment of the invention, the inverse Mel transformation of the voice variable is solved based on the neural network, the errors of the Mel frequency characteristic value of the voice variable and the Mel frequency characteristic value of the target voice are optimized through the neural network, so as to obtain the voice variable when the errors are smaller than a preset threshold value, and the voice variable at the moment is used as an antagonistic sample, thereby enriching the voice sample set of the voice recognition system.

Further, referring to fig. 2, fig. 2 is a flowchart of another speech sample generating method according to an embodiment of the present invention, and step S100 includes the following steps:

step S110: and carrying out Fourier transform on each frame of the first voice variable to obtain a second voice variable.

Specifically, in the process of solving the mel-frequency characteristic value of the first voice variable, the fourier transform process may be performed on the first voice variable, the short-time fourier transform may be performed on each frame of the first voice variable, the power spectrum of each frame may be calculated, and the second voice variable may be obtained, so that the information of the first voice variable may be converted from the time domain to the frequency domain.

Step S120: and carrying out Mel filtering on the second voice variable to obtain a third voice variable.

Specifically, the fourier-transformed second speech variable is mel-filtered through a mel filter. In an embodiment of the present invention, n triangular band pass filters may be used as the mel filter, and the n triangular band pass filters may be uniformly distributed over the mel frequency. After the second voice variable signal is multiplied by the n triangular band-pass filters, the logarithmic energy output by each filter is obtained, and the logarithmic energy is the third voice variable.

Step S130: and performing discrete cosine transform on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable.

Specifically, the third speech variable, that is, n pieces of logarithmic energy, is subjected to discrete pre-selection transformation to obtain an L-order mel scale cepstrum, which is a mel-frequency characteristic value of the first speech variable.

In the embodiment of the present invention, the process of extracting the mel-frequency feature value of the speech variable may be: fourier transform, Mel filtering and discrete cosine transform are carried out, so that the obtained Mel scale cepstrum is used as the Mel frequency characteristic value of the voice variable, and the voice variable can be better represented.

Further, referring to fig. 3, fig. 3 is a flowchart of another speech sample generation method according to an embodiment of the present invention, and steps S110 to S130 may be replaced with the following steps:

step S110: and carrying out Fourier transform on each frame of the first voice variable to obtain a second voice variable.

Step S120: and carrying out Mel filtering on the second voice variable to obtain a third voice variable.

Step S131: and performing discrete cosine transform on the third voice variable to obtain a Mel scale cepstrum.

Step S140: and carrying out difference operation on the Mel scale cepstrum.

Specifically, after the mel scale cepstrum of the first speech variable is obtained through discrete cosine transform in step S131, discrete difference operation may be performed on the mel scale cepstrum, that is, discrete first-order difference calculation may be performed, discrete second-order difference calculation may be performed, or both the discrete first-order difference calculation and the discrete second-order difference calculation may be performed, so as to obtain a difference calculated value.

Step S150: and inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable.

Specifically, the value obtained by the discrete difference calculation in step S140 is inserted into the mel scale cepstrum of the first speech variable to obtain the mel frequency characteristic value of the first speech variable, which is used as the dynamic information between frames of the first speech variable. It should be noted that only the value obtained by the discrete first order difference calculation may be inserted, the value obtained by the discrete second order difference calculation may be inserted, or the values obtained by the discrete first order difference calculation and the discrete second order difference calculation may be inserted at the same time.

In the embodiment of the invention, the difference value of the Mel scale cepstrum extracted from the previous frame and the next frame in the voice variable is taken as a parameter representing the inter-frame dynamic information of the voice variable, and is supplemented into the Mel scale cepstrum to be taken as the Mel frequency characteristic value of the voice variable together, so that the voice recognition system has a wider application range after being trained by using the voice variable.

Further, before step S110, the method further includes the following steps:

step S160: and carrying out high-pass filtering processing on the first voice variable, dividing the first voice variable after filtering processing into continuous frames, and carrying out windowing processing on each frame.

Specifically, before solving the mel-frequency characteristic value of the first speech variable, a series of processing may be performed on the first speech variable, and the processing may include: first, pre-emphasis processing is performed on the first voice variable, that is, the first voice signal is passed through a high-pass filter, so as to eliminate the influence of vocal cords and lips on the voice signal in the process of generating the first cloud-caused signal, thereby compensating the high-frequency part of the first voice signal, which is suppressed by the pronunciation system.

Secondly, the pre-emphasized first speech signal is subjected to framing processing, that is, the continuous first speech signal is divided into a plurality of continuous frames, the length of each frame can be controlled within the range of 20-50 milliseconds, and the number of corresponding sampling points is equal to the product of the sampling rate of the first speech signal and the length of each frame.

Finally, in order to keep the smoothness and continuity of the two end points of each frame of the framed first speech signal, windowing may be performed on each frame of the framed first speech signal. This is because in the subsequent step of performing fourier transform on the first speech signal, it is assumed that the signal in a frame represents a periodic signal, and if the periodicity does not exist, the fourier transform will generate some energy distributions without original signal in order to conform to the discontinuous changes of left and right ends, which results in an analysis error. In the embodiment of the invention, each frame of the first voice signal after frame division processing is multiplied by a Hamming window with the same length so as to achieve the purpose of keeping the left end and the right end of the voice frame continuous.

It should be noted that, the above-mentioned specific way of processing the first speech signal and the data are all several schemes provided by the embodiment of the present invention, and those skilled in the art can easily think of other ways of processing the first speech signal on the basis of this, which are also schemes protected by the embodiment of the present invention.

In the embodiment of the invention, before the mel frequency characteristic value of the voice variable is solved, pre-emphasis processing such as filtering, framing, windowing and the like can be firstly carried out on the voice variable, so that the voice variable obtained by processing is more beneficial to solving the mel frequency characteristic value.

Further, referring to fig. 4, fig. 4 is a flowchart of another speech sample generating method according to an embodiment of the present invention, and step S100 further includes the following steps:

step S170: a speech segment is generated.

Specifically, the generation of the voice segment is a random process, and may adopt a mode of recording one end of audio, downloading a segment of voice, and the like, and the generated voice segment may be a segment of noise, silence, or any voice.

Step S180: and formatting the voice fragment to obtain the first voice variable, so that the characteristic parameter of the first voice variable is the same as the characteristic parameter of the target voice.

Specifically, after the voice segment is generated in step S170, the voice variable may be formatted, so that the formatted first voice variable is the same as the characteristic parameter of the target voice at one end, where the characteristic parameter may be a length, a sampling rate, a channel, and the like. The target voice is the most original voice segment trained as a sample in the neural network.

In the embodiment of the present invention, the speech variable may be a segment of speech generated randomly, and the length, sampling rate, and characteristic parameters such as a vocal tract of the speech should be the same as the length, sampling rate, and vocal tract of the target speech, so that it can be ensured that the speech variable obtained through final optimization can be used as a sample of the speech recognition system.

Further, referring to fig. 5, fig. 5 is a flowchart of another speech sample generating method according to an embodiment of the present invention, before step S100, the method further includes the following steps:

step S400: and obtaining the target voice.

Step S500: and extracting the Mel frequency characteristic value of the target voice.

Specifically, the target speech is the most primitive speech segment trained in the neural network as a sample, and the mel frequency characteristic value of the target speech may be extracted in the same manner as in step S100, so that the first speech variable is made to approach the target speech as much as possible by using the optimization algorithm of the neural network, thereby ensuring that the first speech variable obtained through final optimization may be trained as a sample of the neural network, and achieving the purpose of increasing the robustness of the speech recognition system.

In the embodiment of the present invention, before processing the voice variable, a section of target voice may be obtained first, where the target voice is a target for optimizing the voice variable.

Further, after step S300, the following steps are also included:

step S600: and training a voice recognition system by using the neural network by taking the target voice sample as a sample.

Specifically, after obtaining the suitable first speech variable, the first speech variable is the new countermeasure sample of the speech recognition system. The first speech variable can be saved as speech with the same length, sampling rate, vocal tract and other characteristics as the target speech, and the amplitude of the speech waveform generally takes the normal range, i.e., -2¹⁵To 2¹⁵1, the voice can be added into a primitive voice recognition system for countertraining, thereby enhancing the robustness of the existing voice recognition system.

In the embodiment of the invention, after the speech variable meeting the standard is obtained by utilizing the neural network, the speech variable can be used as a training sample to train the speech recognition system, so that the robustness of the speech system is improved.

Second embodiment

Referring to fig. 6, fig. 6 is a block diagram illustrating a voice

sample generating apparatus

600 according to an embodiment of the present invention, where the voice

sample generating apparatus

600 includes: a

first extraction module

610, configured to extract a mel-frequency feature value of a first voice variable after the first voice variable is obtained; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel; a

first calculating module

620, configured to calculate a loss function of the mel-frequency feature value of the first voice variable and the mel-frequency feature value of the target voice by using a neural network; an optimizing

module

630, configured to optimize the loss function by adjusting values of sampling points in the first voice variable through an optimizing algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold, where a voice variable that satisfies that the value of the loss function is smaller than the preset threshold is a target voice sample.

In the embodiment of the present invention, the

first extraction module

610 is used to solve the inverse mel transformation of the speech variable based on the neural network, and the

optimization module

630 is used to optimize the errors of the mel-frequency characteristic value of the speech variable and the mel-frequency characteristic value of the target speech through the neural network, so as to obtain the speech variable when the error is smaller than the preset threshold, and the speech variable at this time is used as an antagonistic sample, thereby enriching the speech sample set of the speech recognition system.

Further, the

first extraction module

610 includes: the first transformation module is used for carrying out Fourier transformation on each frame of the first voice variable to obtain a second voice variable; the first filtering module is used for carrying out Mel filtering on the second voice variable to obtain a third voice variable; and the second transformation module is used for performing discrete cosine transformation on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable.

In this embodiment of the present invention, the process of extracting the mel-frequency feature value of the speech variable by the first extracting

module

610 may be: the first transformation module is used for Fourier transformation, the first filtering module is used for Mel filtering, and the second transformation module is used for discrete cosine transformation, so that the obtained Mel scale cepstrum is used as a Mel frequency characteristic value of a voice variable, and the voice variable can be better represented.

Further, the apparatus further comprises: the second calculation module is used for carrying out difference operation on the Mel scale cepstrum; the second transformation module comprises: and the inserting module is used for inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable.

In the embodiment of the invention, the difference value of the Mel scale cepstrum extracted from the previous frame and the next frame in the voice variable calculated in the second calculation module is used as a parameter representing the inter-frame dynamic information of the voice variable, and is supplemented into the Mel scale cepstrum by using the insertion module, and the Mel scale cepstrum are used as the Mel frequency characteristic value of the voice variable, so that the voice recognition system has a wider application range after being trained by using the voice variable.

Further, the apparatus further comprises: and the third filtering module is used for carrying out high-pass filtering processing on the first voice variable, dividing the first voice variable after filtering processing into continuous frames and carrying out windowing processing on each frame.

In the embodiment of the present invention, before the

first extraction module

610 is used to solve the mel-frequency characteristic value of the speech variable, the speech variable may be pre-emphasized by a third filtering module, such as filtering, framing, and windowing, so that the processed speech variable is more favorable for solving the mel-frequency characteristic value.

Further, the

first extraction module

610 includes: the generating module is used for generating a voice segment; and the formatting module is used for formatting the voice fragments to obtain the first voice variable, so that the characteristic parameters of the first voice variable are the same as the characteristic parameters of the target voice.

In the embodiment of the present invention, the voice variable may be a segment of voice randomly generated by the generation module, and the length, sampling rate, and vocal tract of the voice should be the same as the length, sampling rate, and vocal tract of the target voice, so that it can be ensured that the voice variable finally obtained through optimization can be used as a sample of the voice recognition system.

Further, the apparatus further comprises: an obtaining module, configured to obtain the target voice; and the second extraction module is used for extracting the Mel frequency characteristic value of the target voice.

In the embodiment of the present invention, before processing the voice variable, a section of target voice, which is a target for optimizing the voice variable, may be obtained by using the obtaining module.

Further, the apparatus further comprises: and the training module is used for training the voice recognition system by using the target voice sample as a sample and utilizing the neural network.

In the embodiment of the invention, after the neural network is used for obtaining the speech variable meeting the standard, the training module can be used for training the speech recognition system by taking the speech variable as the training sample, so that the robustness of the speech system is improved.

Third embodiment

An embodiment of the present invention provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the method of any of the first embodiments.

The Memory may include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), Erasable Read Only Memory (EPROM), electrically Erasable Read Only Memory (EEPROM), and the like.

The processor may be an integrated circuit chip having signal processing capabilities. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. Which may implement or perform the various methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Fourth embodiment

An embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the method described in any optional implementation part of the first embodiment.

In summary, the present invention provides a method and an apparatus for generating a voice sample, the method comprising: after a first voice variable is obtained, extracting a Mel frequency characteristic value of the first voice variable; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel; calculating a loss function of the Mel frequency eigenvalue of the first speech variable and the Mel frequency eigenvalue of the target speech using a neural network; and optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, wherein the voice variable with the value of the loss function smaller than the preset threshold value is a target voice sample. Therefore, the inverse Mel transformation of the voice variable is solved based on the neural network, the errors of the Mel frequency characteristic value of the voice variable and the Mel frequency characteristic value of the target voice are optimized through the neural network, so that the voice variable when the errors are smaller than a preset threshold value is obtained, and the voice variable at the moment is used as an antagonistic sample, so that the voice sample set of the voice recognition system is enriched.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (9)

1. A method for generating a speech sample, comprising:

after a first voice variable is obtained, extracting a Mel frequency characteristic value of the first voice variable; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel;

calculating a loss function of the Mel frequency eigenvalue of the first speech variable and the Mel frequency eigenvalue of the target speech using a neural network;

optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, and taking the voice variable with the value of the loss function smaller than the preset threshold value as a target voice sample;

the obtaining a first speech variable includes:

generating a voice segment;

and formatting the voice fragment to obtain the first voice variable, so that the characteristic parameter of the first voice variable is the same as the characteristic parameter of the target voice.

2. The method of generating a speech sample according to claim 1, wherein said extracting the mel-frequency feature value of the first speech variable comprises:

carrying out Fourier transform on each frame of the first voice variable to obtain a second voice variable;

carrying out Mel filtering on the second voice variable to obtain a third voice variable;

and performing discrete cosine transform on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable.

3. The method of claim 2, wherein after the discrete cosine transforming the third speech variable to obtain a mel scale cepstrum, the method further comprises:

carrying out difference operation on the Mel scale cepstrum;

the taking the mel scale cepstrum as the mel frequency characteristic value of the first voice variable comprises:

and inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable.

4. The method of generating speech samples according to claim 2, wherein before fourier transforming each frame of the first speech variable to obtain a second speech variable, the method further comprises:

and carrying out high-pass filtering processing on the first voice variable, dividing the first voice variable after filtering processing into continuous frames, and carrying out windowing processing on each frame.

5. The method of generating a speech sample according to claim 1, wherein prior to said extracting the mel-frequency feature value of the first speech variable, the method further comprises:

obtaining the target voice;

and extracting the Mel frequency characteristic value of the target voice.

6. The method according to any one of claims 1 to 5, wherein after the speech variable satisfying the loss function and having a value smaller than the preset threshold is a target speech sample, the method further comprises:

and training a voice recognition system by using the neural network by taking the target voice sample as a sample.

7. A speech sample generation apparatus, comprising:

the device comprises a first extraction module, a second extraction module and a third extraction module, wherein the first extraction module is used for extracting a Mel frequency characteristic value of a first voice variable after the first voice variable is obtained; wherein the feature parameters of the first voice variable are the same as the feature parameters of the target voice, and the feature parameters include: length, sampling rate, and channel;

a first calculation module, configured to calculate a loss function between the mel-frequency feature value of the first speech variable and the mel-frequency feature value of the target speech by using a neural network;

the optimization module is used for optimizing the loss function by adjusting the values of sampling points in the first voice variable by utilizing an optimization algorithm in the neural network until the optimized value of the loss function is smaller than a preset threshold value, and the voice variable with the value of the loss function smaller than the preset threshold value is a target voice sample;

the first extraction module comprises:

the generating module is used for generating a voice segment;

and the formatting module is used for formatting the voice fragments to obtain the first voice variable, so that the characteristic parameters of the first voice variable are the same as the characteristic parameters of the target voice.

8. The speech sample generation device of claim 7, wherein the first extraction module comprises:

the first transformation module is used for carrying out Fourier transformation on each frame of the first voice variable to obtain a second voice variable;

the first filtering module is used for carrying out Mel filtering on the second voice variable to obtain a third voice variable;

and the second transformation module is used for performing discrete cosine transformation on the third voice variable to obtain a Mel scale cepstrum, and taking the Mel scale cepstrum as a Mel frequency characteristic value of the first voice variable.

9. The speech sample generation apparatus of claim 8, further comprising:

the second calculation module is used for carrying out difference operation on the Mel scale cepstrum;

the second transformation module comprises:

and the inserting module is used for inserting the result of the differential operation into the Mel scale cepstrum to obtain the Mel frequency characteristic value of the first voice variable.

CN201811593971.6A 2018-12-25 2018-12-25 Voice sample generation method and device Active CN109473091B (en)

Priority Applications (1)

Application Number	Priority Date	Filing Date	Title
CN201811593971.6A CN109473091B (en)	2018-12-25	2018-12-25	Voice sample generation method and device

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
CN201811593971.6A CN109473091B (en)	2018-12-25	2018-12-25	Voice sample generation method and device

Publications (2)

Publication Number	Publication Date
CN109473091A CN109473091A (en)	2019-03-15
CN109473091B true CN109473091B (en)	2021-08-10

Family

ID=65676987

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
CN201811593971.6A Active CN109473091B (en)	2018-12-25	2018-12-25	Voice sample generation method and device

Country Status (1)

Country	Link
CN (1)	CN109473091B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
CN110136690B (en) *	2019-05-22	2023-07-14	平安科技（深圳）有限公司	Speech synthesis method, device and computer readable storage medium
WO2021137754A1 (en) *	2019-12-31	2021-07-08	National University Of Singapore	Feedback-controlled voice conversion
CN111292766B (en) *	2020-02-07	2023-08-08	抖音视界有限公司	Method, apparatus, electronic device and medium for generating voice samples
CN111477247B (en) *	2020-04-01	2023-08-11	宁波大学	Speech countermeasure sample generation method based on GAN
CN112216296B (en) *	2020-09-25	2023-09-22	脸萌有限公司	Test methods, equipment and storage media for audio resistance to disturbances
CN112201227B (en) *	2020-09-28	2024-06-28	海尔优家智能科技（北京）有限公司	Speech sample generation method and device, storage medium and electronic device
CN112466298B (en) *	2020-11-24	2023-08-11	杭州网易智企科技有限公司	Voice detection method, device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
CN107293289A (en) *	2017-06-13	2017-10-24	南京医科大学	A kind of speech production method that confrontation network is generated based on depth convolution
CN108182936A (en) *	2018-03-14	2018-06-19	百度在线网络技术（北京）有限公司	Voice signal generation method and device
CN108597496A (en) *	2018-05-07	2018-09-28	广州势必可赢网络科技有限公司	Voice generation method and device based on generation type countermeasure network
CN108899032A (en) *	2018-06-06	2018-11-27	平安科技（深圳）有限公司	Method for recognizing sound-groove, device, computer equipment and storage medium
US20180342258A1 (en) *	2017-05-24	2018-11-29	Modulate, LLC	System and Method for Creating Timbres
CN109036389A (en) *	2018-08-28	2018-12-18	出门问问信息科技有限公司	The generation method and device of a kind of pair of resisting sample

2018
- 2018-12-25 CN CN201811593971.6A patent/CN109473091B/en active Active

Patent Citations (6)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
US20180342258A1 (en) *	2017-05-24	2018-11-29	Modulate, LLC	System and Method for Creating Timbres
CN107293289A (en) *	2017-06-13	2017-10-24	南京医科大学	A kind of speech production method that confrontation network is generated based on depth convolution
CN108182936A (en) *	2018-03-14	2018-06-19	百度在线网络技术（北京）有限公司	Voice signal generation method and device
CN108597496A (en) *	2018-05-07	2018-09-28	广州势必可赢网络科技有限公司	Voice generation method and device based on generation type countermeasure network
CN108899032A (en) *	2018-06-06	2018-11-27	平安科技（深圳）有限公司	Method for recognizing sound-groove, device, computer equipment and storage medium
CN109036389A (en) *	2018-08-28	2018-12-18	出门问问信息科技有限公司	The generation method and device of a kind of pair of resisting sample

Also Published As

Publication number	Publication date
CN109473091A (en)	2019-03-15

Publication	Publication Date	Title
CN109473091B (en)	2021-08-10	Voice sample generation method and device
CN106935248B (en)	2021-02-05	Voice similarity detection method and device
JP5127754B2 (en)	2013-01-23	Signal processing device
DE112010003461B4 (en)	2019-09-05	Speech feature extraction apparatus, speech feature extraction method and speech feature extraction program
JP4818335B2 (en)	2011-11-16	Signal band expander
CN103943104A (en)	2014-07-23	Voice information recognition method and terminal equipment
CN107851444A (en)	2018-03-27	For acoustic signal to be decomposed into the method and system, target voice and its use of target voice
Alku et al.	2009	Closed phase covariance analysis based on constrained linear prediction for glottal inverse filtering
Edraki et al.	2020	Speech intelligibility prediction using spectro-temporal modulation analysis
CN108847253B (en)	2023-06-13	Vehicle model identification method, device, computer equipment and storage medium
JP4516157B2 (en)	2010-08-04	Speech analysis device, speech analysis / synthesis device, correction rule information generation device, speech analysis system, speech analysis method, correction rule information generation method, and program
EP1995723B1 (en)	2010-06-16	Neuroevolution training system
Singh et al.	2019	Usefulness of linear prediction residual for replay attack detection
CN103258537A (en)	2013-08-21	Method utilizing characteristic combination to identify speech emotions and device thereof
JP5443547B2 (en)	2014-03-19	Signal processing device
Zouhir et al.	2014	A bio-inspired feature extraction for robust speech recognition
Martin et al.	2009	Cepstral modulation ratio regression (CMRARE) parameters for audio signal analysis and classification
CN109741761B (en)	2020-09-25	Sound processing method and device
KR101674597B1 (en)	2016-11-22	System and method for reconizing voice
CN114302301B (en)	2023-08-04	Frequency response correction method and related product
Montalvão et al.	2012	Is masking a relevant aspect lacking in MFCC? A speaker verification perspective
Kolokolov et al.	2019	Measuring the pitch of a speech signal using the autocorrelation function
CN112233693B (en)	2023-12-01	Sound quality evaluation method, device and equipment
JP2002507776A (en)	2002-03-12	Signal processing method for analyzing transients in audio signals
Mallidi et al.	2013	Robust speaker recognition using spectro-temporal autoregressive models.

Legal Events

Date	Code	Title
2019-03-15	PB01	Publication
2019-03-15	PB01	Publication
2019-04-09	SE01	Entry into force of request for substantive examination
2019-04-09	SE01	Entry into force of request for substantive examination
2021-08-10	GR01	Patent grant
2021-08-10	GR01	Patent grant

CN109473091B - Voice sample generation method and device - Google Patents