patents.google.com

TWI855595B - Dialogue-based speech recognition system and method therefor - Google Patents

️Wed Sep 11 2024

TWI855595B - Dialogue-based speech recognition system and method therefor - Google Patents

Dialogue-based speech recognition system and method therefor Download PDF

Info

Publication number

TWI855595B

TWI855595B TW112109659A TW112109659A TWI855595B TW I855595 B TWI855595 B TW I855595B TW 112109659 A TW112109659 A TW 112109659A TW 112109659 A TW112109659 A TW 112109659A TW I855595 B TWI855595 B TW I855595B Authority

Taiwan

Prior art keywords

conversational

speech

speaker

track

data

Prior art date

2023-03-16

Application number

TW112109659A

Other languages

Chinese (zh)

Other versions

TW202439299A (en

Inventor

郭世展

鄭俊彥

陳瑞河

林其翰

林仙琪

許安廷

Original Assignee

玉山商業銀行股份有限公司

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2023-03-16

Filing date

2023-03-16

Publication date

2024-09-11

2023-03-16 Application filed by 玉山商業銀行股份有限公司 filed Critical 玉山商業銀行股份有限公司

2023-03-16 Priority to TW112109659A priority Critical patent/TWI855595B/en

2024-09-11 Application granted granted Critical

2024-09-11 Publication of TWI855595B publication Critical patent/TWI855595B/en

2024-10-01 Publication of TW202439299A publication Critical patent/TW202439299A/en

Images

Landscapes

Telephonic Communication Services (AREA)

Abstract

A dialogue-based speech recognition system and a method therefor are provided. The system provides a server performing the method. In the method, the server receives an audio data from a database or an instant incoming line. The audio data can be a single-channel or a multi-channel audio data that is recorded from multi-person conversations. A speech recognition method is applied to recognizing the audio of each of multiple speakers that can be identified and separated if it is necessary. For the audio data of the multiple speakers, a speech-to-text technology is used to transfer the audio of each of the speakers into a conversational text. The conversational text of each of the speakers that are already separated or are separated from the multi-channel audio data can be automatically inputted with punctuation marks so as to word-segment and/or to segment the conversational text.

Description

對話式語音辨識系統與方法Conversational speech recognition system and method

說明書公開一種處理對話式語音的方法，特別是一種針對多位語者形成的對話式語音數據進行語音辨識的系統與方法。The specification discloses a method for processing conversational speech, in particular, a system and method for performing speech recognition on conversational speech data formed by multiple speakers.

在機構提供客戶服務的通話中，機構為了要評估客服中心的效能以及保留各種爭議通話，會告知客戶錄音將被錄音，並在執行通話中進行錄音。In the process of providing customer service, the agency will inform the customer that the call will be recorded and record the call in order to evaluate the effectiveness of the customer service center and retain various dispute calls.

一般來說，錄製的語音是多方對話式的，並視需要進行語音辨識，並且在對話式情境下應獲得多語者的語音辨識結果，但是如果相關錄音檔案來自多個不同的錄音系統與環境，或是環境相對複雜，則增加語音辨識的困難度。Generally speaking, the recorded speech is in the form of multi-party dialogue, and speech recognition is performed as needed. In the dialogue situation, the speech recognition results of multilingual speakers should be obtained. However, if the relevant recording files come from multiple different recording systems and environments, or the environment is relatively complex, the difficulty of speech recognition will increase.

為了針對對話式情境下產生的多語者的語音數據進行文字辨識，特別是還可產生對話式的文字，揭露書提出一種對話式語音辨識系統與方法。In order to perform text recognition on the speech data of multilingual speakers generated in a conversational context, especially to generate conversational text, the disclosure proposes a conversational speech recognition system and method.

對話式語音辨識系統提出一伺服器，通過其中處理能力實現各種語音處理的功能模組，通過其中處理單元執行對話式語音辨識方法，在方法中，自一資料庫或是即時進線取得語音數據，語音數據可為一單音軌語音數據或是一多音軌語音數據，並可以是錄製多人對話建立的語音檔案。The conversational speech recognition system provides a server, wherein the processing capability thereof realizes various speech processing function modules, and the processing unit thereof executes the conversational speech recognition method. In the method, speech data is obtained from a database or a real-time online connection. The speech data can be a single-track speech data or a multi-track speech data, and can be a speech file created by recording a multi-person conversation.

針對多音軌語音數據，接著對語音數據中多位語者進行語音辨識，包括以語音轉文字技術轉換出對應各語者的對話式文字，並在需要時進行語者分離。其中，經判斷語音數據與其文本的態樣，若為單音軌語音數據，即進行語者分離，從中識別出其中的多位語者；反之，可直接得出不同語者的對話式文字。之後針對單音軌語音數據中多位語者個別的對話式文字經語者分離得出各語者對應的對話式文字，或是對多音軌語音數據可直接得出各語者對應的對話式文字，再進行後續對話文字的整合。For multi-track voice data, voice recognition is then performed on multiple speakers in the voice data, including converting the corresponding dialogue text of each speaker using voice-to-text technology, and performing speaker separation when necessary. Among them, after judging the state of the voice data and its text, if it is single-track voice data, speaker separation is performed to identify multiple speakers therein; otherwise, dialogue texts of different speakers can be directly obtained. Afterwards, the individual dialogue texts of multiple speakers in the single-track voice data are separated by speaker to obtain the corresponding dialogue text of each speaker, or the dialogue text of each speaker can be directly obtained for multi-track voice data, and then the subsequent dialogue text integration is performed.

在一實施方案中，伺服器通過應用程式介面取得語音數據，通過以處理單元運行的語音辨識單元，對此語音數據進行語音辨識，相關流程包括轉換語音檔案的音檔格式、判斷音軌數量，再進行語音辨識的步驟。In one implementation, the server obtains voice data through an application program interface, and performs voice recognition on the voice data through a voice recognition unit running on a processing unit. The relevant process includes converting the audio file format of the voice file, determining the number of audio tracks, and then performing voice recognition steps.

進一步地，伺服器還還可執行一流量處理程序，利用佇列資料結構排列與分配每個語音進線的線路，以依序地進入伺服器的處理單元的多個平行化運算單元中。Furthermore, the server can also execute a traffic processing program to arrange and allocate the lines of each voice input line by using a queue data structure so as to sequentially enter multiple parallel computing units of the processing unit of the server.

進一步地，經判斷語音數據的音軌數量得出單音軌語音數據，即運用一語者音軌分離模型，從中得出多位語者，以取得不同語者個別的語音數據；當取得各語者的對話式文字，可運用一對話整合元件標點符號模型自動標註標點符號，還可對對話式文字進行分詞與/或分段。Furthermore, by determining the number of audio tracks of the voice data, single-track voice data is obtained, that is, a speaker audio track separation model is used to obtain multiple speakers to obtain individual voice data of different speakers; when the dialogue text of each speaker is obtained, a dialogue integration element punctuation model can be used to automatically annotate punctuation marks, and the dialogue text can also be segmented and/or segmented.

進一步地，經得出語音數據中的多位語者，可通過電腦程序給予不同語者不同的識別符，以各語者各自的識別符連結以語音轉文字得出對應各語者的對話式文字。Furthermore, after obtaining multiple speakers in the voice data, different identifiers can be given to different speakers through a computer program, and the identifiers of each speaker are linked to the voice-to-text to obtain the corresponding dialogue text of each speaker.

為使能更進一步瞭解本發明的特徵及技術內容，請參閱以下有關本發明的詳細說明與圖式，然而所提供的圖式僅用於提供參考與說明，並非用來對本發明加以限制。To further understand the features and technical contents of the present invention, please refer to the following detailed description and drawings of the present invention. However, the drawings provided are only used for reference and description and are not used to limit the present invention.

以下是通過特定的具體實施例來說明本發明的實施方式，本領域技術人員可由本說明書所公開的內容瞭解本發明的優點與效果。本發明可通過其他不同的具體實施例加以施行或應用，本說明書中的各項細節也可基於不同觀點與應用，在不悖離本發明的構思下進行各種修改與變更。另外，本發明的附圖僅為簡單示意說明，並非依實際尺寸的描繪，事先聲明。以下的實施方式將進一步詳細說明本發明的相關技術內容，但所公開的內容並非用以限制本發明的保護範圍。The following is a specific embodiment to illustrate the implementation of the present invention. The technical personnel in this field can understand the advantages and effects of the present invention from the content disclosed in this specification. The present invention can be implemented or applied through other different specific embodiments. The details in this specification can also be modified and changed in various ways based on different viewpoints and applications without deviating from the concept of the present invention. In addition, the drawings of the present invention are only for simple schematic illustration and are not depicted according to actual size. Please note in advance. The following implementation will further explain the relevant technical content of the present invention in detail, but the disclosed content is not used to limit the scope of protection of the present invention.

應當可以理解的是，雖然本文中可能會使用到“第一”、“第二”、“第三”等術語來描述各種元件或者信號，但這些元件或者信號不應受這些術語的限制。這些術語主要是用以區分一元件與另一元件，或者一信號與另一信號。另外，本文中所使用的術語“或”，應視實際情況可能包括相關聯的列出項目中的任一個或者多個的組合。It should be understood that, although the terms "first", "second", "third", etc. may be used herein to describe various components or signals, these components or signals should not be limited by these terms. These terms are mainly used to distinguish one component from another component, or one signal from another signal. In addition, the term "or" used herein may include any one or more combinations of the associated listed items depending on the actual situation.

揭露書公開一種對話式語音辨識系統與方法，所提出的對話式語音辨識系統支援多種輸入的音檔格式，並能根據單音軌與雙音軌音檔提供適應式的解決方案，主要目的之一是能得出一個對話情境下多語者個別的對話式文字稿。The disclosure discloses a conversational speech recognition system and method. The proposed conversational speech recognition system supports multiple input audio file formats and can provide adaptive solutions based on single-track and double-track audio files. One of the main purposes is to obtain individual conversational transcripts of multilingual speakers in a conversational context.

根據系統實施例，可參考圖1所示對話式語音辨識系統的架構實施例示意圖，其中顯示系統所提出的伺服器110，可以電腦系統實現，通過電腦系統的處理單元111與記憶體112等數據處理能力實現各種處理語音數據的功能模組，並可通過網路10服務終端使用者。According to the system implementation example, reference may be made to the schematic diagram of the architecture implementation example of the conversational speech recognition system shown in FIG1 , in which the server 110 proposed by the display system may be implemented by a computer system, and various functional modules for processing voice data may be implemented through the data processing capabilities of the computer system such as the processing unit 111 and the memory 112, and may serve the terminal users through the network 10.

舉例來說，對話式語音辨識系統可運作在一個客服中心，客服中心錄製每通客戶以使用者端裝置101或103通過網路10（可以是網際網路（Internet）或是公眾電話網路（PSTN））經過伺服器端的語音交換機105進線與客服人員對話的語音，利用電腦系統的處理單元111與記憶體112等電路元件實現的軟體方法處理後形成的語音檔案形式儲存在資料庫130中。在此一提的是，揭露書所提出的對話式語音辨識系統可運用在各種提供客戶進線的客服通話，或是各種對話形式的實體服務上，包括詢問資訊、申請服務與各種服務需求上，也可以是機構外撥的確認通話，例如購買商品的確認、電話行銷等用途上。For example, the conversational speech recognition system can be operated in a customer service center. The customer service center records each voice conversation between a customer and a customer service representative through a user terminal device 101 or 103 via a network 10 (which can be the Internet or a public telephone network (PSTN)) via a server-side voice switch 105, and stores the voice files formed after processing using a software method implemented by circuit components such as a processing unit 111 and a memory 112 of a computer system in a database 130. It is worth mentioning that the conversational voice recognition system proposed in the disclosure book can be used in various customer service calls that provide customers with online services, or various physical services in the form of conversations, including inquiries for information, application for services and various service needs. It can also be an outbound confirmation call from an organization, such as confirmation of purchase of goods, telephone marketing, etc.

根據對話式語音辨識系統的實施例，在伺服器110中，經一應用程式介面（application programming interface，API）取得語音數據，通過以處理單元111運行的語音辨識單元113，對語音數據進行語音辨識，目的是轉換為文字，並提供轉換語音檔案的音檔格式的功能，以能適用各種音檔格式的語音數據，判斷形成語音數據的音軌數量（單音軌、雙音軌或稱多音軌），以及進行語音辨識的步驟。According to an embodiment of the conversational speech recognition system, in a server 110, speech data is obtained through an application programming interface (API), and speech recognition is performed on the speech data through a speech recognition unit 113 operated by a processing unit 111, with the purpose of converting the speech data into text, and providing a function of converting the audio file format of the speech file, so as to be applicable to speech data in various audio file formats, determine the number of audio tracks (single audio track, dual audio track or multi-audio track) forming the speech data, and perform speech recognition steps.

根據實施例，以軟體手段實現的語音辨識單元113執行音檔格式轉換、音軌數量判斷與語音辨識，並採用特定語音辨識模型，執行取樣、對話辨識與文字化，最終將得出音訊資訊，包括得出語音檔案音訊格式，取得音訊取樣率（sampling rate）、音訊格式（mp3, wav, vox等），以及音軌數量（channel）。According to the embodiment, the speech recognition unit 113 implemented by software performs audio file format conversion, audio track number determination and speech recognition, and adopts a specific speech recognition model to perform sampling, dialogue recognition and text conversion, and finally obtains audio information, including the audio format of the voice file, the audio sampling rate, the audio format (mp3, wav, vox, etc.), and the number of audio tracks (channel).

進一步地，在判斷語音數據的音軌數量後，可根據音軌數量決定轉送語音數據至文字整合單元117與語者分離單元115。其中，若語音數據為單音軌錄製完成，可經判斷語音數據的音軌數量得出為單音軌語音數據，即進行語者分離，從中得出多位語者，再對語音數據中多位語者進行語音辨識。根據實施例，可運用一語者音軌分離模型，如一種SpeechBrain，此類語者音軌分離模型是一個通過深度學習多人語音特徵得出用於處理語音數據的人工智能模型，其中由語音識別（speech recognition）、語者識別（speaker recognition）、語音增強（speech enhancement）、語音分離（speech separation）、語言識別（language identification）、多麥克風訊號處理（multi-microphone signal processing）等軟體功能組成。Furthermore, after determining the number of audio tracks of the voice data, the voice data can be transferred to the text integration unit 117 and the speaker separation unit 115 according to the number of audio tracks. If the voice data is recorded as a single audio track, the number of audio tracks of the voice data can be determined to be single audio track voice data, that is, speaker separation can be performed to obtain multiple speakers, and then voice recognition can be performed on the multiple speakers in the voice data. According to an embodiment, a speaker track separation model, such as a SpeechBrain, may be used. Such a speaker track separation model is an artificial intelligence model for processing voice data obtained by deep learning multi-person voice features, and is composed of software functions such as speech recognition, speaker recognition, speech enhancement, speech separation, language identification, and multi-microphone signal processing.

根據對話式語音辨識方法的實施例，當判斷語音數據為單音軌語音數據，此時運用語者音軌分離模型，能夠根據語音數據中多語者的聲紋特徵進行語音識別與語者識別，得出語音數據中的多位語者個別的語音數據。之後可通過一電腦程序給予不同語者不同的識別符（identifier），以各語者各自的識別符連結以語音轉文字得出對應各語者的對話式文字。如此，根據單音軌語音數據的語者分離結果，或是原本語音數據已經是以多音軌錄製，已經分離為多位語者，即可繼續辨識多位語者個別的對話式文字。According to the embodiment of the method for conversational speech recognition, when the speech data is determined to be single-track speech data, the speaker track separation model is used to perform speech recognition and speaker identification according to the voiceprint features of multiple speakers in the speech data, and obtain the individual speech data of multiple speakers in the speech data. Afterwards, different identifiers can be given to different speakers through a computer program, and the speech-to-text conversion is performed by linking the respective identifiers of each speaker to obtain the conversational text corresponding to each speaker. In this way, according to the speaker separation result of the single-track speech data, or the original speech data has been recorded in multiple tracks and has been separated into multiple speakers, the individual conversational text of multiple speakers can be recognized.

根據實施例，伺服器110通過文字整合單元117整合出每位語者的文字，當取得各語者的對話式文字，參照語音辨識結果與語者資訊，運用一對話整合元件標點符號模型自動標註標點符號，在逐字稿文字中加入標點符號可將對話式文字進行分詞與/或分段，藉此可提高可讀性。其中對話整合元件標點符號模型如一種基於變換器的雙向編碼器表示技術（bidirectional encoder representations from transformers，BERT），BERT是Google™公司提出的預訓練模型，所述對話式語音辨識方法運用此預先用大量資料訓練過的模型，設定任務與模型規格後，再通過調整文字上標註的標點符號優化與訓練模型，使之成為能用於自動標註標點符號的模型。According to the embodiment, the server 110 integrates the text of each speaker through the text integration unit 117. When the conversational text of each speaker is obtained, the punctuation marks are automatically annotated using a conversational integration element punctuation model with reference to the speech recognition results and speaker information. Punctuation marks are added to the transcript text to segment the conversational text into words and/or segments, thereby improving readability. The dialog-integrated punctuation model is a bidirectional encoder representation from transformers (BERT) technology. BERT is a pre-trained model proposed by Google™. The dialog-based speech recognition method uses this model that has been pre-trained with a large amount of data. After setting the task and model specifications, the model is optimized and trained by adjusting the punctuation marks annotated on the text, so that it becomes a model that can be used for automatic punctuation marking.

圖2顯示利用上述伺服器中的軟體手段實現的對話式語音辨識方法的流程實施例圖。FIG. 2 shows a flowchart of an implementation example of a conversational speech recognition method implemented by software means in the server.

經接收儲存於資料庫中的語音數據，或是接收即時進線的語音數據（步驟S201），對語音數據中多位語者進行自動語音辨識，包括執行音檔格式轉換、音軌數量判斷與語音辨識，將語音轉文字（步驟S203）。這時，如步驟S205，判斷是否為單音軌，若不是單音軌語音數據（否），表示語音數據為多音軌（如雙音軌）語音數據，已經是語者分離的檔案，可直接取得多位語者的對話式文字；若為單音軌語音數據（是），從中識別出其中的多位語者，即進行語者分離（步驟S207），以能針對多位語者個別的語音數據分離出各語者的對話式文字。After receiving voice data stored in a database or receiving voice data from a real-time line (step S201), automatic voice recognition is performed on multiple speakers in the voice data, including executing audio file format conversion, audio track number determination and voice recognition, and converting the voice into text (step S203). At this time, as in step S205, it is determined whether it is a single track. If it is not single track voice data (no), it means that the voice data is multi-track (such as dual track) voice data, which is already a speaker-separated file, and the conversational text of multiple speakers can be directly obtained; if it is single track voice data (yes), the multiple speakers are identified therein, that is, speaker separation is performed (step S207), so that the conversational text of each speaker can be separated from the individual voice data of the multiple speakers.

當取得每位語者的對話式文字後，可以自動標註標點符號（步驟S209），另還可針對對話式文字進行分詞與/或分段。根據實施例，透過上述對話整合元件標點符號模型將逐字稿文字加入標點符號，完成後，可以整合同一個語音情境下的多語者的文字檔案，再存檔至系統的資料庫中（步驟S211）。After obtaining the conversational text of each speaker, punctuation marks can be automatically annotated (step S209), and the conversational text can also be segmented and/or segmented. According to the embodiment, punctuation marks are added to the transcript text through the above-mentioned conversation integration element punctuation model. After completion, the text files of multiple speakers in the same voice context can be integrated and then stored in the system database (step S211).

圖3顯示對話式語音辨識系統的運作流程的實施例示意圖。FIG3 is a schematic diagram showing an embodiment of the operation process of the conversational speech recognition system.

在圖中顯示的運作流程中，一開始由使用者發出語音處理的請求，提交語音檔案301，語音檔案301根據錄製方式為單音軌語音數據或是多音軌語音數據。In the operation flow shown in the figure, the user first issues a request for voice processing and submits a voice file 301. The voice file 301 is single-track voice data or multi-track voice data according to the recording method.

在語音檔案處理的過程中，若同時接收處理多個語音檔案的請求，伺服器還執行一流量處理程序，可通過流量處理單元303進行流量調節。根據實施方式之一，流量處理單元303可採用一種可處理高吞吐量並具有低延遲特色的kafka系統，另還可選擇Redis、RabbitMQ等方案，針對在資料庫中多筆語音檔案等待處理的情況利用一佇列（queue）資料結構排列與分配每個語音檔案的處理流程（示意如圖中顯示的多個連線箭頭），以依序地進入伺服器中處理單元的多個平行化運算單元中，可藉此提升運算效能與其實用性。During the voice file processing process, if a request for processing multiple voice files is received at the same time, the server also executes a traffic processing program, and the traffic can be adjusted through the traffic processing unit 303. According to one implementation method, the traffic processing unit 303 can adopt a kafka system that can process high throughput and has low latency characteristics. Redis, RabbitMQ and other solutions can also be selected. For the situation where multiple voice files are waiting to be processed in the database, a queue data structure is used to arrange and allocate the processing flow of each voice file (illustrated as multiple connecting arrows shown in the figure), so that they can enter the multiple parallel computing units of the processing unit in the server in sequence, thereby improving the computing performance and its practicality.

接著，以語音辨識單元305針對每一次語音對話形成的語音數據進行語音辨識，包括轉換語音檔案的音檔格式以及判斷音軌數量等步驟，如此可以得出語音數據為單音軌語音數據或是多音軌語音數據，相關數據檔案32可以即時處理，或是先儲存至資料庫313。經判斷語音數據的音軌數量得出多音軌語音數據31，表示語音數據中以不同音軌錄製不同語者的語音，可以直接通過文字整合單元309整合同一個對話情境下不同語者的對話內容，轉換為對話式的文字輸出，可以資料庫313儲存。若判斷語音數據為單音軌語音數據，即接著通過語音分離單元307進行語者分離，其中可採用以上實施例提出的運用語者音軌分離模型，從中判斷出多位語者，以取得不同語者個別的語音數據，再以文字整合單元309整合同一個對話情境下不同語者的對話內容，得出整合多語者的對話式逐字稿，並以文字形式輸出至資料庫313。Next, the voice recognition unit 305 performs voice recognition on the voice data generated by each voice dialogue, including steps such as converting the audio file format of the voice file and determining the number of audio tracks. In this way, it can be determined whether the voice data is single-track voice data or multi-track voice data. The relevant data file 32 can be processed immediately or first stored in the database 313. After determining the number of audio tracks of the voice data, multi-track voice data 31 is obtained, indicating that the voices of different speakers are recorded in different audio tracks in the voice data. The dialogue content of different speakers in the same dialogue context can be directly integrated through the text integration unit 309 and converted into dialogue-style text output, which can be stored in the database 313. If the speech data is determined to be single-track speech data, speaker separation is then performed through the speech separation unit 307, wherein the speaker track separation model proposed in the above embodiment can be adopted to determine multiple speakers to obtain individual speech data of different speakers, and then the text integration unit 309 integrates the dialogue content of different speakers in the same dialogue context to obtain a dialogue transcript integrating multiple speakers, and outputs it to the database 313 in text form.

在此一提的是，當系統接收到語音檔案，資料庫313除了儲存每個語音檔案外，還儲存相關記錄檔（log），可據此取得語音檔案在各階段處理的最新狀態，藉此記錄檔可查詢得出語音檔案的處理進度。舉例來說，根據記錄檔可知，若系統處理進度是完成語者分離後就沒有再進行整合對話式文字的話，資料庫313儲存的內容就是經過語音文字化以及語者分離後完成的檔案。It is worth mentioning that when the system receives a voice file, in addition to storing each voice file, the database 313 also stores related log files, which can be used to obtain the latest status of the voice file at each stage of processing, and the processing progress of the voice file can be queried through the log file. For example, according to the log file, if the system processing progress is to complete the speaker separation and then not integrate the conversational text, the content stored in the database 313 is the file that has been completed after the voice text and speaker separation.

根據實施例之一，所述對話式語音辨識系統可以針對接收到的語音檔案進行處理，將語音檔案經過各階段處理後的最新狀態儲存在資料庫313。對話式語音辨識系統可以定時檢查（如採用etl等定時掃描程式）資料庫313中的語音數據是否已經完成語音辨識、文字化以及整合處理，若有尚未完成對話式語音辨識的語音數據，可以通過重送單元311掃描得出尚未處理的語音數據，重新進入語音處理的程序中。值得一提的是，系統藉由重送單元311可以提高系統的辨識穩定性。According to one embodiment, the conversational speech recognition system can process the received speech files and store the latest status of the speech files after each stage of processing in the database 313. The conversational speech recognition system can periodically check (such as using a timed scanning program such as etl) whether the speech data in the database 313 has completed speech recognition, text conversion and integration processing. If there is speech data that has not yet completed conversational speech recognition, the unprocessed speech data can be scanned by the resending unit 311 and re-enter the speech processing procedure. It is worth mentioning that the system can improve the recognition stability of the system through the resending unit 311.

當完成語音檔案的處理後，可得出多位語者中各語者對應的對話式文字，亦可為經過文字整合處理後的檔案，除了可收錄在資料庫313中，或者，系統可通過應用程式介面（API）提供給其他系統。After the voice file is processed, the corresponding dialogue text of each speaker can be obtained, which can also be a file after text integration processing. In addition to being included in the database 313, the system can also provide it to other systems through an application programming interface (API).

其中特別的是，輸入至對話式語音辨識系統的語音檔案可以通過流量處理單元303的處理而分配任務流量，再通過語音辨識單元305執行自動語音辨識，包括轉換語音檔案格式、判斷音軌數量，以及轉換語音為文字，可以在後續電路或軟體方法中分別處理單音軌語音數據與雙音軌（或多音軌）語音數據，所述文字整合單元可繼續針對不同語者的建立對話式文字，形成一個整合多語者的對話式文字的檔案。In particular, the voice file input into the conversational speech recognition system can be processed by the traffic processing unit 303 to allocate task traffic, and then the voice recognition unit 305 performs automatic speech recognition, including converting the voice file format, determining the number of audio tracks, and converting the speech into text. The single-track voice data and the dual-track (or multi-track) voice data can be processed separately in subsequent circuits or software methods. The text integration unit can continue to establish conversational text for different speakers to form a file integrating the conversational text of multiple speakers.

綜上所述，根據上述實施例所描述的對話式語音辨識系統與方法，所提出的對話式語音辨識系統實現一個平台，可自適應不同音軌數量的音訊輸入，讓不同錄音環境的輸入可共用此對話式語音辨識平台。所述系統可相容各種語音格式，針對每次與音對話，可以將對話語音合併在一個音軌輸入，之後在系統中可以針對多種音訊格式轉換，並依音軌數量拆分語者，為了提升可閱讀性，採用自然語言語意分析與辨識，形成文字檔，還可在對話式文字稿中自動標註標點符號，產出的對話式逐字稿將符合一般人閱讀文字之習慣。如此，根據對話式語音辨識方法實施例，因為可以在多方對話中形成對話式文字稿，可適用各種通過對話提供服務等各類型自然語言分析應用。In summary, according to the conversational speech recognition system and method described in the above embodiments, the proposed conversational speech recognition system realizes a platform that can adapt to audio input with different numbers of tracks, so that inputs from different recording environments can share this conversational speech recognition platform. The system is compatible with various voice formats. For each conversation with the audio, the conversational speech can be merged into one track for input. Afterwards, the system can convert multiple audio formats and split the speaker according to the number of tracks. In order to improve readability, natural language semantic analysis and recognition are used to form a text file. Punctuation marks can also be automatically annotated in the conversational text script. The generated conversational transcript will meet the reading habits of ordinary people. In this way, according to the implementation example of the conversational speech recognition method, since a conversational transcript can be formed in a multi-party conversation, it can be applied to various types of natural language analysis applications such as providing services through conversation.

以上所公開的內容僅為本發明的優選可行實施例，並非因此侷限本發明的申請專利範圍，所以凡是運用本發明說明書及圖式內容所做的等效技術變化，均包含於本發明的申請專利範圍內。The contents disclosed above are only preferred feasible embodiments of the present invention and are not intended to limit the scope of the patent application of the present invention. Therefore, all equivalent technical changes made using the contents of the specification and drawings of the present invention are included in the scope of the patent application of the present invention.

10:網路10: Internet

101,103:使用者端裝置101,103: User terminal device

105:語音交換機105:Voice Exchange

110:伺服器110: Server

111:處理單元111: Processing unit

112:記憶體112: Memory

113:語音辨識單元113: Speech Recognition Unit

115:語者分離單元115: Speaker separation unit

117:文字整合單元117: Text Integration Unit

130:資料庫130: Database

301:語音檔案301: Voice file

303:流量處理單元303: Traffic processing unit

305:語音辨識單元305: Speech Recognition Unit

307:語音分離單元307: Voice Separation Unit

309:文字整合單元309: Text Integration Unit

311:重送單元311:Resend unit

313:資料庫313: Database

31:多音軌語音數據31:Multi-track voice data

32:數據檔案32:Data file

步驟S201～S211:對話式語音辨識流程Steps S201-S211: Conversational Speech Recognition Process

圖1顯示對話式語音辨識系統的架構實施例示意圖；FIG1 is a schematic diagram showing an example of the architecture of a conversational speech recognition system;

圖2顯示對話式語音辨識方法的流程實施例圖；以及FIG. 2 shows a flowchart of an embodiment of a conversational speech recognition method; and

圖3顯示對話式語音辨識系統的運作流程實施例示意圖。FIG3 is a schematic diagram showing an example of the operation process of a conversational speech recognition system.

10:網路 10: Internet

101,103:使用者端裝置 101,103: User-end device

105:語音交換機 105: Voice exchange machine

110:伺服器 110: Server

111:處理單元 111: Processing unit

112:記憶體 112: Memory

113:語音辨識單元 113: Speech recognition unit

115:語者分離單元 115: Speaker separation unit

117:文字整合單元 117: Text integration unit

130:資料庫 130: Database

Claims (8)

一種對話式語音辨識方法，包括：接收一語音數據；對該語音數據中多位語者進行語音辨識，包括轉換音檔格式與判斷音軌數量，並將語音轉為文字，得出對應各語者的一對話式文字，其中，若根據音軌數量判斷該語音數據為一單音軌語音數據，即運用一語者音軌分離模型將該單音軌語音數據進行語者分離，從中識別出其中的該多位語者，以取得不同語者個別的語音數據；以及根據該單音軌語音數據的語者分離結果辨識該多位語者個別的該對話式文字，或是針對由不同音軌錄製不同語者語音的一多音軌語音數據中已經分離的該多位語者語音數據辨識出各語者對應的對話式文字。 A method for conversational speech recognition comprises: receiving speech data; performing speech recognition on multiple speakers in the speech data, including converting the audio file format and determining the number of audio tracks, and converting the speech into text to obtain a conversational text corresponding to each speaker, wherein if the speech data is determined to be a single-track speech data according to the number of audio tracks, a speaker audio track separation model is used to convert the single-track speech data into a single-track speech data. Speaker separation is performed to identify the multiple speakers therein to obtain individual voice data of different speakers; and the individual conversational texts of the multiple speakers are recognized based on the speaker separation result of the single-track voice data, or the conversational texts corresponding to each speaker are recognized from the separated voice data of the multiple speakers in a multi-track voice data in which the voices of different speakers are recorded by different tracks. 如請求項1所述的對話式語音辨識方法，其中該語音數據為該單音軌語音數據或是該多音軌語音數據，為錄製多人對話建立的一語音檔案。 The conversational speech recognition method as described in claim 1, wherein the speech data is the single-track speech data or the multi-track speech data, and is a speech file created by recording a conversation between multiple people. 如請求項1所述的對話式語音辨識方法，其中，運用一對話整合元件標點符號模型自動標註標點符號，得出各語者的對話式文字，並對該對話式文字進行分詞與/或分段。 A method for conversational speech recognition as described in claim 1, wherein a conversational integration element punctuation model is used to automatically annotate punctuation marks, thereby obtaining conversational texts of each speaker, and performing word segmentation and/or segmentation on the conversational texts. 如請求項1至3中任一項所述的對話式語音辨識方法，其中，經得出該語音數據中的該多位語者，通過一電腦程序，給予不同語者不同的識別符，以各語者各自的識別符連結以語音轉文字得出對應各語者的該對話式文字。 A method for conversational speech recognition as described in any one of claims 1 to 3, wherein after obtaining the multiple speakers in the speech data, different identification symbols are given to different speakers through a computer program, and the identification symbols of each speaker are linked to obtain the conversational text corresponding to each speaker through speech-to-text conversion. 一種對話式語音辨識系統，包括：一伺服器，通過其中一處理單元執行一對話式語音辨識方法，包括：自一資料庫或是即時進線取得多人對話的一語音數據，其中該語音數據為一單音軌語音數據或是一多音軌語音數據；對該語音數據中多位語者進行語音辨識，包括轉換音檔格式與判斷音軌數量，並將語音轉為文字，得出對應各語者的一對話式文字，其中，若根據音軌數量判斷該語音數據為該單音軌語音數據，即運用一語者音軌分離模型將該單音軌語音數據進行語者分離，從中識別出其中的該多位語者，以取得不同語者個別的語音數據；以及根據該單音軌語音數據的語者分離結果辨識該多位語者個別的該對話式文字，或是針對由不同音軌錄製不同語者語音的該多音軌語音數據中已經分離的該多位語者語音數據辨識出各語者對應的對話式文字。 A conversational speech recognition system includes: a server, wherein a processing unit executes a conversational speech recognition method, including: obtaining a speech data of a conversation between multiple people from a database or a real-time online call, wherein the speech data is a single-track speech data or a multi-track speech data; performing speech recognition on multiple speakers in the speech data, including converting the audio file format and determining the number of tracks, and converting the speech into text to obtain a conversational text corresponding to each speaker, wherein if the speaker is determined according to the number of tracks, The voice data is judged to be the single-track voice data, that is, a speaker track separation model is used to perform speaker separation on the single-track voice data, and the multiple speakers are identified therefrom to obtain individual voice data of different speakers; and the individual dialogue texts of the multiple speakers are recognized according to the speaker separation result of the single-track voice data, or the dialogue texts corresponding to each speaker are recognized for the multiple speakers' separated voice data in the multi-track voice data in which the voices of different speakers are recorded by different tracks. 如請求項5所述的對話式語音辨識系統，其中該伺服器通過一應用程式介面取得該語音數據，通過以該處理單元運行的一語音辨識單元，對該語音數據進行語音辨識。 A conversational speech recognition system as described in claim 5, wherein the server obtains the speech data through an application program interface, and performs speech recognition on the speech data through a speech recognition unit running with the processing unit. 如請求項5所述的對話式語音辨識系統，其中該伺服器還執行一流量處理程序，利用一佇列資料結構排列與分配每個語音進線的線路，以依序地進入該處理單元的多個平行化運算單元中。 A conversational speech recognition system as described in claim 5, wherein the server also executes a traffic processing program, using a queue data structure to arrange and allocate the lines of each voice input line so as to sequentially enter the multiple parallelized computing units of the processing unit. 如請求項5至7中任一項所述的對話式語音辨識系統，其中，當取得各語者的對話式文字，運用一對話整合元件標點符號模型自動標註標點符號，得出各語者的對話式文字，並對該對話式文字進行分詞與/或分段。A conversational speech recognition system as described in any one of claims 5 to 7, wherein when conversational texts of each speaker are obtained, punctuation marks are automatically annotated using a conversational integration element punctuation model to obtain the conversational texts of each speaker, and the conversational texts are segmented and/or segmented.

TW112109659A 2023-03-16 2023-03-16 Dialogue-based speech recognition system and method therefor TWI855595B (en)

Priority Applications (1)

Application Number	Priority Date	Filing Date	Title
TW112109659A TWI855595B (en)	2023-03-16	2023-03-16	Dialogue-based speech recognition system and method therefor

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
TW112109659A TWI855595B (en)	2023-03-16	2023-03-16	Dialogue-based speech recognition system and method therefor

Publications (2)

Publication Number	Publication Date
TWI855595B true TWI855595B (en)	2024-09-11
TW202439299A TW202439299A (en)	2024-10-01

Family

ID=93649039

Family Applications (1)

Application Number	Title	Priority Date	Filing Date
TW112109659A TWI855595B (en)	2023-03-16	2023-03-16	Dialogue-based speech recognition system and method therefor

Country Status (1)

Country	Link
TW (1)	TWI855595B (en)

Citations (4)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
TW200907929A (en) *	2007-08-10	2009-02-16	Sonicjam Inc	Interactive music training and entertainment system
CN103222283A (en) *	2010-11-19	2013-07-24	Jacoti有限公司	Personal communication device with hearing support and method for providing the same
US20160343373A1 (en) *	2012-09-07	2016-11-24	Verint Systems Ltd.	Speaker separation in diarization
US20190018570A1 (en) *	2017-07-12	2019-01-17	Facebook, Inc.	Interfaces for a messaging inbox

2023
- 2023-03-16 TW TW112109659A patent/TWI855595B/en active

Patent Citations (4)

* Cited by examiner, † Cited by third party

Publication number	Priority date	Publication date	Assignee	Title
TW200907929A (en) *	2007-08-10	2009-02-16	Sonicjam Inc	Interactive music training and entertainment system
CN103222283A (en) *	2010-11-19	2013-07-24	Jacoti有限公司	Personal communication device with hearing support and method for providing the same
US20160343373A1 (en) *	2012-09-07	2016-11-24	Verint Systems Ltd.	Speaker separation in diarization
US20190018570A1 (en) *	2017-07-12	2019-01-17	Facebook, Inc.	Interfaces for a messaging inbox

Also Published As

Publication number	Publication date
TW202439299A (en)	2024-10-01

Publication	Publication Date	Title
US10546595B2 (en)	2020-01-28	System and method for improving speech recognition accuracy using textual context
WO2022105861A1 (en)	2022-05-27	Method and apparatus for recognizing voice, electronic device and medium
US9898536B2 (en)	2018-02-20	System and method to perform textual queries on voice communications
US8996371B2 (en)	2015-03-31	Method and system for automatic domain adaptation in speech recognition applications
CN103137129B (en)	2015-11-18	Voice recognition method and electronic device
US8457964B2 (en)	2013-06-04	Detecting and communicating biometrics of recorded voice during transcription process
JP4901738B2 (en)	2012-03-21	Machine learning
US11693988B2 (en)	2023-07-04	Use of ASR confidence to improve reliability of automatic audio redaction
WO2020238209A1 (en)	2020-12-03	Audio processing method, system and related device
US20110004473A1 (en)	2011-01-06	Apparatus and method for enhanced speech recognition
CN111489765A (en)	2020-08-04	Telephone traffic service quality inspection method based on intelligent voice technology
US9311914B2 (en)	2016-04-12	Method and apparatus for enhanced phonetic indexing and search
CN110265032A (en)	2019-09-20	Conferencing data analysis and processing method, device, computer equipment and storage medium
CN102549653A (en)	2012-07-04	Speech translation system, first terminal device, speech recognition server device, translation server device, and speech synthesis server device
CN110807093A (en)	2020-02-18	Voice processing method and device and terminal equipment
WO2021169825A1 (en)	2021-09-02	Speech synthesis method and apparatus, device and storage medium
CN106302933A (en)	2017-01-04	Voice information whose processing method and terminal
WO2014203328A1 (en)	2014-12-24	Voice data search system, voice data search method, and computer-readable storage medium
CN110650250A (en)	2020-01-03	Method, system, device and storage medium for processing voice conversation
CN111210821A (en)	2020-05-29	Intelligent voice recognition system based on internet application
CN114328867A (en)	2022-04-12	Intelligent interruption method and device in man-machine conversation
US20250087212A1 (en)	2025-03-13	Intelligent response recommendation system and method for real-time voice counseling support
TWI855595B (en)	2024-09-11	Dialogue-based speech recognition system and method therefor
CN113505612B (en)	2024-08-20	Multi-user dialogue voice real-time translation method, device, equipment and storage medium
JP5713782B2 (en)	2015-05-07	Information processing apparatus, information processing method, and program

TWI855595B - Dialogue-based speech recognition system and method therefor - Google Patents