patents.google.com

JP6499537B2 - Connection expression structure analysis apparatus, method, and program - Google Patents

️Wed Apr 10 2019

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の第１の実施の形態に係る接続表現項構造解析装置の構成＞ <Configuration of connection expression term structure analysis apparatus according to first embodiment of the present invention>

まず、本発明の第１の実施の形態に係る接続表現項構造解析装置の構成について説明する。第１の実施の形態に係る接続表現項構造解析装置では、文書から明示的接続表現に関する接続表現、項、及び意味ラベルを抽出する。 First, the configuration of the connection expression term structure analysis apparatus according to the first embodiment of the present invention will be described. The connection expression term structure analysis apparatus according to the first embodiment extracts connection expressions, terms, and semantic labels relating to explicit connection expressions from a document.

図１に示すように、本発明の第１の実施の形態に係る接続表現項構造解析装置１００は、ＣＰＵと、ＲＡＭと、後述する接続表現項構造解析処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この接続表現項構造解析装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 As shown in FIG. 1, the connection expression term structure analysis apparatus 100 according to the first embodiment of the present invention includes a CPU, a RAM, a program for executing a connection expression term structure analysis processing routine to be described later, and various programs. It can be constituted by a computer including a ROM storing data. Functionally, the connection expression term structure analysis apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 50 as shown in FIG.

入力部１０は、解析対象となる文書を受け付ける。 The input unit 10 receives a document to be analyzed.

演算部２０は、文分割部３０と、談話構造解析部３２と、構文解析部３４と、接続表現抽出部３６と、項位置関係決定部３８と、文内項抽出部４０と、文間項抽出部４２と、意味クラス分類部４４とを含んで構成されている。 The computing unit 20 includes a sentence dividing unit 30, a discourse structure analyzing unit 32, a syntax analyzing unit 34, a connection expression extracting unit 36, a term positional relationship determining unit 38, an intra-sentence term extracting unit 40, and an inter-sentence term. An extraction unit 42 and a semantic class classification unit 44 are included.

文分割部３０は、入力部１０により受け付けた文書を取得し、文書に対して文の区切りを与える。文の区切りの認定は既存の文分割器を利用する。あるいは、句点を手がかりとするだけでも良い。なお、予め文分割した文書を入力部１０により受け付けて、文分割部３０の処理を省略しても良い。 The sentence division unit 30 acquires the document received by the input unit 10 and gives a sentence break to the document. Sentence delimiters are identified using existing sentence dividers. Alternatively, it may be just a clue. Note that a document that has been divided into sentences in advance may be received by the input unit 10 and the processing of the sentence dividing unit 30 may be omitted.

談話構造解析部３２は、文分割部３０により文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。談話構造木によって文同士のノード間の親子関係が表わされる。談話構造木は、非特許文献２などの修辞構造解析器を用いてＲＳＴツリーを生成した後、非特許文献３に記載されているルールを適用することで文同士のノード間の親子関係を決定することができる。また、必ずしもＲＳＴツリーを生成する必要はなく、文同士のノード間の親子関係を表した修辞構造木のデータを用いて学習した解析器を利用することで文同士のノード間の親子関係を得ることも可能である。 The discourse structure analysis unit 32 generates a discourse structure tree in which each sentence is represented by each node based on the rhetorical structure of each sentence included in the document, based on the document given the sentence break by the sentence division unit 30. To do. The discourse structure tree represents a parent-child relationship between nodes of sentences. The discourse structure tree generates an RST tree using a rhetorical structure analyzer such as Non-Patent Document 2, and then determines the parent-child relationship between nodes of sentences by applying the rules described in Non-Patent Document 3. can do. Moreover, it is not always necessary to generate an RST tree, and a parent-child relationship between nodes of sentences is obtained by using an analyzer that has been learned using data of a rhetorical structure tree that represents parent-child relationships between nodes of sentences. It is also possible.

[非特許文献２]：duVerle、 D. and Prendinger、 H. ‘A Novel Discourse Parser Based on Support Vector Machine Classi_cation'. Proc of the 47th ACL, pp. 665{675 (2009) . [Non-Patent Document 2]: duVerle, D. and Prendinger, H. 'A Novel Discourse Parser Based on Support Vector Machine Classi_cation'. Proc of the 47th ACL, pp. 665 {675 (2009).

[非特許文献３]：Tsutomu Hirao、 Yasuhisa Yoshida、 Masaaki Nishino, Norihito Yasuda and Masaaki Nagata. ‘Single-Document Summarization as a Tree Knapsack Problem'. Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515{1520, (2013). [Non-Patent Document 3]: Tsutomu Hirao, Yasuhisa Yoshida, Masaaki Nishino, Norihito Yasuda and Masaaki Nagata. 'Single-Document Summarization as a Tree Knapsack Problem'. Proc. Of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1515 {1520, (2013).

構文解析部３４は、文分割部３０により文区切りが与えられた文書に含まれる文の各々について、構文解析を行って構文木を生成する。構文解析については様々なソフトウェアが開発されているため、既存のソフトウェアを用いて文の各々の構文木を生成すれば良い。 The syntax analysis unit 34 performs syntax analysis on each of the sentences included in the document to which the sentence delimiter 30 has been given a sentence break to generate a syntax tree. Since various software has been developed for parsing, it is only necessary to generate a syntax tree for each sentence using existing software.

接続表現抽出部３６は、構文解析部３４により生成された文の各々についての構文木に基づいて、項を持つ接続表現を抽出する。接続表現抽出部３６は、具体的には、まず文書中に出現する単語について、予め人手で整備した接続表現候補辞書（図示省略）の辞書エントリ表現を参照し、辞書エントリ表現にマッチする単語を抽出する。そして、辞書エントリ表現にマッチする単語が項をとる接続表現か否かを、辞書エントリ表現が項をとる否かを注釈付けした学習データを用いて訓練したＳＶＭ、ロジスティック回帰のような２値分類器を利用して判定し、項をとる接続表現を抽出する。文書中に出現する単語が、項をとる接続表現か否かを判定するために利用する特徴として、以下の（１）〜（５）のような特徴を用いれば良い。 The connection expression extraction unit 36 extracts a connection expression having a term based on the syntax tree for each sentence generated by the syntax analysis unit 34. Specifically, the connection expression extraction unit 36 first refers to a dictionary entry expression of a connection expression candidate dictionary (not shown) prepared manually in advance for a word that appears in a document, and selects a word that matches the dictionary entry expression. Extract. Then, binary classification such as SVM and logistic regression trained using learning data in which whether or not a word matching the dictionary entry expression is a connected expression that takes a term is annotated. A connection expression that takes a term is extracted. The following features (1) to (5) may be used as features used to determine whether a word appearing in a document is a connected expression that takes a term.

（１）辞書エントリ表現とその品詞
（２）辞書エントリ表現の前後５単語とそれらの品詞
（３）構文木における辞書エントリ表現の深さ
（４）構文木における辞書エントリ表現の親、左の兄弟、右の兄弟
（５）構文木における辞書エントリ表現から根までのパス (1) Dictionary entry expression and its part of speech (2) Five words before and after dictionary entry expression and their part of speech (3) Depth of dictionary entry expression in syntax tree (4) Parent of dictionary entry expression in syntax tree, left brother , Right sibling (5) Path from dictionary entry representation to root in syntax tree

項位置関係決定部３８は、接続表現抽出部３６によって抽出された接続表現について、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。項位置関係決定部３８は、具体的には、接続表現抽出部３６と同様に予め学習データを用いて訓練したＳＶＭ、ロジスティク回帰などの２値分類器を用いて、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。判定に利用する特徴は接続表現抽出部３６に利用した上記（１）〜（５）の特徴に加え、接続表現の出現位置（文の前半、なかば、後半など）も用いる。 The term positional relationship determination unit 38 determines whether or not two terms connected by the connection expression appear in the sentence including the connection expression for the connection expression extracted by the connection expression extraction unit 36. Specifically, the term positional relationship determination unit 38 uses a binary classifier such as SVM or logistic regression previously trained using learning data in the same manner as the connection representation extraction unit 36, and uses the binary classifier such as SVM and logistic regression in the sentence including the connection representation. Then, it is determined whether or not two terms connected by the connection expression appear. In addition to the features (1) to (5) used for the connection expression extraction unit 36, the characteristics used for the determination use the appearance position of the connection expression (the first half of the sentence, the middle, the second half, etc.).

項位置関係決定部３８は、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定した場合、接続表現を含む文を、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定し、談話構造解析部３２によって生成された談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文を、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定する。 When it is determined that the two terms connected by the connection expression do not appear in the sentence including the connection expression, the term positional relationship determination unit 38 selects the sentence including the connection expression from the two terms connected by the connection expression. In the discourse structure tree generated by the discourse structure analysis unit 32, the sentence corresponding to the parent node or sibling node of the sentence including the connection expression is connected by the connection expression. Of the two terms, term 1 is determined as a sentence for extracting.

文内項抽出部４０は、項位置関係決定部３８によって接続表現を含む文内に、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた項１、及び項２を抽出し、出力部５０に出力する。 When the term positional relationship determination unit 38 determines that two terms connected by the connection expression appear in the sentence including the connection expression, the in-sentence term extraction unit 40 uses the connection expression from the sentence including the connection expression. The connected terms 1 and 2 are extracted and output to the output unit 50.

文内項抽出部４０は、具体的には、構文解析部３４により生成した文の各々の構文木のうち、接続表現を含む文の構文木を受け取り、接続表現が従属接続、又は等位接続の場合に、それぞれ以下のルールを適用して項１、及び項２を抽出する。なお、接続表現と、従属接続又は等位接続との対応関係は予め人手で与えておく。 Specifically, the sentence internal term extraction unit 40 receives a syntax tree of a sentence including a connection expression among the syntax trees of the sentence generated by the syntax analysis unit 34, and the connection expression is a dependent connection or a coordinate connection. In this case, the following rules are applied to extract the terms 1 and 2 respectively. The correspondence relationship between the connection expression and the subordinate connection or the equipotential connection is previously given manually.

まず、接続表現が従属接続の場合の項１、及び項２の抽出方法について説明する。 First, the method for extracting terms 1 and 2 when the connection representation is dependent connection will be described.

接続表現が従属接続の場合、項２を以下の（１）、（２）の手順で抽出する。 When the connection expression is a subordinate connection, the term 2 is extracted by the following procedures (1) and (2).

（１）対象とする接続表現の最後の単語を表すノードを、構文木のノードをあらわすノード変数ｘに代入する。
（２）ｘの親ノードをｘに代入する。この操作をｘに代入されたノードがＳＢＡＲまたはＳのラベルをとるまで繰り返し、どちらかのラベルをとった時点でのｘによって支配されるテキストスパンを項２とする。 (1) A node representing the last word of the target connection expression is assigned to a node variable x representing a node in the syntax tree.
(2) Substitute the parent node of x into x. This operation is repeated until the node assigned to x takes the label of SBAR or S, and the text span dominated by x at the time of taking either label is term 2.

図２に抽出の例を示す。図２の例では、まずｘにｂｅｃａｕｓｅが代入される。ｂｅｃａｕｓｅはＳ、又はＳＢＡＲのどちらでもないため、ｘにｂｅｃａｕｓｅの親ノードであるＩＮを代入する。ＩＮはＳ、又はＳＢＡＲのどちらでもないため、ｘにＩＮの親ノードであるＳＢＡＲを代入する。ｘがＳＢＡＲとなったので処理が終わり、ｘに代入されたＳＢＡＲが支配するスパン「because he is honest」を項２とする。 FIG. 2 shows an example of extraction. In the example of FIG. 2, first, because is substituted for x. Since “because” is neither S nor SBAR, the parent node of “because” is substituted for x. Since IN is neither S nor SBAR, SBAR which is the parent node of IN is substituted for x. Since x becomes SBAR, the process ends, and the span “because he is honest” controlled by the SBAR assigned to x is term 2.

次に、接続表現が従属接続の場合、項１を以下（１）、（２）の手順で抽出する。なお、ｘは項２の手順が終了した時点での値を引き継ぐ。 Next, when the connection expression is a subordinate connection, the term 1 is extracted by the following procedures (1) and (2). Note that x takes over the value at the time point when the procedure of item 2 is completed.

（１）ｘの親ノードをｘに代入する。
（２）ｘに代入されたノードがＳＢＡＲまたはＳのラベルをとるまで繰り返し、どちらかのラベルをとった時点でのｘによって支配されるテキストスパンを取り出し、そこから項２のスパンを取り除いたものを項１とする。 (1) Substitute the parent node of x into x.
(2) Repeat until the node assigned to x takes the SBAR or S label, extract the text span dominated by x at the time of taking either label, and remove the span of term 2 from it Is term 1.

図２の例では、項２を決定した時点で、ｘには「because he is honest」を支配するＳＢＡＲが代入されているので、その親ノードであるＶＰをｘに代入する。ＶＰはＳ、ＳＢＡＲのどちらでもないので、さらにその親ノードであるＳをｘに代入する。ｘがＳとなったので処理を終え、Ｓが支配するスパン「I like him because he is honest」を取り出し、そこから項２のスパン「because he is honest」を取り除いたスパン「I like him」を項１とする。 In the example of FIG. 2, when the term 2 is determined, since the SBAR governing “because he is honest” is substituted for x, the parent node VP is substituted for x. Since VP is neither S nor SBAR, the parent node S is further substituted for x. Since x becomes S, the process ends, and the span “I like him because he is honest” that S controls is taken out, and the span “I like him” is removed from the span “because he is honest” in item 2. This is term 1.

次に、接続表現が等位接続の場合の項１、及び項２の抽出方法について説明する。 Next, the extraction method of term 1 and term 2 when the connection representation is equipotential connection will be described.

接続表現が等位接続の場合、項２を以下（１）〜（３）の手順で抽出する。 When the connection expression is equipotential connection, the term 2 is extracted by the following procedures (1) to (3).

（１）対象とする接続表現の最後の単語を表すノードを、ノード変数ｘに代入し、ｘの親ノードをノード変数ｙに代入する。
（２）ｘ、ｙにそれぞれの親ノードを代入する。
（３）ｘ、ｙが支配するスパンであるｓｐａｎ（ｘ）及びｓｐａｎ（ｙ）の最左の単語が一致しなくなるまで、（２）を繰り返す。一致しなくなった時点で、ｙが支配するスパンのうち接続表現直後の単語からスパンの最後の単語までを項２とする。 (1) The node representing the last word of the target connection expression is assigned to the node variable x, and the parent node of x is assigned to the node variable y.
(2) Substitute the respective parent nodes for x and y.
(3) Repeat (2) until the leftmost words of span (x) and span (y), which are spans governed by x and y, do not match. The term from the word immediately after the connection expression to the last word of the span among the spans dominated by y at the time when they do not coincide with each other is termed item 2.

図３に抽出対象となる構造木の第１の例を示す。図３の例では、まずｘにａｎｄを代入し、ｙにＣＣを代入する。ｓｐａｎ（ｘ）、ｓｐａｎ（ｙ）の最左の単語がａｎｄで一致するためｘにＣＣ、ｙにＳを代入する。ｓｐａｎ（ｘ）の最左の単語はａｎｄ、ｓｐａｎ（ｙ）の最左の単語はＨｅとなり、一致しないので処理を終了する。そして、ｓｐａｎ（ｙ）、つまり、「He became a student and he received a grant」のａｎｄ直後からのスパン「he received a grant」を項２とする。 FIG. 3 shows a first example of a structural tree to be extracted. In the example of FIG. 3, first, and is substituted for x, and CC is substituted for y. Since the leftmost word of span (x) and span (y) matches with and, CC is substituted for x and S is substituted for y. The leftmost word of span (x) is “and”, and the leftmost word of span (y) is “He”. Then, span (y), that is, the span “he received a grant” immediately after the “He became a student and he received a grant” is term 2.

図４に抽出対象となる構造木の第２の例を示す。図４の例では、まずｘにｂｕｔ、ｙにＣＣを代入する。ｓｐａｎ（ｘ）、ｓｐａｎ（ｙ）の最左の単語がｂｕｔで一致するため、ｘにＣＣ、ｙにＶＰを代入する。ｓｐａｎ（ｘ）とｓｐａｎ（ｙ）の最左の単語はそれぞれｂｕｔとｗｅｒｅとで一致しないので処理を終了する。ｙが支配するスパンのうちｂｕｔの直後からのスパン「were not adjusted for ination」を項２とする。 FIG. 4 shows a second example of the structural tree to be extracted. In the example of FIG. 4, first, but is substituted for x and CC is substituted for y. Since the leftmost word of span (x) and span (y) matches with but, CC is substituted for x and VP is substituted for y. Since the leftmost words of span (x) and span (y) do not match in but and were, respectively, the process ends. The span “were not adjusted for ination” immediately after but of the spans controlled by y is term 2.

次に、接続表現が等位接続の場合、項１を以下（１）、及び（２）の手順で抽出する。なお、ｘ、ｙは項２の手順が終了した時点での値を引き継ぐ。 Next, when the connection representation is equipotential connection, the term 1 is extracted by the following procedures (1) and (2). Note that x and y take on the values at the time point when the procedure of item 2 is completed.

（１）ｙの子ノードのうちｘよりも左にＳあるいはＳＢＡＲが存在する場合（複数存在する場合には最右を選択）、そのノードが支配するスパンを項１とする。
（２）上記（１）に該当しない場合、ｙにその親を代入しＳＢＡＲまたはＳのラベルをとるまで構文木を遡る。ＳＢＡＲあるいはＳをとった時点でのｙが支配するスパンから接続表現と項２を取り除いたスパンを項１とする。 (1) If S or SBAR exists to the left of x among the child nodes of y (the rightmost is selected when there are a plurality of child nodes), the span controlled by the node is term 1.
(2) If the above (1) does not apply, the parent is substituted for y and the syntax tree is traced until SBAR or S label is obtained. The span obtained by removing the connection expression and the term 2 from the span dominated by y at the time when SBAR or S is taken is defined as term 1.

図３の構造木から抽出する例では、項２を決定した時点では、ｘはＣＣ、ｙはＳである。ここで、ｘよりも左のｙの子ノードの中にＳがあるため、そのＳが支配するスパン「He became a student」を項１とする。 In the example extracted from the structural tree in FIG. 3, x is CC and y is S when the term 2 is determined. Here, since S is in the child node of y to the left of x, the span “He became a student” governed by S is term 1.

また、図４の構造木から抽出する例では、項２を決定した時点では、ｘはＣＣ、ｙはＶＰである。ここで、ｘよりも左のｙの子ノードの中にＳ、ＳＢＡＲとも存在しないため、ｙにその親を代入する。するとｙがＳとなるので処理を終了する。ｙが支配するスパン「The gures were adjusted for deation、 but were not adjusted for ination」から「but ware not adjusted for ination」を取り除いた「The _gures were adjusted for deation」を項１とする。 In the example extracted from the structural tree in FIG. 4, x is CC and y is VP when the term 2 is determined. Here, since neither S nor SBAR exists in the child node of y to the left of x, its parent is substituted for y. Then, since y becomes S, the process is terminated. “The _gures were adjusted for deation”, which is obtained by removing “but ware not adjusted for ination” from the span “y gures were adjusted for deation, but were not adjusted for ination”, is defined as item 1.

文間項抽出部４２は、項位置関係決定部３８によって、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定された場合、項位置関係決定部３８によって、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定された接続表現を含む文から、項２を抽出し、項位置関係決定部３８によって、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定された、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、抽出した２つの項を出力部５０に出力する。 The inter-sentence term extraction unit 42 determines that the term positional relationship determination unit 38 determines that the two terms connected by the connection representation do not appear in the sentence including the connection representation. The term 2 is extracted from the sentence including the connection expression determined as the sentence for extracting the term 2 out of the two terms connected by the expression, and is connected by the connection expression by the term positional relationship determination unit 38. The term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression determined as the sentence for extracting the term 1 of the two terms, and the two extracted terms are output. Output to 50.

意味クラス分類部４４は、接続表現抽出部３６によって抽出された接続表現に基づいて、接続表現の意味クラスを分類し、接続表現及び当該接続表現の意味クラスを出力部５０に出力する。意味クラス分類部４４は、具体的には、接続表現抽出部３６で抽出された接続表現と、接続表現の周辺の単語とを入力として、予め学習データにより学習した多クラスの分類問題を解くことにより、接続表現に対する意味クラスを分類する。なお、多クラス分類問題であるため、学習データ中のクラス分布がなるべく均一になるようにデータを学習データから再サンプリングする。 The semantic class classification unit 44 classifies the semantic class of the connection representation based on the connection representation extracted by the connection representation extraction unit 36, and outputs the connection representation and the semantic class of the connection representation to the output unit 50. Specifically, the semantic class classification unit 44 inputs the connection expression extracted by the connection expression extraction unit 36 and words around the connection expression, and solves a multi-class classification problem learned in advance from learning data. To classify semantic classes for connection expressions. Since this is a multi-class classification problem, the data is resampled from the learning data so that the class distribution in the learning data is as uniform as possible.

＜本発明の第１の実施の形態に係る接続表現項構造解析装置の作用＞ <Operation of the connection expression term structure analyzing apparatus according to the first embodiment of the present invention>

次に、本発明の第１の実施の形態に係る接続表現項構造解析装置１００の作用について説明する。入力部１０において文書を受け付けると、接続表現項構造解析装置１００は、図５に示す接続表現項構造解析処理ルーチンを実行する。 Next, the operation of the connection expression term structure analysis apparatus 100 according to the first embodiment of the present invention will be described. When the input unit 10 accepts a document, the connection expression term structure analysis apparatus 100 executes a connection expression term structure analysis processing routine shown in FIG.

まず、ステップＳ１００では、入力部１０において受け付けた文書を取得し、文書に対して文の区切りを与える。 First, in step S100, a document accepted by the input unit 10 is acquired, and sentence breaks are given to the document.

次に、ステップＳ１０２では、ステップＳ１００で文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。 Next, in step S102, a discourse structure tree in which each sentence is represented by each node is generated based on the rhetorical structure of each sentence included in the document, based on the document given the sentence break in step S100.

ステップＳ１０４では、ステップＳ１００で文区切りが与えられた文書に含まれる文の各々について、構文解析を行って構文木を生成する。 In step S104, a syntax tree is generated by performing syntax analysis on each of the sentences included in the document given the sentence break in step S100.

ステップＳ１０６では、ステップＳ１０４で生成された文の各々についての構文木に基づいて、項を持つ接続表現を抽出する。 In step S106, a connection expression having a term is extracted based on the syntax tree for each of the sentences generated in step S104.

ステップＳ１０８では、ステップＳ１０６において、抽出された接続表現について、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定する。また、ステップＳ１０８では、接続表現を含む文内に接続表現によって結ばれた２つの項が出現しないと判定した場合、接続表現を含む文を、接続表現によって結ばれた２つの項のうちの項２を抽出するための文として決定し、ステップＳ１０２で生成された談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文を、接続表現によって結ばれた２つの項のうちの項１を抽出するための文として決定する。 In step S108, it is determined whether or not two terms connected by the connection expression appear in the sentence including the connection expression for the connection expression extracted in step S106. In step S108, when it is determined that two terms connected by the connection expression do not appear in the sentence including the connection expression, the sentence including the connection expression is replaced with a term of the two terms connected by the connection expression. 2 is extracted as a sentence for extracting, and in the discourse structure tree generated in step S102, a sentence corresponding to the parent node or sibling node of the sentence including the connection expression is selected from the two terms connected by the connection expression. Is determined as a sentence for extracting the first term.

ステップＳ１１０では、ステップＳ１０８で接続表現を含む文内に、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた項１、及び項２を抽出し、出力部５０に出力する。 In step S110, when it is determined in step S108 that two terms connected by the connection expression appear in the sentence including the connection expression, the term 1 and the term connected by the connection expression from the sentence including the connection expression are included. 2 is extracted and output to the output unit 50.

ステップＳ１１２では、ステップＳ１０８において、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現しないと判定された場合、ステップＳ１０８で項２を抽出するための文として決定された接続表現を含む文から、項２を抽出し、ステップＳ１０８で項１を抽出するための文として決定された、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、抽出した２つの項を出力部５０に出力する。 In step S112, if it is determined in step S108 that the two terms connected by the connection expression do not appear in the sentence including the connection expression, the connection determined as the sentence for extracting the term 2 in step S108. The term 2 is extracted from the sentence including the expression, and the term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression determined as the sentence for extracting the term 1 in step S108. The two extracted terms are output to the output unit 50.

ステップＳ１１４では、ステップＳ１０６で抽出された接続表現に基づいて、接続表現の意味クラスを分類し、接続表現及び当該接続表現の意味クラスを出力部５０に出力し、接続表現項構造解析処理ルーチンを終了する。 In step S114, the semantic class of the connection representation is classified based on the connection representation extracted in step S106, the connection representation and the semantic class of the connection representation are output to the output unit 50, and the connection representation term structure analysis processing routine is executed. finish.

以上説明したように、第１の実施の形態に係る接続表現項構造解析装置によれば、文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、談話構造木を生成し、構文解析を行って構文木を生成し、項を持つ接続表現を抽出し、接続表現を含む文内に、接続表現によって結ばれた２つの項が出現するか否かを判定し、接続表現によって結ばれた２つの項が出現すると判定された場合、接続表現を含む文から、接続表現によって結ばれた２つの項を抽出し、接続表現によって結ばれた２つの項が出現しないと判定された場合、接続表現を含む文から、項２を抽出し、談話構造木において、接続表現を含む文の親ノード又は兄弟ノードに対応する文から、項１を抽出し、接続表現の意味クラスを分類することにより、隣接しない文間からも、接続表現によって結ばれた項を抽出することができる。 As described above, according to the connection expression term structure analysis apparatus according to the first embodiment, a discourse structure tree is generated based on the rhetorical structure of each sentence included in the document based on the document, and the syntax Parse to generate a syntax tree, extract a connection expression with terms, determine whether two terms connected by the connection expression appear in the sentence containing the connection expression, and connect by the connection expression When it is determined that the two terms appear, the two terms connected by the connection expression are extracted from the sentence including the connection expression, and it is determined that the two terms connected by the connection expression do not appear. The term 2 is extracted from the sentence including the connection expression, and the term 1 is extracted from the sentence corresponding to the parent node or the sibling node of the sentence including the connection expression in the discourse structure tree, and the semantic class of the connection expression is classified. , Even between non-adjacent sentences Can be extracted term tied by a connection represented.

＜本発明の第２の実施の形態に係る接続表現項構造解析装置の構成＞ <Configuration of connection expression term structure analysis device according to second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る接続表現項構造解析装置の構成について説明する。なお、第１の実施の形態と同様の構成となる部分については、同一符号を付して説明を省略する。第２の実施の形態に係る接続表現項構造解析装置では、文書から暗示的接続表現に関する項、及び意味ラベルを抽出する。 Next, the configuration of the connection expression term structure analysis apparatus according to the second embodiment of the present invention will be described. In addition, about the part which becomes the structure similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted. In the connection expression term structure analysis apparatus according to the second embodiment, a term relating to an implicit connection expression and a semantic label are extracted from a document.

図６に示すように、本発明の第２の実施の形態に係る接続表現項構造解析装置２００は、ＣＰＵと、ＲＡＭと、後述する接続表現項構造解析処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この接続表現項構造解析装置２００は、機能的には図６に示すように入力部１０と、演算部２２０と、出力部５０とを備えている。 As shown in FIG. 6, the connection expression term structure analysis apparatus 200 according to the second embodiment of the present invention includes a CPU, a RAM, a program for executing a connection expression term structure analysis processing routine to be described later, and various programs. It can be constituted by a computer including a ROM storing data. Functionally, the connection expression term structure analysis apparatus 200 includes an input unit 10, a calculation unit 220, and an output unit 50 as shown in FIG.

演算部２２０は、文分割部３０と、談話構造解析部３２と、関連文ペア抽出部２３８と、文間項抽出部２４２と、意味クラス分類部２４４とを含んで構成されている。 The computing unit 220 includes a sentence dividing unit 30, a discourse structure analyzing unit 32, a related sentence pair extracting unit 238, an inter-sentence term extracting unit 242, and a semantic class classifying unit 244.

談話構造解析部３２は、第１の実施の形態と同様の処理により、文分割部３０により文区切りが与えられた文書に基づいて、文書に含まれる文の各々の修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成する。 The discourse structure analysis unit 32 performs the same process as in the first embodiment based on the rhetorical structure of each sentence included in the document based on the document given the sentence break by the sentence division unit 30. A discourse structure tree in which each node is represented is generated.

関連文ペア抽出部２３８は、談話構造解析部３２によって生成された談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定する。 Based on the discourse structure tree generated by the discourse structure analysis unit 32, the related sentence pair extraction unit 238 converts a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node to a sentence having a connection relationship. It is determined whether or not there is a connection relationship for each candidate pair of sentences having a connection relationship as a pair candidate.

関連文ペア抽出部２３８は、具体的には、談話構造木を入力として受け取り、木の親子ノード、及び兄弟ノードとなる文のペアを、接続関係を持つ文ペアの候補とし、これらの文ペアの候補の各々に対して、予め学習した２値分類器を利用することで文ペアに接続関係があるか否かを決定する。２値分類器は、学習データの文ペアとして文Ｓ_ｉ、文Ｓ_ｊを用意し、以下の（１）〜（５）の特徴を用いて、２値分類器を学習する。 Specifically, the related sentence pair extraction unit 238 receives a discourse structure tree as an input, sets a sentence pair that becomes a parent-child node and a sibling node of the tree as candidates for a sentence pair having a connection relationship, and sets these sentence pairs. For each of the candidates, it is determined whether or not the sentence pair has a connection relationship by using a binary classifier learned in advance. The binary classifier prepares sentences S _i and S _j as sentence pairs of learning data, and learns the binary classifier using the following features (1) to (5).

（１）文Ｓ_ｉ、及び文Ｓ_ｊの先頭の単語
（２）文Ｓ_ｉ、及び文Ｓ_ｊの最後の単語
（３）文Ｓ_ｉ、及び文Ｓ_ｊの先頭の３単語
（４）文Ｓ_ｉに含まれる単語と文Ｓ_ｊに含まれる単語とのペアすべて
（５）文Ｓ_ｉに含まれる単語の意味クラスと文Ｓ_ｊに含まれる単語の意味クラスのペアすべて (1) sentence _{S i,} and sentence _S beginning word of _j (2) statements _{S i,} and sentence _S last word (3) of the _j statement _{S i,} and sentence _S 3 words (4) of the head of the _j sentence All pairs of words included in S _i and words included in sentence S _j (5) All pairs of meaning classes of words included in sentence S _i and meaning classes of words included in sentence S _j

なお、上記（５）の特徴である単語の意味クラスは既存のシソーラスや単語クラスタリングの結果から得ることができる。さらに、関連文ペア抽出部２３８は、接続関係があると判定された文のペアの候補の各々について、談話構造木が表現する修飾、被修飾関係を利用して、項１を抽出するための文、及び項２を抽出するための文を決定する。例えば、文Ｓ_ｉが文Ｓ_ｊの子ノードであれば、文Ｓ_ｉを、項２を抽出するための文とし、文Ｓ_ｊを、項１を抽出するための文とする。文Ｓ_ｉ、及び文Ｓ_ｊが兄弟ノードであるなら、文番号の小さいものを、項１を抽出するための文とし、大きいものを、項２を抽出するための文とする。 Note that the word semantic class, which is the feature (5) above, can be obtained from the results of existing thesauruses and word clustering. Further, the related sentence pair extraction unit 238 extracts the term 1 for each of the sentence pair candidates determined to have the connection relation by using the modification and the modified relation represented by the discourse structure tree. The sentence and the sentence for extracting the term 2 are determined. For example, if the child node of the statement S _i the sentence S _j, the statement S _i, a statement for extracting the second aspect, the sentence S _j, a sentence for extracting claim 1. If the sentence S _i and the sentence S _j are sibling nodes, a sentence having a small sentence number is a sentence for extracting the term 1, and a sentence having a large sentence number is a sentence for extracting the term 2.

文間項抽出部２４２は、関連文ペア抽出部２３８によって接続関係があると判定された接続関係を持つ文のペアの候補の各々について、当該接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出する。なお、関連文ペア抽出部２３８において、項１、及び項２がどの文から抽出されるかの判定は終わっているため、ここでは以下の（１）及び（２）の操作で項のみを取り出す。 The inter-sentence term extraction unit 242 implicitly determines, from each of the sentence pair candidates having a connection relationship determined by the related sentence pair extraction unit 238, from the sentence pair candidates having the connection relationship. Two terms connected by a simple connection expression are extracted. In addition, since the related sentence pair extraction unit 238 determines which sentence the terms 1 and 2 are extracted from, only the terms are extracted by the following operations (1) and (2). .

（１）文中に含まれる記号のうち、「。」、「！」、「？」の文末表現を削除する。
（２）文頭、文末における「“”」等の括弧表現を削除する。 (1) Delete the sentence end expressions of “.”, “!”, “?” Among the symbols included in the sentence.
(2) Delete parentheses such as ““ ”at the beginning and end of the sentence.

文間項抽出部２４２では、上記の（１）及び（２）の操作を変化がなくなるまで繰り返し、暗示的な接続関係を有する２つの項を出力部５０に出力する。 The inter-sentence term extraction unit 242 repeats the operations (1) and (2) until there is no change, and outputs two terms having an implicit connection relationship to the output unit 50.

意味クラス分類部２４４は、関連文ペア抽出部２３８によって接続関係があると判定された接続関係を持つ文のペアの候補の各々について、接続関係を持つ文のペアの候補に基づいて、暗示的な接続表現の意味クラスを分類し、出力部５０に出力する。意味クラス分類部２４４は、文ペアの候補の各々を入力として、予め学習データにより学習した多クラスの分類問題を解くことにより、文ペアの候補の各々の文同士をつなぐ接続関係の意味クラスを決定する。学習及び分類に用いる特徴は、上記関連文ペア抽出部２３８で利用した（１）〜（５）の特徴を利用する。さらに、多クラス分類問題であるため、学習データ中のクラス分布がなるべく均一になるようにデータを学習データから再サンプリングする。 The semantic class classifying unit 244 implicitly determines, based on the sentence pair candidates having the connection relationship, each of the sentence pair candidates having the connection relationship determined by the related sentence pair extracting unit 238 to have the connection relationship. The semantic classes of connection expressions are classified and output to the output unit 50. The semantic class classification unit 244 receives each sentence pair candidate as an input and solves a multi-class classification problem learned in advance from learning data, thereby determining a semantic class of a connection relationship that connects the sentences of each sentence pair candidate. decide. As features used for learning and classification, the features (1) to (5) used in the related sentence pair extraction unit 238 are used. Furthermore, since it is a multi-class classification problem, the data is resampled from the learning data so that the class distribution in the learning data is as uniform as possible.

＜本発明の第２の実施の形態に係る接続表現項構造解析装置の作用＞ <Operation of the connection expression term structure analyzing apparatus according to the second embodiment of the present invention>

次に、本発明の第２の実施の形態に係る接続表現項構造解析装置２００の作用について説明する。入力部１０において文書を受け付けると、接続表現項構造解析装置２００は、図７に示す接続表現項構造解析処理ルーチンを実行する。なお、第１の実施の形態と同様の作用となる箇所については同一符号を付して説明を省略する。 Next, the operation of the connection expression term structure analysis apparatus 200 according to the second embodiment of the present invention will be described. When the input unit 10 accepts a document, the connection expression term structure analysis apparatus 200 executes a connection expression term structure analysis processing routine shown in FIG. In addition, about the location which becomes the effect | action similar to 1st Embodiment, the same code | symbol is attached | subjected and description is abbreviate | omitted.

ステップＳ２００では、ステップＳ１０２で生成された談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定する。また、ステップ２００では、接続関係があると判定された文のペアの候補の各々について、談話構造木が表現する修飾、被修飾関係を利用して、項２を抽出するための文、及び項１を抽出するための文を決定する。 In step S200, based on the discourse structure tree generated in step S102, the sentence pair corresponding to the parent-child node and the sentence pair corresponding to the sibling node are used as candidate sentence pairs having a connection relation, and the connection relation It is determined whether or not there is a connection relationship for each of the sentence pair candidates having. Further, in step 200, for each sentence pair candidate determined to have a connection relationship, a sentence for extracting term 2 and a term using the modification and modified relationships expressed by the discourse structure tree The sentence for extracting 1 is determined.

次に、ステップＳ２０２では、ステップＳ２００で接続関係があると判定された接続関係を持つ文のペアの候補の各々について、当該接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出し、出力部５０に出力する。 Next, in step S202, each of the sentence pair candidates having the connection relationship determined to have the connection relationship in step S200 is connected from the sentence pair candidates having the connection relationship by an implicit connection expression. Are extracted and output to the output unit 50.

そして、ステップＳ２０４では、ステップＳ２００で接続関係があると判定された接続関係を持つ文のペアの候補の各々について、接続関係を持つ文のペアの候補に基づいて、暗示的な接続表現の意味クラスを分類し、出力部５０に出力し、接続表現項構造解析処理ルーチンを終了する。 Then, in step S204, for each of the sentence pair candidates having the connection relation determined to have the connection relation in step S200, the meaning of the implicit connection expression based on the sentence pair candidates having the connection relation. The class is classified and output to the output unit 50, and the connection expression term structure analysis processing routine is terminated.

なお、第２の実施の形態に係る接続表現項構造解析装置２００の他の構成及び作用については、第１の実施の形態と同様であるため、説明を省略する。 In addition, about the other structure and effect | action of the connection expression term structure analysis apparatus 200 which concern on 2nd Embodiment, since it is the same as that of 1st Embodiment, description is abbreviate | omitted.

以上説明したように、第２の実施の形態に係る接続表現項構造解析装置によれば、文書に基づいて、修辞構造に基づく、文の各々を各ノードで表わした談話構造木を生成し、談話構造木に基づいて、親子ノードに対応する文のペア、及び兄弟ノードに対応する文のペアを、接続関係を持つ文のペアの候補とし、接続関係を持つ文のペアの候補の各々について、接続関係があるか否かを判定し、接続関係を持つ文のペアの候補から、暗示的な接続表現によって結ばれる２つの項を抽出し、暗示的な接続表現の意味クラスを分類することにより、隣接しない文間からも、接続関係を持つ意味的に結ばれた項を抽出することができる。 As described above, according to the connection expression term structure analysis device according to the second embodiment, a discourse structure tree in which each sentence is represented by each node based on a rhetorical structure is generated based on a document, Based on the discourse structure tree, a sentence pair corresponding to a parent-child node and a sentence pair corresponding to a sibling node are used as a sentence pair candidate having a connection relation, and each of a sentence pair candidate having a connection relation is selected. Determine whether there is a connection relationship, extract two terms connected by an implicit connection expression from candidate sentence pairs with a connection relationship, and classify the semantic class of the implicit connection expression Thus, it is possible to extract semantically connected terms having a connection relationship even between non-adjacent sentences.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

例えば、上述した実施の形態では、第１の実施の形態に係る接続表現項構造解析装置によって、文書から明示的接続表現に関する接続表現、項、及び意味ラベルを抽出し、第２の実施の形態に係る接続表現項構造解析装置によって、文書から暗示的接続表現に関する接続表現、項、及び意味ラベルを抽出する場合を例に説明したが、これに限定されるものではなく、一つの接続表現項構造解析装置によって、文書から明示的接続表現に関する接続表現、項、及び意味ラベル、並びに暗示的接続表現に関する項、及び意味ラベルを抽出するようにしてもよい。 For example, in the above-described embodiment, the connection expression, the term, and the semantic label related to the explicit connection expression are extracted from the document by the connection expression term structure analysis apparatus according to the first embodiment, and the second embodiment. In the above description, the connection expression, the term, and the semantic label related to the implicit connection expression are extracted from the document by the connection expression term structure analysis apparatus according to the present invention. However, the present invention is not limited to this. You may make it extract the connection expression regarding an explicit connection expression, a term, and a semantic label, and the term regarding an implicit connection expression, and a semantic label from a document by a structure analysis apparatus.