JP7580593B2

JP7580593B2 - How to train an end-to-end sensitive text recall model, How to recall sensitive text

Info

Publication number: JP7580593B2
Application number: JP2023524462A
Authority: JP
Inventors: ウェイルリウ，
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-06-06
Filing date: 2022-10-10
Publication date: 2024-11-11
Anticipated expiration: 2042-10-10
Also published as: WO2023236405A1; CN114943228A; CN114943228B; JP2024526395A

Description

Priority information

本出願は、中国特許出願番号「２０２２１０６３３２４１４」、出願日２０２２年６月６日の中国特許出願に基づいて提出され、当該中国特許出願の優先権を請求し、当該中国特許出願のすべての内容はここで参照として本出願に組み込まれる。 This application is filed based on a Chinese patent application with Chinese patent application number "2022106332414" and filing date of June 6, 2022, and claims priority to said Chinese patent application, the entire contents of which are hereby incorporated by reference into this application.

本開示は、データ処理技術の分野に関し、具体的に深層学習などの人工知能技術の分野に関し、より具体的には、エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、センシティブテキストリコール方法に関する。 The present disclosure relates to the field of data processing technology, specifically to the field of artificial intelligence technology such as deep learning, and more specifically to a method for training an end-to-end sensitive text recall model, and a sensitive text recall method.

アプリケーション内のテキストはユーザに情報を伝達する主要な方法の１つですが、有害でルール違反な情報を含むセンシティブテキストはユーザに不良な使用体験をもたらすとともに、規制上のリスクをもたらし、社会的な風潮を害し、最終的にはアプリケーション製品がユーザに放棄されることになる。ワードリストリコールは、テキスト情報におけるセンシティブテキストをタイムリーにリコールし、これによって製品の安全を保障し、ユーザの使用体験を向上させることができる。 Text within an application is one of the main ways to convey information to users, but sensitive text containing harmful or illegal information can bring a bad user experience, pose regulatory risks, harm social trends, and ultimately lead to users abandoning the application product. Word list recall can promptly recall sensitive text in text information, thereby ensuring product safety and improving user experience.

本開示は、エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、センシティブテキストリコール方法、装置、機器及び記憶媒体を提供する。 The present disclosure provides a method for training an end-to-end sensitive text recall model, a sensitive text recall method, an apparatus, a device, and a storage medium.

本開示の第１の態様の実施形態によれば、エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法を提供し、前記方法は、センシティブテキストブロックシーンにおける予め設定されたワードリストと第１のランダムテキストコーパスを取得するステップであって、前記予め設定されたワードリストのうちの用語に対応するテキストがセンシティブテキストであるステップと、前記予め設定されたワードリストに基づいてポジティブサンプルデータを構築し、前記第１のランダムテキストコーパスに基づいてネガティブサンプルデータを構築するステップと、前記ポジティブサンプルデータと前記ネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行して、トレーニング終了後にモデル指標がターゲット標準に達するテキスト分類モデルを得るステップと、前記モデル指標がターゲット標準に達するテキスト分類モデルのモデルパラメータに基づいて、エンドツーエンドセンシティブテキストリコールモデルを生成するステップであって、前記エンドツーエンドセンシティブテキストリコールモデルは学習によりワードリストリコール能力を得たものであるステップと、を含む。 According to an embodiment of the first aspect of the present disclosure, a method for training an end-to-end sensitive text recall model is provided, the method including the steps of: obtaining a preset word list and a first random text corpus in a sensitive text block scene, where text corresponding to a term in the preset word list is sensitive text; constructing positive sample data based on the preset word list and constructing negative sample data based on the first random text corpus; performing iterative training on an initial text classification model based on the positive sample data and the negative sample data by a human evaluation method and a multi-sample splice sampling method to obtain a text classification model whose model indicator reaches a target standard after training; and generating an end-to-end sensitive text recall model based on model parameters of the text classification model whose model indicator reaches the target standard, where the end-to-end sensitive text recall model has acquired word list recall ability through learning.

いくつかの実施形態では、前記ポジティブサンプルデータと前記ネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行するステップは、前記ポジティブサンプルデータと前記ネガティブサンプルデータをトレーニングサンプルとしてトレーニングセットと検証セットに分割するステップと、前記トレーニングセットと前記検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得るステップと、テストセットを取得し、前記テストセットに基づいて前記最適なモデルを評価して、モデル評価結果を得るステップと、前記モデル評価結果と前記テストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって前記トレーニングサンプルを更新するステップと、更新後のトレーニングサンプルをトレーニングセットと検証セットに再分割し、トレーニング終了後にモデル指標がターゲット標準に達するまで、前記トレーニングセットと前記検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得るステップを実行するステップと、を含む。 In some embodiments, the step of performing iterative training of an initial text classification model based on the positive sample data and the negative sample data using a human evaluation method and a multi-sample splice sampling method includes the steps of: dividing the positive sample data and the negative sample data into a training set and a validation set as training samples; training a text classification model based on the training set and the validation set to obtain an optimal model; obtaining a test set, evaluating the optimal model based on the test set to obtain a model evaluation result; updating the training samples based on the model evaluation result and the test set using a human evaluation method and a multi-sample splice sampling method; and re-dividing the updated training samples into a training set and a validation set, and training a text classification model based on the training set and the validation set to obtain an optimal model until the model index reaches a target standard after training is completed.

いくつかの実施形態では、前記テストセットにはリコールサンプルと第２のランダムテキストコーパスが含まれ、前記テストセットに基づいて前記最適なモデルを評価して、モデル評価結果を得るステップは、前記テストセットのうちの前記リコールサンプルを前記最適なモデルに入力し、前記最適なモデルから出力された第１の予測結果を取得するステップと、前記第１の予測結果と前記リコールサンプルに対応する実際のラベル情報に基づいて、前記最適なモデルの再現率を決定するステップと、前記テストセットのうちの前記第２のランダムテキストコーパスを前記最適なモデルに入力し、前記最適なモデルから出力された第２の予測結果を取得するステップと、前記第２の予測結果と前記第２のランダムテキストコーパスに対応する実際のラベル情報に基づいて、前記最適なモデルの適合率を決定するステップと、を含む。 In some embodiments, the test set includes a recall sample and a second random text corpus, and the step of evaluating the optimal model based on the test set to obtain a model evaluation result includes the steps of: inputting the recall sample from the test set into the optimal model and obtaining a first prediction result output from the optimal model; determining a recall rate of the optimal model based on the first prediction result and actual label information corresponding to the recall sample; inputting the second random text corpus from the test set into the optimal model and obtaining a second prediction result output from the optimal model; and determining a precision rate of the optimal model based on the second prediction result and actual label information corresponding to the second random text corpus.

いくつかの実施形態では、前記モデル評価結果と前記テストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって前記トレーニングサンプルを更新するステップは、前記再現率が第１の閾値より小さいことに応答し、前記第１の予測結果のうちのネガティブの例であると予測された例の第１の人間による評価結果を取得し、前記第１の人間による評価結果に基づいて、前記リコールサンプルのうちのネガティブの例であると誤予測された例のサンプルを更新対象のサンプルセットに追加するステップ、及び／又は、前記適合率が第２の閾値より小さいことに応答し、前記第２の予測結果のうちのポジティブの例であると予測された例の第２の人間による評価結果を取得し、前記第２の人間による評価結果に基づいて、前記第２のランダムテキストコーパスのうちのポジティブの例であると誤予測された例のテキストコーパスを更新対象のサンプルセットに追加するステップ、前記更新対象のサンプルセットのうちのＮ個のサンプルごとに１つのサンプルにスプライスし、スプライス処理後に得られたサンプルを前記トレーニングサンプルに更新するステップであって、前記Ｎが１より大きい整数であるステップ、を含む。 In some embodiments, the step of updating the training samples based on the model evaluation results and the test set using a human evaluation scheme and a multi-sample splice sampling scheme includes the steps of: in response to the recall being less than a first threshold, obtaining a first human evaluation of the examples predicted to be negative examples from the first prediction results, and adding the examples of the recall samples that are mispredicted to be negative examples from the recall samples to the sample set to be updated based on the first human evaluation; and/or in response to the precision being less than a second threshold, obtaining a second human evaluation of the examples predicted to be positive examples from the second prediction results, and adding the text corpus of the examples of the second random text corpus that are mispredicted to be positive examples from the second random text corpus to the sample set to be updated based on the second human evaluation; splicing every N samples in the sample set to be updated into one sample, and updating the sample obtained after the splicing process to the training sample, where N is an integer greater than 1.

いくつかの実施形態では、前記Ｎは３である。 In some embodiments, N is 3.

いくつかの実施形態では、前記テキスト分類モデルは第１の長短期記憶ネットワークＬＳＴＭ層、平均プール化層、第２のＬＳＴＭ層、最大プール化層、スプライスＣｏｎｃａｔ層、削減Ｄｒｏｐｏｕｔ層及び分類層を含み、前記第１のＬＳＴＭ層が、サンプルのテキスト特徴を抽出し、前記平均プール化層が前記テキスト特徴をプール化処理して、第１の経路特徴を得て、前記第２のＬＳＴＭ層が前記第１のＬＳＴＭ層のうちの最後の隠蔽層出力に対して特徴抽出を行い、抽出された特徴を前記最大プール化層に入力し、前記最大プール化層が前記第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、前記スプライスＣｏｎｃａｔ層が前記第１の経路特徴と前記第２の経路特徴をスプライスして、スプライス特徴を得て、前記削減Ｄｒｏｐｏｕｔ層が前記スプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、前記分類層が前記削減Ｄｒｏｐｏｕｔ層出力の特徴を分類処理して、分類の予測値を得る。 In some embodiments, the text classification model includes a first long short-term memory network (LSTM) layer, an average pooling layer, a second LSTM layer, a max pooling layer, a spliced concat layer, a reduced dropout layer, and a classification layer, the first LSTM layer extracts sample text features, the average pooling layer pools the text features to obtain first path features, and the second LSTM layer calculates features for the last hidden layer output of the first LSTM layer. Extraction is performed, and the extracted features are input to the max pooling layer, which pools the output of the second LSTM layer to obtain a second path feature, the splice concat layer splices the first path feature and the second path feature to obtain a splice feature, the reduced dropout layer performs a dropout operation on the splice feature, and the classification layer classifies the features output from the reduced dropout layer to obtain a classification prediction value.

本開示の第２の態様の実施形態によれば、センシティブテキストリコール方法を提供し、前記方法は、処理対象テキストを取得するステップと、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて前記処理対象テキストを予測して、前記処理対象テキストをリコールするか否かを決定するステップであって、前記エンドツーエンドセンシティブテキストリコールモデルは学習によりワードリストリコール能力を得たものであり、前記エンドツーエンドセンシティブテキストリコールモデルが第１の態様に記載の方法を用いてトレーニングされるステップと、を含む。 According to an embodiment of a second aspect of the present disclosure, there is provided a sensitive text recall method, the method including the steps of obtaining a target text, and predicting the target text based on a pre-trained end-to-end sensitive text recall model to determine whether to recall the target text, the end-to-end sensitive text recall model having acquired word list recall capability by training, the end-to-end sensitive text recall model being trained using the method of the first aspect.

いくつかの実施形態では、前記事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて前記処理対象テキストを予測して、前記処理対象テキストをリコールするか否かを決定するステップは、前記第１の長短期記憶ネットワークＬＳＴＭ層を介して前記処理対象テキストのテキスト特徴を抽出するステップと、前記平均プール化層を介して前記テキスト特徴をプール化処理して、第１の経路特徴を得るステップと、前記第２のＬＳＴＭ層を介して前記第１のＬＳＴＭ層のうちの最後の隠蔽層出力に対して特徴抽出を行い、抽出された特徴を前記最大プール化層に入力するステップと、前記最大プール化層を介して前記第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得るステップと、前記第１の経路特徴と前記第２の経路特徴をスプライスして、スプライス特徴を得て、前記削減Ｄｒｏｐｏｕｔ層を介して前記スプライス特徴に対してＤｒｏｐｏｕｔ操作を行うステップと、前記分類層を介して前記削減Ｄｒｏｐｏｕｔ層出力の特徴を分類処理して、分類の予測値を得るステップと、前記予測値に基づいて、前記処理対象テキストをリコールするか否かを決定するステップと、を含む。 In some embodiments, the step of predicting the target text based on the pre-trained end-to-end sensitive text recall model and determining whether to recall the target text includes the steps of extracting text features of the target text through the first long short-term memory network (LSTM) layer, pooling the text features through the average pooling layer to obtain first path features, and performing feature extraction on the last hidden layer output of the first LSTM layer through the second LSTM layer to extract the extracted features. The method includes inputting features to the max pooling layer, pooling the output of the second LSTM layer through the max pooling layer to obtain a second path feature, splicing the first path feature and the second path feature to obtain a spliced feature, and performing a Dropout operation on the spliced feature through the reduced Dropout layer, classifying the features output from the reduced Dropout layer through the classification layer to obtain a classification prediction value, and determining whether to recall the text to be processed based on the prediction value.

本開示の第３の態様の実施例によれば、エンドツーエンドセンシティブテキストリコールモデルのトレーニング装置を提供し、前記装置は、センシティブテキストブロックシーンにおける予め設定されたワードリストと第１のランダムテキストコーパスを取得する取得モジュールであって、前記予め設定されたワードリストのうちの用語（ｔｅｒｍ）に対応するテキストがセンシティブテキストである取得モジュールと、前記予め設定されたワードリストに基づいてポジティブサンプルデータを構築する構築モジュールであって、前記第１のランダムテキストコーパスに基づいてネガティブサンプルデータを構築する構築モジュールと、前記ポジティブサンプルデータと前記ネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行して、トレーニング終了後にモデル指標がターゲット標準に達するテキスト分類モデルを得て、前記モデル指標がターゲット標準に達するテキスト分類モデルのモデルパラメータに基づいて、エンドツーエンドセンシティブテキストリコールモデルを生成する処理モジュールであって、前記エンドツーエンドセンシティブテキストリコールモデルは学習によりワードリストリコール能力を得たものである処理モジュールと、を備える。 According to an embodiment of the third aspect of the present disclosure, there is provided an apparatus for training an end-to-end sensitive text recall model, the apparatus comprising: an acquisition module for acquiring a preset word list and a first random text corpus in a sensitive text block scene, where text corresponding to a term in the preset word list is sensitive text; a construction module for constructing positive sample data based on the preset word list, where the construction module constructs negative sample data based on the first random text corpus; and a processing module for performing iterative training on an initial text classification model based on the positive sample data and the negative sample data by a human evaluation method and a multi-sample splice sampling method, to obtain a text classification model whose model indicator reaches a target standard after training is completed, and generating an end-to-end sensitive text recall model based on model parameters of the text classification model whose model indicator reaches the target standard, where the end-to-end sensitive text recall model has acquired word list recall ability through learning.

いくつかの実施形態では、前記処理モジュールは、具体的に、前記ポジティブサンプルデータと前記ネガティブサンプルデータをトレーニングサンプルとしてトレーニングセットと検証セットに分割し、前記トレーニングセットと前記検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得て、テストセットを取得し、前記テストセットに基づいて前記最適なモデルを評価して、モデル評価結果を得て、前記モデル評価結果と前記テストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって前記トレーニングサンプルを更新し、更新後のトレーニングサンプルをトレーニングセットと検証セットに再分割し、トレーニング終了後にモデル指標がターゲット標準に達するまで、前記トレーニングセットと前記検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得るステップを実行する。 In some embodiments, the processing module specifically performs the steps of dividing the positive sample data and the negative sample data into a training set and a validation set as training samples, training a text classification model based on the training set and the validation set to obtain an optimal model, obtaining a test set, evaluating the optimal model based on the test set to obtain a model evaluation result, updating the training samples based on the model evaluation result and the test set through a human evaluation method and a multi-sample splice sampling method, re-dividing the updated training samples into a training set and a validation set, and training a text classification model based on the training set and the validation set until the model index reaches the target standard after training is completed to obtain an optimal model.

いくつかの実施形態では、前記テストセットにはリコールサンプルと第２のランダムテキストコーパスが含まれ、前記処理モジュールは、具体的に、前記テストセットのうちの前記リコールサンプルを前記最適なモデルに入力し、前記最適なモデルから出力された第１の予測結果を取得し、前記第１の予測結果と前記リコールサンプルに対応する実際のラベル情報に基づいて、前記最適なモデルの再現率を決定し、前記テストセットのうちの前記第２のランダムテキストコーパスを前記最適なモデルに入力し、前記最適なモデルから出力された第２の予測結果を取得し、前記第２の予測結果と前記第２のランダムテキストコーパスに対応する実際のラベル情報に基づいて、前記最適なモデルの適合率を決定する。 In some embodiments, the test set includes a recall sample and a second random text corpus, and the processing module specifically inputs the recall sample from the test set into the optimal model, obtains a first prediction result output from the optimal model, and determines a recall rate of the optimal model based on the first prediction result and actual label information corresponding to the recall sample, inputs the second random text corpus from the test set into the optimal model, obtains a second prediction result output from the optimal model, and determines a precision rate of the optimal model based on the second prediction result and actual label information corresponding to the second random text corpus.

いくつかの実施形態では、前記処理モジュールは、具体的に、前記再現率が第１の閾値より小さいことに応答し、前記第１の予測結果のうちのネガティブの例であると予測された例の第１の人間による評価結果を取得し、前記第１の人間による評価結果に基づいて、前記リコールサンプルのうちのネガティブの例であると誤予測された例のサンプルを更新対象のサンプルセットに追加し、及び／又は、前記適合率が第２の閾値より小さいことに応答し、前記第２の予測結果のうちのポジティブの例であると予測された例の第２の人間による評価結果を取得し、前記第２の人間による評価結果に基づいて、前記第２のランダムテキストコーパスのうちのポジティブの例であると誤予測された例のテキストコーパスを更新対象のサンプルセットに追加し、前記更新対象のサンプルセットのうちのＮ個のサンプルごとに１つのサンプルにスプライスし、スプライス処理後に得られたサンプルを前記トレーニングサンプルに更新し、前記Ｎが１より大きい整数である。 In some embodiments, the processing module specifically, in response to the recall being less than a first threshold, obtains a first human evaluation result of an example predicted to be a negative example from the first prediction result, and adds a sample of an example of the recall sample that is mispredicted to be a negative example to the sample set to be updated based on the first human evaluation result; and/or, in response to the precision being less than a second threshold, obtains a second human evaluation result of an example predicted to be a positive example from the second prediction result, and adds a text corpus of an example of the second random text corpus that is mispredicted to be a positive example to the sample set to be updated based on the second human evaluation result, splices every N samples from the sample set to be updated into one sample, and updates the sample obtained after the splicing process to the training sample, where N is an integer greater than 1.

いくつかの実施形態では、前記テキスト分類モデルは、第１の長短期記憶ネットワークＬＳＴＭ層、平均プール化層、第２のＬＳＴＭ層、最大プール化層、スプライスＣｏｎｃａｔ層、削減Ｄｒｏｐｏｕｔ層及び分類層を含み、前記第１のＬＳＴＭ層が、サンプルのテキスト特徴を抽出し、前記平均プール化層が前記テキスト特徴をプール化処理して、第１の経路特徴を得て、前記第２のＬＳＴＭ層が前記第１のＬＳＴＭ層のうちの最後の隠蔽層出力に対して特徴抽出を行い、抽出された特徴を前記最大プール化層に入力し、前記最大プール化層が前記第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、前記スプライスＣｏｎｃａｔ層が前記第１の経路特徴と前記第２の経路特徴をスプライスして、スプライス特徴を得て、前記削減Ｄｒｏｐｏｕｔ層が前記スプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、前記分類層が前記削減Ｄｒｏｐｏｕｔ層出力の特徴を分類処理して、分類の予測値を得る。 In some embodiments, the text classification model includes a first long short-term memory network (LSTM) layer, an average pooling layer, a second LSTM layer, a max pooling layer, a spliced concat layer, a reduced dropout layer, and a classification layer, the first LSTM layer extracts sample text features, the average pooling layer pools the text features to obtain first path features, and the second LSTM layer applies a feature to the last hidden layer output of the first LSTM layer. Feature extraction is performed, and the extracted features are input to the max pooling layer, which pools the output of the second LSTM layer to obtain a second path feature, the splice concat layer splices the first path feature and the second path feature to obtain a spliced feature, the reduced dropout layer performs a dropout operation on the spliced feature, and the classification layer classifies the features output from the reduced dropout layer to obtain a classification prediction value.

本開示の第４の実施形態によれば、センシティブテキストリコール装置を提供し、前記装置は、処理対象テキストを取得する取得モジュールと、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて前記処理対象テキストを予測して、前記処理対象テキストをリコールするか否かを決定する予測モジュールであって、前記エンドツーエンドセンシティブテキストリコールモデルは学習によりワードリストリコール能力を得たものであり、前記エンドツーエンドセンシティブテキストリコールモデルが本開示の第１の態様のいずれかの実施形態に記載の方法を用いてトレーニングされる予測モジュールと、を備える。 According to a fourth embodiment of the present disclosure, there is provided a sensitive text recall device, the device comprising: an acquisition module for acquiring a target text; and a prediction module for predicting the target text based on a pre-trained end-to-end sensitive text recall model to determine whether to recall the target text, the end-to-end sensitive text recall model having acquired word list recall capability through learning, the end-to-end sensitive text recall model being trained using a method according to any of the embodiments of the first aspect of the present disclosure.

いくつかの実施形態では、前記予測モジュールは具体的に、前記第１の長短期記憶ネットワークＬＳＴＭ層を介して前記処理対象テキストのテキスト特徴を抽出し、前記平均プール化層を介して前記テキスト特徴をプール化処理して、第１の経路特徴を得て、前記第２のＬＳＴＭ層を介して前記第１のＬＳＴＭ層のうちの最後の隠蔽層出力に対して特徴抽出を行い、抽出された特徴を前記最大プール化層に入力し、前記最大プール化層を介して前記第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、前記第１の経路特徴と前記第２の経路特徴をスプライスして、スプライス特徴を得て、前記削減Ｄｒｏｐｏｕｔ層を介して前記スプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、前記分類層を介して前記削減Ｄｒｏｐｏｕｔ層出力の特徴を分類処理して、分類の予測値を得て、前記予測値に基づいて、前記処理対象テキストをリコールするか否かを決定する。 In some embodiments, the prediction module specifically extracts text features of the target text through the first long short-term memory network (LSTM) layer, pools the text features through the average pooling layer to obtain first path features, performs feature extraction on the last hidden layer output of the first LSTM layer through the second LSTM layer, inputs the extracted features to the max pooling layer, pools the output of the second LSTM layer through the max pooling layer to obtain second path features, splices the first path features and the second path features to obtain splice features, performs a dropout operation on the splice features through the reduced dropout layer, classifies the features of the reduced dropout layer output through the classification layer to obtain a classification prediction value, and determines whether to recall the target text based on the prediction value.

本開示の第５の態様の実施形態によれば、電子機器を提供し、少なくとも１つのプロセッサと、該少なくとも１つのプロセッサと通信可能に接続されるメモリと、を備え、前記メモリには、前記少なくとも１つのプロセッサによって実行可能な命令が記憶されており、前記命令は、前記少なくとも１つのプロセッサが第１態様、または第２の態様のいずれかの実施形態に記載の方法を実行できるように、前記少なくとも１つのプロセッサによって実行される。 According to an embodiment of the fifth aspect of the present disclosure, there is provided an electronic device, comprising at least one processor and a memory communicatively connected to the at least one processor, the memory storing instructions executable by the at least one processor, the instructions being executed by the at least one processor such that the at least one processor can perform a method according to any of the embodiments of the first or second aspects.

本開示の第６の態様の実施形態によれば、コンピュータ命令が記憶されている非一時的なコンピュータ読み取り可能な記憶媒体を提供し、前記コンピュータ命令は、前記コンピュータに前記第１の態様、または第２の態様のいずれかの実施形態に記載の方法を実行させる。 According to an embodiment of the sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium having computer instructions stored thereon, the computer instructions causing the computer to perform a method according to any of the embodiments of the first or second aspect.

本開示の第７の態様の実施形態によれば、コンピュータプログラムを提供し、前記コンピュータプログラムがコンピュータに実行されている際、コンピュータに第１の態様、または第２の態様のいずれかの実施形態に記載の方法を実行させる。 According to an embodiment of the seventh aspect of the present disclosure there is provided a computer program which , when running on a computer, causes the computer to perform a method according to any of the embodiments of the first or second aspect.

本開示の技術によれば、ワードリスト和大量の実際のデータに基づいてポジティブサンプルデータとネガティブサンプルデータを構築し、構築されたポジティブサンプルデータとネガティブサンプルデータに基づいてテキスト分類モデルに対してイテレーション処理トレーニングを行って、エンドツーエンドセンシティブテキストリコールモデルを生成して、エンドツーエンドセンシティブテキストリコールモデルがワードリストリコール能力を学習できるようにすることにより、エンドツーエンドセンシティブテキストリコールモデルの知識汎化能力を向上させて、このモデルのセンシティブテキストに対するリコール能力を向上させ、これによってエンドツーエンドセンシティブテキストリコールモデルを使用してワードリストリコールを実現し、ワードリスト汎化能力を向上させることができる。 According to the technology disclosed herein, positive sample data and negative sample data are constructed based on a word list and a large amount of actual data, and an iterative training process is performed on a text classification model based on the constructed positive sample data and negative sample data to generate an end-to-end sensitive text recall model, so that the end-to-end sensitive text recall model can learn the word list recall ability, thereby improving the knowledge generalization ability of the end-to-end sensitive text recall model and improving the recall ability of the model for sensitive text, thereby enabling the end-to-end sensitive text recall model to realize word list recall and improve the word list generalization ability.

なお、この部分に記載の内容は、本開示の実施例の肝心または重要な特徴を特定することを意図しておらず、本開示の範囲を限定することも意図していないことを理解されたい。本開示の他の特徴は、以下の説明を通して容易に理解される。 It should be understood that the contents described in this section are not intended to identify key or important features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

図面は、本技術案をよりよく理解するために使用され、本開示を限定するものではない。
本開示の第１の実施例に係る概略図である。本開示の実施例によって提供されるモデルトレーニングの概略フローチャートである。本開示の第２の実施例に係る概略図である。本開示の実施例によって提供されるテキスト分類モデルのアーキテクチャの概略図である。本開示の第３の実施例に係る概略図である。本開示の実施例によって提供されるエンドツーエンドセンシティブテキストリコールモデルのトレーニング装置の概略図である。本開示の実施例によって提供されるセンシティブテキストリコール装置の概略図である。本開示の実施例によって提供される電子機器のブロック図である。 The drawings are used for a better understanding of the present technical solution, and are not intended to limit the present disclosure.
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure. 1 is a schematic flowchart of model training provided by an embodiment of the present disclosure. FIG. 13 is a schematic diagram according to a second embodiment of the present disclosure. FIG. 1 is a schematic diagram of an architecture of a text classification model provided by an embodiment of the present disclosure. FIG. 13 is a schematic diagram according to a third embodiment of the present disclosure. FIG. 1 is a schematic diagram of a training apparatus for an end-to-end sensitive text recall model provided by an embodiment of the present disclosure. 1 is a schematic diagram of a sensitive text recall device provided by an embodiment of the present disclosure; FIG. 2 is a block diagram of an electronic device provided by an embodiment of the present disclosure.

以下、図面と併せて本開示の例示的な実施例を説明し、理解を容易にするためにその中には本開示の実施例の様々な詳細事項が含まれており、それらは単なる例示的なものと見なされるべきである。したがって、当業者は、本開示の範囲及び精神から逸脱することなく、ここで説明される実施例に対して様々な変更と修正を行うことができることを認識されたい。同様に、明確及び簡潔にするために、以下の説明では、周知の機能及び構造の説明を省略する。本開示の説明では、別に説明がない限り、「／」は「または」という意味を表し、例えば、Ａ／Ｂという記載はＡまたはＢを表すことができ、本明細書の「及び／又は」は、関連対象の関連関係を説明し、３つの関係が存在可能であることを表す。例えば、Ａ及び／又はＢという記載は、Ａが単独で存在する、ＡとＢが同時に存在する、Ｂが単独で存在するという３つの状況を表すことができる。 Hereinafter, exemplary embodiments of the present disclosure will be described in conjunction with the drawings, and various details of the embodiments of the present disclosure will be included therein for ease of understanding, and they should be regarded as merely exemplary. Therefore, it should be recognized that those skilled in the art can make various changes and modifications to the embodiments described herein without departing from the scope and spirit of the present disclosure. Similarly, for clarity and conciseness, the following description will omit the description of well-known functions and structures. In the description of the present disclosure, unless otherwise specified, "/" means "or", for example, the description A/B can represent A or B, and "and/or" in this specification describes a related relationship between related objects and indicates that three relationships can exist. For example, the description A and/or B can represent three situations: A exists alone, A and B exist simultaneously, and B exists alone.

ワードリストブロックテキストを使用してテキストにおけるセンシティブ情報（関連法令に違反する情報など）をブロックすることは、有害な情報を排除するための重要な手段であるが、関連技術で使用されているワードリストポリシーの汎化性が悪く、例えば「私たちはＸＸが大好きだ」を１つの用語として、「私たちはＸＸが大好きだ、私たちはＹＹが大好きだ」というテキストをリコールすることができるが、「私たちはとてもＸＸが大好きだ」という上記のテキストの意味と非常に類似したテキストはリコールすることができない。これにより、本開示の実施例は、エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法を提案し、高い知識汎化能力を有するセンシティブテキストリコールモデルを構築することができ、そしてこのモデルに基づいてエンドツーエンドのテキスト予測とリコールを実現し、意味的に類似した表現テキストを効果的にリコールすることができる。 Using word list blocking text to block sensitive information (such as information that violates relevant laws and regulations) in text is an important means to eliminate harmful information, but the generalizability of the word list policy used in the related technology is poor. For example, "We love XX" can be used as a term to recall the text "We love XX, we love YY", but cannot recall the text "We love XX very much", which is very similar in meaning to the above text. Therefore, the embodiment of the present disclosure proposes a method for training an end-to-end sensitive text recall model, and can build a sensitive text recall model with high knowledge generalization ability, and based on this model, end-to-end text prediction and recall can be realized, and semantically similar expression text can be effectively recalled.

図１を参照すると、図１は本開示の第１の実施例に係るエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法の概略図である。図１に示すように、この方法は以下のステップＳ１０１～Ｓ１０４を含むことができるが、これに限定されない。 Referring to FIG. 1, FIG. 1 is a schematic diagram of a method for training an end-to-end sensitive text recall model according to a first embodiment of the present disclosure. As shown in FIG. 1, the method may include, but is not limited to, the following steps S101 to S104.

ステップＳ１０１、センシティブテキストブロックシーンにおける予め設定されたワードリストと第１のランダムテキストコーパスを取得する。 Step S101: Obtain a predefined word list and a first random text corpus in a sensitive text block scene.

本開示の実施例では、予め設定されたワードリストには対応するセンシティブテキストの用語が含まれ、当該用語が、複数語の用語と単一語の用語を含むことができ、第１のランダムテキストコーパスは、予め設定されたワードリストによって取得されたテキスト、または人間による評価結果が正常なテキストを含むが、これに限定されない。 In an embodiment of the present disclosure, the pre-defined word list includes corresponding sensitive text terms, which may include multi-word terms and single-word terms, and the first random text corpus includes, but is not limited to, text captured by the pre-defined word list or text that has been evaluated by a human to be normal.

例えば、センシティブテキストブロックシーンにおいてブロック必要なセンシティブテキストに対応する用語からなる予め設定されたワードリストを取得し、そして、実際の状況に応じて第１のランダムテキストコーパスを取得する。 For example, obtain a pre-defined word list consisting of terms corresponding to sensitive text that needs to be blocked in a sensitive text blocking scene, and then obtain a first random text corpus according to the actual situation.

ステップＳ１０２、予め設定されたワードリストに基づいてポジティブサンプルデータを構築し、第１のランダムテキストコーパスに基づいてネガティブサンプルデータを構築する。 Step S102: construct positive sample data based on a predefined word list, and construct negative sample data based on a first random text corpus.

例えば、予め設定されたワードリスト中の複数語の用語の区切り記号と単一語の用語を削除し、残りのテキストをモデルポジティブサンプルとして、第１のランダムテキストコーパスのうちの有害なコーパスの割合を評価し、第１のランダムコーパスのうちの有害なコーパスの割合が予め設定された閾値（例えば、１％）以下である場合、第１のランダムテキストコーパスのうちのテキストをネガティブサンプルとして直接ランダムに抽出し、第１のランダムコーパスのうちの有害なコーパスの割合が予め設定された閾値より大きい場合、第１のランダムテキストコーパスを審査して、第１のランダムテキストコーパスのうちの有害なコーパスを除去し、第１のランダムコーパスのうちの有害なコーパスの割合が予め設定された閾値以下になり、処理後の第１のランダムコーパスのうちのテキストをネガティブサンプルとしてランダムに抽出する。 For example, delimiters of multi-word terms and single-word terms in a preset word list are removed, and the remaining text is used as a model positive sample to evaluate the proportion of harmful corpus in the first random text corpus; if the proportion of harmful corpus in the first random corpus is equal to or less than a preset threshold (e.g., 1%), the text in the first random text corpus is directly randomly extracted as a negative sample; if the proportion of harmful corpus in the first random corpus is greater than the preset threshold, the first random text corpus is examined to remove harmful corpus in the first random text corpus; if the proportion of harmful corpus in the first random corpus is equal to or less than the preset threshold, the text in the first random corpus after processing is randomly extracted as a negative sample.

ステップＳ１０３、ポジティブサンプルデータとネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行して、トレーニング終了後にモデル指標がターゲット標準に達するテキスト分類モデルを得る。 Step S103: Based on the positive sample data and the negative sample data, an iterative training process is performed on the initial text classification model by using a human evaluation method and a multi-sample splice sampling method, and a text classification model whose model index reaches the target standard after the training is completed is obtained.

本開示の実施例では、テキスト分類モデルは、ＴｅｘｔＣＮＮ（ＴｅｘｔＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ、テキスト畳み込みニューラルネットワーク）、高速テキスト分類ＦａｓｔＴｅｘｔ、ＢＥＲＴ（ＢｉｄｉｒｅｃｔｉｏｎａｌＥｎｃｏｄｅｒＲｅｐｒｅｓｅｎｔａｔｉｏｎｓｆｒｏｍＴｒａｎｓｆｏｒｍｅｒｓ、コンバータベースの双方向符号化表示）を含むが、これに限定されない。マルチサンプルスプライスのサンプリング方式とは、人間によって評価された複数のマークアップされたテキストのうち、予め選択された数（例えば、３つ）のテキストを１つのサンプルにスプライスすることを指す。例えば、人間によって評価された複数のマークアップされたテキストのうちの３つずつのテキストを１つのサンプルにスプライスする。
In the embodiment of the present disclosure, the text classification model includes, but is not limited to, TextCNN (Text Convolutional Neural Networks), FastText for fast text classification, and BERT (Bidirectional Encoder Representations from Transformers). The multi-sample splice sampling method refers to splicing a preselected number (e.g., three) of texts from a plurality of human - evaluated marked-up texts into one sample. For example, three texts each from a plurality of human - evaluated marked-up texts are spliced into one sample.

例えば、ポジティブサンプルデータとネガティブサンプルデータに基づいて初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行し、イテレーション処理トレーニング中に予め設定されたトレーニングステップの数（例えば、１００ステップ）ごとに、現在のモデルの指標を計算して、現在のモデルの効果を評価し、マルチサンプルスプライスのサンプリング方式を用いてトレーニングサンプルを更新し、損失関数に基づいてモデルの損失値を計算し、損失値に基づいて勾配を逆転計算して、モデルパラメータを最適化し、更新後のトレーニングサンプルを使用してラメータ最適化後のモデルをトレーニングする。モデルの指標がターゲット標準に達するまで、上記のステップを繰り返し実行し、このモデルをテキスト分類モデルとする。 For example, an iterative training process is performed on an initial text classification model based on positive sample data and negative sample data, and during the iterative training process, at each preset number of training steps (e.g., 100 steps), the current model's indicators are calculated to evaluate the effectiveness of the current model, the training samples are updated using a multi-sample splice sampling method, the model's loss value is calculated based on a loss function, the gradient is inversely calculated based on the loss value to optimize the model parameters, and the updated training samples are used to train the model after parameter optimization. The above steps are repeated until the model's indicators reach the target standard, and this model is regarded as a text classification model.

本開示の実施例では、ターゲット指標とは、モデルがターゲット效果を達成しているか否かを判定するための予め設定された指標を指し、モデルの指標は、モデルの適合率とモデルの再現率を含むが、これに限定されない。 In the embodiments of the present disclosure, the target indicator refers to a preset indicator for determining whether the model achieves the target effect, and the model indicator includes, but is not limited to, the model precision and model recall.

本開示の実施例では、適合率の計算式は以下のように表すことができる：

In the embodiment of the present disclosure, the formula for calculating the precision rate can be expressed as follows:

ａｃｃｕｒａｃｙは適合率であり、ＴＰはモデルがポジティブサンプルをポジティブサンプルとして予測するサンプルの数であり、ＦＮはモデルがポジティブサンプルをネガティブサンプルとして予測するサンプルの数であり、ＦＰはモデルがネガティブサンプルをポジティブサンプルとして予測するサンプル数であり、ＴＮはモデルがネガティブサンプルをネガティブサンプルとして予測するサンプルの数である。 Accuracy is the precision rate, TP is the number of samples for which the model predicts positive samples as positive samples, FN is the number of samples for which the model predicts positive samples as negative samples, FP is the number of samples for which the model predicts negative samples as positive samples, and TN is the number of samples for which the model predicts negative samples as negative samples.

本開示の実施例では、再現率の計算式は以下のように表すことができる：

In the embodiment of the present disclosure, the formula for the recall can be expressed as follows:

ｒｅｃａｌｌは再現率であり、ＴＰはモデルがポジティブサンプルをポジティブサンプルとして予測するサンプルの数であり、ＦＮはモデルがポジティブサンプルをネガティブサンプルとして予測するサンプルの数である。 recall is the recall, TP is the number of samples for which the model predicts a positive sample as a positive sample, and FN is the number of samples for which the model predicts a positive sample as a negative sample.

本開示の実施例では、損失値の計算式は以下のより表すことができる：

In an embodiment of the present disclosure, the calculation formula for the loss value can be expressed as follows:

Ｌは損失値であり、ｉはｉ番目のサンプルであり、ｙｉはサンプルｉのラベルであり、ポジティブサンプルは１であり、ネガティブサンプルは０であり、ｐｉはサンプルｉをポジティブサンプルとして予測する確率である。 L is the loss value, i is the i-th sample, yi is the label of sample i, positive samples are 1, negative samples are 0, and pi is the probability of predicting sample i as a positive sample.

ステップＳ１０４、モデル指標がターゲット標準に達するテキスト分類モデルのモデルパラメータに基づいて、エンドツーエンドセンシティブテキストリコールモデルを生成する。 Step S104: Generate an end-to-end sensitive text recall model based on the model parameters of the text classification model whose model indicators reach the target standard.

本開示の実施例では、エンドツーエンドセンシティブテキストリコールモデルは既にワードリストリコール能力を学習した。 In the embodiment of the present disclosure, the end-to-end sensitive text recall model has already learned the word list recall capability.

例えば、モデル指標がターゲット標準に達するテキスト分類モデルのモデルパラメータに基づいて、予め設定されたニューラルネットワークモデル構造を使用して、エンドツーエンドセンシティブテキストリコールモデルを生成する。 For example, an end-to-end sensitive text recall model is generated using a pre-configured neural network model structure based on the model parameters of a text classification model whose model metrics reach a target standard.

本開示の実施例では、予め設定されたニューラルネットワークモデル構造は前記のテキスト分類モデルと同じであってもよい。 In an embodiment of the present disclosure, the pre-configured neural network model structure may be the same as the text classification model described above.

本開示の実施例を実施することにより、構築されたポジティブサンプルデータとネガティブサンプルデータに基づいてテキスト分類モデルに対してイテレーション処理トレーニングを行って、エンドツーエンドセンシティブテキストリコールモデルを生成することができ、これにより、エンドツーエンドセンシティブテキストリコールモデルの知識汎化能力を向上させて、このモデルのセンシティブテキストに対するリコール能力を向上させる。 By implementing the embodiments of the present disclosure, an end-to-end sensitive text recall model can be generated by performing iterative training on a text classification model based on the constructed positive and negative sample data, thereby improving the knowledge generalization ability of the end-to-end sensitive text recall model and improving the recall ability of the model for sensitive text.

本開示の実施例は、ポジティブ．ネガティブサンプルを構築することにより、テキスト分類モデルがワードリストリコール能力を学習可能である。しかしながら、ニューラルネットワークの汎化能力が高く、ワードリストリコールニーズに合致しない多くのテキストをリコールする可能性があるため、本開示の実施例は、モデルが適切な汎化能力をより正確に学習できるように確保するために、オフライントレーニング環境でモデルトレーニングフローを設計した。一例として、図２を参照すると、図２は、本開示の実施例によって提供されるモデルトレーニングの概略フローチャートであり、図２に示すように、本開示の実施例は、ポジティブ．ネガティブサンプルを構築することにより分類モデルをトレーニングし、モデルのイテレーション過程において、モデルの適合率、再現率の指標に基づいて、人間による評価方式で、モデルが誤リコールするサンプルをポジティブ/ネガティブサンプルに加え、これによってモデル指標を向上させる。 In the embodiment of the present disclosure, the text classification model can learn the word list recall ability by constructing positive and negative samples. However, since the generalization ability of the neural network is high and it may recall many texts that do not meet the word list recall needs, the embodiment of the present disclosure designed a model training flow in an offline training environment to ensure that the model can more accurately learn the appropriate generalization ability. As an example, refer to FIG. 2, which is a schematic flowchart of model training provided by the embodiment of the present disclosure. As shown in FIG. 2, the embodiment of the present disclosure trains the classification model by constructing positive and negative samples, and in the model iteration process, samples that the model misrecalls are added to the positive/negative samples in a human evaluation manner based on the model's precision and recall indicators, thereby improving the model indicator.

一例として、図３を参照すると、図３は本開示の第２の実施例に係るモデルトレーニング方法の概略図である。図３に示すように、前記ポジティブサンプルデータとネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行する実現過程は、以下のステップＳ３０１～Ｓ３０５を含むことができるが、これに限定されない。 As an example, refer to FIG. 3, which is a schematic diagram of a model training method according to a second embodiment of the present disclosure. As shown in FIG. 3, the implementation process of performing iterative training on an initial text classification model based on the positive sample data and the negative sample data through a human evaluation method and a multi-sample splice sampling method may include, but is not limited to, the following steps S301 to S305.

ステップＳ３０１、ポジティブサンプルデータとネガティブサンプルデータをトレーニングサンプルとしてトレーニングセットと検証セットに分割する。 Step S301: Divide the positive sample data and negative sample data into a training set and a validation set as training samples.

例えば、ポジティブサンプルデータとネガティブサンプルデータをそれぞれ予め設定された割合でランダムに分割して、トレーニングセットと検証セットを得る。 For example, the positive sample data and the negative sample data are randomly divided in a predetermined ratio to obtain a training set and a validation set.

いくつかの実施例では、トレーニングセットと検証セット中のサンプルデータの数の割合が９：１であることを例として、ポジティブサンプルとネガティブサンプルをそれぞれ上記の割合でランダムに分割し、９０％のポジティブサンプルデータと９０％のネガティブサンプルデータをトレーニングセットとして、残りの１０％のポジティブサンプルデータと１０％のネガティブサンプルデータを検証セットとする。 In some embodiments, for example, the ratio of the number of sample data in the training set and the validation set is 9:1, and the positive samples and negative samples are randomly divided in the above ratios, with 90% of the positive sample data and 90% of the negative sample data being the training set, and the remaining 10% of the positive sample data and 10% of the negative sample data being the validation set.

ステップＳ３０２、トレーニングセットと検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得る。 Step S302: Train a text classification model based on the training set and validation set to obtain an optimal model.

例えば、トレーニングセットに基づいてテキスト分類モデルをトレーニングし、検証セットを使用してモデルが予め設定されたトレーニングステップ（例えば、１００ステップ）ごとに現在トレーニングステップモデルの適合率と再現率効果をテストし、異なるトレーニングステップモデルの適合率と再現率效果を比較して、現在のトレーニングセットと検証セットを使用してトレーニングして得られた最適なモデルを得る。 For example, a text classification model is trained based on a training set, and the model is tested for the precision and recall effects of the current training step model at each preset training step (e.g., 100 steps) using a validation set, and the precision and recall effects of different training step models are compared to obtain an optimal model trained using the current training set and the validation set.

ステップＳ３０３、テストセットを取得し、テストセットに基づいて最適なモデルを評価し、モデル評価結果を得る。 Step S303: Obtain a test set, evaluate the optimal model based on the test set, and obtain a model evaluation result.

例えば、リコールセットとランダムデータを含むテストセットを取得し、前のステップで得られた最適なモデルに基づいてテストセットをリコールし、モデル出力に基づいて、モデルの現在指標を計算し、この指標を最適なモデルの評価結果とする。 For example, obtain a test set containing a recall set and random data, recall the test set based on the optimal model obtained in the previous step, calculate the current index of the model based on the model output, and take this index as the evaluation result of the optimal model.

いくつかの実施例では、上記のテストセットにはリコールサンプルと第２のランダムテキストコーパスが含まれることができ、前記テストセットに基づいて最適なモデルを評価し、モデル評価結果を得るステップは、テストセットのうちのリコールサンプルを最適なモデルに入力し、最適なモデルから出力された第１の予測結果を取得するステップと、第１の予測結果とリコールサンプルに対応する実際のラベル情報に基づいて、最適なモデルの再現率を決定するステップと、テストセットのうちの第２のランダムテキストコーパスを最適なモデルに入力し、最適なモデルから出力された第２の予測結果を取得するステップと、第２の予測結果と第２のランダムテキストコーパスに対応する実際のラベル情報に基づいて、最適なモデルの適合率を決定するステップと、を含むことができる。 In some embodiments, the test set may include a recall sample and a second random text corpus, and the step of evaluating the optimal model based on the test set and obtaining a model evaluation result may include the steps of: inputting the recall sample from the test set into the optimal model and obtaining a first prediction result output from the optimal model; determining a recall rate of the optimal model based on the first prediction result and actual label information corresponding to the recall sample; inputting the second random text corpus from the test set into the optimal model and obtaining a second prediction result output from the optimal model; and determining a precision rate of the optimal model based on the second prediction result and actual label information corresponding to the second random text corpus.

本開示の実施例では、リコールサンプルは、モデルのリコール能力をテストするために予め取得されたものであり、第２のランダムテキストコーパスの取得方式は第１のランダムテキストコーパスの取得方式と同じであってもよい。 In an embodiment of the present disclosure, the recall samples are pre-acquired to test the recall ability of the model, and the acquisition method of the second random text corpus may be the same as the acquisition method of the first random text corpus.

なお、第１のランダムテキストコーパスのうちのテキストと第２のランダムテキストコーパスのうちのテキストとが異なる。 Note that the text in the first random text corpus is different from the text in the second random text corpus.

一例として、予め設定されたワードリストによって取得されたテキスト、または人工審査結果が正常であるテキストを、２つの部分にランダムに分割することができ、一部を第１のランダムテキストコーパスとして、他の部分を第２のランダムテキストコーパスとして、第１のランダムテキストコーパスのうちのテキストと第２のランダムテキストコーパスのうちのテキストとが異なることを保証する。 As an example, a text obtained by a pre-set word list or a text with a normal artificial screening result can be randomly split into two parts, one part as a first random text corpus and the other part as a second random text corpus, ensuring that the text in the first random text corpus is different from the text in the second random text corpus.

例えば、テストセットのうちのリコールサンプルを入力データとして最適なモデルに入力して、リコールサンプル中の各サンプルのラベル情報を予測し、各サンプルの予測ラベル情報を第１の予測結果として取得し、第１の予測結果とリコールサンプルに対応する実際のラベル情報に基づいて、前記の再現率の計算式で、最適なモデルの再現率を計算し、テストセットのうちの第２のランダムテキストコーパスを入力データとして最適なモデルに入力して、第２のランダムテキストコーパスの各サンプルのラベル情報を予測し、各サンプルの予測ラベル情報を第２の予測結果として取得し、第２の予測結果と第２のランダムテキストコーパスに対応する実際のラベル情報に基づいて、前記の適合率の計算式で、最適なモデルの適合率を計算する。 For example, a recall sample from the test set is input as input data into the optimal model to predict label information for each sample in the recall sample, the predicted label information for each sample is obtained as a first prediction result, and the recall of the optimal model is calculated based on the first prediction result and the actual label information corresponding to the recall sample using the above-mentioned recall calculation formula; a second random text corpus from the test set is input as input data into the optimal model to predict label information for each sample in the second random text corpus, the predicted label information for each sample is obtained as a second prediction result, and the precision of the optimal model is calculated based on the second prediction result and the actual label information corresponding to the second random text corpus using the above-mentioned precision calculation formula.

Ｐｒｅｃｉｓｉｏｎは適合率であり、ＴＰはポジティブサンプルをポジティブサンプルとして予測するサンプルの数であり、ＦＰはネガティブサンプルをポジティブサンプルとして予測するサンプルの数である。 Precision is the precision rate, TP is the number of samples that predict positive samples as positive samples, and FP is the number of samples that predict negative samples as positive samples.

ステップＳ３０４、モデル評価結果とテストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によってトレーニングサンプルを更新する。 Step S304: Based on the model evaluation results and the test set, update the training samples using the human evaluation method and the multi-sample splice sampling method.

例えば、モデル評価結果が予想された結果に達していない場合、人間による評価モデルによってポジティブの例であるデータを予測し、そのうちの実際の分類がネガティブの例であるデータをトレーニングサンプルに加えて、トレーニングサンプルを更新する。 For example, if the model evaluation results do not reach the expected results, the human evaluation model predicts data that is a positive example, and the data that is actually classified as a negative example is added to the training sample to update the training sample.

いくつかの実施例では、モデル評価結果とテストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によってトレーニングサンプルを更新するステップは、再現率が第１の閾値より小さいことに応答し、第１の予測結果のうちのネガティブの例であると予測された第１の人間による評価結果を取得し、第１の人間による評価結果に基づいて、リコールサンプルのうちのネガティブの例であると誤予測されたサンプルを更新対象のサンプルセットに加えるステップ、及び／又は、適合率が第２の閾値より小さいことに応答し、第２の予測結果のうちのポジティブの例であると予測された第２の人間による評価結果を取得し、第２の人間による評価結果に基づいて、第２のランダムテキストコーパスのうちのポジティブの例であると誤予測されたテキストコーパスを更新対象のサンプルセットに加えるステップ、更新対象のサンプルセット中のＮ個ごとのサンプルを１つのサンプルにスプライスし、スプライス処理を行って得られたサンプルをトレーニングサンプルに更新するステップであって、Ｎが１より大きい整数であるステップ、を含むことができる In some embodiments, the step of updating the training samples using the human evaluation method and the multi-sample splice sampling method based on the model evaluation results and the test set can include the steps of: obtaining a first human evaluation result predicted to be a negative example of the first prediction result in response to the recall rate being smaller than a first threshold, and adding a sample of the recall sample that is mispredicted to be a negative example based on the first human evaluation result to the sample set to be updated; and/or obtaining a second human evaluation result predicted to be a positive example of the second prediction result in response to the precision rate being smaller than a second threshold, and adding a text corpus of the second random text corpus that is mispredicted to be a positive example based on the second human evaluation result to the sample set to be updated; splicing every N samples in the sample set to be updated into one sample, and updating the sample obtained by performing the splicing process to the training sample, where N is an integer greater than 1.

いくつかの実施例では、Ｎは３である。いくつかの実施例では、現在モデルの再現率が第１の閾値より小さいことに応答し、モデル予測がネガティブの例である第１の予測結果を取得し、第１の予測結果のうちのネガティブの例であると予測されたサンプルに対して人間で評価し、上記のサンプルのうちのネガティブの例であると誤予測されたポジティブサンプルを選択し、ネガティブの例であると誤予測されたポジティブサンプルを更新対象のサンプルセットに加え、当該更新対象のサンプルセット中の３つごとのサンプルを１つのサンプルにスプライスし、スプライス処理を行って得られたサンプルをトレーニングサンプルに更新し、現在のモデルの適合率が第２の閾値以上であることに応答し、第２の予測結果を処理しない。 In some embodiments, N is 3. In some embodiments, in response to the recall of the current model being less than a first threshold, obtaining a first prediction result in which the model prediction is a negative example, performing human evaluation on the samples predicted to be negative examples from the first prediction result, selecting positive samples from the samples that were mispredicted to be negative examples, adding the positive samples that were mispredicted to be negative examples to a sample set to be updated, splicing every third sample in the sample set to be updated into one sample, updating the sample obtained by performing the splicing process to a training sample, and in response to the precision of the current model being greater than or equal to a second threshold, not processing the second prediction result.

他のいくつかの実施例では、現在モデルの再現率が第１の閾値以上であることに応答し、第１の予測結果を処理しない。現在モデルの適合率が第２の閾値より小さいことに応答し、モデル予測がポジティブの例である第２の予測結果を取得し、第２の予測結果のうちのポジティブの例であると予測されたサンプルを手動で評価し、上記サンプル中のうちのポジティブの例であると誤予測されたネガティブサンプルを選択し、ポジティブの例であると誤予測されたネガティブサンプルを更新対象のサンプルセットに加え、当該更新対象のサンプルセット中の３つごとのサンプルを１つのサンプルにスプライスし、スプライス処理を行って得られたサンプルをトレーニングサンプルに更新する。 In some other embodiments, in response to the recall of the current model being equal to or greater than a first threshold, the first prediction result is not processed. In response to the precision of the current model being less than a second threshold, a second prediction result is obtained in which the model prediction is a positive example, samples predicted to be positive examples in the second prediction result are manually evaluated, negative samples among the samples that are mispredicted to be positive examples are selected, the negative samples mispredicted to be positive examples are added to a sample set to be updated, every third sample in the sample set to be updated is spliced into one sample, and the sample obtained by performing the splicing process is updated to a training sample.

他のいくつかの実施例では、現在モデルの再現率が第１の閾値より小さいことに応答し、モデル予測がネガティブの例である第１の予測結果を取得し、第１の予測結果のうちのネガティブの例であると予測されたサンプルに対して人間で評価し、上記のサンプルのうちのネガティブの例であると誤予測されたポジティブサンプルを選択し、ネガティブの例であると誤予測されたポジティブサンプルを更新対象のサンプルセットに加え、現在モデルの適合率が第２の閾値より小さいことに応答し、モデル予測がポジティブの例である第２の予測結果を取得し、第２の予測結果のうちのポジティブの例であると予測されたサンプルに対して人間で評価し、上記サンプル中のうちのポジティブの例であると誤予測されたネガティブサンプルを選択し、ポジティブの例であると誤予測されたネガティブサンプルを更新対象のサンプルセットに加え、上記更新対象のサンプルセット中の３つごとのサンプルを１つのサンプルにスプライスし、スプライス処理を行って得られたサンプルをトレーニングサンプルに更新する。 In some other embodiments, in response to the recall rate of the current model being less than a first threshold, a first prediction result in which the model prediction is a negative example is obtained, a sample predicted to be a negative example from the first prediction result is evaluated by a human, a positive sample from the above samples that is mispredicted as a negative example is selected, and the positive sample mispredicted as a negative example is added to the sample set to be updated; in response to the precision rate of the current model being less than a second threshold, a second prediction result in which the model prediction is a positive example is obtained, a sample predicted to be a positive example from the second prediction result is evaluated by a human, a negative sample from the above samples that is mispredicted as a positive example is selected, and the negative sample mispredicted as a positive example is added to the sample set to be updated; every third sample in the sample set to be updated is spliced into one sample, and the sample obtained by performing the splicing process is updated to a training sample.

他のいくつかの実施例では、現在モデルの再現率が第１の閾値以上であることに応答し、現在モデルの適合率が第２の閾値以上である場合、現在モデルの指標がターゲット標準に達したと判断される。ステップＳ３０５、更新後のトレーニングサンプルをトレーニングセットと検証セットに再分割し、トレーニング終了後にモデル指標がターゲット標準に達するまで、トレーニングセットと検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得るステップを実行する。 In some other embodiments, in response to the recall rate of the current model being equal to or greater than the first threshold, if the precision rate of the current model is equal to or greater than the second threshold, it is determined that the index of the current model has reached the target standard. Step S305: Repartition the updated training samples into a training set and a validation set, and train the text classification model based on the training set and the validation set until the model index reaches the target standard after the training is completed, thereby obtaining an optimal model.

例えば、更新後のトレーニングサンプルを予め設定された割合で再分割して、新たなトレーニングセットと検証セットを得て、新たなトレーニングセット及び検証セットを使用してステップＳ３０２の実行に戻り、実際の状況に基づいてその後のステップを実行して、トレーニング終了後のモデル指標がターゲット標準に達するまで、テキスト分類モデルを再トレーニングする。 For example, the updated training samples are re-divided at a preset ratio to obtain new training sets and validation sets, and the new training sets and validation sets are used to execute step S302 again, and the subsequent steps are executed based on the actual situation to retrain the text classification model until the model indicators after training reach the target standard.

なお、本開示の実施例によれば、モデルバージョンイテレーションを利用してテキスト分類モデルをオフラインでトレーニングして、エンドツーエンドセンシティブテキストリコールモデルを得ることができる。当該エンドツーエンドセンシティブテキストリコールモデルをサーバに配置すると、リンクされたアプリケーション内のテキストを直接認識してセンシティブテキストをリコールし、これによってエンドツーエンドのセンシティブテキストリコールを実現することができる。 Note that, according to an embodiment of the present disclosure, a text classification model can be trained offline using model version iteration to obtain an end-to-end sensitive text recall model. The end-to-end sensitive text recall model can be deployed on a server to directly recognize text in linked applications to recall sensitive text, thereby achieving end-to-end sensitive text recall.

本開示の実施例では、テキスト分類モデルは、第１の長短期記憶ネットワークＬＳＴＭ層、平均プール化層、第２のＬＳＴＭ層、最大プール化層、スプライスＣｏｎｃａｔ層、削減Ｄｒｏｐｏｕｔ層及び分類層を含むことができる。一例として、図４を参照すると、図４は、本開示の実施例によって提供されるテキスト分類モデルのアーキテクチャの概略図である。図４に示すように、第１のＬＳＴＭ層はサンプルのテキスト特徴を抽出し、平均プール化層（ｍｅａｎ－ｐｏｏｌｉｎｇ）はテキスト特徴をプール化処理して、第１の経路特徴を得て、第２のＬＳＴＭ層は、第１のＬＳＴＭ層中の最後の隠蔽層（すなわち図４に示すｈｎ）の出力を特徴抽出し、抽出された特徴を前記最大プール化層（ｍａｘ－ｐｏｏｌｉｎｇ）に入力し、最大プール化層は、第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、Ｃｏｎｃａｔ層は、第１の経路特徴と第２の経路特徴をスプライスして、スプライス特徴を得て、Ｄｒｏｐｏｕｔ層は、上記のスプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、分類層は、Ｄｒｏｐｏｕｔ層から出力された特徴を分類処理して、分類された予測値を得る。 In an embodiment of the present disclosure, the text classification model may include a first long short-term memory network (LSTM) layer, an average pooling layer, a second LSTM layer, a max pooling layer, a spliced Concat layer, a reduced Dropout layer, and a classification layer. As an example, refer to FIG. 4, which is a schematic diagram of the architecture of a text classification model provided by an embodiment of the present disclosure. As shown in FIG. 4, the first LSTM layer extracts the text features of the samples, the mean-pooling layer pools the text features to obtain the first path features, the second LSTM layer extracts features from the output of the last hidden layer in the first LSTM layer (i.e., hn shown in FIG. 4) and inputs the extracted features to the max-pooling layer, which pools the output of the second LSTM layer to obtain the second path features, the concat layer splices the first path features and the second path features to obtain spliced features, the Dropout layer performs a Dropout operation on the spliced features, and the classification layer classifies the features output from the Dropout layer to obtain classified prediction values.

なお、Ｄｒｏｐｏｕｔ層により、オーバーフィット現象の発生を効果的に予防することができ、Ｄｒｏｐｏｕｔ関数は特殊なアクティブ化関数であり、テキスト分類モデルのトレーニング段階では、Ｄｒｏｐｏｕｔ層がアクティブ化された重みの数とこのＤｒｏｐｏｕｔ層の総重みの数との比が保持確率ｋｅｅｐ＿ｐｒｏｂ（一般的に０．５をとる）と確保すべきであり、予測段階ではｋｅｅｐ＿ｐｒｏｂ＝１をとる。 In addition, the Dropout layer can effectively prevent the occurrence of overfitting. The Dropout function is a special activation function. During the training phase of the text classification model, the ratio of the number of weights activated by the Dropout layer to the total number of weights in the Dropout layer should be kept at the retention probability keep_prob (generally 0.5), and during the prediction phase, keep_prob = 1.

図５を参照すると、図５は、本開示の第３の実施例に係るセンシティブテキストリコール方法の概略図である。図５に示すように、この方法は以下のステップＳ５０１～Ｓ５０２を含むことができるが、これに限定されない。 Referring to FIG. 5, FIG. 5 is a schematic diagram of a sensitive text recall method according to a third embodiment of the present disclosure. As shown in FIG. 5, the method may include, but is not limited to, the following steps S501 to S502.

ステップＳ５０１、処理対象テキストを取得する。 Step S501: Obtain the text to be processed.

例えば、関連するアプリケーション内のテキスト情報を処理対象テキストとして取得することができる。 For example, text information within a related application can be obtained as the text to be processed.

ステップＳ５０２、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて処理対象テキストを予測して、処理対象テキストをリコールするか否かを決定する。 Step S502: Predict the text to be processed based on the pre-trained end-to-end sensitive text recall model, and determine whether to recall the text to be processed.

本開示の実施例では、エンドツーエンドセンシティブテキストリコールモデルは既に学習によりワードリストリコール能力を得たものであり、エンドツーエンドセンシティブテキストリコールモデルは本開示の実施例のいずれかによって提供される方法でトレーニングされる。 In an embodiment of the present disclosure, the end-to-end sensitive text recall model has already been trained to have word list recall capability, and the end-to-end sensitive text recall model is trained in a manner provided by any of the embodiments of the present disclosure.

例えば、処理対象テキストを事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに入力して、このテキストを予測し、テキストにセンシティブテキストが含まれるか否かを判断し、これによって処理対象テキストをリコールするか否かを決定する。 For example, the text to be processed is input into a pre-trained end-to-end sensitive text recall model to predict the text and determine whether the text contains sensitive text, thereby determining whether the text to be processed should be recalled.

本開示の実施例を実施することにより、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて処理対象テキストを予測して、処理対象テキストをリコールするか否かを決定することができ、これによってセンシティブテキストに対するリコールを向上させる。 By implementing an embodiment of the present disclosure, it is possible to predict the text to be processed based on a pre-trained end-to-end sensitive text recall model and determine whether to recall the text to be processed, thereby improving the recall for sensitive text.

図６を参照すると、図６は、本開示の実施例によって提供されるエンドツーエンドセンシティブテキストリコールモデルのトレーニング装置の概略図である。図６に示すように、この装置は、取得モジュール６０１、構築モジュール６０２及び処理モジュール６０３を備える。 Referring to FIG. 6, FIG. 6 is a schematic diagram of a training apparatus for an end-to-end sensitive text recall model provided by an embodiment of the present disclosure. As shown in FIG. 6, the apparatus includes an acquisition module 601, a construction module 602, and a processing module 603.

取得モジュール６０１は、センシティブテキストブロックシーンにおける予め設定されたワードリストと第１のランダムテキストコーパスを取得し、予め設定されたワードリスト中の用語に対応するテキストがセンシティブテキストであり、構築モジュール６０２は、予め設定されたワードリストに基づいてポジティブサンプルデータを構築し、第１のランダムテキストコーパスに基づいてネガティブサンプルデータを構築し、処理モジュール６０３は、ポジティブサンプルデータとネガティブサンプルデータに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によって初期のテキスト分類モデルに対してイテレーション処理トレーニングを実行して、トレーニング終了後にモデル指標がターゲット標準に達するテキスト分類モデルを得て、及びモデル指標がターゲット標準に達するテキスト分類モデルのモデルパラメータに基づいて、エンドツーエンドセンシティブテキストリコールモデルを生成し、エンドツーエンドセンシティブテキストリコールモデルは既に学習によりワードリストリコール能力を得たものである。 The acquisition module 601 acquires a preset word list and a first random text corpus in a sensitive text block scene, and the text corresponding to the terms in the preset word list is the sensitive text. The construction module 602 constructs positive sample data according to the preset word list and constructs negative sample data according to the first random text corpus. The processing module 603 performs iterative processing training on the initial text classification model based on the positive sample data and the negative sample data by a human evaluation method and a multi-sample splice sampling method, to obtain a text classification model whose model indicator reaches the target standard after the training is completed, and generates an end-to-end sensitive text recall model based on the model parameters of the text classification model whose model indicator reaches the target standard, and the end-to-end sensitive text recall model has already obtained the word list recall ability through learning.

いくつかの実施例では、処理モジュール６０３は、具体的に、ポジティブサンプルデータとネガティブサンプルデータをトレーニングサンプルとしてトレーニングセットと検証セットに分割し、トレーニングセットと検証セットに基づいて、テキスト分類モデルをトレーニングして、最適なモデルを得て、テストセットを取得し、テストセットに基づいて最適なモデルを評価し、モデル評価結果を得て、モデル評価結果とテストセットに基づいて、人間による評価方式とマルチサンプルスプライスのサンプリング方式によってトレーニングサンプルを更新し、更新後のトレーニングサンプルをトレーニングセットと検証セットに再分割し、トレーニングセットと検証セットに基づいて、トレーニング終了後にモデル指標がターゲット標準に達するまで、テキスト分類モデルをトレーニングして、最適なモデルを得るステップを実行する。 In some embodiments, the processing module 603 specifically performs the steps of dividing the positive sample data and the negative sample data into a training set and a validation set as training samples, training a text classification model based on the training set and the validation set to obtain an optimal model, obtaining a test set, evaluating the optimal model based on the test set, obtaining a model evaluation result, updating the training samples based on the model evaluation result and the test set through a human evaluation method and a multi-sample splice sampling method, redividing the updated training samples into a training set and a validation set, and training a text classification model based on the training set and the validation set until the model index reaches the target standard after training is completed to obtain an optimal model.

いくつかの実施例では、テストセットにはリコールサンプルと第２のランダムテキストコーパスが含まれ、処理モジュール６０３は、具体的に、テストセットのうちのリコールサンプルを最適なモデルに入力し、最適なモデルから出力された第１の予測結果を取得し、第１の予測結果とリコールサンプルに対応する実際のラベル情報に基づいて、最適なモデルの再現率を決定し、テストセットのうちの第２のランダムテキストコーパスを最適なモデルに入力し、最適なモデルから出力された第２の予測結果を取得し、第２の予測結果と第２のランダムテキストコーパスに対応する実際のラベル情報に基づいて、最適なモデルの適合率を決定する。 In some embodiments, the test set includes a recall sample and a second random text corpus, and the processing module 603 specifically inputs the recall sample from the test set into the optimal model, obtains a first prediction result output from the optimal model, determines a recall rate of the optimal model based on the first prediction result and actual label information corresponding to the recall sample, inputs the second random text corpus from the test set into the optimal model, obtains a second prediction result output from the optimal model, and determines a precision rate of the optimal model based on the second prediction result and actual label information corresponding to the second random text corpus.

いくつかの実施例では、処理モジュール６０３は、具体的に、再現率が第１の閾値より小さいことに応答し、第１の予測結果のうちのネガティブの例であると予測された第１の人間による評価結果を取得し、第１の人間による評価結果に基づいて、リコールサンプルのうちのネガティブの例であると誤予測されたサンプルを更新対象のサンプルセットに加え、及び／又は、適合率が第２の閾値より小さいことに応答し、第２の予測結果のうちのポジティブの例であると予測された第２の人間による評価結果を取得し、第２の人間による評価結果に基づいて、第２のランダムテキストコーパスのうちのポジティブの例であると誤予測されたテキストコーパスを更新対象のサンプルセットに加え、更新対象のサンプルセット中のＮ個ごとのサンプルを１つのサンプルにスプライスし、スプライス処理を行って得られたサンプルをトレーニングサンプルに更新し、Ｎが１より大きい整数である。 In some embodiments, the processing module 603 specifically, in response to the recall being less than a first threshold, obtains a first human evaluation result that is predicted to be a negative example of the first prediction result, and adds a sample of the recall sample that is mispredicted to be a negative example based on the first human evaluation result to the sample set to be updated; and/or, in response to the precision being less than a second threshold, obtains a second human evaluation result that is predicted to be a positive example of the second prediction result, and adds a text corpus of the second random text corpus that is mispredicted to be a positive example based on the second human evaluation result to the sample set to be updated, splices every N samples in the sample set to be updated into one sample, and updates the sample obtained by performing the splicing process to a training sample, where N is an integer greater than 1.

いくつかの実施例では、Ｎは３である。 In some embodiments, N is 3.

いくつかの実施例では、テキスト分類モデルは、第１の長短期記憶ネットワークＬＳＴＭ層、平均プール化層、第２のＬＳＴＭ層、最大プール化層、スプライスＣｏｎｃａｔ層、削減Ｄｒｏｐｏｕｔ層及び分類層を含み、第１のＬＳＴＭ層はサンプルのテキスト特徴を抽出し、平均プール化層は、テキスト特徴をプール化処理して、第１の経路特徴を得て、第２のＬＳＴＭ層は第１のＬＳＴＭ層中の最後の隠蔽層の出力を特徴抽出し、抽出された特徴を最大プール化層に入力し、最大プール化層は、第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、スプライスＣｏｎｃａｔ層は、第１の経路特徴と第２の経路特徴をスプライスして、スプライス特徴を得て、削減Ｄｒｏｐｏｕｔ層はスプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、分類層は、削減Ｄｒｏｐｏｕｔ層から出力された特徴を分類処理して、分類された予測値を得る。 In some embodiments, the text classification model includes a first long short-term memory network (LSTM) layer, an average pooling layer, a second LSTM layer, a max pooling layer, a splice concat layer, a reduced dropout layer, and a classification layer, where the first LSTM layer extracts text features of the samples, the average pooling layer pools the text features to obtain first path features, the second LSTM layer extracts features from the output of the last hidden layer in the first LSTM layer and inputs the extracted features to the max pooling layer, the max pooling layer pools the output of the second LSTM layer to obtain second path features, the splice concat layer splices the first path features and the second path features to obtain splice features, the reduced dropout layer performs a dropout operation on the splice features, and the classification layer classifies the features output from the reduced dropout layer to obtain classified prediction values.

本開示の実施例の装置によれば、構築されたポジティブサンプルデータとネガティブサンプルデータに基づいてテキスト分類モデルに対してイテレーション処理トレーニングを行って、テキストリコールモデルを得ることができ、これによってテキストリコールモデルの知識汎化能力を向上させて、このモデルのセンシティブテキストに対するリコール能力を向上させる。 According to the device of the embodiment of the present disclosure, a text recall model can be obtained by performing iterative training on the text classification model based on the constructed positive sample data and negative sample data, thereby improving the knowledge generalization ability of the text recall model and improving the recall ability of the model for sensitive text.

図７を参照すると、図７は、本開示の実施例によって提供されるセンシティブテキストリコール装置の概略図である。図７に示すように、この装置は、取得モジュール７０１と予測モジュール７０２を備える。取得モジュール７０１は、処理対象テキストを取得し、予測モジュール７０２は、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて処理対象テキストを予測して、処理対象テキストをリコールするか否かを決定し、エンドツーエンドセンシティブテキストリコールモデルは既に学習によりワードリストリコール能力を得たものであり、エンドツーエンドセンシティブテキストリコールモデルは本開示の実施例のいずれかに記載の方法でトレーニングされる。 Referring to FIG. 7, FIG. 7 is a schematic diagram of a sensitive text recall device provided by an embodiment of the present disclosure. As shown in FIG. 7, the device includes an acquisition module 701 and a prediction module 702. The acquisition module 701 acquires a target text, and the prediction module 702 predicts the target text based on a pre-trained end-to-end sensitive text recall model to determine whether to recall the target text, where the end-to-end sensitive text recall model has already acquired word list recall ability through learning, and the end-to-end sensitive text recall model is trained by a method described in any of the embodiments of the present disclosure.

いくつかの実施例では、予測モジュール７０２は、具体的に、第１の長短期記憶ネットワークＬＳＴＭ層を介して処理対象テキストのテキスト特徴を抽出し、平均プール化層を介して、テキスト特徴をプール化処理して、第１の経路特徴を得て、第２のＬＳＴＭ層を介して第１のＬＳＴＭ層中の最後の隠蔽層の出力を特徴抽出し、抽出された特徴を最大プール化層に入力し、最大プール化層を介して、第２のＬＳＴＭ層の出力をプール化処理して、第２の経路特徴を得て、第１の経路特徴と第２の経路特徴をスプライスして、スプライス特徴を得て、削減Ｄｒｏｐｏｕｔ層を介してスプライス特徴に対してＤｒｏｐｏｕｔ操作を行い、分類層を介して、削減Ｄｒｏｐｏｕｔ層から出力された特徴を分類処理して、分類された予測値を得て、予測値に基づいて、処理対象テキストをリコールするか否かを決定する。 In some embodiments, the prediction module 702 specifically extracts text features of the target text through a first long short-term memory network (LSTM) layer, pools the text features through an average pooling layer to obtain a first path feature, extracts features from the output of the last hidden layer in the first LSTM layer through a second LSTM layer, inputs the extracted features to a max pooling layer, pools the output of the second LSTM layer through a max pooling layer to obtain a second path feature, splices the first path feature and the second path feature to obtain a spliced feature, performs a Dropout operation on the spliced feature through a reduced Dropout layer, classifies the features output from the reduced Dropout layer through a classification layer to obtain a classified prediction value, and determines whether to recall the target text based on the prediction value.

本開示の実施例の装置によれば、事前トレーニングされたエンドツーエンドセンシティブテキストリコールモデルに基づいて処理対象テキストを予測して、処理対象テキストをリコールするか否かを決定することができ、これによってセンシティブテキストに対するリコールを向上させる。 According to an embodiment of the device disclosed herein, it is possible to predict the text to be processed based on a pre-trained end-to-end sensitive text recall model and determine whether to recall the text to be processed, thereby improving the recall for sensitive text.

上記実施例の装置について、各モジュールの操作を実行する具体的な方式は、当該方法に関する実施例において既に詳細に説明したが、ここでは詳細に説明しない。 The specific method for executing the operations of each module in the device of the above embodiment has already been described in detail in the embodiment relating to the method, and will not be described in detail here.

本開示の実施例によれば、本開示は、電子機器、コンピュータ読み取り可能な記憶媒体及びコンピュータプログラムをさらに提供する。 According to an embodiment of the present disclosure, the present disclosure further provides an electronic device, a computer-readable storage medium, and a computer program.

図８に示すように、本公開の実施例に係る電子機器のロック図である。この電子機器は本開示の実施例のいずれかのエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法を実現するために使用されることができる。電子機器は、ラップトップコンピュータ、デスクトップコンピュータ、ワークステーション、パーソナルデジタルアシスタント、サーバ、ブレードサーバ、メインフレームコンピュータ、および他の適切なコンピュータなどの様々な形態のデジタルコンピュータを表すことを目的とする。電子機器は、パーソナルデジタル処理、携帯電話、スマートフォン、ウェアラブルデバイス、および他の同様のコンピューティングデバイスなどの様々な形態のモバイルデバイスを表すこともできる。本明細書で示される部品、それらの接続と関係、およびそれらの機能は、単なる例であり、本明細書の説明および／または求められる本開示の実現を制限することを意図したものではない。 As shown in FIG. 8, a block diagram of an electronic device according to an embodiment of the present disclosure. The electronic device can be used to realize any of the end-to-end sensitive text recall model training methods or sensitive text recall methods of the embodiments of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, mobile phones, smartphones, wearable devices, and other similar computing devices. The components, their connections and relationships, and their functions shown herein are merely examples and are not intended to limit the description herein and/or the implementation of the present disclosure as sought.

図８に示すように、当該電子機器は、１つ又は複数のプロセッサ８０１と、メモリ８０２と、高速インターフェースと低速インターフェースを含む各コンポーネントを接続するためのインターフェースと、を備える。各コンポーネントは、異なるバスで相互に接続され、共通のマザーボードに取り付けられてもよいし、又は必要に応じて他の方式で取り付けられてもよい。プロセッサは、外部入力／出力装置（インターフェースに結合されたディスプレイデバイスなど）にＧＵＩの図形情報をディスプレイするためにメモリに記憶されている命令を含む、電子機器内に実行される命令を処理することができる。他の実施形態では、必要であれば、複数のプロセッサ及び／又は複数のバスを、複数のメモリとともに使用することができる。同様に、複数の電子機器を接続することができ、各電子機器は、部分的な必要な操作（例えば、サーバアレイ、ブレードサーバ、又はマルチプロセッサシステムとする）を提供する。図８では、１つのプロセッサ８０１を例とする。 As shown in FIG. 8, the electronic device includes one or more processors 801, memory 802, and interfaces for connecting each component, including high-speed and low-speed interfaces. Each component may be connected to each other by different buses and mounted on a common motherboard, or mounted in other manners as needed. The processor can process instructions executed in the electronic device, including instructions stored in the memory for displaying graphical information of a GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses can be used, along with multiple memories, if necessary. Similarly, multiple electronic devices can be connected, each providing a partial required operation (e.g., a server array, a blade server, or a multi-processor system). In FIG. 8, one processor 801 is taken as an example.

メモリ８０２は、本開示により提供される非一時的なコンピュータ読み取り可能な記憶媒体である。前記メモリには、少なくとも１つのプロセッサによって実行される命令を記憶しており、前記少なくとも１つのプロセッサが本開示によって提供されるエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法を実行するようにする。本開示の非一時的なコンピュータ読み取り可能な記憶媒体は、コンピュータが本出願により提供されるエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法を実行するためのコンピュータ命令を記憶する。 Memory 802 is a non-transitory computer-readable storage medium provided by the present disclosure. The memory stores instructions executed by at least one processor, causing the at least one processor to execute the method for training an end-to-end sensitive text recall model or the sensitive text recall method provided by the present disclosure. The non-transitory computer-readable storage medium of the present disclosure stores computer instructions for a computer to execute the method for training an end-to-end sensitive text recall model or the sensitive text recall method provided by the present application.

メモリ８０２は、非一時的なコンピュータ読み取り可能な記憶媒体として、本開示の実施例におけるエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法に対応するコンピュータ命令／モジュール（例えば、図６に示す取得モジュール６０１、構築モジュール６０２及び処理モジュール６０３）、またはセンシティブテキストリコール方法に対応するコンピュータ命令／モジュール（例えば、図７に示す取得モジュール７０１と予測モジュール７０２）のような非一時的なソフトウェアプログラム、非一時的なコンピュータ実行可能プログラム及びモジュールを記憶する。プロセッサ８０１は、メモリ８０２に記憶されている非一時的なソフトウェアプログラム、命令及びモジュールを実行することによって、サーバの様々な機能アプリケーション及びデータ処理を実行し、すなわち上記の方法の実施例におけるエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法を実現する。 The memory 802, as a non-transitory computer-readable storage medium, stores non-transitory software programs, non-transitory computer-executable programs, and modules, such as computer instructions/modules (e.g., the acquisition module 601, the construction module 602, and the processing module 603 shown in FIG. 6) corresponding to the end-to-end sensitive text recall model training method in the embodiment of the present disclosure, or computer instructions/modules (e.g., the acquisition module 701 and the prediction module 702 shown in FIG. 7) corresponding to the sensitive text recall method. The processor 801 executes the non-transitory software programs, instructions, and modules stored in the memory 802 to perform various functional applications and data processing of the server, i.e., to realize the end-to-end sensitive text recall model training method or the sensitive text recall method in the embodiment of the above method.

メモリ８０２は、記憶プログラム領域及び記憶データ領域を含むことができる。記憶プログラム領域は、オペレーティングシステム、少なくとも１つの機能に必要なアプリケーションを記憶することができる。記憶データ領域は、エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法の電子機器の使用による作成されたデータなどを記憶することができる。また、メモリ８０２は、高速ランダムアクセスメモリを備えることができ、少なくとも１つの磁気ディスクストレージデバイス、フラッシュメモリデバイス、又は他の非一時的なソリッドステートストレージデバイスなどの非一時的なメモリをさらに備えることができる。いくつかの実施例では、メモリ８０２は、プロセッサ８０１に対して遠隔に設定されたメモリを選択的に備えることができ、これらの遠隔メモリは、ネットワークを介してエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法の電子機器に接続されることができる。上記ネットワークの例は、インターネット、イントラネット、ローカルエリアネットワーク、モバイル通信ネットワーク及びそれらの組み合わせを含むが、これらに限定されない。 The memory 802 may include a storage program area and a storage data area. The storage program area may store an operating system, an application required for at least one function. The storage data area may store data created by using the end-to-end sensitive text recall model training method or the electronic device of the sensitive text recall method, and the like. The memory 802 may also include high-speed random access memory, and may further include non-transient memory, such as at least one magnetic disk storage device, flash memory device, or other non-transient solid-state storage device. In some embodiments, the memory 802 may selectively include memory configured remotely with respect to the processor 801, and these remote memories may be connected to the end-to-end sensitive text recall model training method or the electronic device of the sensitive text recall method via a network. Examples of the above networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法の電子機器は、入力装置８０３と出力装置８０４とをさらに備えることができる。プロセッサ８０１、メモリ８０２、入力装置８０３、及び出力装置８０４は、バスまたは他の方式で接続することができ、図８では、バスを介して接続することを例に挙げる。 The electronic device of the end-to-end sensitive text recall model training method or the sensitive text recall method may further include an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected via a bus or other methods, and FIG. 8 illustrates an example in which they are connected via a bus.

入力装置８０３は、入力された数字又は文字情報を受信することができ、及びエンドツーエンドセンシティブテキストリコールモデルのトレーニング方法、またはセンシティブテキストリコール方法の電子機器のユーザ設定及び機能制御に関するキー信号入力を生成することができ、例えば、タッチスクリーン、キーパッド、マウス、トラックパッド、タッチパッド、指示棒、１つ又は複数のマウスボタン、トラックボール、ジョイスティックなどの入力装置である。出力装置８０４は、ディスプレイデバイス、補助照明デバイス（例えば、ＬＥＤ）、及び触覚フィードバックデバイス（例えば、振動モータ）などを含むことができる。当該ディスプレイデバイスは、液晶ディスプレイ（ＬＣＤ）、発光ダイオード（ＬＥＤ）ディスプレイ、及びプラズマディスプレイを備えることができるが、これらに限定されない。いくつかの実施形態では、ディスプレイデバイスは、タッチスクリーンであってもよい。 The input device 803 can receive input numeric or character information and generate key signal inputs related to user settings and function control of the end-to-end sensitive text recall model training method or the electronic device of the sensitive text recall method, for example, a touch screen, a keypad, a mouse, a track pad, a touch pad, a wand, one or more mouse buttons, a track ball, a joystick, etc. The output device 804 can include a display device, an auxiliary lighting device (e.g., LED), a haptic feedback device (e.g., vibration motor), etc. The display device can include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some embodiments, the display device can be a touch screen.

本明細書で説明されるシステムと技術の様々な実施形態は、デジタル電子回路システム、集積回路システム、特定用途向けＡＳＩＣ（特定用途向け集積回路）、コンピュータハードウェア、ファームウェア、ソフトウェア、及び／又はそれらの組み合わせで実現することができる。これらの様々な実施形態は、１つ又は複数のコンピュータプログラムで実施されることを含むことができ、当該１つ又は複数のコンピュータプログラムは、少なくとも１つのプログラマブルプロセッサを備えるプログラム可能なシステムで実行および／または解釈されることができ、当該プログラマブルプロセッサは、特定用途向け又は汎用プログラマブルプロセッサであってもよく、ストレージシステム、少なくとも１つの入力装置、および少なくとも１つの出力装置からデータおよび命令を受信し、データおよび命令を当該ストレージシステム、当該少なくとも１つの入力装置、および当該少なくとも１つの出力装置に伝送することができる。 Various embodiments of the systems and techniques described herein may be realized in digital electronic circuitry systems, integrated circuit systems, application specific integrated circuits (ASICs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that may be executed and/or interpreted in a programmable system having at least one programmable processor, which may be an application specific or general purpose programmable processor, and that may receive data and instructions from and transmit data and instructions to a storage system, at least one input device, and at least one output device.

これらのコンピューティングプログラム（プログラム、ソフトウェア、ソフトウェアアプリケーション、又はコードとも呼ばれる）は、プログラマブルプロセッサの機械命令、高度のプロセス及び／又は対象指向プログラミング言語、及び／又はアセンブリ／機械言語でこれらのコンピューティングプログラムを実施することを含む。本明細書で使用されるように、「機械読み取り可能な媒体」及び「コンピュータ読み取り可能な媒体」という用語は、機械命令及び／又はデータをプログラマブルプロセッサに提供するために使用される任意のコンピュータプログラム製品、機器、及び／又は装置（例えば、磁気ディスク、光ディスク、メモリ、プログラマブルロジックデバイス（ＰＬＤ））を指し、機械読み取り可能な信号である機械命令を受信する機械読み取り可能な媒体を含む。「機械読み取り可能な信号」という用語は、機械命令及び／又はデータをプログラマブルプロセッサに提供するための任意の信号を指す。 These computing programs (also referred to as programs, software, software applications, or code) include the implementation of these computing programs in machine instructions for a programmable processor, in high-level process and/or object-oriented programming languages, and/or in assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic disks, optical disks, memories, programmable logic devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including machine-readable media that receive machine instructions that are machine-readable signals. The term "machine-readable signal" refers to any signal for providing machine instructions and/or data to a programmable processor.

ユーザとのインタラクションを提供するために、コンピュータにここで説明されるシステム及び技術を実施することができ、当該コンピュータは、ユーザに情報を表示するためのディスプレイ装置（例えば、ＣＲＴ（陰極線管）又はＬＣＤ（液晶ディスプレイ）モニタ）と、キーボード及びポインティングデバイス（例えば、マウス又はトラックボール）とを有し、ユーザは、当該キーボード及び当該ポインティングデバイスによって入力をコンピュータに提供することができる。他の種類の装置も、ユーザとのインタラクションを提供することができ、例えば、ユーザに提供されるフィードバックは、任意の形式のセンシングフィードバック（例えば、ビジョンフィードバック、聴覚フィードバック、又は触覚フィードバック）であってもよく、任意の形式（音響入力と、音声入力、または、触覚入力とを含む）でユーザからの入力を受信することができる。 To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user, and a keyboard and pointing device (e.g., a mouse or trackball) by which the user can provide input to the computer. Other types of devices can also provide interaction with a user, for example, the feedback provided to the user can be any form of sensing feedback (e.g., vision feedback, auditory feedback, or haptic feedback) and can receive input from the user in any form (including acoustic, speech, or haptic input).

ここで説明されるシステムおよび技術は、バックエンドコンポーネントを備えるコンピューティングシステム（例えば、データサーバとする）、又はミドルウェアコンポーネントを備えるコンピューティングシステム（例えば、アプリケーションサーバ）、又はフロントエンドコンポーネントを備えるコンピューティングシステム（例えば、グラフィカルユーザインターフェース又はウェブブラウザを有するユーザコンピュータ、ユーザは、当該グラフィカルユーザインターフェース又は当該ウェブブラウザによってここで説明されるシステムおよび技術の実施形態とインタラクションできる）、又はこのようなバックエンドコンポーネントと、ミドルウェアコンポーネントと、フロントエンドコンポーネントのいずれかの組み合わせを備えるコンピューティングシステムで実行することができる。任意の形態又は媒体のデジタルデータ通信（例えば、通信ネットワーク）によってシステムのコンポーネントを相互に接続することができる。通信ネットワークの例は、ローカルエリアネットワーク（ＬＡＮ）と、ワイドエリアネットワーク（ＷＡＮ）と、インターネットと、ブロックチェーンネットワークを含む。 The systems and techniques described herein may be implemented on a computing system having a back-end component (e.g., a data server), or a computing system having a middleware component (e.g., an application server), or a computing system having a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with the embodiments of the systems and techniques described herein), or any combination of such back-end, middleware, and front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (LAN), a wide area network (WAN), the Internet, and a blockchain network.

コンピュータシステムは、クライアントとサーバを備えることができる。クライアントとサーバは、一般に、互いに離れており、通常に通信ネットワークを介してインタラクションする。対応するコンピュータに実行され、互いにクライアント－サーバ関係を有するコンピュータプログラムによってクライアントとサーバとの関係が生成される。サーバはクラウドサーバであってもよく、クラウドコンピューティングサーバまたはクラウドホストとも呼ばれ、クラウドコンピューティングサービスシステムにおける１つのホスト製品であり、従来の物理ホストとＶＰＳサービス（「ＶｉｒｔｕａｌＰｒｉｖａｔｅＳｅｒｖｅｒ」，または「ＶＰＳ」と省略する）に存在する管理の難しさ、ビジネス拡張性の弱いという欠陥を解決した。サーバは分散システムのサーバであってもよく、ブロックチェーンを組み込んだサーバであってもよい。 The computer system may include a client and a server. The client and server are generally remote from each other and typically interact with each other via a communication network. The relationship between the client and the server is generated by a computer program that is executed on a corresponding computer and has a client-server relationship with each other. The server may be a cloud server, also called a cloud computing server or cloud host, which is a host product in a cloud computing service system and solves the deficiencies of difficult management and weak business scalability that exist in traditional physical hosts and VPS services (abbreviated as "Virtual Private Server", or "VPS"). The server may be a server of a distributed system or a server incorporating blockchain.

本開示の実施例の技術案によれば、構築されたポジティブサンプルデータとネガティブサンプルデータに基づいてテキスト分類モデルに対してイテレーション処理トレーニングを行って、テキストリコールモデルを得ることができ、これによってテキストリコールモデルの知識汎化能力を向上させて、このモデルのセンシティブテキストに対するリコール能力を向上させる。 According to the technical proposal of the embodiment of the present disclosure, an iterative training process can be performed on the text classification model based on the constructed positive sample data and negative sample data to obtain a text recall model, thereby improving the knowledge generalization ability of the text recall model and improving the recall ability of the model for sensitive text.

なお、上記エンドツーエンドセンシティブテキストリコールモデルのトレーニング方法の説明は、本開示の実施例の装置、電子機器、コンピュータ読み取り可能な記憶媒体およびコンピュータプログラムにも適用され、ここでは説明を省略する。 In addition, the description of the above-mentioned method for training the end-to-end sensitive text recall model also applies to the device, electronic device, computer-readable storage medium, and computer program of the embodiments of the present disclosure, and therefore the description thereof will be omitted here.

なお、上記に示される様々な形式のフローを使用して、ステップを並べ替え、追加、又は削除することができることを理解されたい。例えば、本開示に記載の各ステップは、並列に実行されてもよいし、順次実行されてもよいし、異なる順序で実行されてもよいが、本開示で開示されている技術案が所望の結果を実現することができれば、本明細書では限定されない。上記具体的な実施形態は、本開示の保護範囲を制限するものではない。 It should be understood that steps can be rearranged, added, or removed using the various types of flows shown above. For example, the steps described in this disclosure may be performed in parallel, sequentially, or in a different order, but are not limited herein as long as the technical solution disclosed in this disclosure can achieve the desired results. The specific embodiments described above do not limit the scope of protection of the present disclosure.

当業者は、設計要求と他の要因に応じて、様々な修正、組み合わせ、サブコンビネーション、及び代替を行うことができると理解されたい。任意の本開示の精神と原則内で行われる修正、同等の置換、及び改善などは、いずれも本開示の保護範囲内に含まれなければならない。 It should be understood that those skilled in the art may make various modifications, combinations, subcombinations, and substitutions according to design requirements and other factors. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

Obtaining a preset word list and a first random text corpus in a sensitive text block scene, where text corresponding to a term in the preset word list is sensitive text;
constructing positive sample data based on the predefined word list and constructing negative sample data based on the first random text corpus;
According to the positive sample data and the negative sample data, iteratively train the initial text classification model through a human evaluation method and a multi-sample splice sampling method to obtain a text classification model whose model index reaches a target standard after training;
generating an end-to-end sensitive text recall model based on the model parameters of the text classification model whose model indicators reach a target standard, the end-to-end sensitive text recall model having acquired word list recall ability through training;
A method for training an end-to-end sensitive text recall model, including:

performing an iterative training process on an initial text classification model based on the positive sample data and the negative sample data through a human evaluation scheme and a multi-sample splice sampling scheme;
Dividing the positive sample data and the negative sample data into a training set and a validation set as training samples;
training a text classification model based on the training set and the validation set to obtain an optimal model;
obtaining a test set and evaluating the optimal model based on the test set to obtain a model evaluation result;
updating the training samples according to the model evaluation results and the test set through human evaluation and multi-sample splice sampling;
Re-dividing the updated training samples into a training set and a validation set, and training a text classification model based on the training set and the validation set until the model index reaches a target standard after the training is completed, thereby obtaining an optimal model;
2. The method for training an end-to-end sensitive text recall model of claim 1, comprising:

the test set includes a recall sample and a second random text corpus, and evaluating the best fit model based on the test set to obtain a model evaluation result;
inputting the recall sample of the test set into the optimal model and obtaining a first prediction result output from the optimal model;
determining a recall rate of the optimal model based on the first prediction result and actual label information corresponding to the recall sample;
inputting the second random text corpus of the test set into the best-fit model and obtaining a second prediction result output from the best-fit model;
determining a fit rate of the optimal model based on the second prediction result and actual label information corresponding to the second random text corpus;
The method for training an end-to-end sensitive text recall model of claim 2, comprising:

updating the training samples according to the model evaluation results and the test set through a human evaluation scheme and a multi-sample splice sampling scheme;
In response to the recall rate being less than a first threshold, obtaining a first human evaluation result of the first prediction result of the examples that are predicted to be negative examples, and adding samples of the recall samples that are mispredicted to be negative examples to a sample set to be updated based on the first human evaluation result; and/or
in response to the precision being less than a second threshold, obtaining a second human evaluation of the examples predicted to be positive examples from the second prediction results, and obtaining text corpora mispredicted to be positive examples from the second random text corpus based on the second human evaluation;
4. The method of claim 3, further comprising: splicing every N samples in the sample set to be updated into one sample, and updating the sample obtained after the splicing process to the training sample, where N is an integer greater than 1.

The method for training an end-to-end sensitive text recall model according to claim 4, wherein N is 3.

the text classification model includes a first long short-term memory network (LSTM) layer, an average pooling layer, a second LSTM layer, a max pooling layer, a spliced concat layer, a reduced dropout layer, and a classification layer;
The first LSTM layer extracts text features of the samples;
the average pooling layer pools the text features to obtain first path features;
The second LSTM layer performs feature extraction on the output of the last hidden layer of the first LSTM layer, and inputs the extracted features to the max pooling layer;
the max pooling layer pools the output of the second LSTM layer to obtain second path features;
the splice concat layer splices the first path feature and the second path feature to obtain a splice feature;
the reduced Dropout layer performs a Dropout operation on the splice feature;
The method for training an end-to-end sensitive text recall model according to claim 1 , wherein the classification layer performs classification processing on the features output from the reduced Dropout layer to obtain a classification prediction value.

obtaining a text to be processed;
predicting the target text based on a pre-trained end-to-end sensitive text recall model to determine whether the target text is recalled;
Including,
The end-to-end sensitive text recall model is trained to obtain word list recall capability, and the end-to-end sensitive text recall model is trained according to the method of claim 1 .

predicting the target text based on the pre-trained end-to-end sensitive text recall model to determine whether to recall the target text,
extracting text features of the target text through a first L STM layer;
pooling the text features through an average pooling layer to obtain first path features;
performing feature extraction on the output of the last hidden layer of the first LSTM layer through a second LSTM layer, and inputting the extracted features into a max pooling layer;
pooling the output of the second LSTM layer through the max pooling layer to obtain second path features;
splicing the first path feature and the second path feature to obtain a splice feature, and performing a Dropout operation on the splice feature via the reduced Dropout layer;
processing the features of the output of the reduced Dropout layer through a classification layer to obtain a predicted classification value;
determining whether to recall the subject text based on the predicted value;
8. The method of claim 7, comprising:

An acquisition module for acquiring a preset word list and a first random text corpus in a sensitive text block scene, wherein text corresponding to a term in the preset word list is sensitive text;
a construction module for constructing positive sample data based on the predefined word list, and a construction module for constructing negative sample data based on the first random text corpus;
According to the positive sample data and the negative sample data, an iterative training process is performed on the initial text classification model through a human evaluation method and a multi-sample splice sampling method, and a text classification model whose model index reaches a target standard after training is completed is obtained;
A processing module for generating an end-to-end sensitive text recall model based on the model parameters of the text classification model whose model indicators reach a target standard, wherein the end-to-end sensitive text recall model has a word list recall ability through learning;
An apparatus for training an end-to-end sensitive text recall model, comprising:

an acquisition module for acquiring a text to be processed;
a prediction module for predicting the target text based on a pre-trained end-to-end sensitive text recall model to determine whether the target text is recalled;
Equipped with
The end-to-end sensitive text recall model is trained to obtain word list recall capability, and the end-to-end sensitive text recall model is trained according to the method of any one of claims 1 to 6.

At least one processor;
a memory communicatively coupled to the at least one processor;
Equipped with
An electronic device comprising: an input/output section for inputting a first input/output signal to the at least one processor; and a second input/output section for outputting a first input/output signal to the at least one processor; and a memory for storing instructions executable by the at least one processor; the memory storing instructions executable by the at least one processor;

A non-transitory computer-readable storage medium having computer instructions stored thereon, comprising:
A non-transitory computer readable storage medium having computer instructions that cause a computer to perform the method of any one of claims 1 to 6 or the method of claims 7 or 8.

A computer program which, when executed on a computer , causes the computer to carry out the method according to any one of claims 1 to 6 or the method according to claim 7 or 8.