[go: up one dir, main page]

WO2019228016A1 - Intelligent writing method and apparatus - Google Patents

Intelligent writing method and apparatus Download PDF

Info

Publication number
WO2019228016A1
WO2019228016A1 PCT/CN2019/077443 CN2019077443W WO2019228016A1 WO 2019228016 A1 WO2019228016 A1 WO 2019228016A1 CN 2019077443 W CN2019077443 W CN 2019077443W WO 2019228016 A1 WO2019228016 A1 WO 2019228016A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
written
writing
texts
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2019/077443
Other languages
French (fr)
Chinese (zh)
Inventor
龙翀
秦昊
王雅芳
张晓彤
姚琳琳
韩非吾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of WO2019228016A1 publication Critical patent/WO2019228016A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the embodiments of the present specification relate to the field of machine learning technology, and more particularly, to a method and device for training a writing model, and a writing method and device.
  • the embodiments of the present specification aim to provide a more effective intelligent writing method to solve the deficiencies in the prior art.
  • an aspect of the present specification provides a method for training a writing model
  • the writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network
  • the method includes: obtaining a sample for training the writing model, so The sample includes a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and using the sample to train the writing model so that after the text is input to the writing model, Compared to before training, the absolute value of the difference between the output value of the writing model after training and the calibration value is reduced, and after the text is input to the writing model, the word segmentation unit
  • the text obtains a plurality of participles arranged in order, wherein the plurality of participles are arranged in order according to a position before and after in the text, and the conversion unit converts the plurality of participles arranged in order into a corresponding order.
  • Multiple word vectors, and the recurrent neural network outputs an output value of the writing model based on the sequentially arranged multiple word
  • the text in the method for training a writing model, includes two non-adjacent texts in an existing article, and the calibration value is 0.
  • the text in the method for training a writing model, includes two texts respectively belonging to two existing articles, and the calibration value is 0.
  • the text sequence in the method for training a writing model, includes two texts arranged next to each other in an existing article, and the calibration value is 1.
  • the existing article is an existing article belonging to a selected field
  • the writing model is used for writing belonging to the selected area after the training. Articles in the field.
  • training the writing model using the samples includes using the samples to train the writing model by a back propagation algorithm.
  • the recurrent neural network includes one of the following networks: RNN, LSTM, and GRU.
  • Another aspect of the present specification provides a writing method, including: obtaining n first candidate texts of a first to-be-written text of an article to be written and m second candidate texts of a second to-be-written text, where n and m are predetermined The number, and in the article to be written, the first to-be-written text is located in front of the second to-be-written text and adjacent to the second to-be-written text; the first candidate text is Each text in is combined with each text in the second candidate text to obtain n ⁇ m input texts, where in each of the input texts, the first candidate text is located in the In front of the second candidate text; input the n ⁇ m input texts respectively into a writing model trained according to the above training method to obtain n ⁇ m model output values corresponding to the n ⁇ m input texts, Predicting the smoothness of the corresponding input text by the model output value; and determining the selected text of the first to-be-written text from the n first candidate texts based on the n
  • Another aspect of the present specification provides a writing method, including: obtaining q candidate texts of to-be-written text in a to-be-completed article, where q is a predetermined number; and obtaining from the to-be-completed articles, the texts to be related to the to-be-written text Existing text of a predetermined length adjacent to and in front of the text to be written; combining the existing text and the q candidate texts separately to obtain q input texts, where in each of the input In the text, the existing text is located in front of the candidate text; the q input texts are respectively input into a writing model trained according to the above training method to obtain q models corresponding to the q input texts respectively An output value, the model output value predicts the smoothness of the corresponding input text; and a selected text of the text to be written is determined from the q candidate texts based on the q model output values.
  • the writing method further includes, after determining the selected text, performing synonymous replacement of words and / or sentences in the selected text.
  • obtaining the q candidate texts of the text to be written in the article to be completed includes acquiring at least one keyword of the text to be written; and based on the keywords To obtain q candidate texts of the text to be written.
  • the writing method further includes, before acquiring at least one keyword of the text to be written, acquiring keywords of a topic of the article to be completed.
  • obtaining the q candidate texts of the text to be written based on the keywords includes, according to a predetermined search and sorting method, selecting A search is performed in the corpus based on the keywords to obtain the q candidate texts.
  • a corpus corresponding to the text to be written includes at least one of the following: a corpus corresponding to a field of the text to be written, and a form corresponding to the form of the text to be written Corpus.
  • the writing model includes a word segmentation unit, a conversion unit, and a recurrent neural network.
  • the device includes: an obtaining unit configured to obtain a sample for training the writing model.
  • the sample includes a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and a training unit configured to use the sample to train the writing model such that: After the model inputs the text, compared to before training, the absolute value of the difference between the output value of the writing model after training and the calibration value decreases, wherein the text is input to the writing model
  • the word segmentation unit obtains a plurality of sequentially arranged word segments from the text, wherein the plurality of segmented words are sequentially arranged according to their forward and backward positions in the text, and the conversion unit arranges the sequentially arranged plurality of Word segmentation is converted into a corresponding sequence of multiple word vectors, and the recurrent neural network outputs the writing based on the sequence of multiple word
  • the training unit is further configured to use the samples to train the writing model by a back propagation algorithm.
  • a writing device including: an obtaining unit configured to obtain n first candidate texts of a first to-be-written text of an article to be written and m second candidate texts of a second to-be-written text, Where n and m are a predetermined number, and in the article to be written, the first to-be-written text is located in front of and adjacent to the second to-be-written text; combination A unit configured to combine each text in the first candidate text with each text in the second candidate text to obtain n ⁇ m input texts, wherein, in each of the input In the text, the first candidate text is located in front of the second candidate text; the input unit is configured to input the n ⁇ m input texts respectively through a writing model trained according to the above training method to obtain Describing n ⁇ m model output values corresponding to the n ⁇ m input texts respectively, the model output values predicting the smoothness of the corresponding input texts; and a determining unit configured to be based on the n ⁇
  • a writing device including: a first obtaining unit configured to obtain q candidate texts of to-be-written text in a to-be-completed article, where q is a predetermined number; and a second obtaining unit configured to: Obtaining a predetermined length of existing text adjacent to the to-be-written text and in front of the to-be-written text from the to-be-completed article; a combination unit configured to combine the existing text with the The q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing text is located in front of the candidate texts; the input unit is configured to input the q input texts The text is input respectively through a writing model trained according to the above training method to obtain q model output values corresponding to the q input texts respectively, the model output values predicting the smoothness of the corresponding input texts; and a determining unit, configuration To determine the selected text of the text to be written from the q candidate texts based on the q model output values.
  • the writing device further includes a replacement unit configured to, after determining the selected text, perform synonymous replacement of words and / or sentences in the selected text.
  • the first acquisition unit includes a first acquisition subunit configured to acquire at least one keyword of the text to be written; and a second acquisition subunit, Based on the keywords, q candidate texts of the text to be written are obtained.
  • the writing device further includes a third acquisition unit configured to acquire keywords of the subject of the article to be completed before acquiring at least one keyword of the text to be written.
  • the second acquisition subunit is further configured to search based on the keywords in a corpus corresponding to the text to be written according to a predetermined search and sorting method, Thereby, the q candidate texts are obtained.
  • the intelligent drafting scheme proposes an algorithmic framework that can efficiently solve the intelligent drafting problem; solves the problem of corpus reuse by using synonymous replacement; and uses the method of inserting sentences after paragraphs to enrich the content of the article. Compared with the method of writing manuscripts that is too automated, this solution adds manual intervention, which greatly improves the practicability and accuracy; compared with the completely template-based method, it greatly enhances flexibility and scalability.
  • FIG. 1 shows a schematic diagram of a system 100 according to an embodiment of the present specification
  • FIG. 2 shows a method for training a writing model according to an embodiment of the present specification
  • FIG. 3 illustrates a writing method according to an embodiment of the present specification
  • FIG. 4 shows a writing method according to an embodiment of the present specification
  • FIG. 5 illustrates an apparatus 500 for training a writing model according to an embodiment of the present specification
  • FIG. 6 illustrates a writing device 600 according to an embodiment of the present specification.
  • FIG. 7 illustrates a writing device 700 according to an embodiment of the present specification.
  • FIG. 1 shows a schematic diagram of a system 100 according to an embodiment of the present specification.
  • the system 100 includes a writing model 11 and a selection unit 12.
  • the writing model 11 further includes a word segmentation unit 111, a conversion unit 112, and a recurrent neural network 113.
  • a training sample is obtained, the training sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth.
  • the writing model 11 is trained by, for example, a BPTT algorithm (back propagation algorithm), so that after the text is input to the writing model, the writing model is compared with that before training.
  • BPTT algorithm back propagation algorithm
  • the absolute value of the difference between the output value after training and the calibration value decreases.
  • the word segmentation unit 111 performs word segmentation on the text in the sample to obtain a plurality of sequenced word segmentations.
  • the conversion unit 112 converts the plurality of word segmentation into word vectors arranged in sequence corresponding to the plurality of word segmentations and sequentially inputs the word vectors to the recurrent neural network 113.
  • the recurrent neural network 113 calculates the word vectors that are sequentially input, thereby outputting a calculation result.
  • parameters are adjusted according to, for example, the BPTT algorithm, so that the plurality of recurrent neural networks 113 are sequentially input. After the word vector, the output value of the network is closer to the calibration value.
  • Multiple training samples may be obtained to train the writing model multiple times, that is, perform multiple parameter adjustments on the recurrent neural network 113, so that the output value of the writing model corresponding to the input sample is infinitely close to the input The calibration value of the sample.
  • n first candidate paragraphs for the first paragraph and m second candidate paragraphs for the second paragraph can be obtained by searching, Combine to obtain n ⁇ m input texts, and input the n ⁇ m input texts to the writing model 11 respectively.
  • the word segmentation unit 111 first performs word segmentation on the text to obtain the sequenced word segmentation, and then the conversion unit 112 converts the group of sequenced word segmentation into a sequenced word Vector and input it to the recurrent neural network 113 in sequence.
  • the recurrent neural network 113 is a network trained by the above-mentioned model. It calculates word vectors that are sequentially input, thereby outputting output values corresponding to the word vectors that are sequentially input, and sends the output values to the selection unit 12. After obtaining the n ⁇ m output values corresponding to the n ⁇ m input texts, the selection unit 12 may select the input text with the largest output value as the first paragraph and the second paragraph of the article to be written.
  • q third candidate paragraphs of the third paragraph can be obtained by searching. Then, the text of the selected second paragraph and the q third candidate paragraphs are respectively combined to obtain q input text. Similarly, by inputting the q input texts to the writing model 11 respectively, q model output values corresponding to the q input texts can be obtained. Finally, the selection unit 12 selects, from the q third candidate paragraphs, the paragraph corresponding to the largest model output value as the third paragraph. Similarly, subsequent paragraphs of the article to be written can be obtained.
  • the system 100 according to the embodiment of the present specification shown in FIG. 1 is merely exemplary.
  • the system according to the embodiment of the present specification is not limited to the system 100.
  • the system 100 may further include a search unit for searching for text, and a combination unit. For combining text.
  • the system 100 may also include other units, such as a replacement unit, for synonymous replacement of words and / or sentences in selected text of the article to be written, and so on.
  • the manuscript writing model 11 is not limited to processing paragraphs of articles to be written. For example, based on the corresponding corpus, sentences can be inserted into existing articles, supplementary content, and the like.
  • FIG. 2 illustrates a method for training a writing model according to an embodiment of the present specification.
  • the writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network.
  • the method includes:
  • step S21 a sample for training the writing model is obtained, the sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth;
  • step S22 using the sample to train the writing model, so that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is compared with the calibration
  • the absolute value of the difference in values decreases, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of sequenced word segmentations from the text, where the plurality of word segmentations are based on
  • the text is arranged sequentially in the front and rear positions, the conversion unit converts the sequentially arranged word segments into correspondingly arranged multiple word vectors, and the recurrent neural network outputs based on the sequentially arranged multiple word vectors An output value of the writing model.
  • a sample for training the writing model is obtained, the sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth.
  • the writing model includes a word segmentation unit, a conversion unit, and a recurrent neural network.
  • the recurrent neural network is, for example, RNN, LSTM, GRU, or the like.
  • the common feature of these recurrent neural networks is that the final output value is obtained by sequentially inputting a set of vectors, wherein the calculation of the vector input at a certain time point takes into account one or more vectors input before the time point. That is, the recurrent neural network characterizes the relationship between the current output of a sequence and the previous information.
  • two non-contiguous texts from an existing article can be combined into a sample text, and in this case, since the two texts are non-contiguous in the original article, that is, its It is not smooth to combine them, so mark the sample text as 0, which means that it is not smooth.
  • one text may be obtained from two different existing articles and combined into a sample text. Similarly, the sample text is not smooth and its calibration value is also 0.
  • two adjacently arranged texts are obtained from an existing article, and the two texts are combined into a sample text in the original order. Since the two texts in the sample text are in the existing article, The middle is adjacent and arranged in the original order, that is, the sample text is smooth, so the sample text is marked as 1, which means smooth.
  • the text selected from the existing article can be a sentence or a paragraph, which can be used to train a writing model.
  • the writing model is repeatedly trained by acquiring samples of the order of 100,000 to one million, that is, the method shown in FIG. 2 is repeatedly repeated, thereby training a more accurate writing model.
  • the existing articles can be obtained from a website, can be obtained from a search, can be obtained from a database, and the like.
  • existing articles in a selected field may be obtained to train a writing model for writing articles in the selected field.
  • the field can be divided in different dimensional spaces. For example, in the dimensional space of content, the field can include entertainment, sports, technology, finance and other categories.
  • the field can include micro Bo, WeChat, Zhihu, headlines and many other categories.
  • the field can include micro Bo, WeChat, Zhihu, headlines and many other categories.
  • multiple existing articles can be obtained from existing entertainment news, and training samples can be obtained from them to train a writing model.
  • the trained writing model can be used to write articles in the field of entertainment news.
  • step S22 using the sample to train the writing model, so that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is compared with the calibration The absolute value of the difference in values decreases.
  • the segmentation unit obtains a plurality of segmented words arranged sequentially from the text, wherein the plurality of segmented words are sequentially arranged according to their forward and backward positions in the text.
  • the word segmentation method may, for example, perform word segmentation based on thesaurus comparison, word segmentation based on a deep learning model, and the like.
  • the conversion unit converts the sequentially arranged plurality of word segmentation into correspondingly arranged plurality of word vectors, wherein the segmentation can be converted into the corresponding by a specific model, for example, a word2vec (word-vector) model.
  • Word vector that is, a computer-understood dense vector.
  • the recurrent neural network outputs the output value of the writing model based on the plurality of word vectors arranged in order, that is, the recurrent neural network performs a calculation for each input word vector, except that the calculation is based on the word vector. In addition to the eigenvalues of, the calculation is based on the previous word vector. Therefore, after performing multiple corresponding calculations on multiple word vectors that are input sequentially, the output results reflect the Dependency, which can be used to calculate the smoothness of the text.
  • the model is trained by a back-propagation algorithm (BPTT algorithm).
  • BPTT algorithm back-propagation algorithm
  • the training process can be divided into three steps: forward calculation, calculation of loss value and backward calculation.
  • an existing model is used to calculate an output value after sequentially inputting the plurality of word vectors to the model, and an intermediate result at each step of calculating each word vector.
  • calculate the loss value at each step accumulate the loss value at each step, and average them. For example, for a sample with a calibration value of 1, subtract 1 from the model output value to obtain the loss value for the last step of the model calculation.
  • the parameters of the neurons in the last layer are pushed forward to obtain the loss value of each neuron in the previous layer.
  • the back propagation calculation is performed based on the results of the forward propagation, and the gradient value of each step is accumulated.
  • the gradient descent method is used to adjust the parameters of the existing model, so that after the adjustment, the output results of the model on the multiple word vectors are closer to the calibration value (for example, "1").
  • the training method is not limited to the above-mentioned back-propagation algorithm, and can also be trained by the existing improved algorithm of BPTT, or the model can be trained by using deep learning libraries such as Tensorflow, Torch, Theano, and the like.
  • FIG. 3 illustrates a writing method according to an embodiment of the present specification, including:
  • step S31 n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text are obtained, where n and m are a predetermined number, and wherein In writing an article, the first text to be written is located in front of the second text to be written and is adjacent to the second text to be written;
  • each text in the first candidate text is combined with each text in the second candidate text, thereby obtaining n ⁇ m input texts, wherein, in each of the input texts, The first candidate text is located in front of the second candidate text;
  • step S33 the n ⁇ m input texts are respectively input into a writing model trained according to the model training method described above to obtain n ⁇ m model output values corresponding to the n ⁇ m input texts respectively. How well the model output predicts the corresponding input text;
  • step S34 based on the n ⁇ m model output values, the selected text of the first to-be-written text is determined from the n first candidate texts, and the selected text is determined from the m second candidate texts.
  • the selected text of the second text to be written is described.
  • step S31 n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text are obtained, where n and m are a predetermined number, and
  • the first text to be written is located in front of the second text to be written and is adjacent to the second text to be written.
  • the writing method shown in FIG. 3 is implemented by a writing model trained with the training method shown in FIG. 2. For example, during the training phase, the writing model is trained through a corpus (training text) in the field of entertainment news, that is, the writing model is A model for writing entertainment news.
  • n candidate texts corresponding to the first paragraph and m candidate texts corresponding to the second paragraph may be searched in a corpus corresponding to the entertainment news field and corresponding to the paragraphs.
  • n and m are predetermined numbers, which may be on the order of tens or hundreds, for example.
  • a corpus in a specific domain and a specific form may be formed by obtaining articles in a specific domain and obtaining text in a specific form (for example, paragraphs, sentences, etc.) from the articles.
  • Articles in the specific field for example, entertainment news field
  • the above method for obtaining candidate text is only schematic.
  • the first to-be-written text and the second to-be-written text are not limited to paragraphs, and are not limited to the first and second paragraphs of the article to be written, and may be, for example, two sentences or two parts of text in a paragraph.
  • the method for obtaining candidate text is not limited to search and acquisition through keywords, for example, it can be obtained through manual finishing, or candidate text can be obtained through machine learning, and so on.
  • the corpus is not limited to the above-mentioned corpus in a specific field and a specific form.
  • the corpus may be a corpus corresponding only to a domain, and correspondingly, the candidate text searched by the corpus is not limited to a specific form, for example, it may be a sentence or a paragraph.
  • the corpus may be a corpus corresponding only to the form of the text to be written.
  • the candidate text searched by the corpus is not limited to a specific field, for example, it may be in the entertainment field or the financial field.
  • the corpus included in the corpus can be set according to the needs of specific scenarios.
  • the field of the corpus is selected according to the applicable field of the writing model.
  • the writing model is a model for writing articles in the field of entertainment news. When writing using this writing model, the corpus of the field of entertainment news is selected. Search for candidate text for text to write.
  • each text in the first candidate text is combined with each text in the second candidate text, thereby obtaining n ⁇ m input texts, wherein, in each of the input texts,
  • the first candidate text is located in front of the second candidate text.
  • a candidate text in the first paragraph is "a, b, c.”
  • a candidate text in the second paragraph is "d, e.”
  • the candidate for the first paragraph is The text is placed in front of the candidate text in the second paragraph, and an input text "a, b, c. D, e.” Is obtained.
  • n ⁇ m input texts can be obtained.
  • step S33 the n ⁇ m input texts are respectively input into a writing model trained according to the model training method described above to obtain n ⁇ m model output values corresponding to the n ⁇ m input texts respectively.
  • the model output value predicts the smoothness of the corresponding input text.
  • the writing model For each of the n ⁇ m input texts, after inputting it into a writing model trained according to the above-mentioned model training method, the writing model outputs a calculation result through a process as described above with reference to FIG. 2. That is, the word segmentation unit in the writing model obtains a plurality of word segmentations arranged sequentially from the text, wherein the plurality of word segmentations are sequentially arranged according to their forward and backward positions in the text.
  • the conversion unit converts the sequentially arranged plurality of word segmentation into a correspondingly arranged plurality of word vectors through a word2vec (word-vector) model, for example.
  • the recurrent neural network outputs an output value of the writing model based on the sequentially arranged plurality of word vectors.
  • the output value of the writing model relative to the input text indicates the smoothness of the input text. The closer the output value is to 1, the smoother the input text is. The closer the output value is to 0, the less smooth the input text is. .
  • step S34 based on the n ⁇ m model output values, the selected text of the first to-be-written text is determined from the n first candidate texts, and the selected text is determined from the m second candidate texts.
  • the selected text of the second text to be written is described.
  • the input text corresponding to the maximum value of the n ⁇ m model output values is the selected text, that is,
  • the candidate text included in the first paragraph is the selected text of the first paragraph
  • the candidate text included in the second paragraph is the selected text of the second paragraph. That is, the smoothest combination of the pairwise combinations of the n first candidate texts and the m second candidate texts is used as the selected text of the article to be written.
  • FIG. 4 illustrates a writing method according to an embodiment of the present specification, including:
  • step S41 q candidate texts of the text to be written in the article to be completed are acquired, where q is a predetermined number;
  • step S42 obtaining a predetermined length of existing text that is adjacent to the text to be written and is located in front of the text to be written from the to-be-completed article;
  • step S43 the existing text and the q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing texts are located in front of the candidate texts ;
  • step S44 the q input texts are respectively input into a writing model trained according to the above model training method to obtain q model output values corresponding to the q input texts respectively, and the model output values predict corresponding The fluency of the input text;
  • step S45 based on the q model output values, a selected text of the text to be written is determined from the q candidate texts.
  • step S41 q candidate texts of the text to be written in the article to be completed are acquired, where q is a predetermined number.
  • q is a predetermined number.
  • at least one keyword of the third paragraph may be provided by the user, for example, keywords related to event impact and public opinion.
  • q candidate texts corresponding to the third paragraph are searched in a corpus corresponding to the entertainment news and corresponding to the paragraphs, where q is a predetermined number, which can be, for example, dozens, several Hundreds of magnitude.
  • the method for obtaining the candidate text is only schematic.
  • the text to be written is not limited to a paragraph, and it may be, for example, a sentence or a part of text in a paragraph.
  • the method for obtaining candidate text is not limited to search and acquisition through keywords, for example, it can be obtained through manual finishing, or candidate text can be obtained through machine learning, and so on.
  • a predetermined length of existing text that is adjacent to the text to be written and located in front of the text to be written is obtained from the to-be-completed article.
  • the text to be written is the third paragraph of the article
  • the existing text of a predetermined length adjacent to the text to be written and in front of the text to be written may be the article The second paragraph.
  • the predetermined length may be determined according to the length of the text to be written. For example, if the text to be written is a sentence, a sentence before the sentence to be written may be obtained in the article to be completed as the existing text.
  • step S43 the existing text and the q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing texts are located in front of the candidate texts .
  • the selected text of the second paragraph and the q candidate texts of the third paragraph are respectively combined to obtain q input text.
  • Step S44 is basically the same as the description of step S33 shown in FIG. 3, and is not repeated here.
  • a selected text of the text to be written is determined from the q candidate texts. For example, in the above example, after obtaining q model output values corresponding to q input texts respectively from the writing model, it may be determined that the candidate text included in the input text corresponding to the maximum value among the q model output values is The selected text, that is, the candidate text of the third paragraph included therein is the selected text of the third paragraph. That is, the candidate text of the third paragraph that is most smoothly combined with the selected text of the second paragraph is used as the selected text of the third paragraph.
  • the text to be written can be a sentence, a paragraph, or a part of a paragraph.
  • the words and sentences in the selected text may be replaced synonymously, for example, in the selected text.
  • Replace words with synonyms replace sentences in the selected text with another sentence with the same meaning but different expression, etc.
  • the above replacement can be performed through a synonym thesaurus and a synonym corpus.
  • a keyword of a topic of the article to be written may also be obtained. Therefore, based on the keywords of the topic and the keywords of the text to be written, a search strategy is set to search for candidate text. For example, in the corpus, the corpus can be filtered first by using the topic keywords, and then the candidate text is searched from the filtered corpus by the keywords of the text to be written.
  • FIG. 5 illustrates an apparatus 500 for training a writing model according to an embodiment of the present specification.
  • the writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network.
  • the apparatus 500 includes:
  • the obtaining unit 51 is configured to obtain a sample for training the writing model, the sample including a text and a calibration value for the text, the calibration value indicating whether the text is smooth or not;
  • the training unit 52 is configured to use the sample to train the writing model such that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is equal to The absolute value of the difference between the calibration values is reduced, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of word segmentations arranged in sequence from the text, wherein the plurality of word segmentations are based on It is arranged sequentially in front and back in the text, the conversion unit converts the sequentially arranged plurality of word segmentation into correspondingly arranged multiple word vectors, and the recurrent neural network is based on the sequentially arranged multiple The word vector outputs the output value of the writing model.
  • the training unit is further configured to use the samples to train the writing model by a back propagation algorithm.
  • FIG. 6 illustrates a writing device 600 according to an embodiment of the present specification, including:
  • the obtaining unit 61 is configured to obtain n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text, where n and m are a predetermined number, and wherein In the article to be written, the first to-be-written text is located in front of the second to-be-written text and is adjacent to the second to-be-written text;
  • the combining unit 62 is configured to combine each text in the first candidate text with each text in the second candidate text, thereby obtaining n ⁇ m input texts.
  • the first candidate text is located in front of the second candidate text;
  • the input unit 63 is configured to input the n ⁇ m input texts respectively into a writing model trained according to the training method to obtain n ⁇ m model output values corresponding to the n ⁇ m input texts, respectively.
  • the model output value predicts the smoothness of the corresponding input text
  • the determining unit 64 is configured to determine the selected text of the first to-be-written text from the n first candidate texts based on the n ⁇ m model output values, and from the m second candidates The text determines the selected text of the second text to be written.
  • FIG. 7 illustrates a writing device 700 according to an embodiment of the present specification, including:
  • the first obtaining unit 71 is configured to obtain q candidate texts of a text to be written in a to-be-completed article, where q is a predetermined number;
  • a second obtaining unit 72 configured to obtain, from the to-be-completed article, an existing text of a predetermined length that is adjacent to the to-be-written text and is located in front of the to-be-written text;
  • the combining unit 73 is configured to combine the existing text and the q candidate texts separately to obtain q input texts, where in each of the input texts, the existing texts are located in the candidate Before the text
  • the input unit 74 is configured to respectively input the q input texts into a writing model trained according to the training method to obtain q model output values corresponding to the q input texts respectively, and the model output values are predicted The fluency of the corresponding input text;
  • the determining unit 75 is configured to determine the selected text of the text to be written from the q candidate texts based on the q model output values.
  • the writing device 600 (700) further includes a replacement unit 65 (76) configured to, after determining the selected text, perform the same operation on the words and / or sentences in the selected text. Meaning replacement.
  • the first obtaining unit 71 includes a first obtaining subunit 711 configured to obtain at least one keyword of the text to be written; and a second obtaining The subunit 712 obtains q candidate texts of the text to be written based on the keywords.
  • the writing device 700 further includes a third obtaining unit 77 configured to obtain keywords of a topic of the article to be completed before obtaining at least one keyword of the text to be written.
  • the second acquisition subunit 712 is further configured to perform a search based on the keywords in a corpus corresponding to the text to be written according to a predetermined search and sorting method. To obtain the q candidate texts.
  • the intelligent drafting scheme proposes an algorithm framework capable of efficiently solving the intelligent drafting problem; solving the problem of corpus reuse by using synonymous replacement; and using the method of inserting sentences after paragraphs to enrich the content of the article.
  • this solution adds manual intervention, which greatly improves the practicability and accuracy; compared with the completely template-based method, it greatly enhances flexibility and scalability.
  • RAM random access memory
  • ROM read-only memory
  • EEPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • registers hard disks, removable disks, CD-ROMs, or in technical fields Any other form of storage medium is known.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present description provide a method and apparatus for training a writing model, and a writing method and apparatus. The writing model comprises a word segmentation unit, a conversion unit, and a cyclic neural network. The training method comprises: obtaining a sample for training the writing model, the sample comprising a text and a calibration value for the text, and the calibration value indicating whether the text is smooth; and training the writing model by using the sample so that, after the text is inputted into the writing model, the absolute value of a difference between an output value and the calibration value of the writing model after the training is decreased with respect to that before the training.

Description

一种智能写作方法和装置Intelligent writing method and device 技术领域Technical field

本说明书实施例涉及机器学习技术领域,更具体地,涉及一种训练写作模型的方法和装置、以及一种写作方法和装置。The embodiments of the present specification relate to the field of machine learning technology, and more particularly, to a method and device for training a writing model, and a writing method and device.

背景技术Background technique

随着当今新媒体的爆炸式发展,舆论宣传的手段日益丰富,这给写作宣传工作带来很大的挑战。对于每个突发事件以及各种社会热点问题,不仅需要及时地进行回应,而且要根据各种传媒、读者的特点写出相应风格的文章,如微博、微信、知乎、头条等等。这样极大地增加了写稿的工作量。因此,通过智能写作工具进行自动写稿,能够极大提高写稿的工作效率,在品牌宣传和公关中抢占先机。目前已有的智能写作工具包括Giiso智能写作机器人、以及学术界的自动写作工具。在上述现有的智能写作工具中,在写稿的过程中,人工干预非常少,而在写稿完成后,需要大量人力进行校对;另外,完成的文章偏于格式化,例如,完成的文章为表格式的,不能适用多种写稿场景。因此,需要一种更有效的智能写作方法。With the explosive development of today's new media, the means of publicity and publicity have become increasingly abundant, which has brought great challenges to writing and publicity. For each emergency and various social hotspot issues, not only need to respond in a timely manner, but also write articles of corresponding styles according to the characteristics of various media and readers, such as Weibo, WeChat, Zhihu, Headlines, and so on. This greatly increases the workload of writing. Therefore, automatic writing through intelligent writing tools can greatly improve the efficiency of writing, and seize the opportunity in brand publicity and public relations. Intelligent writing tools currently available include Giiso's intelligent writing robots and automatic writing tools in academia. In the above-mentioned existing intelligent writing tools, there is very little manual intervention in the process of writing a manuscript, and after the writing of the manuscript, a large amount of manpower is required for proofreading; in addition, the completed article is biased toward formatting, for example, the completed article It is tabular and cannot be used in multiple writing scenarios. Therefore, a more effective method of intelligent writing is needed.

发明内容Summary of the Invention

本说明书实施例旨在提供一种更有效的智能写作方法,以解决现有技术中的不足。The embodiments of the present specification aim to provide a more effective intelligent writing method to solve the deficiencies in the prior art.

为实现上述目的,本说明书一个方面提供一种训练写作模型的方法,所述写作模型包括分词单元、转换单元和循环神经网络,所述方法包括:获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺;以及利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。To achieve the above object, an aspect of the present specification provides a method for training a writing model, the writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network, and the method includes: obtaining a sample for training the writing model, so The sample includes a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and using the sample to train the writing model so that after the text is input to the writing model, Compared to before training, the absolute value of the difference between the output value of the writing model after training and the calibration value is reduced, and after the text is input to the writing model, the word segmentation unit The text obtains a plurality of participles arranged in order, wherein the plurality of participles are arranged in order according to a position before and after in the text, and the conversion unit converts the plurality of participles arranged in order into a corresponding order. Multiple word vectors, and the recurrent neural network outputs an output value of the writing model based on the sequentially arranged multiple word vectors.

在一个实施例中,在所述训练写作模型的方法中,所述文本包括一篇现有文章中的非相邻的两个文本,以及所述标定值为0。In one embodiment, in the method for training a writing model, the text includes two non-adjacent texts in an existing article, and the calibration value is 0.

在一个实施例中,在所述训练写作模型的方法中,所述文本包括分别属于两篇现有文章的两个文本,以及所述标定值为0。In one embodiment, in the method for training a writing model, the text includes two texts respectively belonging to two existing articles, and the calibration value is 0.

在一个实施例中,在所述训练写作模型的方法中,所述文本顺序包括一篇现有文章中的相邻排列的两个文本,以及所述标定值为1。In one embodiment, in the method for training a writing model, the text sequence includes two texts arranged next to each other in an existing article, and the calibration value is 1.

在一个实施例中,在所述训练写作模型的方法中,所述现有文章为属于选定领域的现有文章,并且,所述写作模型在所述训练之后用于写作属于所述选定领域的文章。In one embodiment, in the method for training a writing model, the existing article is an existing article belonging to a selected field, and the writing model is used for writing belonging to the selected area after the training. Articles in the field.

在一个实施例中,在所述训练写作模型的方法中,利用所述样本训练所述写作模型包括,利用所述样本,通过反向传播算法训练所述写作模型。In one embodiment, in the method for training a writing model, training the writing model using the samples includes using the samples to train the writing model by a back propagation algorithm.

在一个实施例中,在所述训练写作模型的方法中,所述循环神经网络包括以下一种网络:RNN、LSTM和GRU。In one embodiment, in the method for training a writing model, the recurrent neural network includes one of the following networks: RNN, LSTM, and GRU.

本说明书另一方面提供一种写作方法,包括:获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;将所述n×m个输入文本分别输入通过根据上述训练方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。Another aspect of the present specification provides a writing method, including: obtaining n first candidate texts of a first to-be-written text of an article to be written and m second candidate texts of a second to-be-written text, where n and m are predetermined The number, and in the article to be written, the first to-be-written text is located in front of the second to-be-written text and adjacent to the second to-be-written text; the first candidate text is Each text in is combined with each text in the second candidate text to obtain n × m input texts, where in each of the input texts, the first candidate text is located in the In front of the second candidate text; input the n × m input texts respectively into a writing model trained according to the above training method to obtain n × m model output values corresponding to the n × m input texts, Predicting the smoothness of the corresponding input text by the model output value; and determining the selected text of the first to-be-written text from the n first candidate texts based on the n × m model output values, and From the m second candidate texts Written text of the selected text to be the second.

本说明书另一方面提供一种写作方法,包括:获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;将所述q个输入文本分别输入通过根据上述训练方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。Another aspect of the present specification provides a writing method, including: obtaining q candidate texts of to-be-written text in a to-be-completed article, where q is a predetermined number; and obtaining from the to-be-completed articles, the texts to be related to the to-be-written text Existing text of a predetermined length adjacent to and in front of the text to be written; combining the existing text and the q candidate texts separately to obtain q input texts, where in each of the input In the text, the existing text is located in front of the candidate text; the q input texts are respectively input into a writing model trained according to the above training method to obtain q models corresponding to the q input texts respectively An output value, the model output value predicts the smoothness of the corresponding input text; and a selected text of the text to be written is determined from the q candidate texts based on the q model output values.

在一个实施例中,所述写作方法还包括,在确定所述选定文本之后,对所述选定文 本中的词和/或句进行同义替换。In one embodiment, the writing method further includes, after determining the selected text, performing synonymous replacement of words and / or sentences in the selected text.

在一个实施例中,在所述写作方法中,所述获取待完成文章中的待写文本的q个候选文本包括,获取所述待写文本的至少一个关键词;以及,基于所述关键词,获取所述待写文本的q个候选文本。In one embodiment, in the writing method, obtaining the q candidate texts of the text to be written in the article to be completed includes acquiring at least one keyword of the text to be written; and based on the keywords To obtain q candidate texts of the text to be written.

在一个实施例中,所述写作方法还包括,在获取所述待写文本的至少一个关键词之前,获取所述待完成文章的主题的关键词。In one embodiment, the writing method further includes, before acquiring at least one keyword of the text to be written, acquiring keywords of a topic of the article to be completed.

在一个实施例中,在所述写作方法中,所述基于所述关键词,获取所述待写文本的q个候选文本包括,根据预定搜索和排序方法,在与所述待写文本对应的语料库中基于所述关键词进行搜索,从而获取所述q个候选文本。In one embodiment, in the writing method, obtaining the q candidate texts of the text to be written based on the keywords includes, according to a predetermined search and sorting method, selecting A search is performed in the corpus based on the keywords to obtain the q candidate texts.

在一个实施例中,在所述写作方法中,与所述待写文本对应的语料库包括以下至少一种:与所述待写文本的领域对应的语料库、以及与所述待写文本的形式对应的语料库。In one embodiment, in the writing method, a corpus corresponding to the text to be written includes at least one of the following: a corpus corresponding to a field of the text to be written, and a form corresponding to the form of the text to be written Corpus.

本说明书另一方面提供一种训练写作模型的装置,所述写作模型包括分词单元、转换单元和循环神经网络,所述装置包括:获取单元,配置为,获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺;以及训练单元,配置为,利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。Another aspect of the present specification provides a device for training a writing model. The writing model includes a word segmentation unit, a conversion unit, and a recurrent neural network. The device includes: an obtaining unit configured to obtain a sample for training the writing model. The sample includes a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and a training unit configured to use the sample to train the writing model such that: After the model inputs the text, compared to before training, the absolute value of the difference between the output value of the writing model after training and the calibration value decreases, wherein the text is input to the writing model After that, the word segmentation unit obtains a plurality of sequentially arranged word segments from the text, wherein the plurality of segmented words are sequentially arranged according to their forward and backward positions in the text, and the conversion unit arranges the sequentially arranged plurality of Word segmentation is converted into a corresponding sequence of multiple word vectors, and the recurrent neural network outputs the writing based on the sequence of multiple word vectors Type output value.

在一个实施例中,在所述训练写作模型的装置中,所述训练单元还配置为,利用所述样本,通过反向传播算法训练所述写作模型。In one embodiment, in the apparatus for training a writing model, the training unit is further configured to use the samples to train the writing model by a back propagation algorithm.

本说明书另一方面提供一种写作装置,包括:获取单元,配置为,获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;组合单元,配置为,将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;输入单元,配置为,将所述n×m个输入文本分别输入通过根据上述训练方法训练的写作模型,以 获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及确定单元,配置为,基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。Another aspect of the present specification provides a writing device, including: an obtaining unit configured to obtain n first candidate texts of a first to-be-written text of an article to be written and m second candidate texts of a second to-be-written text, Where n and m are a predetermined number, and in the article to be written, the first to-be-written text is located in front of and adjacent to the second to-be-written text; combination A unit configured to combine each text in the first candidate text with each text in the second candidate text to obtain n × m input texts, wherein, in each of the input In the text, the first candidate text is located in front of the second candidate text; the input unit is configured to input the n × m input texts respectively through a writing model trained according to the above training method to obtain Describing n × m model output values corresponding to the n × m input texts respectively, the model output values predicting the smoothness of the corresponding input texts; and a determining unit configured to be based on the n × m model output values, From the n Determining a first candidate text to be written in the text of the selected text, and determining the selected text written text from the second to be the second candidate text m.

本说明书另一方面提供一种写作装置,包括:第一获取单元,配置为,获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;第二获取单元,配置为,从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;组合单元,配置为,将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;输入单元,配置为,将所述q个输入文本分别输入通过根据上述训练方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及确定单元,配置为,基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。Another aspect of the present specification provides a writing device, including: a first obtaining unit configured to obtain q candidate texts of to-be-written text in a to-be-completed article, where q is a predetermined number; and a second obtaining unit configured to: Obtaining a predetermined length of existing text adjacent to the to-be-written text and in front of the to-be-written text from the to-be-completed article; a combination unit configured to combine the existing text with the The q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing text is located in front of the candidate texts; the input unit is configured to input the q input texts The text is input respectively through a writing model trained according to the above training method to obtain q model output values corresponding to the q input texts respectively, the model output values predicting the smoothness of the corresponding input texts; and a determining unit, configuration To determine the selected text of the text to be written from the q candidate texts based on the q model output values.

在一个实施例中,所述写作装置还包括替换单元,配置为,在确定所述选定文本之后,对所述选定文本中的词和/或句进行同义替换。In one embodiment, the writing device further includes a replacement unit configured to, after determining the selected text, perform synonymous replacement of words and / or sentences in the selected text.

在一个实施例中,在所述写作装置中,所述第一获取单元包括,第一获取子单元,配置为,获取所述待写文本的至少一个关键词;以及,第二获取子单元,基于所述关键词,获取所述待写文本的q个候选文本。In one embodiment, in the writing device, the first acquisition unit includes a first acquisition subunit configured to acquire at least one keyword of the text to be written; and a second acquisition subunit, Based on the keywords, q candidate texts of the text to be written are obtained.

在一个实施例中,所述写作装置还包括,第三获取单元,配置为,在获取所述待写文本的至少一个关键词之前,获取所述待完成文章的主题的关键词。In one embodiment, the writing device further includes a third acquisition unit configured to acquire keywords of the subject of the article to be completed before acquiring at least one keyword of the text to be written.

在一个实施例中,在所述写作装置中,所述第二获取子单元还配置为,根据预定搜索和排序方法,在与所述待写文本对应的语料库中基于所述关键词进行搜索,从而获取所述q个候选文本。In one embodiment, in the writing device, the second acquisition subunit is further configured to search based on the keywords in a corpus corresponding to the text to be written according to a predetermined search and sorting method, Thereby, the q candidate texts are obtained.

根据本说明书实施例的智能写稿方案提出了能够高效解决智能写稿问题的算法框架;通过使用同义替换解决语料重复使用问题;并利用段落后插入句子的方式充实文章内容。本方案相对于写稿过于自动化的方法来说,增加了人工干预,极大提高了实用性和准确性;相对于完全基于模板的方法来说,极大增强了灵活性和可扩展性。The intelligent drafting scheme according to the embodiment of the present specification proposes an algorithmic framework that can efficiently solve the intelligent drafting problem; solves the problem of corpus reuse by using synonymous replacement; and uses the method of inserting sentences after paragraphs to enrich the content of the article. Compared with the method of writing manuscripts that is too automated, this solution adds manual intervention, which greatly improves the practicability and accuracy; compared with the completely template-based method, it greatly enhances flexibility and scalability.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

通过结合附图描述本说明书实施例,可以使得本说明书实施例更加清楚:By describing the embodiments of this specification with reference to the drawings, the embodiments of this specification can be made more clear:

图1示出根据本说明书实施例的系统100的示意图;FIG. 1 shows a schematic diagram of a system 100 according to an embodiment of the present specification;

图2示出了根据本说明书实施例的一种训练写作模型的方法;FIG. 2 shows a method for training a writing model according to an embodiment of the present specification;

图3示出了根据本说明书实施例的一种写作方法;FIG. 3 illustrates a writing method according to an embodiment of the present specification;

图4示出了根据本说明书实施例的一种写作方法;FIG. 4 shows a writing method according to an embodiment of the present specification;

图5示出了根据本说明书实施例的一种训练写作模型的装置500;FIG. 5 illustrates an apparatus 500 for training a writing model according to an embodiment of the present specification;

图6示出了根据本说明书实施例的一种写作装置600;以及FIG. 6 illustrates a writing device 600 according to an embodiment of the present specification; and

图7示出了根据本说明书实施例的一种写作装置700。FIG. 7 illustrates a writing device 700 according to an embodiment of the present specification.

具体实施方式Detailed ways

下面将结合附图描述本说明书实施例。The embodiments of the present specification will be described below with reference to the drawings.

图1示出根据本说明书实施例的系统100的示意图。如图1所示,系统100包括写作模型11和选择单元12。其中,写作模型11还包括分词单元111、转换单元112和循环神经网络113。在训练写作模型11的阶段中,首先,获取训练样本,训练样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺。然后,利用所述训练样本,通过例如BPTT算法(反向传播算法)训练所述写作模型11,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小。在将训练样本中的文本输入写作模型11之后,在写作模型11中,分词单元111对样本中的文本进行分词,以获取顺序排列的多个分词。之后,转换单元112将所述多个分词转换成与其分别对应的顺序排列的词向量并顺序输入循环神经网络113。然后,循环神经网络113对顺序输入的词向量进行计算,从而输出计算结果。在训练模型的过程中,在循环神经网络113中,基于所述顺序输入的词向量和标定值,根据例如BPTT算法进行调参,以使得在对所述循环神经网络113顺序输入所述多个词向量之后,所述网络的输出值更逼近所述标定值。可获取多个训练样本,以对所述写作模型进行多次训练,即,对所述循环神经网络113进行多次调参,从而使得所述写作模型对应于输入样本的输出值无限接近该输入样本的标定值。FIG. 1 shows a schematic diagram of a system 100 according to an embodiment of the present specification. As shown in FIG. 1, the system 100 includes a writing model 11 and a selection unit 12. Among them, the writing model 11 further includes a word segmentation unit 111, a conversion unit 112, and a recurrent neural network 113. In the stage of training the writing model 11, first, a training sample is obtained, the training sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth. Then, using the training samples, the writing model 11 is trained by, for example, a BPTT algorithm (back propagation algorithm), so that after the text is input to the writing model, the writing model is compared with that before training. The absolute value of the difference between the output value after training and the calibration value decreases. After the text in the training sample is input into the writing model 11, in the writing model 11, the word segmentation unit 111 performs word segmentation on the text in the sample to obtain a plurality of sequenced word segmentations. After that, the conversion unit 112 converts the plurality of word segmentation into word vectors arranged in sequence corresponding to the plurality of word segmentations and sequentially inputs the word vectors to the recurrent neural network 113. Then, the recurrent neural network 113 calculates the word vectors that are sequentially input, thereby outputting a calculation result. In the process of training the model, in the recurrent neural network 113, based on the sequentially input word vectors and calibration values, parameters are adjusted according to, for example, the BPTT algorithm, so that the plurality of recurrent neural networks 113 are sequentially input. After the word vector, the output value of the network is closer to the calibration value. Multiple training samples may be obtained to train the writing model multiple times, that is, perform multiple parameter adjustments on the recurrent neural network 113, so that the output value of the writing model corresponding to the input sample is infinitely close to the input The calibration value of the sample.

在利用写作模型11进行写作的阶段,首先获取多个输入文本,并将将多个输入文本 输入写作模型11。例如,对于待写文章的第一段和第二段,可通过搜索获取用于第一段的n个第一候选段落,用于第二段的m个第二候选段落,并将其两两组合,从而获取n×m个输入文本,并将该n×m个输入文本分别输入写作模型11。在每次对写作模型11输入一个输入文本时,同样地,首先分词单元111对该文本进行分词,以获取顺序排列的分词,然后转换单元112将该组顺序排列的分词转换成顺序排列的词向量,并将其顺序输入循环神经网络113。循环神经网络113为经过上述模型训练的网络,其对顺序输入的词向量进行计算,从而输出对应于该顺序输入的词向量的输出值,并将该输出值发送给选择单元12。选择单元12在获取与所述n×m个输入文本分别对应的n×m个输出值之后,可选取输出值最大的输入文本作为所述待写文章的第一段和第二段。In the stage of writing with the writing model 11, a plurality of input texts are first obtained, and a plurality of input texts are input to the writing model 11. For example, for the first paragraph and the second paragraph of the article to be written, n first candidate paragraphs for the first paragraph and m second candidate paragraphs for the second paragraph can be obtained by searching, Combine to obtain n × m input texts, and input the n × m input texts to the writing model 11 respectively. Each time an input text is input to the writing model 11, similarly, the word segmentation unit 111 first performs word segmentation on the text to obtain the sequenced word segmentation, and then the conversion unit 112 converts the group of sequenced word segmentation into a sequenced word Vector and input it to the recurrent neural network 113 in sequence. The recurrent neural network 113 is a network trained by the above-mentioned model. It calculates word vectors that are sequentially input, thereby outputting output values corresponding to the word vectors that are sequentially input, and sends the output values to the selection unit 12. After obtaining the n × m output values corresponding to the n × m input texts, the selection unit 12 may select the input text with the largest output value as the first paragraph and the second paragraph of the article to be written.

对于待写文章的第三段,类似地,可通过搜索获取第三段的q个第三候选段落。然后,将上述选定的第二段的文本与所述q个第三候选段落分别组合,从而获取q个输入文本。类似地,可通过将所述q个输入文本分别输入所述写作模型11,从而获取与所述q个输入文本分别对应的q个模型输出值。最后,选择单元12在q个第三候选段落中选取对应模型输出值最大的段落作为第三段。类似地,可获取待写文章的后续段落。For the third paragraph of the article to be written, similarly, q third candidate paragraphs of the third paragraph can be obtained by searching. Then, the text of the selected second paragraph and the q third candidate paragraphs are respectively combined to obtain q input text. Similarly, by inputting the q input texts to the writing model 11 respectively, q model output values corresponding to the q input texts can be obtained. Finally, the selection unit 12 selects, from the q third candidate paragraphs, the paragraph corresponding to the largest model output value as the third paragraph. Similarly, subsequent paragraphs of the article to be written can be obtained.

图1所示的根据本说明书实施例的系统100只是示例性的,根据本说明书实施例的系统不限于系统100,例如,系统100还可以包括搜索单元,以用于搜索文本,以及组合单元,以用于组合文本。另外,系统100还可以包括其它单元,例如替换单元,用于对待写文章的选定文本进行词和/或句的同义替换,等等。所述写稿模型11不限于对待写文章的段落进行处理,例如,基于对应的语料库,也可以对已有文章插入句子、补充内容等。The system 100 according to the embodiment of the present specification shown in FIG. 1 is merely exemplary. The system according to the embodiment of the present specification is not limited to the system 100. For example, the system 100 may further include a search unit for searching for text, and a combination unit. For combining text. In addition, the system 100 may also include other units, such as a replacement unit, for synonymous replacement of words and / or sentences in selected text of the article to be written, and so on. The manuscript writing model 11 is not limited to processing paragraphs of articles to be written. For example, based on the corresponding corpus, sentences can be inserted into existing articles, supplementary content, and the like.

图2示出了根据本说明书实施例的一种训练写作模型的方法,所述写作模型包括分词单元、转换单元和循环神经网络,所述方法包括:FIG. 2 illustrates a method for training a writing model according to an embodiment of the present specification. The writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network. The method includes:

在步骤S21,获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺;以及In step S21, a sample for training the writing model is obtained, the sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth; and

在步骤S22,利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。In step S22, using the sample to train the writing model, so that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is compared with the calibration The absolute value of the difference in values decreases, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of sequenced word segmentations from the text, where the plurality of word segmentations are based on The text is arranged sequentially in the front and rear positions, the conversion unit converts the sequentially arranged word segments into correspondingly arranged multiple word vectors, and the recurrent neural network outputs based on the sequentially arranged multiple word vectors An output value of the writing model.

首先,在步骤S21,获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺。在本说明书实施例中,所述写作模型包括分词单元、转换单元和循环神经网络,所述循环神经网络例如为RNN、LSTM、GRU等等。这些循环神经网络的共性是,通过顺序输入一组向量,获得最终的输出值,其中,对在某个时间点输入的向量的计算考虑在该时间点之前输入的一个或多个向量。即,循环神经网络刻画了一个序列当前的输出与之前的信息的关系。First, in step S21, a sample for training the writing model is obtained, the sample includes a text and a calibration value for the text, and the calibration value indicates whether the text is smooth. In the embodiment of the present specification, the writing model includes a word segmentation unit, a conversion unit, and a recurrent neural network. The recurrent neural network is, for example, RNN, LSTM, GRU, or the like. The common feature of these recurrent neural networks is that the final output value is obtained by sequentially inputting a set of vectors, wherein the calculation of the vector input at a certain time point takes into account one or more vectors input before the time point. That is, the recurrent neural network characterizes the relationship between the current output of a sequence and the previous information.

在一个实施例中,可从一篇现有文章中获取非相邻的两个文本组合成样本文本,并且在该情况下,由于这两个文本在原有文章中是非相邻的,即,其组合在一起是不通顺的,因此,将该样本文本标定为0,即代表不通顺。在一个实施例中,可从不同的两篇现有文章中分别获取一个文本组合成样本文本,同样地,该样本文本也是不通顺的,其标定值也为0。在另一个实施例中,从一篇现有文章中获取相邻排列的两个文本,将该两个文本以原有顺序组合成样本文本,由于该样本文本中的两个文本在现有文章中就是相邻的,并且以原有的顺序排列,即,该样本文本是通顺的,因此,将该样本文本标定为1,即,表示通顺。其中,从现有文章中选取的文本可以是句子,也可以是段,其都可以用于训练写稿模型。In one embodiment, two non-contiguous texts from an existing article can be combined into a sample text, and in this case, since the two texts are non-contiguous in the original article, that is, its It is not smooth to combine them, so mark the sample text as 0, which means that it is not smooth. In one embodiment, one text may be obtained from two different existing articles and combined into a sample text. Similarly, the sample text is not smooth and its calibration value is also 0. In another embodiment, two adjacently arranged texts are obtained from an existing article, and the two texts are combined into a sample text in the original order. Since the two texts in the sample text are in the existing article, The middle is adjacent and arranged in the original order, that is, the sample text is smooth, so the sample text is marked as 1, which means smooth. Among them, the text selected from the existing article can be a sentence or a paragraph, which can be used to train a writing model.

从而,可通过获取现有文章,而容易地获取大量的训练样本。在本说明书实施例中,通过获取十万量级至百万量级的样本对所述写作模型进行反复训练,即,多次重复图2所示的方法,从而训练出较准确的写作模型。所述现有文章例如可从网站获取、可通过搜索获取,可通过数据库获取等等。在一个实施例中,可获取选定领域的现有文章,以训练用于写作选定领域的文章的写作模型。所述领域可以在不同的维度空间进行划分,例如,在内容的维度空间中,所述领域可包括娱乐、体育、科技、财经等多类,在风格的维度空间中,所述领域可包括微博、微信、知乎、头条等多类。例如,在选定娱乐新闻领域的情况中,可从现有的娱乐新闻中获取多个现有文章,并从中获取训练样本以训练写作模型。在该情况中,训练后的写作模型可用于写作娱乐新闻领域的文章。Therefore, a large number of training samples can be easily obtained by acquiring existing articles. In the embodiment of the present specification, the writing model is repeatedly trained by acquiring samples of the order of 100,000 to one million, that is, the method shown in FIG. 2 is repeatedly repeated, thereby training a more accurate writing model. The existing articles can be obtained from a website, can be obtained from a search, can be obtained from a database, and the like. In one embodiment, existing articles in a selected field may be obtained to train a writing model for writing articles in the selected field. The field can be divided in different dimensional spaces. For example, in the dimensional space of content, the field can include entertainment, sports, technology, finance and other categories. In the dimensional space of style, the field can include micro Bo, WeChat, Zhihu, headlines and many other categories. For example, in the case of a selected field of entertainment news, multiple existing articles can be obtained from existing entertainment news, and training samples can be obtained from them to train a writing model. In this case, the trained writing model can be used to write articles in the field of entertainment news.

在步骤S22,利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小。In step S22, using the sample to train the writing model, so that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is compared with the calibration The absolute value of the difference in values decreases.

其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列。所述分词方法例如可以基于词库比对进行分词、基于深度学习模型进行分词等。所述转换单元 将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,其中,可通过特定的模型,例如通过word2vec(词-向量)模型,将所述分词转换成对应的词向量,即,计算机可理解的稠密向量。从而可获取与顺序排列的多个分词分别对应的顺序排列的词向量。所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值,即,所述循环神经网络对每次输入的词向量都进行一次计算,该计算除了基于该词向量的特征值之外,还基于对上一个词向量的计算,从而,在对顺序输入的多个词向量进行对应的多次计算之后,输出的结果体现了顺序输入的多个词向量之间的前后依赖性,从而可用于计算文本的通顺程度。Wherein, after the text is input to the writing model, the segmentation unit obtains a plurality of segmented words arranged sequentially from the text, wherein the plurality of segmented words are sequentially arranged according to their forward and backward positions in the text. The word segmentation method may, for example, perform word segmentation based on thesaurus comparison, word segmentation based on a deep learning model, and the like. The conversion unit converts the sequentially arranged plurality of word segmentation into correspondingly arranged plurality of word vectors, wherein the segmentation can be converted into the corresponding by a specific model, for example, a word2vec (word-vector) model. Word vector, that is, a computer-understood dense vector. Thereby, a sequenced word vector corresponding to a plurality of sequenced participles can be obtained. The recurrent neural network outputs the output value of the writing model based on the plurality of word vectors arranged in order, that is, the recurrent neural network performs a calculation for each input word vector, except that the calculation is based on the word vector. In addition to the eigenvalues of, the calculation is based on the previous word vector. Therefore, after performing multiple corresponding calculations on multiple word vectors that are input sequentially, the output results reflect the Dependency, which can be used to calculate the smoothness of the text.

在一个实施例中,通过反向传播算法(BPTT算法)训练所述模型。训练的过程可分为三步:前向计算、计算损失值和反向计算。首先,通过现有模型计算在对所述模型顺序输入所述多个词向量之后的输出值、及在计算每个词向量的每一步的中间结果。然后,计算每一步的损失值,累加每步的损失值,并求平均,例如,对于标定值为1的样本,用1减去模型输出值,从而得出模型计算最后一步的损失值,通过最后一层神经元的参数进行前推,可获取上一层每个神经元的损失值。最后,基于前向传播的结果进行反向传播计算,累加每步的梯度值。然后,通过梯度下降法,对现有模型的参数进行调整,使得在调整后,该模型对所述多个词向量的输出结果更逼近所述标定值(例如“1”)。通过对所述模型输入大量训练样本以多次调参,从而使得该模型对训练样本的输出结果无限逼近其标定值。In one embodiment, the model is trained by a back-propagation algorithm (BPTT algorithm). The training process can be divided into three steps: forward calculation, calculation of loss value and backward calculation. First, an existing model is used to calculate an output value after sequentially inputting the plurality of word vectors to the model, and an intermediate result at each step of calculating each word vector. Then, calculate the loss value at each step, accumulate the loss value at each step, and average them. For example, for a sample with a calibration value of 1, subtract 1 from the model output value to obtain the loss value for the last step of the model calculation. The parameters of the neurons in the last layer are pushed forward to obtain the loss value of each neuron in the previous layer. Finally, the back propagation calculation is performed based on the results of the forward propagation, and the gradient value of each step is accumulated. Then, the gradient descent method is used to adjust the parameters of the existing model, so that after the adjustment, the output results of the model on the multiple word vectors are closer to the calibration value (for example, "1"). By inputting a large number of training samples to the model to adjust the parameters multiple times, the output result of the model on the training samples is infinitely close to its calibration value.

然而,所述训练方法不限于上述反向传播算法,还可以通过现有的BPTT的改进算法进行训练,或者,可以借助Tensorflow、Torch、Theano等深度学习库训练模型。However, the training method is not limited to the above-mentioned back-propagation algorithm, and can also be trained by the existing improved algorithm of BPTT, or the model can be trained by using deep learning libraries such as Tensorflow, Torch, Theano, and the like.

图3示出了根据本说明书实施例的一种写作方法,包括:FIG. 3 illustrates a writing method according to an embodiment of the present specification, including:

在步骤S31,获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;In step S31, n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text are obtained, where n and m are a predetermined number, and wherein In writing an article, the first text to be written is located in front of the second text to be written and is adjacent to the second text to be written;

在步骤S32,将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;In step S32, each text in the first candidate text is combined with each text in the second candidate text, thereby obtaining n × m input texts, wherein, in each of the input texts, The first candidate text is located in front of the second candidate text;

在步骤S33,将所述n×m个输入文本分别输入通过根据上述模型训练方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输 出值预测对应的输入文本的通顺程度;以及In step S33, the n × m input texts are respectively input into a writing model trained according to the model training method described above to obtain n × m model output values corresponding to the n × m input texts respectively. How well the model output predicts the corresponding input text; and

在步骤S34,基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。In step S34, based on the n × m model output values, the selected text of the first to-be-written text is determined from the n first candidate texts, and the selected text is determined from the m second candidate texts. The selected text of the second text to be written is described.

首先,在步骤S31,获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻。图3所示的写作方法是通过以图2所示的训练方法训练的写作模型实施的,例如,在训练阶段通过娱乐新闻领域的语料(训练文本)训练该写作模型,即,该写作模型是用于写作娱乐新闻的模型。在通过上述写作模型写作一篇新的娱乐新闻时,对于待写文章的第一段(即第一待写文本)和第二段(即第二待写文本),可由用户提供第一段的至少一个关键词,例如关于人物、时间、背景的关键词,以及第二段的至少一个关键词,例如关于事件、经过的关键词。然后,可根据预定的搜索和排序策略,在与娱乐新闻领域对应且与段落对应的语料库中搜索出对应于第一段的n个候选文本、以及对应于第二段的m个候选文本。其中,n、m为预定数目,其例如可以为几十、几百的量级。特定领域、特定形式的语料库可通过获取特定领域的文章,并从所述文章中获取特定形式(例如段落、句子等等)的文本而形成。所述特定领域(例如娱乐新闻领域)的文章可从网站的特定领域的栏目下获取、可通过搜索获取,或者可从数据库获取。First, in step S31, n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text are obtained, where n and m are a predetermined number, and In the article to be written, the first text to be written is located in front of the second text to be written and is adjacent to the second text to be written. The writing method shown in FIG. 3 is implemented by a writing model trained with the training method shown in FIG. 2. For example, during the training phase, the writing model is trained through a corpus (training text) in the field of entertainment news, that is, the writing model is A model for writing entertainment news. When writing a new entertainment news through the above writing model, for the first paragraph (that is, the first text to be written) and the second paragraph (that is, the second text to be written) of the article to be written, the user can provide the first paragraph At least one keyword, such as keywords about people, time, background, and at least one keyword in the second paragraph, such as keywords about events, passing. Then, according to a predetermined search and ranking strategy, n candidate texts corresponding to the first paragraph and m candidate texts corresponding to the second paragraph may be searched in a corpus corresponding to the entertainment news field and corresponding to the paragraphs. Among them, n and m are predetermined numbers, which may be on the order of tens or hundreds, for example. A corpus in a specific domain and a specific form may be formed by obtaining articles in a specific domain and obtaining text in a specific form (for example, paragraphs, sentences, etc.) from the articles. Articles in the specific field (for example, entertainment news field) can be obtained from a specific field column of the website, can be obtained by searching, or can be obtained from a database.

上述对候选文本的获取方法只是示意性的。所述第一待写文本和第二待写文本不限于段落,也不限于待写文章的第一段和第二段,其例如可以为两个句子,或者一段中的两部分文本。获取候选文本的方法也不限于通过关键词进行搜索获取,例如,可通过人工整理获取,或者可通过机器学习获取候选文本等等。另外,所述语料库不限于上述特定领域、特定形式的语料库。例如,所述语料库可以是仅与领域对应的语料库,相对应地,通过该语料库搜索的候选文本不限于特定形式,例如,其可以是句子,也可以是段落。再例如,所述语料库可以是仅与待写文本的形式对应的语料库,相对应地,通过该语料库搜索的候选文本不限于特定领域,例如,其可以是娱乐领域,也可以是财经领域的。可根据具体场景的需要,对语料库包括的语料进行设定。在一个实施例中,根据写作模型的适用领域,选择语料库的领域,例如,写作模型是用于写作娱乐新闻领域文章的模型,则在使用该写作模型写稿时,选择娱乐新闻领域的语料库来搜索待写文本的候选文本。The above method for obtaining candidate text is only schematic. The first to-be-written text and the second to-be-written text are not limited to paragraphs, and are not limited to the first and second paragraphs of the article to be written, and may be, for example, two sentences or two parts of text in a paragraph. The method for obtaining candidate text is not limited to search and acquisition through keywords, for example, it can be obtained through manual finishing, or candidate text can be obtained through machine learning, and so on. In addition, the corpus is not limited to the above-mentioned corpus in a specific field and a specific form. For example, the corpus may be a corpus corresponding only to a domain, and correspondingly, the candidate text searched by the corpus is not limited to a specific form, for example, it may be a sentence or a paragraph. For another example, the corpus may be a corpus corresponding only to the form of the text to be written. Correspondingly, the candidate text searched by the corpus is not limited to a specific field, for example, it may be in the entertainment field or the financial field. The corpus included in the corpus can be set according to the needs of specific scenarios. In one embodiment, the field of the corpus is selected according to the applicable field of the writing model. For example, the writing model is a model for writing articles in the field of entertainment news. When writing using this writing model, the corpus of the field of entertainment news is selected. Search for candidate text for text to write.

在步骤S32,将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面。例如,在上述获取待写文章第一段的候选文本和第二段的候选文本的情况中,作为示意,以字母表示一段中的句子。例如,第一段的一个候选文本为“a,b,c。”,第二段的一个候选文本为“d,e。”则通过将上两个候选文本顺序组合,即将第一段的候选文本放在第二段的候选文本的前面,可获取一个输入文本“a,b,c。d,e。”。通过将第一段的n个候选文本中的每一个与第二段的m个候选文本中的每一个如上所述两两组合,从而可获取n×m个输入文本。In step S32, each text in the first candidate text is combined with each text in the second candidate text, thereby obtaining n × m input texts, wherein, in each of the input texts, The first candidate text is located in front of the second candidate text. For example, in the above-mentioned case of obtaining the candidate text of the first paragraph and the candidate text of the second paragraph of the article to be written, as an illustration, the sentences in one paragraph are represented by letters. For example, a candidate text in the first paragraph is "a, b, c." And a candidate text in the second paragraph is "d, e." By combining the previous two candidate texts in order, the candidate for the first paragraph is The text is placed in front of the candidate text in the second paragraph, and an input text "a, b, c. D, e." Is obtained. By combining each of the n candidate texts of the first paragraph with each of the m candidate texts of the second paragraph as described above, n × m input texts can be obtained.

在步骤S33,将所述n×m个输入文本分别输入通过根据上述模型训练方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度。In step S33, the n × m input texts are respectively input into a writing model trained according to the model training method described above to obtain n × m model output values corresponding to the n × m input texts respectively. The model output value predicts the smoothness of the corresponding input text.

对于所述n×m个输入文本中每个文本,将其输入通过根据上述模型训练方法训练的写作模型之后,写作模型通过如上文参考图2中所述的处理过程输出计算结果。即,所述写作模型中的分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列。所述转换单元例如通过word2vec(词-向量)模型将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量。所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。For each of the n × m input texts, after inputting it into a writing model trained according to the above-mentioned model training method, the writing model outputs a calculation result through a process as described above with reference to FIG. 2. That is, the word segmentation unit in the writing model obtains a plurality of word segmentations arranged sequentially from the text, wherein the plurality of word segmentations are sequentially arranged according to their forward and backward positions in the text. The conversion unit converts the sequentially arranged plurality of word segmentation into a correspondingly arranged plurality of word vectors through a word2vec (word-vector) model, for example. The recurrent neural network outputs an output value of the writing model based on the sequentially arranged plurality of word vectors.

由于在训练所述写作模型阶段,将包括连续段落的文本标定为1,即表示通顺,将包括不连续段落的文本标定为0,即表示不通顺。因此,写作模型的相对于输入文本的输出值表示该输入文本的通顺程度,所述输出值越接近1,表示该输入文本越通顺,所述输出值越接近0,表示该输入文本越不通顺。Because during the training of the writing model, the text including consecutive paragraphs is marked as 1, which means smooth, and the text including discontinuous paragraphs is marked as 0, which means not smooth. Therefore, the output value of the writing model relative to the input text indicates the smoothness of the input text. The closer the output value is to 1, the smoother the input text is. The closer the output value is to 0, the less smooth the input text is. .

在步骤S34,基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。在从所述写作模型获取与n×m个输入文本分别对应的n×m个模型输出值之后,可确定与n×m个模型输出值中最大值对应的输入文本为选定文本,即,其中包括的第一段的候选文本即为第一段的选定文本,其中包括的第二段的候选文本即为第二段的选定文本。也就是说,将n个第一候选文本与m个第二候选文本的两两组合中的最通顺的组合作为待写文章的选定文本。In step S34, based on the n × m model output values, the selected text of the first to-be-written text is determined from the n first candidate texts, and the selected text is determined from the m second candidate texts. The selected text of the second text to be written is described. After obtaining n × m model output values corresponding to the n × m input texts from the writing model, it can be determined that the input text corresponding to the maximum value of the n × m model output values is the selected text, that is, The candidate text included in the first paragraph is the selected text of the first paragraph, and the candidate text included in the second paragraph is the selected text of the second paragraph. That is, the smoothest combination of the pairwise combinations of the n first candidate texts and the m second candidate texts is used as the selected text of the article to be written.

图4示出了根据本说明书实施例的一种写作方法,包括:FIG. 4 illustrates a writing method according to an embodiment of the present specification, including:

在步骤S41,获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;In step S41, q candidate texts of the text to be written in the article to be completed are acquired, where q is a predetermined number;

在步骤S42,从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;In step S42, obtaining a predetermined length of existing text that is adjacent to the text to be written and is located in front of the text to be written from the to-be-completed article;

在步骤S43,将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;In step S43, the existing text and the q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing texts are located in front of the candidate texts ;

在步骤S44,将所述q个输入文本分别输入通过根据上述模型训练方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及In step S44, the q input texts are respectively input into a writing model trained according to the above model training method to obtain q model output values corresponding to the q input texts respectively, and the model output values predict corresponding The fluency of the input text; and

在步骤S45,基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。In step S45, based on the q model output values, a selected text of the text to be written is determined from the q candidate texts.

首先,在步骤S41,获取待完成文章中的待写文本的q个候选文本,其中q为预定数目。例如对于上述已确定第一段和第二段的选定文本的娱乐新闻领域的待完成文章,可由用户提供第三段的至少一个关键词,例如,关于事件影响、舆论的关键词。然后,可根据预定的搜索和排序策略,在与娱乐新闻对应且与段落对应的语料库中搜索出对应于第三段的q个候选文本,其中q为预定数目,其例如可以为几十、几百的量级。First, in step S41, q candidate texts of the text to be written in the article to be completed are acquired, where q is a predetermined number. For example, for the to-be-completed articles in the field of entertainment news in which the selected texts of the first paragraph and the second paragraph have been determined, at least one keyword of the third paragraph may be provided by the user, for example, keywords related to event impact and public opinion. Then, according to a predetermined search and ranking strategy, q candidate texts corresponding to the third paragraph are searched in a corpus corresponding to the entertainment news and corresponding to the paragraphs, where q is a predetermined number, which can be, for example, dozens, several Hundreds of magnitude.

这里,对候选文本的获取方法只是示意性的。所述待写文本不限于段落,其例如可以为句子,或者一段中的部分文本。获取候选文本的方法也不限于通过关键词进行搜索获取,例如,可通过人工整理获取,或者可通过机器学习获取候选文本等等。Here, the method for obtaining the candidate text is only schematic. The text to be written is not limited to a paragraph, and it may be, for example, a sentence or a part of text in a paragraph. The method for obtaining candidate text is not limited to search and acquisition through keywords, for example, it can be obtained through manual finishing, or candidate text can be obtained through machine learning, and so on.

在步骤S42,从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本。例如,在上述娱乐新闻领域的待完成文章中,待写文本为该文章的第三段,则与该待写文本相邻、且位于待写文本前面的预定长度的已有文本可以为该文章的第二段。所述预定长度可根据待写文本的长度而确定,例如,待写文本为一个句子,则可在待完成文章中获取该待写的句子之前的一个句子作为已有文本。In step S42, a predetermined length of existing text that is adjacent to the text to be written and located in front of the text to be written is obtained from the to-be-completed article. For example, in the above-mentioned article to be completed in the field of entertainment news, the text to be written is the third paragraph of the article, and the existing text of a predetermined length adjacent to the text to be written and in front of the text to be written may be the article The second paragraph. The predetermined length may be determined according to the length of the text to be written. For example, if the text to be written is a sentence, a sentence before the sentence to be written may be obtained in the article to be completed as the existing text.

在步骤S43,将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面。例如,对于上述娱乐新闻领域的待完成文章,将已有的第二段的选定文本与第三段的q个候选文本分别组合,从而获取q个输入文本。In step S43, the existing text and the q candidate texts are respectively combined to obtain q input texts, wherein in each of the input texts, the existing texts are located in front of the candidate texts . For example, for the above-mentioned articles in the field of entertainment news, the selected text of the second paragraph and the q candidate texts of the third paragraph are respectively combined to obtain q input text.

步骤S44与上文中对图3所示的步骤S33的描述基本相同,在此不再赘述。Step S44 is basically the same as the description of step S33 shown in FIG. 3, and is not repeated here.

在步骤S45,基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。例如,在上述实例中,在从所述写作模型获取与q个输入文本分别对应的q个模型输出值之后,可确定与q个模型输出值中最大值对应的输入文本中包括的候选文本为选定文本,即,其中包括的第三段的候选文本即为第三段的选定文本。也就是说,将与第二段的选定文本组合最通顺的第三段的候选文本作为第三段的选定文本。In step S45, based on the q model output values, a selected text of the text to be written is determined from the q candidate texts. For example, in the above example, after obtaining q model output values corresponding to q input texts respectively from the writing model, it may be determined that the candidate text included in the input text corresponding to the maximum value among the q model output values is The selected text, that is, the candidate text of the third paragraph included therein is the selected text of the third paragraph. That is, the candidate text of the third paragraph that is most smoothly combined with the selected text of the second paragraph is used as the selected text of the third paragraph.

在该方法中,待写文本可以为句子、一个段落或或一段的部分内容,通过利用上述写作模型对文章的已有段落确定连在后面的最通顺的插入文本,从而可以以另外的内容对已有的文章进行充实和补充。In this method, the text to be written can be a sentence, a paragraph, or a part of a paragraph. By using the above writing model to determine the most smooth insertion text that is connected to the existing paragraphs of the article, it can be used for other content. Existing articles are enriched and supplemented.

在一个实施例中,在通过上述写作模型确定待写文本的选定文本之后,为了避免简单多次重复使用,可对选定文本中的词、句进行同义替换,例如将选定文本中的词替换成同义词,将选定文本中的句子替换成意思相同但是表达不同的另一句话等。其中,可通过同义词词库、同义句语料库进行上述替换。In an embodiment, after the selected text of the text to be written is determined through the above writing model, in order to avoid simple repeated use, the words and sentences in the selected text may be replaced synonymously, for example, in the selected text. Replace words with synonyms, replace sentences in the selected text with another sentence with the same meaning but different expression, etc. Among them, the above replacement can be performed through a synonym thesaurus and a synonym corpus.

在一个实施例中,在获取待写文本的候选文本中,除了获取待写文本的至少一个关键词之外,还可以获取待写文章的主题的关键词。从而,基于主题的关键词、待写文本的关键词,设置搜索策略,从而搜索出候选文本。例如,在语料库中,可首先使用主题关键词对语料进行筛选,然后再通过待写文本的关键词从筛选过的语料集中搜索候选文本。In one embodiment, in obtaining candidate text of the text to be written, in addition to acquiring at least one keyword of the text to be written, a keyword of a topic of the article to be written may also be obtained. Therefore, based on the keywords of the topic and the keywords of the text to be written, a search strategy is set to search for candidate text. For example, in the corpus, the corpus can be filtered first by using the topic keywords, and then the candidate text is searched from the filtered corpus by the keywords of the text to be written.

图5示出了根据本说明书实施例的一种训练写作模型的装置500。所述写作模型包括分词单元、转换单元和循环神经网络。所述装置500包括:FIG. 5 illustrates an apparatus 500 for training a writing model according to an embodiment of the present specification. The writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network. The apparatus 500 includes:

获取单元51,配置为,获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺;以及The obtaining unit 51 is configured to obtain a sample for training the writing model, the sample including a text and a calibration value for the text, the calibration value indicating whether the text is smooth or not; and

训练单元52,配置为,利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。The training unit 52 is configured to use the sample to train the writing model such that after the text is input to the writing model, compared to before training, the output value of the writing model after the training is equal to The absolute value of the difference between the calibration values is reduced, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of word segmentations arranged in sequence from the text, wherein the plurality of word segmentations are based on It is arranged sequentially in front and back in the text, the conversion unit converts the sequentially arranged plurality of word segmentation into correspondingly arranged multiple word vectors, and the recurrent neural network is based on the sequentially arranged multiple The word vector outputs the output value of the writing model.

在一个实施例中,在所述训练写作模型的装置中,所述训练单元还配置为,利用所述样本,通过反向传播算法训练所述写作模型。In one embodiment, in the apparatus for training a writing model, the training unit is further configured to use the samples to train the writing model by a back propagation algorithm.

图6示出了根据本说明书实施例的一种写作装置600,包括:FIG. 6 illustrates a writing device 600 according to an embodiment of the present specification, including:

获取单元61,配置为,获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;The obtaining unit 61 is configured to obtain n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text, where n and m are a predetermined number, and wherein In the article to be written, the first to-be-written text is located in front of the second to-be-written text and is adjacent to the second to-be-written text;

组合单元62,配置为,将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;The combining unit 62 is configured to combine each text in the first candidate text with each text in the second candidate text, thereby obtaining n × m input texts. In the input text, the first candidate text is located in front of the second candidate text;

输入单元63,配置为,将所述n×m个输入文本分别输入通过根据上述训练方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及The input unit 63 is configured to input the n × m input texts respectively into a writing model trained according to the training method to obtain n × m model output values corresponding to the n × m input texts, respectively. The model output value predicts the smoothness of the corresponding input text; and

确定单元64,配置为,基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。The determining unit 64 is configured to determine the selected text of the first to-be-written text from the n first candidate texts based on the n × m model output values, and from the m second candidates The text determines the selected text of the second text to be written.

图7示出了根据本说明书实施例的一种写作装置700,包括:FIG. 7 illustrates a writing device 700 according to an embodiment of the present specification, including:

第一获取单元71,配置为,获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;The first obtaining unit 71 is configured to obtain q candidate texts of a text to be written in a to-be-completed article, where q is a predetermined number;

第二获取单元72,配置为,从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;A second obtaining unit 72 configured to obtain, from the to-be-completed article, an existing text of a predetermined length that is adjacent to the to-be-written text and is located in front of the to-be-written text;

组合单元73,配置为,将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;The combining unit 73 is configured to combine the existing text and the q candidate texts separately to obtain q input texts, where in each of the input texts, the existing texts are located in the candidate Before the text

输入单元74,配置为,将所述q个输入文本分别输入通过根据上述训练方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及The input unit 74 is configured to respectively input the q input texts into a writing model trained according to the training method to obtain q model output values corresponding to the q input texts respectively, and the model output values are predicted The fluency of the corresponding input text; and

确定单元75,配置为,基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。The determining unit 75 is configured to determine the selected text of the text to be written from the q candidate texts based on the q model output values.

在一个实施例中,所述写作装置600(700)还包括替换单元65(76),配置为,在确定所述选定文本之后,对所述选定文本中的词和/或句进行同义替换。In an embodiment, the writing device 600 (700) further includes a replacement unit 65 (76) configured to, after determining the selected text, perform the same operation on the words and / or sentences in the selected text. Meaning replacement.

在一个实施例中,在所述写作装置700中,所述第一获取单元71包括,第一获取子 单元711,配置为,获取所述待写文本的至少一个关键词;以及,第二获取子单元712,基于所述关键词,获取所述待写文本的q个候选文本。In one embodiment, in the writing device 700, the first obtaining unit 71 includes a first obtaining subunit 711 configured to obtain at least one keyword of the text to be written; and a second obtaining The subunit 712 obtains q candidate texts of the text to be written based on the keywords.

在一个实施例中,所述写作装置700还包括,第三获取单元77,配置为,在获取所述待写文本的至少一个关键词之前,获取所述待完成文章的主题的关键词。In one embodiment, the writing device 700 further includes a third obtaining unit 77 configured to obtain keywords of a topic of the article to be completed before obtaining at least one keyword of the text to be written.

在一个实施例中,在所述写作装置中,所述第二获取子单元712还配置为,根据预定搜索和排序方法,在与所述待写文本对应的语料库中基于所述关键词进行搜索,从而获取所述q个候选文本。In one embodiment, in the writing device, the second acquisition subunit 712 is further configured to perform a search based on the keywords in a corpus corresponding to the text to be written according to a predetermined search and sorting method. To obtain the q candidate texts.

根据本说明书实施例的智能写稿方案提出了能够高效解决智能写稿问题的算法框架;通过使用同义替换解决语料重复使用问题;并利用段落后插入句子的方式充实文章内容。本方案相对于写稿过于自动化的方法来说,增加了人工干预,极大提高了实用性和准确性;相对于完全基于模板的方法来说,极大增强了灵活性和可扩展性。The intelligent drafting scheme according to the embodiment of the present specification proposes an algorithm framework capable of efficiently solving the intelligent drafting problem; solving the problem of corpus reuse by using synonymous replacement; and using the method of inserting sentences after paragraphs to enrich the content of the article. Compared with the method of writing manuscripts that is too automated, this solution adds manual intervention, which greatly improves the practicability and accuracy; compared with the completely template-based method, it greatly enhances flexibility and scalability.

本领域普通技术人员应该还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执轨道,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art should be further aware that the units and algorithm steps of the examples described in connection with the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of the two. In order to clearly illustrate the hardware Interchangeability with software. In the above description, the composition and steps of each example have been described generally in terms of functions. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present application.

结合本文中所公开的实施例描述的方法或算法的步骤可以用硬件、处理器执轨道的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in connection with the embodiments disclosed herein may be implemented by hardware, a software module executed by a processor, or a combination of the two. Software modules can be placed in random access memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or in technical fields Any other form of storage medium is known.

以上所述的具体实施方式,对本发明的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本发明的具体实施方式而已,并不用于限定本发明的保护范围,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The specific embodiments described above further describe the objectives, technical solutions, and beneficial effects of the present invention in detail. It should be understood that the above are only specific embodiments of the present invention and are not intended to limit the present invention. The scope of protection, any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the scope of protection of the present invention.

Claims (28)

一种训练写作模型的方法,所述写作模型包括分词单元、转换单元和循环神经网络,所述方法包括:A method for training a writing model. The writing model includes a word segmentation unit, a transformation unit, and a recurrent neural network. The method includes: 获取用于训练所述写作模型的样本,所述样本包括文本和针对所述文本的标定值,所述标定值指示所述文本是否通顺;以及Obtaining a sample for training the writing model, the sample including a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and 利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。Using the sample to train the writing model such that after the text is input to the writing model, compared to before training, the difference between the output value of the writing model after the training and the calibration value is The absolute value decreases, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of word segmentations arranged in sequence from the text, wherein the plurality of word segmentations are based on their The front and back positions are arranged sequentially, the conversion unit converts the sequentially arranged plurality of word segments into correspondingly arranged multiple word vectors, and the recurrent neural network outputs the writing model based on the sequentially arranged multiple word vectors Output value. 根据权利要求1所述的训练写作模型的方法,其中,所述文本包括一篇现有文章中的非相邻的两个文本,以及所述标定值为0。The method of training a writing model according to claim 1, wherein the text comprises two non-adjacent texts in an existing article, and the calibration value is 0. 根据权利要求1所述的训练写作模型的方法,其中,所述文本包括分别属于两篇现有文章的两个文本,以及所述标定值为0。The method of training a writing model according to claim 1, wherein the text includes two texts respectively belonging to two existing articles, and the calibration value is 0. 根据权利要求1所述的训练写作模型的方法,其中,所述文本顺序包括一篇现有文章中的相邻排列的两个文本,以及所述标定值为1。The method of training a writing model according to claim 1, wherein the text order comprises two texts arranged next to each other in an existing article, and the calibration value is 1. 根据权利要求2-4中任一项所述的训练写作模型的方法,其中,所述现有文章为属于选定领域的现有文章,并且,所述写作模型在所述训练之后用于写作属于所述选定领域的文章。The method for training a writing model according to any one of claims 2-4, wherein the existing article is an existing article belonging to a selected field, and the writing model is used for writing after the training Articles belonging to the selected area. 根据权利要求1所述的训练写作模型的方法,其中,利用所述样本训练所述写作模型包括,利用所述样本,通过反向传播算法训练所述写作模型。The method of training a writing model according to claim 1, wherein training the writing model using the samples comprises using the samples to train the writing model by a back propagation algorithm. 根据权利要求1所述的训练写作模型的方法,其中,所述循环神经网络包括以下一种网络:RNN、LSTM和GRU。The method of training a writing model according to claim 1, wherein the recurrent neural network comprises one of the following networks: RNN, LSTM, and GRU. 一种写作方法,包括:A method of writing, including: 获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;Obtaining n first candidate texts of a first to-be-written text of an article to be written and m second candidate texts of a second to-be-written text, where n, m are a predetermined number, and wherein, in the to-be-written article, The first text to be written is located in front of the second text to be written and is adjacent to the second text to be written; 将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;Combining each text in the first candidate text with each text in the second candidate text to obtain n × m input texts, wherein, in each of the input texts, the The first candidate text is located in front of the second candidate text; 将所述n×m个输入文本分别输入通过根据权利要求1-7中任一项所述的方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及Input the n × m input texts respectively into a writing model trained by the method according to any one of claims 1-7 to obtain n × m models corresponding to the n × m input texts respectively An output value that predicts the degree of fluency of the corresponding input text; and 基于所述n×m个模型输出值,从所述n个第一候选文本中确定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。Determine the selected text of the first to-be-written text from the n first candidate texts based on the n × m model output values, and determine the second to-be-written text from the m second candidate texts Write selected text for text. 一种写作方法,包括:A method of writing, including: 获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;Obtaining q candidate texts of to-be-written text in a to-be-completed article, where q is a predetermined number; 从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;Obtaining an existing text of a predetermined length that is adjacent to the text to be written and is located in front of the text to be written from the to-be-completed article; 将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;Combining the existing text and the q candidate texts separately to obtain q input texts, wherein in each of the input texts, the existing texts are located in front of the candidate texts; 将所述q个输入文本分别输入通过根据权利要求1-7中任一项所述的方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及Input the q input texts respectively into a writing model trained by the method according to any one of claims 1 to 7 to obtain q model output values corresponding to the q input texts respectively, the model How well the output value predicts the corresponding input text; and 基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。Based on the q model output values, a selected text of the text to be written is determined from the q candidate texts. 根据权利要求8或9所述的写作方法,还包括,在确定所述选定文本之后,对所述选定文本中的词和/或句进行同义替换。The writing method according to claim 8 or 9, further comprising, after the selected text is determined, synonymous replacement of words and / or sentences in the selected text. 根据权利要求9所述的写作方法,其中,所述获取待完成文章中的待写文本的q个候选文本包括,获取所述待写文本的至少一个关键词;以及,基于所述关键词,获取所述待写文本的q个候选文本。The writing method according to claim 9, wherein the obtaining the q candidate texts of the text to be written in the article to be completed comprises acquiring at least one keyword of the text to be written; and, based on the keywords, Acquiring q candidate texts of the text to be written. 根据权利要求11所述的写作方法,还包括,在获取所述待写文本的至少一个关键词之前,获取所述待完成文章的主题的关键词。The writing method according to claim 11, further comprising, before acquiring at least one keyword of the text to be written, acquiring a keyword of a theme of the article to be completed. 根据权利要求11或12所述的写作方法,其中,所述基于所述关键词,获取所述待写文本的q个候选文本包括,根据预定搜索和排序方法,在与所述待写文本对应的语料库中基于所述关键词进行搜索,从而获取所述q个候选文本。The writing method according to claim 11 or 12, wherein obtaining the q candidate texts of the text to be written based on the keywords comprises, according to a predetermined search and sorting method, corresponding to the text to be written The corpus is searched based on the keywords to obtain the q candidate texts. 根据权利要求13所述的写作方法,其中,与所述待写文本对应的语料库包括以下至少一种:与所述待写文本的领域对应的语料库、以及与所述待写文本的形式对应的语料库。The writing method according to claim 13, wherein the corpus corresponding to the text to be written comprises at least one of the following: a corpus corresponding to a field of the text to be written, and a form corresponding to the form of the text to be written Corpus. 一种训练写作模型的装置,所述写作模型包括分词单元、转换单元和循环神经网络,所述装置包括:A device for training a writing model. The writing model includes a word segmentation unit, a conversion unit, and a recurrent neural network. The device includes: 获取单元,配置为,获取用于训练所述写作模型的样本,所述样本包括文本和针对 所述文本的标定值,所述标定值指示所述文本是否通顺;以及An obtaining unit configured to obtain a sample for training the writing model, the sample including a text and a calibration value for the text, the calibration value indicating whether the text is smooth; and 训练单元,配置为,利用所述样本训练所述写作模型,使得:在对所述写作模型输入所述文本之后,相比于训练前,所述写作模型在所述训练后的输出值与所述标定值之差的绝对值减小,其中,在对所述写作模型输入所述文本之后,所述分词单元从所述文本获取顺序排列的多个分词,其中,所述多个分词根据其在所述文本中的前后位置顺序排列,所述转换单元将所述顺序排列的多个分词转换成对应的顺序排列的多个词向量,所述循环神经网络基于所述顺序排列的多个词向量输出所述写作模型的输出值。A training unit configured to use the sample to train the writing model such that, after the text is input to the writing model, compared to before training, the output value of the writing model after The absolute value of the difference between the calibration values is reduced, wherein after the text is input to the writing model, the word segmentation unit obtains a plurality of sequenced word segments arranged in sequence from the text, wherein the plurality of word segmentations are based on Are arranged sequentially in front and back positions in the text, the conversion unit converts the sequentially arranged plurality of word segmentation into correspondingly arranged multiple word vectors, and the recurrent neural network is based on the sequentially arranged multiple words The vector outputs the output value of the writing model. 根据权利要求15所述的训练写作模型的装置,其中,所述文本包括一篇现有文章中的非相邻的两个文本,以及所述标定值为0。The apparatus for training a writing model according to claim 15, wherein the text includes two non-adjacent texts in an existing article, and the calibration value is 0. 根据权利要求15所述的训练写作模型的装置,其中,所述文本包括分别属于两篇现有文章的两个文本,以及所述标定值为0。The apparatus for training a writing model according to claim 15, wherein the text includes two texts respectively belonging to two existing articles, and the calibration value is 0. 根据权利要求15所述的训练写作模型的装置,其中,所述文本顺序包括一篇现有文章中的相邻排列的两个文本,以及所述标定值为1。The apparatus for training a writing model according to claim 15, wherein the text order comprises two texts arranged next to each other in an existing article, and the calibration value is 1. 根据权利要求16-18中任一项所述的训练写作模型的装置,其中,所述现有文章为属于选定领域的现有文章,并且,所述写作模型在所述训练之后用于写作属于所述选定领域的文章。The apparatus for training a writing model according to any one of claims 16 to 18, wherein the existing article is an existing article belonging to a selected field, and the writing model is used for writing after the training Articles belonging to the selected area. 根据权利要求15所述的训练写作模型的装置,其中,所述训练单元还配置为,利用所述样本,通过反向传播算法训练所述写作模型。The apparatus for training a writing model according to claim 15, wherein the training unit is further configured to use the samples to train the writing model by a back propagation algorithm. 根据权利要求15所述的训练写作模型的装置,其中,所述循环神经网络包括以下一种网络:RNN、LSTM和GRU。The apparatus for training a writing model according to claim 15, wherein the recurrent neural network comprises one of the following networks: RNN, LSTM, and GRU. 一种写作装置,包括:A writing device including: 获取单元,配置为,获取待写文章的第一待写文本的n个第一候选文本和第二待写文本的m个第二候选文本,其中n、m为预定数目,以及其中,在所述待写文章中,所述第一待写文本位于所述第二待写文本的前面、并与所述第二待写文本相邻;The obtaining unit is configured to obtain n first candidate texts of the first to-be-written text of the article to be written and m second candidate texts of the second to-be-written text, where n and m are a predetermined number, and In the article to be written, the first to-be-written text is located in front of the second to-be-written text and is adjacent to the second to-be-written text; 组合单元,配置为,将所述第一候选文本中的每个文本与所述第二候选文本中的每个文本两两组合,从而获取n×m个输入文本,其中,在每个所述输入文本中,所述第一候选文本位于所述第二候选文本的前面;A combining unit configured to combine each text in the first candidate text with each text in the second candidate text to obtain n × m input texts, where In the input text, the first candidate text is located in front of the second candidate text; 输入单元,配置为,将所述n×m个输入文本分别输入通过根据权利要求1-7中任一项所述的方法训练的写作模型,以获取与所述n×m个输入文本分别对应的n×m个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及An input unit configured to respectively input the n × m input texts into a writing model trained by the method according to any one of claims 1 to 7, so as to obtain respective correspondences to the n × m input texts N × m model output values that predict the degree of smoothness of the corresponding input text; and 确定单元,配置为,基于所述n×m个模型输出值,从所述n个第一候选文本中确 定所述第一待写文本的选定文本,以及从所述m个第二候选文本确定所述第二待写文本的选定文本。A determining unit configured to determine a selected text of the first to-be-written text from the n first candidate texts based on the n × m model output values, and from the m second candidate texts A selected text of the second to-be-written text is determined. 一种写作装置,包括:A writing device including: 第一获取单元,配置为,获取待完成文章中的待写文本的q个候选文本,其中q为预定数目;A first obtaining unit configured to obtain q candidate texts of a text to be written in a to-be-completed article, where q is a predetermined number; 第二获取单元,配置为,从所述待完成文章中获取将与所述待写文本相邻、且位于所述待写文本的前面的预定长度的已有文本;A second obtaining unit configured to obtain, from the to-be-completed article, an existing text of a predetermined length that is adjacent to the to-be-written text and located in front of the to-be-written text; 组合单元,配置为,将所述已有文本与所述q个候选文本分别组合,以获取q个输入文本,其中,在每个所述输入文本中,所述已有文本位于所述候选文本的前面;A combining unit configured to combine the existing text and the q candidate texts separately to obtain q input texts, wherein in each of the input texts, the existing texts are located in the candidate texts In front of 输入单元,配置为,将所述q个输入文本分别输入通过根据权利要求1-7中任一项所述的方法训练的写作模型,以获取与所述q个输入文本分别对应的q个模型输出值,所述模型输出值预测对应的输入文本的通顺程度;以及An input unit configured to respectively input the q input texts into a writing model trained by the method according to any one of claims 1 to 7 to obtain q models corresponding to the q input texts respectively An output value that predicts the degree of fluency of the corresponding input text; and 确定单元,配置为,基于所述q个模型输出值,从所述q个候选文本中确定所述待写文本的选定文本。The determining unit is configured to determine the selected text of the text to be written from the q candidate texts based on the q model output values. 根据权利要求22或23所述的写作装置,还包括,替换单元,配置为,在确定所述选定文本之后,对所述选定文本中的词和/或句进行同义替换。The writing device according to claim 22 or 23, further comprising a replacement unit configured to perform synonymous replacement of words and / or sentences in the selected text after determining the selected text. 根据权利要求23所述的写作装置,其中,所述第一获取单元包括,第一获取子单元,配置为,获取所述待写文本的至少一个关键词;以及,第二获取子单元,基于所述关键词,获取所述待写文本的q个候选文本。The writing device according to claim 23, wherein the first acquisition unit comprises a first acquisition subunit configured to acquire at least one keyword of the text to be written; and a second acquisition subunit based on The keywords obtain q candidate texts of the text to be written. 根据权利要求25所述的写作装置,还包括,第三获取单元,配置为,在获取所述待写文本的至少一个关键词之前,获取所述待完成文章的主题的关键词。The writing device according to claim 25, further comprising a third acquisition unit configured to acquire keywords of a subject of the article to be completed before acquiring at least one keyword of the text to be written. 根据权利要求25或26所述的写作装置,其中,所述第二获取子单元还配置为,根据预定搜索和排序方法,在与所述待写文本对应的语料库中基于所述关键词进行搜索,从而获取所述q个候选文本。The writing device according to claim 25 or 26, wherein the second acquisition subunit is further configured to perform a search based on the keywords in a corpus corresponding to the text to be written according to a predetermined search and sorting method To obtain the q candidate texts. 根据权利要求27所述的写作装置,其中,与所述待写文本对应的语料库包括以下至少一种:与所述待写文本的领域对应的语料库、以及与所述待写文本的形式对应的语料库。The writing device according to claim 27, wherein the corpus corresponding to the text to be written comprises at least one of the following: a corpus corresponding to a field of the text to be written, and a form corresponding to the form of the text to be written Corpus.
PCT/CN2019/077443 2018-05-31 2019-03-08 Intelligent writing method and apparatus Ceased WO2019228016A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810551101.6A CN108874761A (en) 2018-05-31 2018-05-31 A kind of intelligence writing method and device
CN201810551101.6 2018-05-31

Publications (1)

Publication Number Publication Date
WO2019228016A1 true WO2019228016A1 (en) 2019-12-05

Family

ID=64335762

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/077443 Ceased WO2019228016A1 (en) 2018-05-31 2019-03-08 Intelligent writing method and apparatus

Country Status (3)

Country Link
CN (1) CN108874761A (en)
TW (1) TWI716824B (en)
WO (1) WO2019228016A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108874761A (en) * 2018-05-31 2018-11-23 阿里巴巴集团控股有限公司 A kind of intelligence writing method and device
CN110287478B (en) * 2019-05-15 2023-05-23 广东工业大学 Machine writing system based on natural language processing technology
CN113762523B (en) * 2021-01-26 2025-07-18 北京沃东天骏信息技术有限公司 Text generation method and device, storage medium and electronic equipment
CN117291244A (en) * 2023-09-08 2023-12-26 北京声智科技有限公司 Model training methods, generative writing methods, devices and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system
CN106815194A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and keyword recognition method and device
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN107729322A (en) * 2017-11-06 2018-02-23 广州杰赛科技股份有限公司 Segmenting method and device, establish sentence vector generation model method and device
CN108874761A (en) * 2018-05-31 2018-11-23 阿里巴巴集团控股有限公司 A kind of intelligence writing method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI608367B (en) * 2012-01-11 2017-12-11 國立臺灣師範大學 Text readability measuring system and method thereof
TWI592812B (en) * 2013-05-23 2017-07-21 恆鼎科技股份有限公司 Methods for identifying comment units of articles, and related devices and computer program prodcuts
US9679555B2 (en) * 2013-06-26 2017-06-13 Qualcomm Incorporated Systems and methods for measuring speech signal quality
CN104035751B (en) * 2014-06-20 2016-10-12 深圳市腾讯计算机系统有限公司 Data parallel processing method based on multi-graphics processor and device
CN107870964B (en) * 2017-07-28 2021-04-09 北京中科汇联科技股份有限公司 Statement ordering method and system applied to answer fusion system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system
US20180032870A1 (en) * 2015-10-22 2018-02-01 Tencent Technology (Shenzhen) Company Limited Evaluation method and apparatus based on text analysis, and storage medium
CN106815194A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and keyword recognition method and device
CN107729322A (en) * 2017-11-06 2018-02-23 广州杰赛科技股份有限公司 Segmenting method and device, establish sentence vector generation model method and device
CN108874761A (en) * 2018-05-31 2018-11-23 阿里巴巴集团控股有限公司 A kind of intelligence writing method and device

Also Published As

Publication number Publication date
CN108874761A (en) 2018-11-23
TW202004709A (en) 2020-01-16
TWI716824B (en) 2021-01-21

Similar Documents

Publication Publication Date Title
Morente-Molinera et al. Carrying out consensual group decision making processes under social networks using sentiment analysis over comparative expressions
CN108829801B (en) An event-triggered word extraction method based on document-level attention mechanism
US12158906B2 (en) Systems and methods for generating query responses
US20190317986A1 (en) Annotated text data expanding method, annotated text data expanding computer-readable storage medium, annotated text data expanding device, and text classification model training method
CN110489523B (en) Fine-grained emotion analysis method based on online shopping evaluation
CN112818118B (en) Reverse translation-based Chinese humor classification model construction method
CN113673241B (en) A framework system and method for text summarization generation based on example learning
CN111222318B (en) Trigger word recognition method based on double-channel bidirectional LSTM-CRF network
CN112182145B (en) Text similarity determination method, device, equipment and storage medium
CN114387537B (en) Video question-answering method based on descriptive text
TWI716824B (en) A smart writing method and device
CN112883722B (en) A Distributed Text Summarization Method Based on Cloud Data Center
CN112446217B (en) Emotion analysis method and device and electronic equipment
CN117540023A (en) Image joint text emotion analysis method based on modal fusion graph convolution network
CN106649250A (en) Method and device for identifying emotional new words
CN107305543B (en) Method and apparatus for classifying semantic relations of entity words
CN119047468B (en) Intelligent creation method based on NLP intention recognition
KR102330190B1 (en) Apparatus and method for embedding multi-vector document using semantic decomposition of complex documents
CN111259651A (en) User emotion analysis method based on multi-model fusion
KR102649948B1 (en) Text augmentation apparatus and method using hierarchy-based word replacement
CN114254622B (en) Intention recognition method and device
CN113204963B (en) Input method multi-word discovery method and device
Yturrizaga-Aguirre et al. Story visualization using image-text matching architecture for digital storytelling
CN113641800B (en) Text duplicate checking method, device and equipment and readable storage medium
CN115238672B (en) Sentence component recognition method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19812291

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19812291

Country of ref document: EP

Kind code of ref document: A1