[go: up one dir, main page]

WO2018010579A1 - Character string segmentation method, apparatus and device - Google Patents

Character string segmentation method, apparatus and device Download PDF

Info

Publication number
WO2018010579A1
WO2018010579A1 PCT/CN2017/091783 CN2017091783W WO2018010579A1 WO 2018010579 A1 WO2018010579 A1 WO 2018010579A1 CN 2017091783 W CN2017091783 W CN 2017091783W WO 2018010579 A1 WO2018010579 A1 WO 2018010579A1
Authority
WO
WIPO (PCT)
Prior art keywords
segmentation result
word
segmentation
character string
reverse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2017/091783
Other languages
French (fr)
Chinese (zh)
Inventor
张增明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Publication of WO2018010579A1 publication Critical patent/WO2018010579A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools

Definitions

  • the present application relates to computer technology, and in particular, to a word segmentation method, device and device.
  • Natural language processing is the use of computers to analyze and understand natural language, so that computers have human language capabilities to some extent.
  • it often encounters dirty data that does not conform to natural language rules, resulting in a significant compromise in natural language processing. Therefore, it is necessary to pre-process the English text to obtain a normal natural language containing multiple English words, and then use the natural language model for processing.
  • the dirty data in the prior art mainly includes a character string formed by concatenating a plurality of words due to the absence of a space character, a character string doped with an interfering character, and the like.
  • the specific process of the word segmentation of the English text in the prior art is as follows: sequentially reading a letter of the character string to be divided in order, adding it to the back of the already obtained letters, forming a substring, and then checking whether the substring is Can be found in the pre-acquired English dictionary. If it can be found, the substring is a word, which is first separated from the original string. Then repeat this method for the remaining strings, eventually completing the word segmentation, or the remaining strings can't be split directly.
  • the method for word segmentation of English texts may cause semantic errors when the words of the previous word and the prefix of the latter word form a word or are doped with interference characters in the character string to be segmented. Even the phenomenon that cannot be divided.
  • the invention provides a word segmentation method, device and device, which not only improves the segmentation success rate, but also improves the probability that each word in the segmentation result is semantically correct.
  • the present invention provides a word segmentation method for a string, comprising:
  • a segmentation result of the character string to be segmented is the forward segmentation result Or the reverse segmentation result.
  • the obtaining a forward segmentation result of the character string to be segmented includes:
  • the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned;
  • the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split;
  • the operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained.
  • the forward segmentation method provided in this embodiment is a forward-gradient segmentation method of layer by layer, and after a layer-by-layer attempt, the interference characters are overcome, and finally the forward segmentation result is obtained.
  • the obtaining a reverse segmentation result of the character string to be segmented includes:
  • the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed;
  • the first character in the reverse direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing reverse splitting on the string to be split;
  • the operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a reverse segmentation result is obtained.
  • the reverse segmentation method provided in this embodiment is a layer-by-layer reverse progressive segmentation method. After a layer-by-layer attempt, the interference characters are overcome, and finally the reverse segmentation result is obtained.
  • the string is forward-divided or reverse-divided according to the dictionary tree.
  • the search path may be continued based on the character before the character is added. The next level of node lookup can avoid repeated lookups, minimize unnecessary string comparisons, reduce query time, and improve search efficiency.
  • each first node of the forward dictionary tree stores a word frequency of a word corresponding to the first node
  • each second node of the reverse dictionary tree stores The word frequency of the word corresponding to the second node
  • the acquiring the word frequency of each of the first words and the word frequency of each of the second words includes:
  • the method further includes:
  • Constructing a corpus including a word library and a word frequency of words in the word library;
  • the constructing a forward dictionary tree and a reverse dictionary tree including:
  • a forward dictionary tree and a reverse dictionary tree are constructed, and the word frequency of each word is stored to the corresponding first node and second node.
  • the preset text includes: a text that satisfies a preset use condition and a text to be divided; and the constructed corpus includes:
  • the words in the word library the words in the word library, the text that satisfies the preset use condition And the number of occurrences in the text to be segmented, constructing the corpus.
  • the determining the number of occurrences of a word in the word library in the text to be segmented includes:
  • the word frequency of the words in the corpus is corrected by the text to be segmented, and has a certain correlation with the text to be segmented, so that the word frequency of the words in the corpus is closer to the application of the text to be segmented.
  • the situation can make the semantics of the segmentation result closer to the semantics of the text representation to be segmented, and improve the correctness of the string segmentation.
  • the determining the segmentation result of the character string to be divided according to the word frequency of each of the first words and the word frequency of each of the second words includes:
  • both the forward segmentation and the reverse segmentation adopt the longest word segmentation method.
  • the present invention provides a word segmentation method for a character string, including:
  • the cloud server receiving, by the cloud server, the segmentation result information of the to-be-divided character string, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the segmentation result of the character string to be segmented is Describe the forward segmentation result or the reverse segmentation result;
  • the segmentation result is output to the user.
  • the word segmentation method of the character string provided by the embodiment sends the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and The word frequency of each second word in the reverse segmentation result determines the segmentation result; since the cloud server can distinguish the interference characters of the string header or the string tail by bidirectionally dividing the character string, the segmentation success rate is improved, and the final frequency is determined based on the word frequency.
  • the segmentation result improves the probability that the semantics of each word in the segmentation result is correct, and receives the segmentation result information of the character string to be segmented fed back by the cloud server, and the segmentation result information includes the segmentation result of the character string to be segmented; and outputs the segmentation result to the user.
  • the user can know the segmentation result, so that the user can know the query word corresponding to the final query result, which improves the user experience.
  • the outputting the segmentation result to a user includes:
  • the segmentation result is displayed on the display interface.
  • the segmentation result information further includes a segmentation type corresponding to the segmentation result, and the segmentation type is a forward segmentation or a reverse segmentation;
  • Displaying the segmentation result on the display interface including:
  • the segmentation result and the segmentation type of the segmentation result are displayed on the display interface.
  • the segmentation information further includes a reverse segmentation result
  • the segmentation information further includes a forward segmentation result
  • Displaying the segmentation result on the display interface including:
  • the segmentation information further includes a word frequency of each of the first words in the forward segmentation result and a word frequency of each of the second words in the reverse segmentation result;
  • the method further includes:
  • Displaying the forward segmentation result and the reverse segmentation result on the display interface including:
  • the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and each of the second segment in the reverse segmentation result The second word frequency and value corresponding to the word;
  • the method further includes:
  • Displaying the forward segmentation result and the reverse segmentation result on the display interface including:
  • the method further includes:
  • the present invention provides a word segmentation device, comprising:
  • a first segmentation module configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word
  • a second segmentation module configured to acquire a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word
  • a word frequency acquisition module configured to acquire a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a predetermined number of occurrences of each word in the preset text;
  • a result determining module configured to determine a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is The forward segmentation result or the reverse segmentation result.
  • the present invention provides a word segmentation device, comprising:
  • a sending module configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;
  • a receiving module configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented
  • the segmentation result is the forward segmentation result or the reverse segmentation result
  • an output module configured to output the segmentation result to a user.
  • the present invention provides a word segmentation device, comprising:
  • a processor coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results.
  • the present invention provides a cloud server, including:
  • a processor coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results.
  • the present invention provides a word segmentation device, comprising:
  • An output device configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;
  • An input device configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented
  • the segmentation result is the forward segmentation result or the reverse segmentation result
  • a processor coupled to the output device and the input device, configured to control the input device to output the segmentation result to a user according to the segmentation result information.
  • the present invention provides a user equipment, including:
  • An output device configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;
  • An input device configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented
  • the segmentation result is the forward segmentation result or the reverse segmentation result
  • a processor coupled to the output device and the input device, configured to control the input device to output the segmentation result to a user according to the segmentation result information.
  • the bidirectional segmentation string is used to identify the interfering character of the string header or the string tail, thereby improving
  • the segmentation success rate is obtained, and then the word frequency of each first word and the word frequency of each second word are obtained, and the segmentation result of the character string to be segmented is determined according to the word frequency of each first word and the word frequency of each second word, and is determined based on the word frequency.
  • the final segmentation result increases the probability that the semantics of each word in the segmentation result is correct.
  • FIG. 1 is a schematic diagram of a word segmentation scenario of a character string according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a word segmentation method of a character string according to an embodiment of the present invention
  • FIG. 3 is a schematic diagram of forward splitting according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of reverse splitting according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of forward and reverse splitting according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of forward splitting according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of reverse splitting according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of a forward dictionary tree according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a reverse dictionary tree according to an embodiment of the present invention.
  • FIG. 10 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention.
  • FIG. 11 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention.
  • FIG. 12 is a signaling flowchart of a word segmentation method of a character string according to an embodiment of the present invention.
  • FIG. 13 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 14 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 15 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 16 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 17 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 18 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • FIG. 19 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention.
  • FIG. 20 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention.
  • FIG. 21 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention.
  • FIG. 22 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention.
  • FIG. 23 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention.
  • FIG. 24 is a schematic structural diagram of hardware of a cloud server according to an embodiment of the present invention.
  • 25 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention.
  • FIG. 26 is a schematic structural diagram of hardware of a user equipment according to an embodiment of the present invention.
  • FIG. 1 is a schematic diagram of a word segmentation scenario of a character string according to an embodiment of the present invention.
  • the user inputs the text to be divided by the user equipment 100 , for the user, that is, the character string input by the user, and then the user equipment 100 sends the text to be divided to the cloud server 200 . Since the character string input by the user may have dirty data, the cloud server 200 performs word segmentation processing on the character string input by the user.
  • the word segmentation method provided by this embodiment can be applied to the processing process of natural language processing, and the word segmentation method of the string is to preprocess the natural language to obtain multiple semantic correctness.
  • the natural language of the English word, and then the natural language as an input to the natural language model, is further processed by the natural language model.
  • the natural language model can be a model for highlight vocabulary extraction.
  • the user equipment 100 can be installed with an application corresponding to the e-commerce platform, or a browser can be installed, and the user can browse the e-commerce website through the browser.
  • the user purchases the product through the application or the e-commerce website, the user first searches for the product. Specifically, the user inputs a character string on the application corresponding to the e-commerce platform or the input interface of the e-commerce website, and then the user device 100 selects the character.
  • the string is sent to the cloud server 200.
  • the word segmentation method of the string provided by the present invention is applied, and the cloud server 200 performs segmentation on the string to obtain a plurality of English words, and then the cloud server 200 uses the highlight vocabulary extraction model to The English words are extracted to obtain the title, attribute and other information of the product, that is, the highlight vocabulary capable of describing the elements, styles and the like of the product, and then the user is provided with the goods required by the user according to the highlight vocabulary.
  • the cloud server 200 may further feed back the word segmentation result to the user equipment, so that the user knows the word segmentation result, so as to know which words the cloud server specifically uses to find the matching product.
  • the user segmentation result or the reverse segmentation result may be fed back to the user equipment, and the user segmentation result is selected by the user, and then the user equipment 100 feeds back the word segmentation result selected by the user to the cloud server 200, and the cloud server 200 divides the word according to the word selected by the user. The result is followed up.
  • the present invention shows a specific application scenario.
  • the word segmentation method of the string can also be applied to a scenario such as a webpage search.
  • the processing function of the user equipment such as a computer, a mobile phone, a tablet, or the like
  • the word segmentation method of the character string may also be completed by the user equipment.
  • the embodiment is not particularly limited herein. In the following, a detailed embodiment is firstly used to explain the method for segmenting a character string by the cloud server.
  • FIG. 2 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention.
  • the word segmentation method of the string can be implemented by a word segmentation device of the string.
  • the device can be implemented by software and/or hardware.
  • the word segmentation device can also be configured into a cloud server, a computer, a mobile phone, a tablet, and the like.
  • the method includes:
  • Step 101 Obtain a forward segmentation result of the character string to be segmented, where the forward segmentation result includes at least one first word;
  • Step 102 Acquire a reverse segmentation result of the character string to be segmented, where the reverse segmentation result includes at least one second word;
  • Step 103 Obtain a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a number of times the predetermined words appear in the preset text;
  • Step 104 Determine a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, where a segmentation result of the character string to be segmented is the positive Split the result to the segmentation result or the inverse.
  • the text to be divided sent by the user equipment is obtained, and then the character string to be divided is obtained according to the text to be divided. Then, the character string to be divided is segmented, and those skilled in the art can understand that the character string to be divided is a continuous character string without any symbols.
  • the text to be divided by the user is without any symbols
  • the text to be divided is the character string to be divided.
  • the text to be divided includes spaces and various punctuation marks
  • the text to be divided is subjected to a symbol deletion operation, that is, an operation of deleting spaces and punctuation marks, and finally a continuous character string to be divided is obtained.
  • step 101 and step 102 are performed, and the forward segmentation result and the reverse segmentation result are obtained by performing forward segmentation and reverse segmentation respectively on the segment string to be segmented.
  • the character string to be segmented is forwardly segmented to obtain a forward segmentation result
  • the segment to be segmented is inversely segmented to obtain a reverse segmentation result. There is no strict timing. relationship.
  • FIG. 3 is a schematic diagram of forward splitting according to an embodiment of the present invention. As shown in FIG. 3, this embodiment performs forward segmentation on the string floorlengthsleevelessdressst, and the final forward segmentation result is a plurality of first words: floor length sleevelessdress.
  • the specific forward segmentation process is: taking characters from left to right, checking the dictionary once every time to determine whether to take a word. When taking the floor, it will continue to try floorl, floor, floorlen until the whole character is taken. String, or reach the preset string length, the preset string length is the longest length of the word, and then among all the words, take the longest word as the segmentation result. Since there is no word afterwards, the floor is the segmentation result. .
  • sleeveless 10
  • sleeveless 6
  • sleeveless is the result of segmentation
  • sleeve and less are not the final segmentation results.
  • This embodiment adopts the longest segmentation method of words, and is most suitable for semantics. In general, two words are written together, or there are not many examples of words, but if they are written together or words, they are more semantic.
  • the forward splitting or the reverse splitting in this embodiment may also adopt other splitting manners in the prior art, and the present embodiment is not particularly limited herein.
  • FIG. 4 is a schematic diagram of reverse splitting according to an embodiment of the present invention. As shown in Fig. 4, this embodiment reversely splits the character string ssfloorlengthsleevelessdress.
  • the specific reverse segmentation process is: taking characters from right to left, and checking the dictionary once every time to determine whether a word is obtained.
  • the specific segmentation process is similar to the forward segmentation process, and will not be described in detail in this embodiment.
  • the final reverse segmentation result is a plurality of second words: floor length sleeveless dress.
  • Another specific example is to split the string sleepshirt forward, and the forward segmentation result is sleeps hirt; the string sleepshirt is split in the reverse direction, and the reverse segmentation result is sleep shirt.
  • step 103 the word frequency of each first word and the word frequency of each second word are obtained.
  • the word frequency The number of times each predetermined word appears in the preset text.
  • the preset text can be a complete collection of English literature or an English textbook.
  • the above embodiments will be described as an example.
  • the obtained forward segmentation result is a plurality of correct first words: floor length sleeveless dress, and when the floorlengthsleevelessdressst is reversely segmented, Get a second word of the error. At this time, the word frequency of the second word is infinitely small.
  • the obtained reverse segmentation result is a plurality of correct second words: floor length sleeveless dress, and when performing forward segmentation, a The first word of the error. At this time, the word frequency of the first word is infinitely small.
  • FIG. 5 is a schematic diagram of forward and reverse segmentation according to an embodiment of the present invention.
  • the forward segmentation result is sleeps hirt
  • the word frequency of sleeps is 100
  • the word frequency of hirt is 10
  • the reverse segmentation result is sleep shirt
  • the word frequency of sleep is 10000
  • the word frequency of the shirt is 9000.
  • the segmentation result of the character string to be divided is determined according to the word frequency of each first word and the word frequency of each second word. Specifically, the word frequency of all the first words may be summed to obtain a first word frequency sum value; the word frequency of all the second words is summed to obtain a second word frequency sum value; if the first word frequency sum is greater than the first word frequency The second word frequency sum value determines that the segmentation result of the character string to be segmented is a forward segmentation result; if the second word frequency sum value is greater than the first word frequency sum value, it is determined that the segmentation result of the character string to be segmented is a reverse segmentation result.
  • the reverse segmentation result cannot be obtained, and the word frequency of the second word is infinitely small, and the segmentation result is a forward segmentation result.
  • the forward segmentation result cannot be obtained, and the word frequency of the first word is infinitely small, and the segmentation result is the reverse segmentation result.
  • the first word frequency sum is 110, and the second word frequency sum is 19000, and the segmentation result is a reverse segmentation result.
  • the word frequency threshold can also be set in a specific implementation process. Then, the number of words larger than the word frequency threshold in the forward segmentation result is determined, the number of words larger than the word frequency threshold in the reverse segmentation result is determined, and the large number of forward segmentation results or reverse segmentation results are used as the final segmentation result. At the same time, various deformation processing can be performed on the word frequency, and then the segmentation result is determined. That is, it is within the scope of the present invention to determine that each word in the segmentation result is a relatively common word according to the word frequency of each first word and the word frequency of the second word, thereby ensuring the semantically correct implementation.
  • the bidirectional segmentation string is used to identify the interfering character of the string header or the string tail, thereby improving
  • the segmentation success rate is obtained, and then the word frequency of each first word and the word frequency of each second word are obtained, and the segmentation result of the character string to be segmented is determined according to the word frequency of each first word and the word frequency of each second word, and is determined based on the word frequency.
  • the final segmentation result increases the probability that the semantics of each word in the segmentation result is correct.
  • the word segmentation method is further improved, so that in the presence of interfering characters, the character string in the embodiment of FIG. 3 can also obtain a plurality of correct second words.
  • the string can also get a plurality of correct first words, which are described in detail below with reference to FIGS. 6 and 7.
  • FIG. 6 is a schematic diagram of forward splitting according to an embodiment of the present invention.
  • the character string ssfloorlengthsleevelessdressst to be divided is forwardly divided to determine whether the first word is acquired, and since the first character is not obtained because of the interference character ss, the forward direction of the character string to be divided is The character is deleted, that is, the first character s in the forward direction is deleted, and the processed character string to be divided is obtained. Then, the processed string to be split is used as a new string to be split, and the forward splitting of the string to be split is continued. Since the first character is still unable to be obtained due to the interference character s, The first character s of the forward direction of the processed string to be split is deleted.
  • the processed character string to be divided is used as a new character string to be divided, and the forward segmentation of the character string to be segmented is continued, and the first word floor can be obtained. At this time, the first word will be removed.
  • the character to be split is used as a new string to be split, and the forward segmentation of the string to be segmented is performed, and the forward segmentation of the string to be segmented is performed repeatedly until the process is performed. The segmentation of the segmentation string ends, and the result of the forward segmentation is obtained.
  • the interfering character of the intermediate position becomes the first character of the remaining character string, and the forward segmentation is performed.
  • the interfering character in the middle position can also be deleted, and then the forward segmentation is continued until the segmentation of the segment to be segmented ends, and a forward segmentation result is obtained.
  • the forward segmentation result is floor length sleeveless dress.
  • the forward segmentation method provided in this embodiment is a forward-gradient segmentation method of layer by layer, and after a layer-by-layer attempt, the interference characters are overcome, and finally the forward segmentation result is obtained.
  • FIG. 7 is a schematic diagram of reverse splitting according to an embodiment of the present invention.
  • the character string ssfloorlengthsleevelessdressst to be divided is reverse-segmented to determine whether the second word is acquired, and since the second character is not obtained because of the interference character st, the reverse of the character string to be divided is The character is deleted, that is, the reversed first character t is deleted, and the processed character string to be divided is obtained. Then, the processed string to be split is used as a new string to be split, and the operation of the segment to be split is performed. If the second character is still not obtained due to the interference character s, The reversed first character s of the processed string to be split is deleted.
  • the processed character string to be divided is used as a new character string to be divided, and the reverse segmentation operation of the character string to be segmented is performed, and the second word dress can be obtained.
  • the second word is removed.
  • the string to be split is used as a new string to be split, and the operation of the split string is reversed.
  • the reverse segmentation result is obtained by repeatedly performing the forward segmentation operation on the character string to be segmented until the segmentation of the string to be segmented ends.
  • the interfering character existing in the middle position of the character string after removing the second word that has been segmented, the interfering character of the intermediate position becomes the first character of the remaining character string, and the reverse segmentation is performed.
  • the interfering character in the middle position can also be deleted, and then the reverse splitting is continued until the segmentation of the string to be divided ends, and the reverse segmentation result is obtained.
  • the result of the reverse segmentation is floor length sleeveless dress.
  • the reverse segmentation method provided in this embodiment is a layer-by-layer reverse progressive segmentation method. After a layer-by-layer attempt, the interference characters are overcome, and finally the reverse segmentation result is obtained.
  • the present application can also construct a forward dictionary tree and a reverse dictionary before word segmentation, that is, before forward word segmentation and before reverse word segmentation.
  • the tree enables the forward segmentation of the segment to be segmented according to the forward dictionary tree when the word is segmented, and the segmentation of the segment to be segmented according to the reverse dictionary tree.
  • the dictionary tree is a tree structure and is a variant of a hash tree. Its advantage is: use the common prefix of the string to reduce the query time, minimize the unnecessary string comparison, the query efficiency is higher than the hash tree. It has three basic properties: the root node does not contain characters, and each node except the root node contains only one character; from the root node to a node, the characters passing through the path are connected, which is the string corresponding to the node; All children of a node contain different characters.
  • some data can be stored in the node, such as the frequency of the word.
  • the word frequency of the word corresponding to the first node is stored in each of the first nodes of the forward dictionary tree, and the word frequency of the word corresponding to the second node is stored in each second node of the reverse dictionary tree.
  • the word frequency of the first word is obtained from the first node corresponding to the first word; and the word frequency of the second word is obtained from the second node corresponding to the second word.
  • FIG. 8 is a schematic diagram of a forward dictionary tree according to an embodiment of the present invention.
  • the so-called forward dictionary tree that is, the dictionary tree established from the root node to the child nodes at each level in the forward order of the characters in the word.
  • the words "expend” and “expense” have the same prefix “expen”, which can be made by the forward dictionary tree.
  • the search path has a common part (ie a path consisting of 5 nodes connected by dashed lines in the forward dictionary tree).
  • FIG. 9 is a schematic diagram of a reverse dictionary tree according to an embodiment of the present invention.
  • the reverse dictionary tree that is, the dictionary tree established from the root node to the child nodes at each level in the reverse order of the characters in the word.
  • the two words "endless” and “useless” with the same suffix "less” also have a common lookup path (dashed connection) in the reverse dictionary tree, ie, the same suffix can be made by reverse dictionary tree Two or more words have the same search path.
  • the string is forward-divided or reverse-divided according to the dictionary tree.
  • the search path may be continued based on the character before the character is added. The next level of node search, so as to avoid repeated lookups, the largest Limit the unnecessary string comparison, reduce the query time, and improve the search efficiency.
  • the corpus can also be constructed in advance.
  • the corpus includes the word frequency of the words in the word library and the word library, and then constructs a forward dictionary tree and a reverse dictionary tree according to the corpus, and stores the word frequency of each word to the corresponding first node and second node.
  • FIG. 10 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention. As shown in FIG. 10, the method includes:
  • Step 201 Obtain a word library according to a dictionary that satisfies a preset use condition.
  • the dictionary satisfying the preset use condition may be a dictionary whose vocabulary exceeds a preset value, or a dictionary whose download frequency exceeds a preset number of times, etc., extracting words in the dictionary, all of Words form the word library.
  • Step 202 Determine the number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided.
  • the text that satisfies the preset use condition may be a complete text of English literature, English textbooks, English newspapers and the like, which use frequencies exceeding a preset value. Determines the number of times a word in a word library appears in these texts.
  • the punctuation is removed from the text to be split, and the first character string is separated by a space. If the first string is not in the dictionary, the discard is discarded, and the rest are words, that is, the second The string, then counts the number of times the second string appears in the text to be split, that is, the number of times the word in the word library appears in the text to be split.
  • Step 203 Construct a corpus according to the number of occurrences of the words in the word library and the word library in the text satisfying the preset use condition and the text to be divided.
  • the corpus includes the word library and the word frequency of the words in the word library. If the same word appears in the text that satisfies the preset use condition and appears in the text to be split, the word frequency of the word is the number of occurrences of the word in the text satisfying the preset use condition and the to-be-divided The sum of the occurrences in the text.
  • the word frequency of the words in the corpus is corrected by the text to be segmented, and has a certain correlation with the text to be segmented, so that the word frequency of the words in the corpus is closer to the application of the text to be segmented.
  • the situation can make the semantics of the segmentation result closer to the semantics of the text representation to be segmented, and improve the correctness of the string segmentation.
  • the cloud server in this embodiment may also interact with the user equipment, so that the user can know the segmentation result.
  • the detailed embodiments are described in detail below.
  • FIG. 11 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention.
  • the word segmentation method of the string can be implemented by a word segmentation device of the string.
  • the device can be implemented by software and/or hardware.
  • the word segmentation device can also be configured into a user device, such as a computer, a cell phone, a tablet, and the like. In this embodiment, a detailed description will be given by taking the word segmentation device as a user equipment as an example.
  • the method includes:
  • Step 301 Send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be divided, and according to each of the word frequency and the reverse segmentation result of each first word in the forward segmentation result.
  • the word frequency of the second word determines the segmentation result
  • Step 302 Receive segmentation result information of the character string to be divided that is fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein, the segmentation of the character string to be segmented The result is the forward segmentation result or the reverse segmentation result;
  • Step 303 Output the segmentation result to a user.
  • step 301 when the user browses the e-commerce platform through an application or a browser installed on the user device, when the user needs to find a certain product, the user device acquires the text to be divided by the user, and then sends the user input to the cloud server.
  • the text to be split Specifically, the user can input the text to be divided by voice or text.
  • the cloud server After obtaining the text to be segmented, the cloud server obtains the character string to be segmented according to the text to be segmented, and then performs word segmentation processing on the character string to be segmented, thereby obtaining a forward segmentation result and a forward segmentation result.
  • the word frequency of each first word, the first word frequency and value, the inverse segmentation result, the word frequency of each second word in the reverse segmentation result, the second word frequency and value, and the final segmentation result The specific implementation of the word segmentation processing of the segmented string by the cloud server For example, refer to the embodiment shown in FIG. 2 to FIG. 10 above, and details are not described herein again.
  • step 302 after obtaining the segmentation result, the cloud server feeds back the segmentation result information of the character string to be segmented to the user equipment, and the segmentation result information includes the segmentation result.
  • step 303 after obtaining the segmentation result, the user equipment outputs the segmentation result to the user.
  • the user equipment may output the segmentation result in the form of voice or text.
  • the word segmentation method of the character string provided by the embodiment sends the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and The word frequency of each second word in the reverse segmentation result determines the segmentation result; since the cloud server can distinguish the interference characters of the string header or the string tail by bidirectionally dividing the character string, the segmentation success rate is improved, and the final frequency is determined based on the word frequency.
  • the segmentation result improves the probability that the semantics of each word in the segmentation result is correct, and receives the segmentation result information of the character string to be segmented fed back by the cloud server, and the segmentation result information includes the segmentation result of the character string to be segmented; and outputs the segmentation result to the user.
  • the user can know the segmentation result, so that the user can know the query word corresponding to the final query result, which improves the user experience.
  • FIG. 12 is a signaling flowchart of a word segmentation method for a character string according to an embodiment of the present invention. As shown in FIG. 12, the method includes:
  • Step 401 The user equipment acquires text to be divided by the user;
  • Step 402 The user equipment sends the text to be divided by the user input to the cloud server.
  • Step 403 The cloud server obtains a character string to be divided according to the text to be divided, and determines a segmentation result of the character string to be divided.
  • Step 404 The cloud server sends, to the user equipment, segmentation result information of the character string to be divided.
  • Step 405 The user equipment outputs the segmentation result information to the user.
  • step 406 to step 408 may also be performed.
  • Step 406 The user equipment acquires a segmentation result to be processed determined by the user.
  • Step 407 The user equipment sends the segmentation result to be processed to the cloud server.
  • Step 408 Perform natural language processing on the segmentation result to be processed.
  • the user device interacts with the cloud server, so that the user can not only know Segmenting the result information can also determine the segmentation result to be processed and improve the user experience.
  • the user equipment in the embodiment is used to obtain the text to be divided by the user, and the user equipment outputs the segmentation result information to the user for detailed description.
  • a detailed description will be made by taking shopping by an e-commerce platform as an example.
  • a person skilled in the art can understand that the scenario is only an exemplary scenario, and the method can also be applied to a scenario such as a webpage search.
  • the specific embodiment does not specifically limit the specific scenario.
  • FIG. 13 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the user can input the type of the item to be viewed in the search box of the display interface of the user device.
  • the user inputs the text of “slee pshirt” in the search box of the display interface, and the user equipment sends the text to the cloud server.
  • the cloud server obtains the text to be divided
  • the text to be divided is processed to obtain a string “sleepshirt” to be divided.
  • the cloud server performs the segmentation process on the character string to be divided, and the specific segmentation process and the segmentation process result are shown in the embodiment shown in FIG. 5, and details are not described herein again.
  • the cloud server after the cloud server obtains the segmentation result, the cloud server returns the segmentation result information to the user equipment. After receiving the segmentation result information, the user equipment outputs the segmentation result to the user according to the segmentation result information.
  • the implementation process of the user equipment output segmentation result will be specifically described below with reference to FIG. 14 to FIG.
  • FIG. 14 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the segmentation result information includes the segmentation result of the character string to be segmented, and the segmentation result is displayed correspondingly on the display interface of the user equipment. As shown in FIG. 14, the segmentation result "sleep shirt" is displayed on the display interface.
  • FIG. 15 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the segmentation result information includes a segmentation result of the character string to be segmented, a segmentation type corresponding to the segmentation result, and the segmentation type is forward segmentation or reverse segmentation.
  • the segmentation result and the segmentation type of the segmentation result are displayed on the display interface of the user device. As shown in FIG. 15, the segmentation result “sleep shirt” is displayed on the display interface, and the segmentation type "reverse segmentation” having the segmentation result is displayed.
  • FIG. 16 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the segmentation result information includes a forward segmentation result, a reverse segmentation result, and The final segmentation result.
  • the forward segmentation result and the reverse segmentation result are displayed on the display interface of the user equipment, and the segmentation result corresponding to the string to be segmented is marked.
  • the reverse segmentation result "sleep shirt” and the forward segmentation result "sleeps hirt” are displayed on the display interface, and the segmentation result corresponding to the segmentation string is marked by the gray back image as the reverse segmentation result.
  • FIG. 17 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the segmentation result information further includes a word frequency of each of the word frequency and the reverse segmentation result in the forward segmentation result.
  • the word frequency of each of the second word in the reverse segmentation result and the reverse segmentation result is displayed, and the first segmentation result and the first segment in the forward segmentation result are also displayed.
  • the word frequency of the word may be directly displayed on the display interface, or the content shown in FIG. 16 may be displayed on the display interface first, and then acquired.
  • the word frequency of each first word and/or the word frequency of each second word are displayed according to the word frequency display instruction.
  • the specific display content can be as shown in FIG.
  • FIG. 18 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention.
  • the segmentation information further includes a first word frequency sum value corresponding to each first word in the forward segmentation result and a second word frequency corresponding to each second word in the reverse segmentation result. And value.
  • the reverse segmentation result and the second word frequency sum corresponding to each second word are displayed, and the forward segmentation result and the first word frequency corresponding to each first word are also displayed. value.
  • the content shown in FIG. 18 may be directly displayed on the display interface, or the content shown in FIG.
  • the 16 may be displayed on the display interface first, and then acquired. After the user operates the word frequency display instruction triggered by the display interface, the first word frequency sum value and/or the second word frequency sum value are displayed according to the word frequency display instruction.
  • the specific display content can be as shown in FIG. 18.
  • the user can determine the segmentation result of the cloud server to be processed by operating the display interface. Specifically, the user can operate the forward segmentation result or the reverse segmentation result by clicking, sliding, or the like. User equipment can be based on users The operation mode of the forward segmentation result or the reverse segmentation result is obtained to obtain operation information, and the segmentation result to be processed is determined according to the operation information.
  • the user equipment acquires operation information according to the click operation, and the specific operation information is selected by the user for the reverse segmentation result, and the user equipment according to the operation information It is determined that the segmentation result to be processed is a reverse segmentation result. Then, the user equipment feeds back the segmentation result to be processed to the cloud server, and the segmentation result to be processed by the cloud server is subsequently processed.
  • the user can determine the object that needs to be searched or searched according to the forward segmentation result and the reverse segmentation result, thereby improving the search. Accuracy and effectiveness.
  • the word frequency is also displayed on the display interface, and after seeing the word frequency, the user can quickly make a more correct judgment and improve the user experience.
  • a word segmentation device of a character string will be described in detail below.
  • the word segmentation device of the string can be implemented on various devices, such as a server device, a server, a web server, and the like.
  • Those skilled in the art will appreciate that the word segmentation device of the string can be constructed using commercially available hardware components configured by the steps taught by the present solution.
  • the modules related to the control function and the update function in the following embodiments may be implemented using components such as a single chip microcomputer, a microcontroller, a microprocessor, and the like from a company such as Texas Instruments, Intel Corporation, and ARM Corporation.
  • FIG. 19 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in Figure 19, the device includes:
  • a first segmentation module 10 configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word
  • a second segmentation module 11 configured to acquire a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word
  • the word frequency acquisition module 12 is configured to obtain a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a predetermined number of occurrences of each word in the preset text;
  • a result determining module 13 configured to determine, according to a word frequency of each of the first words and a word frequency of each of the second words, a segmentation result of the character string to be divided, wherein a segmentation result of the character string to be segmented The result of the forward segmentation or the inverse segmentation.
  • the word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 20 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. This embodiment is implemented on the basis of the embodiment of FIG. 19, and the details are as follows:
  • the first segmentation module 10 is specifically configured to:
  • the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned;
  • the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split;
  • the operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained.
  • the second segmentation module 11 is specifically configured to perform an operation of performing a reverse segmentation on the character string to be divided, and determining whether the second word is acquired;
  • the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed;
  • the first character in the reverse direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing reverse splitting on the string to be split;
  • the operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a reverse segmentation result is obtained.
  • the method further includes: a text obtaining module 14 configured to acquire text to be divided, perform a symbol deletion operation on the text to be divided, and obtain the character string to be divided.
  • a text obtaining module 14 configured to acquire text to be divided, perform a symbol deletion operation on the text to be divided, and obtain the character string to be divided.
  • the method further includes: a dictionary tree building module 15 configured to construct a forward dictionary tree and a reverse dictionary tree;
  • the first segmentation module 10 is specifically configured to:
  • the second segmentation module 11 is specifically configured to:
  • a word frequency of a word corresponding to the first node is stored in each first node of the forward dictionary tree, and the second node is stored in each second node of the reverse dictionary tree.
  • the word frequency of the corresponding word is stored in each first node of the forward dictionary tree, and the second node is stored in each second node of the reverse dictionary tree.
  • the word frequency acquisition module 12 is specifically configured to:
  • the method further includes: a corpus construction module 16 for constructing a corpus, the corpus including a word library and a word frequency of a word in the word library;
  • the dictionary tree construction module 15 is specifically configured to construct a forward dictionary tree and a reverse dictionary tree according to the corpus, and store the word frequency of each word to the corresponding first node and the second node.
  • the preset text includes: text that meets preset usage conditions and text to be divided; the corpus construction module 16 is specifically configured to:
  • the corpus is constructed according to the word library, the words in the word library, the number of occurrences of the text satisfying the preset use condition and the text to be divided.
  • the corpus construction module 16 is specifically configured to:
  • the result determining module 13 is specifically configured to:
  • the method further includes: a feedback module 17;
  • the text obtaining module 14 is specifically configured to acquire the text to be divided sent by the user equipment;
  • the feedback module 17 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment The user outputs the segmentation result.
  • the method further includes: a result obtaining module 18 and a processing module 19,
  • the result obtaining module 18 is configured to acquire a segmentation result to be processed sent by the user equipment;
  • the processing module 19 is configured to perform natural language processing on the segmentation result to be processed.
  • the word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 21 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in Figure 21, the device includes:
  • the sending module 20 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word in the determination of the segmentation result;
  • the receiving module 21 is configured to receive the segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the character to be segmented
  • the segmentation result of the string is the forward segmentation result or the reverse segmentation result;
  • the output module 22 is configured to output the segmentation result to a user.
  • the word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 22 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in FIG. 22, the embodiment is implemented on the basis of the embodiment shown in FIG. 21, and the details are as follows:
  • the output module 22 is specifically configured to display the segmentation result on a display interface.
  • the segmentation result information further includes a segmentation type corresponding to the segmentation result, where the segmentation type is forward segmentation or reverse segmentation;
  • the output module 22 is specifically configured to display the segmentation result and the segmentation type of the segmentation result on a display interface.
  • the segmentation information further includes a reverse segmentation result
  • the segmentation information further includes a forward segmentation result
  • the output module 22 is configured to display the forward segmentation result and the reverse segmentation result on the display interface, and label the segmentation result corresponding to the to-be-divided string.
  • the segmentation information further includes a word frequency of each of the first words in the forward segmentation result and a word frequency of each of the second words in the reverse segmentation result;
  • the display device further includes: an instruction acquisition module 23, configured to acquire a word frequency display instruction triggered by the user to operate the display interface;
  • the output module 22 is further configured to display, according to the word frequency display instruction, a word frequency of each of the first words and/or a word frequency of each of the second words;
  • the output module 22 is specifically configured to display, on the display interface, the forward segmentation result, a word frequency of a first word in the forward segmentation result, and the reverse segmentation result and the reverse segmentation The word frequency of the second word in the result.
  • the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and a corresponding number of each of the second words in the reverse segmentation result Two word frequency sum value;
  • the display device further includes: an instruction acquisition module 23, configured to acquire a word frequency display instruction triggered by the user to operate the display interface;
  • the output module 22 is further configured to display the first word frequency sum value and/or the second word frequency sum value according to the word frequency display instruction;
  • the output module 22 is specifically configured to display, on the display interface, the forward segmentation result, the first word frequency sum value, and the reverse segmentation result and the second word frequency sum value.
  • the method further includes: an operation information acquiring module 24, configured to acquire, by the user, operation information about the forward segmentation result or the reverse segmentation result on the display interface,
  • a determining module 25 configured to determine, according to the operation information, a segmentation result to be processed
  • the sending module 20 is further configured to send the to-be-processed split node to the cloud server.
  • the cloud server In order to enable the cloud server to perform natural language processing on the segmentation result to be processed.
  • the word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.
  • FIG. 23 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention.
  • the word segmentation device of the character string may include an input device 30, a processor 31, a memory 32, and at least one communication bus 33 and an output device 34.
  • the communication bus 33 is used to implement a communication connection between components.
  • Memory 32 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.
  • the input device 30 is configured to acquire text to be divided
  • the reverse segmentation result includes at least one second word; the word frequency of each of the first words and the word frequency of each of the second words are obtained, wherein the word frequency is a predetermined word in the preset text a number of occurrences; determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the Forward segmentation results or the inverse segmentation results.
  • the output device 34 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment outputs the Segment the result.
  • the processor 31 is further configured to perform the foregoing method as shown in FIG. 2 to FIG. 10, where the input device 30 performs an input operation, and the output device 34 performs an output operation, and the specific implementation process may be referred to the foregoing embodiment. The embodiments are not described herein again.
  • FIG. 24 is a schematic structural diagram of hardware of a cloud server according to an embodiment of the present invention.
  • the cloud server may include an input device 40, a processor 41, a memory 42 and at least one communication bus 43 and an output device 44.
  • Communication bus 43 is used to implement a communication connection between the components.
  • Memory 42 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.
  • the input device 40 is configured to acquire text to be divided
  • a processor 41 coupled to the input device 40, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and obtaining a reverse of the character string to be segmented
  • the reverse segmentation result includes at least one second word; the word frequency of each of the first words and the word frequency of each of the second words are obtained, wherein the word frequency is a predetermined word in the preset text a number of occurrences; determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the Forward segmentation results or the inverse segmentation results.
  • the output device 44 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment outputs the Segment the result.
  • the processor 41 is further configured to perform the method described in the foregoing FIG. 2 to FIG. 10, where the input device 40 performs an input operation, and the output device 44 corresponds to at least an output operation.
  • the embodiments are not described herein again.
  • FIG. 25 is a schematic diagram showing the hardware structure of a word segmentation device of a character string according to an embodiment of the present invention.
  • the word segmentation device of the character string may include an input device 50, a processor 51, a memory 52, and at least one communication bus 53 and an output device 54.
  • the communication bus 53 is used to implement a communication connection between components.
  • Memory 52 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.
  • the output device 54 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency and the reverse of each first word in the forward segmentation result.
  • the word frequency of each second word in the segmentation result determines a segmentation result
  • the input device 50 is configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the segment to be segmented
  • the segmentation result of the string is the forward segmentation result or the reverse segmentation result;
  • the processor 51 is configured to control, according to the segmentation result information, the input device to output the segmentation result to a user.
  • the processor 51 is further configured to perform the foregoing method shown in FIG. 11 to FIG. 18, where the input device 50 performs an input operation, and the output device 54 corresponds to at least an output operation.
  • the processor 51 is further configured to perform the foregoing method shown in FIG. 11 to FIG. 18, where the input device 50 performs an input operation, and the output device 54 corresponds to at least an output operation.
  • the embodiments are not described herein again.
  • FIG. 26 is a schematic structural diagram of hardware of a user equipment according to an embodiment of the present invention.
  • the word segmentation device of the character string may include an input device 60, a processor 61, a memory 62, and at least one communication bus 63 and an output device 64.
  • Communication bus 63 is used to implement a communication connection between the components.
  • Memory 62 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.
  • the output device 64 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency and the reverse of each first word in the forward segmentation result.
  • the word frequency of each second word in the segmentation result determines a segmentation result
  • the input device 60 is configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the character to be segmented
  • the segmentation result of the string is the forward segmentation result or the reverse segmentation result;
  • the processor 61 is configured to control, according to the segmentation result information, the input device to output the segmentation result to a user.
  • the processor 61 is further configured to perform the foregoing method shown in FIG. 11 to FIG. 18, where the input device 60 performs an input operation, and the output device 64 corresponds to at least an output operation.
  • the embodiments are not described herein again.
  • the processor may be, for example, a central processing unit (CPU), an application specific integrated circuit (ASIC), a digital signal processor (DSP), and a digital signal.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • DSPD processing device
  • PLD programmable logic device
  • FPGA field programmable gate array
  • controller microcontroller, microprocessor or other electronic component implementation.
  • the input device may include a plurality of input devices, for example, at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, and a transceiver.
  • the device-oriented device interface may be a wired interface for data transmission between the device and the device, or may be a hardware insertion interface (for example, a USB interface, a serial port, etc.) for data transmission between the device and the device.
  • the user-oriented user interface may be, for example, a user-oriented control button, a voice input device for receiving voice input, and a touch-sensing device for receiving a user's touch input (eg, a touch screen with touch sensing function, touch Control board, etc.);
  • the programmable interface of the above software may be, for example, an entrance for the user to edit or modify the program, such as a core
  • the input pin interface or the input interface of the chip; optionally, the transceiver may be a radio frequency transceiver chip with a communication function, a baseband processing chip, and a transceiver antenna.
  • the output device may include a plurality of output devices, for example, at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, and a transceiver.
  • the device-oriented device interface may be a wired interface for data transmission between the device and the device, or may be a hardware insertion interface (for example, a USB interface, a serial port, etc.) for data transmission between the device and the device.
  • the user-oriented user interface may be, for example, a user-oriented display device or a voice output device; optionally, the programmable interface of the software may be, for example, an input for the user to edit or modify the program, such as a chip.
  • the input pin interface or the input interface, etc.; optionally, the transceiver may be a radio frequency transceiver chip with a communication function, a baseband processing chip, and a transceiver antenna.
  • first, second, third, etc. may be used to describe XXX in embodiments of the invention, these XXX should not be limited to these terms. These terms are only used to distinguish XXX from each other.
  • first XXX may also be referred to as a second XXX without departing from the scope of the embodiments of the present invention.
  • second XXX may also be referred to as a first XXX.
  • the above readable storage storage medium may be by any type of volatile or non-volatile storage Devices or combinations of them, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM) , read only memory (ROM), magnetic memory, flash memory, disk or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read only memory
  • EPROM erasable programmable read only memory
  • PROM programmable read only memory
  • ROM read only memory
  • magnetic memory magnetic memory
  • flash memory disk or optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A character string segmentation method, apparatus and device, said method comprising: acquiring forward segmentation results of a character string to be segmented, said forward segmentation results comprising at least one first word (101); acquiring backward segmentation results of the character string to be segmented, said backward segmentation results comprising at least one second word (102); acquiring a word frequency of said first word and a word frequency of said second word, said word frequencies being the number of times predetermined words appear in a predetermined text (103); determining segmentation results of said character string to be segmented according to the word frequency of said first word and the word frequency of said second word, wherein the segmentation results of said character string to be segmented are said forward segmentation results or said backward segmentation results (104). Not only does the present invention improve the success rate of segmentation, but also increases the probability of the semantics of words being accurate within the segmentation results.

Description

字符串的分词方法、装置及设备String word segmentation method, device and device 技术领域Technical field

本申请涉及计算机技术,尤其涉及一种字符串的分词方法、装置及设备。The present application relates to computer technology, and in particular, to a word segmentation method, device and device.

背景技术Background technique

自然语言处理是运用计算机对自然语言进行分析和理解,从而使计算机在某种程度上具有人的语言能力。在对英文文本进行自然语言处理时,经常会遇到不符合自然语言规则的脏数据,导致自然语言处理效果大打折扣。因此,需要先对英文文本进行分词预处理,得到包含多个英文单词的正常自然语言,然后再使用自然语言模型进行处理。Natural language processing is the use of computers to analyze and understand natural language, so that computers have human language capabilities to some extent. When performing natural language processing on English texts, it often encounters dirty data that does not conform to natural language rules, resulting in a significant compromise in natural language processing. Therefore, it is necessary to pre-process the English text to obtain a normal natural language containing multiple English words, and then use the natural language model for processing.

现有技术中的脏数据主要包括因空格字符缺失造成多个单词连在一起形成的字符串、掺杂有干扰字符的字符串等。现有技术对英文文本进行分词的具体过程如下:按顺序依次读取待分割的字符串的一个字母,添加到已经取得的字母们的后面,组成一个子字符串,然后检查此子字符串是否能在预先获取的英文词典中查到。如果能查到,则说明该子字符串是一个单词,先将其从原字符串中分割出来。然后对剩下的字符串重复使用这种方法,最终完成单词分割,或者剩下的字符串没法分割直接输出。The dirty data in the prior art mainly includes a character string formed by concatenating a plurality of words due to the absence of a space character, a character string doped with an interfering character, and the like. The specific process of the word segmentation of the English text in the prior art is as follows: sequentially reading a letter of the character string to be divided in order, adding it to the back of the already obtained letters, forming a substring, and then checking whether the substring is Can be found in the pre-acquired English dictionary. If it can be found, the substring is a word, which is first separated from the original string. Then repeat this method for the remaining strings, eventually completing the word segmentation, or the remaining strings can't be split directly.

然而,现有技术对英文文本进行分词的方法,在待分割的字符串中前一单词与后一单词的前缀组成单词、或掺杂有干扰字符等情况下,会出现分割不当导致语义错误、甚至无法分割的现象。However, in the prior art, the method for word segmentation of English texts may cause semantic errors when the words of the previous word and the prefix of the latter word form a word or are doped with interference characters in the character string to be segmented. Even the phenomenon that cannot be divided.

发明内容Summary of the invention

本发明提供一种字符串的分词方法、装置及设备,不仅提高了分割成功率,还提高了分割结果中的各单词语义正确的概率。The invention provides a word segmentation method, device and device, which not only improves the segmentation success rate, but also improves the probability that each word in the segmentation result is semantically correct.

第一方面,本发明提供一种字符串的分词方法,包括:In a first aspect, the present invention provides a word segmentation method for a string, comprising:

获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;Obtaining a forward segmentation result of the string to be segmented, the forward segmentation result including at least one first word;

获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少 一个第二单词;Obtaining a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least a second word;

获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;Obtaining a word frequency of each of the first words and a word frequency of each of the second words, the word frequency being a number of times the predetermined words appear in the preset text;

根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。Determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward segmentation result Or the reverse segmentation result.

作为一种可实现的方式,所述获取待分割的字符串的正向分割结果,包括:As an achievable manner, the obtaining a forward segmentation result of the character string to be segmented includes:

对所述待分割的字符串进行正向分割的操作,判断是否获取到第一单词;Performing a forward split operation on the character string to be divided to determine whether the first word is acquired;

若是,将除去所述第一单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If yes, the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned;

若否,对所述待分割的字符串的正向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If not, the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split;

重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到正向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained.

本实施例提供的正向分割方法,为一层一层的正向递进式分割方式,经过一层一层的尝试,克服了干扰字符,最终得到了正向分割结果。The forward segmentation method provided in this embodiment is a forward-gradient segmentation method of layer by layer, and after a layer-by-layer attempt, the interference characters are overcome, and finally the forward segmentation result is obtained.

作为一种可实现的方式,所述获取待分割的字符串反向分割结果,包括:As an achievable manner, the obtaining a reverse segmentation result of the character string to be segmented includes:

对所述待分割的字符串进行反向分割的操作,判断是否获取到第二单词;Performing an inverse segmentation operation on the character string to be divided to determine whether the second word is acquired;

若是,将除去所述第二单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If yes, the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed;

若否,对所述待分割的字符串的反向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If not, the first character in the reverse direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing reverse splitting on the string to be split;

重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到反向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a reverse segmentation result is obtained.

本实施例提供的反向分割方法,为一层一层的反向递进式分割方式,经过一层一层的尝试,克服了干扰字符,最终得到了反向分割结果。The reverse segmentation method provided in this embodiment is a layer-by-layer reverse progressive segmentation method. After a layer-by-layer attempt, the interference characters are overcome, and finally the reverse segmentation result is obtained.

作为一种可实现的方式,还包括: As an achievable way, it also includes:

获取待分割的文本,对所述待分割的文本进行符号删除操作,得到所述待分割的字符串。Obtaining a text to be divided, performing a symbol deletion operation on the text to be divided, and obtaining the character string to be divided.

作为一种可实现的方式,还包括:As an achievable way, it also includes:

构建正向字典树和反向字典树;Construct a forward dictionary tree and a reverse dictionary tree;

所述对所述待分割的字符串进行正向分割的操作,包括:The operation of performing forward segmentation on the character string to be divided includes:

根据所述正向字典树,对所述待分割的字符串进行正向分割的操作;Performing a forward split operation on the character string to be divided according to the forward dictionary tree;

所述对所述待分割的字符串进行反向分割的操作,包括:The operation of performing the reverse splitting on the character string to be divided includes:

根据所述反向字典树,对所述待分割的字符串进行反向分割的操作。Performing an inverse split operation on the character string to be split according to the reverse dictionary tree.

本实施例根据字典树来对字符串进行正向分割或反向分割,由于公共查找路径的存在,可以在读取到的子字符串增加一个字符后,基于该字符增加前的查找路径继续向下一级节点查找,从而可以避免重复查找,最大限度地减少无谓的字符串比较,减少查询时间,提高查找效率。In this embodiment, the string is forward-divided or reverse-divided according to the dictionary tree. After the common sub-string is added, the search path may be continued based on the character before the character is added. The next level of node lookup can avoid repeated lookups, minimize unnecessary string comparisons, reduce query time, and improve search efficiency.

作为一种可实现的方式,所述正向字典树的每个第一节点中存储有所述第一节点对应的单词的词频,所述反向字典树的每个第二节点中存储有所述第二节点对应的单词的词频;As an achievable manner, each first node of the forward dictionary tree stores a word frequency of a word corresponding to the first node, and each second node of the reverse dictionary tree stores The word frequency of the word corresponding to the second node;

所述获取各所述第一单词的词频和各所述第二单词的词频,包括:The acquiring the word frequency of each of the first words and the word frequency of each of the second words includes:

从所述第一单词对应的第一节点中获取所述第一单词的词频;Obtaining a word frequency of the first word from a first node corresponding to the first word;

从所述第二单词对应的第二节点中获取所述第二单词的词频。And acquiring a word frequency of the second word from a second node corresponding to the second word.

作为一种可实现的方式,所述构建正向字典树和反向字典树之前,还包括:As an achievable manner, before the constructing the forward dictionary tree and the reverse dictionary tree, the method further includes:

构建语料库,所述语料库包括单词库和所述单词库中的单词的词频;Constructing a corpus, the corpus including a word library and a word frequency of words in the word library;

所述构建正向字典树和反向字典树,包括:The constructing a forward dictionary tree and a reverse dictionary tree, including:

根据所述语料库,构建正向字典树和反向字典树,并将各单词的词频存储至对应的第一节点和第二节点。According to the corpus, a forward dictionary tree and a reverse dictionary tree are constructed, and the word frequency of each word is stored to the corresponding first node and second node.

作为一种可实现的方式,所述预设文本包括:满足预设使用条件的文本以及待分割的文本;所述构建语料库,包括:As an achievable manner, the preset text includes: a text that satisfies a preset use condition and a text to be divided; and the constructed corpus includes:

根据满足预设使用条件的词典,得到单词库;Obtain a word library according to a dictionary that satisfies a preset use condition;

确定所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数;Determining a number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided;

根据所述单词库、所述单词库中的单词在所述满足预设使用条件的文本 以及所述待分割的文本中出现的次数,构建所述语料库。According to the word library, the words in the word library, the text that satisfies the preset use condition And the number of occurrences in the text to be segmented, constructing the corpus.

作为一种可实现的方式,所述确定单词库中的单词在所述待分割的文本中出现的次数,包括:As an achievable manner, the determining the number of occurrences of a word in the word library in the text to be segmented includes:

根据所述待分割的文本中的空格符,获取至少一个第一字符串;Obtaining at least one first character string according to the space character in the text to be divided;

将所述至少一个第一字符串与所述单词库中的单词进行匹配,得到与所述单词库中的单词匹配的至少一个第二字符串;Matching the at least one first character string with a word in the word library to obtain at least one second character string that matches a word in the word library;

根据各所述第二字符串在所述待分割的文本中出现的次数,确定单词库中的单词在所述待分割的文本中出现的次数。And determining, according to the number of occurrences of each of the second character strings in the text to be divided, the number of times the words in the word library appear in the text to be divided.

本实施例构建的语料库,该语料库中的单词的词频是通过待分割的文本进行修正的,与待分割的文本具有一定的相关性,使得语料库中的单词的词频更接近待分割的文本的应用情况,从而可以使得分割结果的语义与待分割的文本表达的语义更接近,提高了字符串分割的正确性。In the corpus constructed in this embodiment, the word frequency of the words in the corpus is corrected by the text to be segmented, and has a certain correlation with the text to be segmented, so that the word frequency of the words in the corpus is closer to the application of the text to be segmented. The situation can make the semantics of the segmentation result closer to the semantics of the text representation to be segmented, and improve the correctness of the string segmentation.

作为一种可实现的方式,所述根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,包括:As an achievable manner, the determining the segmentation result of the character string to be divided according to the word frequency of each of the first words and the word frequency of each of the second words includes:

对所有所述第一单词的词频进行求和处理,得到第一词频和值;And summing the word frequencies of all the first words to obtain a first word frequency sum value;

对所有所述第二单词的词频进行求和处理,得到第二词频和值;And summing the word frequencies of all the second words to obtain a second word frequency sum value;

若所述第一词频和值大于所述第二词频和值,则确定所述待分割的字符串的分割结果为正向分割结果;If the first word frequency sum value is greater than the second word frequency sum value, determining that the segmentation result of the character string to be segmented is a forward segmentation result;

若所述第二词频和值大于所述第一词频和值,则确定所述待分割的字符串的分割结果为反向分割结果。And if the second word frequency sum value is greater than the first word frequency sum value, determining that the segmentation result of the character string to be segmented is a reverse segmentation result.

作为一种可实现的方式,所述正向分割和所述反向分割均采用最长单词分割方式。As an achievable manner, both the forward segmentation and the reverse segmentation adopt the longest word segmentation method.

第二方面,本发明提供一种字符串的分词方法,包括:In a second aspect, the present invention provides a word segmentation method for a character string, including:

向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;Sending the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and each second word in the reverse segmentation result The word frequency determines the segmentation result;

接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;And receiving, by the cloud server, the segmentation result information of the to-be-divided character string, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the segmentation result of the character string to be segmented is Describe the forward segmentation result or the reverse segmentation result;

向用户输出所述分割结果。 The segmentation result is output to the user.

本实施例提供的字符串的分词方法,通过向云端服务器发送用户输入的待分割的文本,以使云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;由于云端服务器通过双向分割字符串,可以识别字符串头或字符串尾的干扰字符,提高了分割成功率,基于词频来确定最终的分割结果,提高了分割结果中的各单词语义正确的概率,接收云端服务器反馈的待分割的字符串的分割结果信息,分割结果信息包括待分割的字符串的分割结果;向用户输出分割结果,用户可以获知分割结果,使得用户可以获知最终的查询结果对应的查询单词,提高了用户的体验。The word segmentation method of the character string provided by the embodiment sends the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and The word frequency of each second word in the reverse segmentation result determines the segmentation result; since the cloud server can distinguish the interference characters of the string header or the string tail by bidirectionally dividing the character string, the segmentation success rate is improved, and the final frequency is determined based on the word frequency. The segmentation result improves the probability that the semantics of each word in the segmentation result is correct, and receives the segmentation result information of the character string to be segmented fed back by the cloud server, and the segmentation result information includes the segmentation result of the character string to be segmented; and outputs the segmentation result to the user. The user can know the segmentation result, so that the user can know the query word corresponding to the final query result, which improves the user experience.

作为一种可实现的方式,所述向用户输出所述分割结果,包括:As an achievable manner, the outputting the segmentation result to a user includes:

在显示界面上显示所述分割结果。The segmentation result is displayed on the display interface.

作为一种可实现的方式,所述分割结果信息中还包括所述分割结果对应的分割类型,所述分割类型为正向分割或反向分割;As an achievable manner, the segmentation result information further includes a segmentation type corresponding to the segmentation result, and the segmentation type is a forward segmentation or a reverse segmentation;

所述在显示界面上显示所述分割结果,包括:Displaying the segmentation result on the display interface, including:

在显示界面上显示所述分割结果以及所述分割结果的分割类型。The segmentation result and the segmentation type of the segmentation result are displayed on the display interface.

作为一种可实现的方式,若所述分割结果为正向分割结果,则所述分割信息中还包括反向分割结果;或者As an achievable manner, if the segmentation result is a forward segmentation result, the segmentation information further includes a reverse segmentation result; or

若所述分割结果为反向分割结果,则所述分割信息中还包括正向分割结果;If the segmentation result is a reverse segmentation result, the segmentation information further includes a forward segmentation result;

所述在显示界面上显示所述分割结果,包括:Displaying the segmentation result on the display interface, including:

在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果。Displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string.

作为一种可实现的方式,所述分割信息中还包括所述正向分割结果中的各所述第一单词的词频和所述反向分割结果中的各所述第二单词的词频;As an achievable manner, the segmentation information further includes a word frequency of each of the first words in the forward segmentation result and a word frequency of each of the second words in the reverse segmentation result;

在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果之后,还包括:After displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string, the method further includes:

获取所述用户操作所述显示界面触发的词频显示指令;Obtaining a word frequency display instruction triggered by the user operating the display interface;

根据所述词频显示指令,显示各所述第一单词的词频和/或各所述第二单词的词频;Displaying a word frequency of each of the first words and/or a word frequency of each of the second words according to the word frequency display instruction;

或者 Or

在所述显示界面上显示所述正向分割结果和所述反向分割结果,包括:Displaying the forward segmentation result and the reverse segmentation result on the display interface, including:

在所述显示界面上显示所述正向分割结果、所述正向分割结果中的第一单词的词频,以及所述反向分割结果、所述反向分割结果中的第二单词的词频。And displaying, on the display interface, the forward segmentation result, a word frequency of the first word in the forward segmentation result, and a word frequency of the second segment of the reverse segmentation result and the reverse segmentation result.

作为一种可实现的方式,所述分割信息中还包括所述正向分割结果中的各所述第一单词对应的第一词频和值以及所述反向分割结果中的各所述第二单词对应的第二词频和值;As an achievable manner, the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and each of the second segment in the reverse segmentation result The second word frequency and value corresponding to the word;

在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果之后,还包括:After displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string, the method further includes:

获取所述用户操作所述显示界面触发的词频显示指令;Obtaining a word frequency display instruction triggered by the user operating the display interface;

根据所述词频显示指令,显示所述第一词频和值和/或所述第二词频和值;Displaying the first word frequency sum value and/or the second word frequency sum value according to the word frequency display instruction;

或者or

在所述显示界面上显示所述正向分割结果和所述反向分割结果,包括:Displaying the forward segmentation result and the reverse segmentation result on the display interface, including:

在所述显示界面上显示所述正向分割结果、所述第一词频和值,以及所述反向分割结果、所述第二词频和值。Displaying the forward segmentation result, the first word frequency sum value, and the reverse segmentation result, the second word frequency sum value on the display interface.

作为一种可实现的方式,所述在所述显示界面上显示所述正向分割结果和所述反向分割结果之后,还包括:As an achievable manner, after the displaying the forward segmentation result and the reverse segmentation result on the display interface, the method further includes:

获取所述用户对所述显示界面上的所述正向分割结果或反向分割结果的操作信息,Obtaining, by the user, operation information of the forward segmentation result or the reverse segmentation result on the display interface,

根据所述操作信息确定待处理的分割结果;Determining a segmentation result to be processed according to the operation information;

向所述云端服务器发送所述待处理的分割结果,以使所述云端服务器对所述待处理的分割结果进行自然语言处理。And sending, to the cloud server, the segmentation result to be processed, so that the cloud server performs natural language processing on the segmentation result to be processed.

第三方面,本发明提供一种字符串的分词装置,包括:In a third aspect, the present invention provides a word segmentation device, comprising:

第一分割模块,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;a first segmentation module, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word;

第二分割模块,用于获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;a second segmentation module, configured to acquire a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word;

词频获取模块,用于获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数; a word frequency acquisition module, configured to acquire a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a predetermined number of occurrences of each word in the preset text;

结果确定模块,用于根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a result determining module, configured to determine a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is The forward segmentation result or the reverse segmentation result.

第四方面,本发明提供一种字符串的分词装置,包括:In a fourth aspect, the present invention provides a word segmentation device, comprising:

发送模块,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;a sending module, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;

接收模块,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;a receiving module, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result;

输出模块,用于向用户输出所述分割结果。And an output module, configured to output the segmentation result to a user.

第五方面,本发明提供一种字符串的分词设备,包括:包括:In a fifth aspect, the present invention provides a word segmentation device, comprising:

输入设备,用于获取待分割的文本;An input device for acquiring text to be divided;

处理器,耦合至所述输入设备,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor, coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results.

第六方面,本发明提供一种云端服务器,包括:In a sixth aspect, the present invention provides a cloud server, including:

输入设备,用于获取待分割的文本;An input device for acquiring text to be divided;

处理器,耦合至所述输入设备,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor, coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results.

第七方面,本发明提供一种字符串的分词设备,包括: In a seventh aspect, the present invention provides a word segmentation device, comprising:

输出设备,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;An output device, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;

输入设备,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;An input device, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result;

处理器,耦合至所述输出设备和所述输入设备,用于根据所述分割结果信息,控制所述输入设备向用户输出所述分割结果。And a processor coupled to the output device and the input device, configured to control the input device to output the segmentation result to a user according to the segmentation result information.

第八方面,本发明提供一种用户设备,包括:In an eighth aspect, the present invention provides a user equipment, including:

输出设备,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;An output device, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result;

输入设备,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;An input device, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result;

处理器,耦合至所述输出设备和所述输入设备,用于根据所述分割结果信息,控制所述输入设备向用户输出所述分割结果。。And a processor coupled to the output device and the input device, configured to control the input device to output the segmentation result to a user according to the segmentation result information. .

本实施例通过获取包括至少一个第一单词的正向分割结果,并获取包括至少一个第二单词的反向分割结果,通过双向分割字符串,识别字符串头或字符串尾的干扰字符,提高了分割成功率,然后获取各第一单词的词频和各第二单词的词频,根据各第一单词的词频以及各第二单词的词频,确定待分割的字符串的分割结果,基于词频来确定最终的分割结果,提高了分割结果中的各单词语义正确的概率。In this embodiment, by obtaining a forward segmentation result including at least one first word, and acquiring a reverse segmentation result including at least one second word, the bidirectional segmentation string is used to identify the interfering character of the string header or the string tail, thereby improving The segmentation success rate is obtained, and then the word frequency of each first word and the word frequency of each second word are obtained, and the segmentation result of the character string to be segmented is determined according to the word frequency of each first word and the word frequency of each second word, and is determined based on the word frequency. The final segmentation result increases the probability that the semantics of each word in the segmentation result is correct.

附图说明DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. Obviously, the drawings in the following description It is a certain embodiment of the present invention, and other drawings can be obtained from those skilled in the art without any inventive labor.

图1为本发明一实施例提供的字符串的分词场景示意图; 1 is a schematic diagram of a word segmentation scenario of a character string according to an embodiment of the present invention;

图2为本发明一实施例提供的字符串的分词方法流程示意图;2 is a schematic flowchart of a word segmentation method of a character string according to an embodiment of the present invention;

图3为本发明一实施例提供的正向分割示意图;FIG. 3 is a schematic diagram of forward splitting according to an embodiment of the present invention; FIG.

图4为本发明一实施例提供的反向分割示意图;FIG. 4 is a schematic diagram of reverse splitting according to an embodiment of the present invention; FIG.

图5为本发明一实施例提供的正向和反向分割示意图;FIG. 5 is a schematic diagram of forward and reverse splitting according to an embodiment of the present invention; FIG.

图6为本发明一实施例提供的正向分割示意图;FIG. 6 is a schematic diagram of forward splitting according to an embodiment of the present invention; FIG.

图7为本发明一实施例提供的反向分割示意图;FIG. 7 is a schematic diagram of reverse splitting according to an embodiment of the present invention; FIG.

图8为本发明一实施例提供的正向字典树的示意图;FIG. 8 is a schematic diagram of a forward dictionary tree according to an embodiment of the present invention; FIG.

图9为本发明一实施例提供的反向字典树的示意图;FIG. 9 is a schematic diagram of a reverse dictionary tree according to an embodiment of the present invention;

图10为本发明一实施例提供的字符串的分词方法流程示意图;FIG. 10 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention; FIG.

图11为本发明一实施例提供的字符串的分词方法流程示意图;FIG. 11 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention; FIG.

图12为本发明一实施例提供的字符串的分词方法的信令流程图;FIG. 12 is a signaling flowchart of a word segmentation method of a character string according to an embodiment of the present invention;

图13为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 13 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图14为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 14 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图15为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 15 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图16为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 16 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图17为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 17 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图18为本发明一实施例提供的字符串的分词方法的显示界面示意图;FIG. 18 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention; FIG.

图19为本发明一实施例提供的字符串的分词装置的结构示意图;FIG. 19 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention; FIG.

图20为本发明一实施例提供的字符串的分词装置的结构示意图;FIG. 20 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention; FIG.

图21为本发明一实施例提供的字符串的分词装置的结构示意图;FIG. 21 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention;

图22为本发明一实施例提供的字符串的分词装置的结构示意图;FIG. 22 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention;

图23为本发明一实施例提供的字符串的分词设备的硬件结构示意图;23 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention;

图24为本发明一实施例提供的云端服务器的硬件结构示意图;FIG. 24 is a schematic structural diagram of hardware of a cloud server according to an embodiment of the present invention;

图25为本发明一实施例提供的字符串的分词设备的硬件结构示意图;25 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention;

图26为本发明一实施例提供的用户设备的硬件结构示意图。FIG. 26 is a schematic structural diagram of hardware of a user equipment according to an embodiment of the present invention.

具体实施方式detailed description

这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一 致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. The following description refers to the same or similar elements in the different figures unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent the same as the present invention. All implementations. Instead, they are merely examples of devices and methods consistent with aspects of the invention as detailed in the appended claims.

图1为本发明一实施例提供的字符串的分词场景示意图。如图1所示,用户通过用户设备100输入待分割的文本,对于用户而言,即用户输入的字符串,然后用户设备100将该待分割的文本发送给云端服务器200。由于用户输入的字符串可能存在脏数据,因此,云端服务器200对用户输入的字符串进行分词处理。在一个具体的应用场景中,本实施例提供的字符串的分词方法可以应用到自然语言处理的处理过程中,该字符串的分词方法是对自然语言进行预处理,以得到包含多个语义正确的英文单词的自然语言,然后该自然语言作为自然语言模型的输入,由自然语言模型对该自然语言进行进一步的处理。例如,该自然语言模型可以为亮点词汇提取模型。在一个具体的应用场景中,用户设备100上可以安装电商平台对应的应用程序,也可以安装浏览器,用户可以通过该浏览器来浏览电商网站。当用户通过应用程序或电商网站来购买商品时,用户先查找商品,具体地,用户在电商平台对应的应用程序或电商网站的输入界面上输入字符串,然后用户设备100将该字符串发送给云端服务器200。由于该字符串中可能存在脏数据,所以应用本发明提供的字符串的分词方法,云端服务器200对该字符串进行分词,得到多个英文单词,然后云端服务器200通过亮点词汇提取模型对该多个英文单词进行提取,获取商品的标题、属性等信息,即能够描述此商品的元素、风格等特征的亮点词汇,然后根据该亮点词汇向用户提供该用户需要的商品。可选地,云端服务器200在得到单词分割结果之后,还可以向用户设备反馈该单词分割结果,以使用户获知单词分割结果,从而得知云端服务器具体通过哪些单词来查找匹配商品。进一步还可以向用户设备反馈正向分割结果或者反向分割结果,由用户来选择单词分割结果,然后用户设备100向云端服务器200反馈用户选择的单词分割结果,云端服务器200根据用户选择的单词分割结果进行后续的处理。FIG. 1 is a schematic diagram of a word segmentation scenario of a character string according to an embodiment of the present invention. As shown in FIG. 1 , the user inputs the text to be divided by the user equipment 100 , for the user, that is, the character string input by the user, and then the user equipment 100 sends the text to be divided to the cloud server 200 . Since the character string input by the user may have dirty data, the cloud server 200 performs word segmentation processing on the character string input by the user. In a specific application scenario, the word segmentation method provided by this embodiment can be applied to the processing process of natural language processing, and the word segmentation method of the string is to preprocess the natural language to obtain multiple semantic correctness. The natural language of the English word, and then the natural language as an input to the natural language model, is further processed by the natural language model. For example, the natural language model can be a model for highlight vocabulary extraction. In a specific application scenario, the user equipment 100 can be installed with an application corresponding to the e-commerce platform, or a browser can be installed, and the user can browse the e-commerce website through the browser. When the user purchases the product through the application or the e-commerce website, the user first searches for the product. Specifically, the user inputs a character string on the application corresponding to the e-commerce platform or the input interface of the e-commerce website, and then the user device 100 selects the character. The string is sent to the cloud server 200. Because the dirty data may exist in the string, the word segmentation method of the string provided by the present invention is applied, and the cloud server 200 performs segmentation on the string to obtain a plurality of English words, and then the cloud server 200 uses the highlight vocabulary extraction model to The English words are extracted to obtain the title, attribute and other information of the product, that is, the highlight vocabulary capable of describing the elements, styles and the like of the product, and then the user is provided with the goods required by the user according to the highlight vocabulary. Optionally, after obtaining the word segmentation result, the cloud server 200 may further feed back the word segmentation result to the user equipment, so that the user knows the word segmentation result, so as to know which words the cloud server specifically uses to find the matching product. Further, the user segmentation result or the reverse segmentation result may be fed back to the user equipment, and the user segmentation result is selected by the user, and then the user equipment 100 feeds back the word segmentation result selected by the user to the cloud server 200, and the cloud server 200 divides the word according to the word selected by the user. The result is followed up.

本发明在此示出了一个具体的应用场景,在具体实现过程中,该字符串的分词方法还可以应用到网页搜索等场景中。或者,当用户设备,例如计算机、手机、平板等设备的处理功能比较强大时,还可以由用户设备来完成字符串的分词方法。对于本发明的字符串的分词方法的应用场景,本 实施例此处不做特别限制。下面首先采用详细的实施例,来说明上述云端服务器对字符串进行分词的方法。The present invention shows a specific application scenario. In a specific implementation process, the word segmentation method of the string can also be applied to a scenario such as a webpage search. Alternatively, when the processing function of the user equipment, such as a computer, a mobile phone, a tablet, or the like, is relatively powerful, the word segmentation method of the character string may also be completed by the user equipment. For the application scenario of the word segmentation method of the string of the present invention, The embodiment is not particularly limited herein. In the following, a detailed embodiment is firstly used to explain the method for segmenting a character string by the cloud server.

图2为本发明一实施例提供的字符串的分词方法流程示意图。该字符串的分词方法可以由字符串的分词装置来实现。该装置可以通过软件和/或硬件来实现。该分词装置还可以被配置到云端服务器、计算机、手机、平板等设备中。该方法包括:FIG. 2 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention. The word segmentation method of the string can be implemented by a word segmentation device of the string. The device can be implemented by software and/or hardware. The word segmentation device can also be configured into a cloud server, a computer, a mobile phone, a tablet, and the like. The method includes:

步骤101、获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;Step 101: Obtain a forward segmentation result of the character string to be segmented, where the forward segmentation result includes at least one first word;

步骤102、获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;Step 102: Acquire a reverse segmentation result of the character string to be segmented, where the reverse segmentation result includes at least one second word;

步骤103、获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;Step 103: Obtain a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a number of times the predetermined words appear in the preset text;

步骤104、根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。Step 104: Determine a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, where a segmentation result of the character string to be segmented is the positive Split the result to the segmentation result or the inverse.

在本实施例中,获取用户设备发送的待分割的文本,然后根据该待分割的文本获取待分割的字符串。然后对该待分割的字符串进行分词,本领域技术人员可以理解,该待分割的字符串为连续的不带任何符号的字符串。同时,用户输入的待分割的文本,若不带任何符号,则该待分割的文本即待分割的字符串。进一步地,若该待分割的文本中包括空格以及各种标点符号,则对待分割的文本进行符号删除操作,即删除空格和标点符号的操作,最终得到连续的待分割的字符串。In this embodiment, the text to be divided sent by the user equipment is obtained, and then the character string to be divided is obtained according to the text to be divided. Then, the character string to be divided is segmented, and those skilled in the art can understand that the character string to be divided is a continuous character string without any symbols. At the same time, if the text to be divided by the user is without any symbols, the text to be divided is the character string to be divided. Further, if the text to be divided includes spaces and various punctuation marks, the text to be divided is subjected to a symbol deletion operation, that is, an operation of deleting spaces and punctuation marks, and finally a continuous character string to be divided is obtained.

在获取到字符串之后,执行步骤101和步骤102,通过对待分割的字符串分别进行正向分割和反向分割,获取正向分割结果和反向分割结果。本领域技术人员可以理解,本实施例中对待分割的字符串进行正向分割以获取正向分割结果,与对待分割的字符串进行反向分割以获取反向分割结果的过程,没有严格的时序关系。After the character string is obtained, step 101 and step 102 are performed, and the forward segmentation result and the reverse segmentation result are obtained by performing forward segmentation and reverse segmentation respectively on the segment string to be segmented. A person skilled in the art can understand that in this embodiment, the character string to be segmented is forwardly segmented to obtain a forward segmentation result, and the segment to be segmented is inversely segmented to obtain a reverse segmentation result. There is no strict timing. relationship.

下面以几个具体的例子,来说明对字符串进行正向分割和反向分割的过程。The following is a few specific examples to illustrate the process of forward segmentation and reverse segmentation of strings.

一个具体的实施例,图3为本发明一实施例提供的正向分割示意图。 如图3所示,本实施例对字符串floorlengthsleevelessdressst进行正向分割,最终的正向分割结果为多个第一单词:floor length sleevelessdress。A specific embodiment, FIG. 3 is a schematic diagram of forward splitting according to an embodiment of the present invention. As shown in FIG. 3, this embodiment performs forward segmentation on the string floorlengthsleevelessdressst, and the final forward segmentation result is a plurality of first words: floor length sleevelessdress.

具体的正向分割过程为:从左到右取字符,每取一次查一次词典,来判断是否取到一个单词,当取到floor时,还会继续尝试floorl、floorle、floorlen直至取完整个字符串,或者达到预设字符串长度,该预设字符串长度为单词的最长长度,然后在所有单词中,取长度最长的单词作为分割结果,由于后续没有单词,则floor即为分割结果。The specific forward segmentation process is: taking characters from left to right, checking the dictionary once every time to determine whether to take a word. When taking the floor, it will continue to try floorl, floor, floorlen until the whole character is taken. String, or reach the preset string length, the preset string length is the longest length of the word, and then among all the words, take the longest word as the segmentation result. Since there is no word afterwards, the floor is the segmentation result. .

因此,本领域技术人员可以理解,sleeveless的长度为10,而sleeve的长度为6,则sleeveless为分割结果,而sleeve以及less却不是最终的分割结果。本实施例采用单词最长的分割方式,最能符合语义。一般情况下两个单词写在一起,还是单词的例子不多,但是如果写在一起还是单词,则更符合语义。Therefore, those skilled in the art can understand that the length of sleeveless is 10, and the length of sleeve is 6, then sleeveless is the result of segmentation, and sleeve and less are not the final segmentation results. This embodiment adopts the longest segmentation method of words, and is most suitable for semantics. In general, two words are written together, or there are not many examples of words, but if they are written together or words, they are more semantic.

需要说明的是,在具体实现过程中,本实施例的正向分割或者反向分割,还可以采用现有技术中的其它分割方式,本实施例此处不做特别限制。It should be noted that, in the specific implementation process, the forward splitting or the reverse splitting in this embodiment may also adopt other splitting manners in the prior art, and the present embodiment is not particularly limited herein.

然而,在对该字符串floorlengthsleevelessdressst进行反向分割时,由于存在干扰字符st,则反向分割结果为一个错误的第二单词。However, when the character string floorlengthsleevelessdressst is reverse-segmented, the reverse segmentation result is an erroneous second word due to the presence of the disturbing character st.

另一个具体的例子,图4为本发明一实施例提供的反向分割示意图。如图4所示,本实施例对字符串ssfloorlengthsleevelessdress进行反向分割。Another specific example, FIG. 4 is a schematic diagram of reverse splitting according to an embodiment of the present invention. As shown in Fig. 4, this embodiment reversely splits the character string ssfloorlengthsleevelessdress.

具体的反向分割过程为:从右到左取字符,每取一次查一次词典,来判断是否取到一个单词,具体的分割过程与正向分割过程类似,本实施例此处不再赘述。最终的反向分割结果为多个第二单词:floor length sleeveless dress。The specific reverse segmentation process is: taking characters from right to left, and checking the dictionary once every time to determine whether a word is obtained. The specific segmentation process is similar to the forward segmentation process, and will not be described in detail in this embodiment. The final reverse segmentation result is a plurality of second words: floor length sleeveless dress.

然而,在对该字符串ssfloorlengthsleevelessdress进行正向分割时,由于存在干扰字符ss,则正向分割结果为一个错误的第一单词。However, when the character string ssfloorlengthsleevelessdress is forward-divided, since the interference character ss exists, the forward segmentation result is an erroneous first word.

又一个具体的例子,对字符串sleepshirt进行正向分割,正向分割结果为sleeps hirt;对字符串sleepshirt进行反向分割,反向分割结果为sleep shirt。Another specific example is to split the string sleepshirt forward, and the forward segmentation result is sleeps hirt; the string sleepshirt is split in the reverse direction, and the reverse segmentation result is sleep shirt.

在步骤103中,获取各第一单词的词频和各第二单词的词频。该词频 为预先确定的各单词在预设文本中出现的次数。该预设文本可以为英文文学全集或者英文教材等。In step 103, the word frequency of each first word and the word frequency of each second word are obtained. The word frequency The number of times each predetermined word appears in the preset text. The preset text can be a complete collection of English literature or an English textbook.

具体的,以上述各实施例为例来进行说明。在图3所示的实施例中,在对floorlengthsleevelessdressst进行正向分割时,得到的正向分割结果为多个正确的第一单词:floor length sleeveless dress,而对该floorlengthsleevelessdressst进行反向分割时,则得到一个错误的第二单词。此时,则第二单词的词频为无限小。Specifically, the above embodiments will be described as an example. In the embodiment shown in FIG. 3, when the floor lengthsleevelessdressst is forwardly split, the obtained forward segmentation result is a plurality of correct first words: floor length sleeveless dress, and when the floorlengthsleevelessdressst is reversely segmented, Get a second word of the error. At this time, the word frequency of the second word is infinitely small.

在图4所示的实施例中,在对字符串ssfloorlengthsleevelessdress进行反向分割时,得到的反向分割结果为多个正确的第二单词:floor length sleeveless dress,进行正向分割时,则得到一个错误的第一单词。此时,则第一单词的词频为无限小。In the embodiment shown in FIG. 4, when the character string ssfloorlengthsleevelessdress is inversely split, the obtained reverse segmentation result is a plurality of correct second words: floor length sleeveless dress, and when performing forward segmentation, a The first word of the error. At this time, the word frequency of the first word is infinitely small.

在上述实施例中,在对sleepshirt进行正向分割或者反向分割时,可以得到两个正确的第一单词和两个正确的第二单词。图5为本发明一实施例提供的正向和反向分割示意图。如图5所示,正向分割结果为sleeps hirt,sleeps的词频为100,hirt的词频为10;反向分割结果为sleep shirt,sleep的词频为10000,shirt的词频为9000。In the above embodiment, when the sleepshirt is forward-divided or reverse-divided, two correct first words and two correct second words can be obtained. FIG. 5 is a schematic diagram of forward and reverse segmentation according to an embodiment of the present invention. As shown in Fig. 5, the forward segmentation result is sleeps hirt, the word frequency of sleeps is 100, the word frequency of hirt is 10; the reverse segmentation result is sleep shirt, the word frequency of sleep is 10000, and the word frequency of the shirt is 9000.

本领域技术人员可以理解,如果在正向分割或反向分割过程中,如果该字符串本身即为一个正确的单词,则该单词的词频为无限大。Those skilled in the art can understand that if the string itself is a correct word in the forward segmentation or the reverse segmentation process, the word frequency of the word is infinite.

在步骤104中,根据各第一单词的词频以及各第二单词的词频,确定待分割的字符串的分割结果。具体地,可以对所有第一单词的词频进行求和处理,得到第一词频和值;对所有第二单词的词频进行求和处理,得到第二词频和值;若第一词频和值大于第二词频和值,则确定待分割的字符串的分割结果为正向分割结果;若第二词频和值大于第一词频和值,则确定待分割的字符串的分割结果为反向分割结果。In step 104, the segmentation result of the character string to be divided is determined according to the word frequency of each first word and the word frequency of each second word. Specifically, the word frequency of all the first words may be summed to obtain a first word frequency sum value; the word frequency of all the second words is summed to obtain a second word frequency sum value; if the first word frequency sum is greater than the first word frequency The second word frequency sum value determines that the segmentation result of the character string to be segmented is a forward segmentation result; if the second word frequency sum value is greater than the first word frequency sum value, it is determined that the segmentation result of the character string to be segmented is a reverse segmentation result.

以图3至图5所示的实施例为例,在图3所示的实施例中,无法得到反向分割结果,则第二单词的词频为无限小,则分割结果为正向分割结果。在图4所示的实施例中,无法得到正向分割结果,则第一单词的词频为无限小,则分割结果为反向分割结果。在图5所示的实施例中,第一词频和值为110,而第二词频和值为19000,则分割结果为反向分割结果。Taking the embodiment shown in FIG. 3 to FIG. 5 as an example, in the embodiment shown in FIG. 3, the reverse segmentation result cannot be obtained, and the word frequency of the second word is infinitely small, and the segmentation result is a forward segmentation result. In the embodiment shown in FIG. 4, the forward segmentation result cannot be obtained, and the word frequency of the first word is infinitely small, and the segmentation result is the reverse segmentation result. In the embodiment shown in FIG. 5, the first word frequency sum is 110, and the second word frequency sum is 19000, and the segmentation result is a reverse segmentation result.

本领域技术人员可以理解,在具体实现过程中,还可以设置词频阈值, 然后确定正向分割结果中大于该词频阈值的单词的数量,确定反向分割结果中大于该词频阈值的单词的数量,将数量大的正向分割结果或反向分割结果作为最终的分割结果。同时,还可以对该词频进行各种变形处理,然后确定分割结果。即只要根据各第一单词的词频和第二单词的词频,来确定分割结果中的各单词为比较常用的单词,从而保证语义正确的实现方式,都在本发明的保护范围之内。Those skilled in the art can understand that the word frequency threshold can also be set in a specific implementation process. Then, the number of words larger than the word frequency threshold in the forward segmentation result is determined, the number of words larger than the word frequency threshold in the reverse segmentation result is determined, and the large number of forward segmentation results or reverse segmentation results are used as the final segmentation result. At the same time, various deformation processing can be performed on the word frequency, and then the segmentation result is determined. That is, it is within the scope of the present invention to determine that each word in the segmentation result is a relatively common word according to the word frequency of each first word and the word frequency of the second word, thereby ensuring the semantically correct implementation.

本实施例通过获取包括至少一个第一单词的正向分割结果,并获取包括至少一个第二单词的反向分割结果,通过双向分割字符串,识别字符串头或字符串尾的干扰字符,提高了分割成功率,然后获取各第一单词的词频和各第二单词的词频,根据各第一单词的词频以及各第二单词的词频,确定待分割的字符串的分割结果,基于词频来确定最终的分割结果,提高了分割结果中的各单词语义正确的概率。In this embodiment, by obtaining a forward segmentation result including at least one first word, and acquiring a reverse segmentation result including at least one second word, the bidirectional segmentation string is used to identify the interfering character of the string header or the string tail, thereby improving The segmentation success rate is obtained, and then the word frequency of each first word and the word frequency of each second word are obtained, and the segmentation result of the character string to be segmented is determined according to the word frequency of each first word and the word frequency of each second word, and is determined based on the word frequency. The final segmentation result increases the probability that the semantics of each word in the segmentation result is correct.

由上述实施例可知,在图3所示的实施例中,如果反向分割将无法得到正确的第二单词,在图4所示的实施例中,如果正向分割,将无法得到正确的第一单词。在本实施例中,对单词分割方法还做了进一步的改进,使得在存在干扰字符的情况下,图3实施例中的字符串也可以得到多个正确的第二单词,图4实施例中的字符串也可以得到多个正确的第一单词,下面结合图6和图7进行详细说明。As can be seen from the above embodiment, in the embodiment shown in FIG. 3, if the reverse segmentation will not obtain the correct second word, in the embodiment shown in FIG. 4, if the forward segmentation is performed, the correct segment will not be obtained. One word. In this embodiment, the word segmentation method is further improved, so that in the presence of interfering characters, the character string in the embodiment of FIG. 3 can also obtain a plurality of correct second words. In the embodiment of FIG. 4 The string can also get a plurality of correct first words, which are described in detail below with reference to FIGS. 6 and 7.

图6为本发明一实施例提供的正向分割示意图。如图6所示,对待分割的字符串ssfloorlengthsleevelessdressst进行正向分割,判断是否获取到第一单词,由于存在干扰字符ss,因此无法获取到第一单词,则对待分割的字符串的正向的首字符进行删除处理,即删除正向的第一个字符s,得到处理后的待分割的字符串。然后将处理后的待分割的字符串作为新的待分割的字符串,并继续执行对待分割的字符串进行正向分割的操作,由于存在干扰字符s,则依然无法获取到第一单词,则删除处理后的待分割的字符串的正向的第一个字符s。然后,将处理后的待分割的字符串作为新的待分割的字符串,并继续执行对待分割的字符串进行正向分割的操作,可以得到第一单词floor,此时,将除去第一单词的待分割的字符串作为新的待分割的字符串,继续执行对待分割的字符串进行正向分割的操作,通过重复执行对待分割的字符串进行正向分割的操作,直至对待 分割的字符串分割结束,得到正向分割结果。FIG. 6 is a schematic diagram of forward splitting according to an embodiment of the present invention. As shown in FIG. 6, the character string ssfloorlengthsleevelessdressst to be divided is forwardly divided to determine whether the first word is acquired, and since the first character is not obtained because of the interference character ss, the forward direction of the character string to be divided is The character is deleted, that is, the first character s in the forward direction is deleted, and the processed character string to be divided is obtained. Then, the processed string to be split is used as a new string to be split, and the forward splitting of the string to be split is continued. Since the first character is still unable to be obtained due to the interference character s, The first character s of the forward direction of the processed string to be split is deleted. Then, the processed character string to be divided is used as a new character string to be divided, and the forward segmentation of the character string to be segmented is continued, and the first word floor can be obtained. At this time, the first word will be removed. The character to be split is used as a new string to be split, and the forward segmentation of the string to be segmented is performed, and the forward segmentation of the string to be segmented is performed repeatedly until the process is performed. The segmentation of the segmentation string ends, and the result of the forward segmentation is obtained.

本领域技术人员可以理解,对于字符串中间位置存在的干扰字符,在除去已经分割出来的第一单词后,该中间位置的干扰字符就变成了剩余字符串的首字符,在进行正向分割没有得到正确的第一单词时,还可以将该中间位置的干扰字符删除,然后继续进行正向分割,直至对待分割的字符串分割结束,得到正向分割结果。最终,正向分割结果为floor length sleeveless dress。Those skilled in the art can understand that for the interfering character existing in the middle position of the character string, after removing the first word that has been segmented, the interfering character of the intermediate position becomes the first character of the remaining character string, and the forward segmentation is performed. When the correct first word is not obtained, the interfering character in the middle position can also be deleted, and then the forward segmentation is continued until the segmentation of the segment to be segmented ends, and a forward segmentation result is obtained. Finally, the forward segmentation result is floor length sleeveless dress.

本实施例提供的正向分割方法,为一层一层的正向递进式分割方式,经过一层一层的尝试,克服了干扰字符,最终得到了正向分割结果。The forward segmentation method provided in this embodiment is a forward-gradient segmentation method of layer by layer, and after a layer-by-layer attempt, the interference characters are overcome, and finally the forward segmentation result is obtained.

图7为本发明一实施例提供的反向分割示意图。如图7所示,对待分割的字符串ssfloorlengthsleevelessdressst进行反向分割,判断是否获取到第二单词,由于存在干扰字符st,因此无法获取到第二单词,则对待分割的字符串的反向的首字符进行删除处理,即删除反向的第一个字符t,得到处理后的待分割的字符串。然后将处理后的待分割的字符串作为新的待分割的字符串,并继续执行对待分割的字符串进行反向分割的操作,由于存在干扰字符s,则依然无法获取到第二单词,则删除处理后的待分割的字符串的反向的第一个字符s。然后,将处理后的待分割的字符串作为新的待分割的字符串,并继续执行对待分割的字符串进行反向分割的操作,可以得到第二单词dress,此时,将除去第二单词的待分割的字符串作为新的待分割的字符串,并继续执行对待分割的字符串进行反向分割的操作。通过重复执行对待分割的字符串进行正向分割的操作,直至对待分割的字符串分割结束,得到反向分割结果。FIG. 7 is a schematic diagram of reverse splitting according to an embodiment of the present invention. As shown in FIG. 7, the character string ssfloorlengthsleevelessdressst to be divided is reverse-segmented to determine whether the second word is acquired, and since the second character is not obtained because of the interference character st, the reverse of the character string to be divided is The character is deleted, that is, the reversed first character t is deleted, and the processed character string to be divided is obtained. Then, the processed string to be split is used as a new string to be split, and the operation of the segment to be split is performed. If the second character is still not obtained due to the interference character s, The reversed first character s of the processed string to be split is deleted. Then, the processed character string to be divided is used as a new character string to be divided, and the reverse segmentation operation of the character string to be segmented is performed, and the second word dress can be obtained. At this time, the second word is removed. The string to be split is used as a new string to be split, and the operation of the split string is reversed. The reverse segmentation result is obtained by repeatedly performing the forward segmentation operation on the character string to be segmented until the segmentation of the string to be segmented ends.

本领域技术人员可以理解,对于字符串中间位置存在的干扰字符,在除去已经分割出来的第二单词后,该中间位置的干扰字符就变成了剩余字符串的首字符,在进行反向分割没有得到正确的第二单词时,还可以将该中间位置的干扰字符删除,然后继续进行反向分割,直至对待分割的字符串分割结束,得到反向分割结果。最终,反向分割结果为floor length sleeveless dress。Those skilled in the art can understand that for the interfering character existing in the middle position of the character string, after removing the second word that has been segmented, the interfering character of the intermediate position becomes the first character of the remaining character string, and the reverse segmentation is performed. When the correct second word is not obtained, the interfering character in the middle position can also be deleted, and then the reverse splitting is continued until the segmentation of the string to be divided ends, and the reverse segmentation result is obtained. Finally, the result of the reverse segmentation is floor length sleeveless dress.

本实施例提供的反向分割方法,为一层一层的反向递进式分割方式,经过一层一层的尝试,克服了干扰字符,最终得到了反向分割结果。 The reverse segmentation method provided in this embodiment is a layer-by-layer reverse progressive segmentation method. After a layer-by-layer attempt, the interference characters are overcome, and finally the reverse segmentation result is obtained.

进一步地,在上述实施例的基础上,为了提高查询单词的效率,本申请还可以在单词分割之前,即正向单词分割之前和反向单词分割之前,预先构建正向字典树和反向字典树,使得在分割单词时,可以根据正向字典树,对待分割的字符串进行正向分割,根据反向字典树,对待分割的字符串进行反向分割。Further, on the basis of the above embodiments, in order to improve the efficiency of querying words, the present application can also construct a forward dictionary tree and a reverse dictionary before word segmentation, that is, before forward word segmentation and before reverse word segmentation. The tree enables the forward segmentation of the segment to be segmented according to the forward dictionary tree when the word is segmented, and the segmentation of the segment to be segmented according to the reverse dictionary tree.

具体地,字典树是一种树形结构,是一种哈希树的变种。它的优点是:利用字符串的公共前缀来减少查询时间,最大限度地减少无谓的字符串比较,查询效率比哈希树高。它有3个基本性质:根节点不包含字符,除根节点外每一个节点都只包含一个字符;从根节点到某一节点,路径上经过的字符连接起来,为该节点对应的字符串;每个节点的所有子节点包含的字符都不相同。Specifically, the dictionary tree is a tree structure and is a variant of a hash tree. Its advantage is: use the common prefix of the string to reduce the query time, minimize the unnecessary string comparison, the query efficiency is higher than the hash tree. It has three basic properties: the root node does not contain characters, and each node except the root node contains only one character; from the root node to a node, the characters passing through the path are connected, which is the string corresponding to the node; All children of a node contain different characters.

另外,节点中可以存储一些数据,比如该单词的频率等。正向字典树的每个第一节点中存储有第一节点对应的单词的词频,反向字典树的每个第二节点中存储有第二节点对应的单词的词频。对应地,从第一单词对应的第一节点中获取第一单词的词频;从第二单词对应的第二节点中获取第二单词的词频。In addition, some data can be stored in the node, such as the frequency of the word. The word frequency of the word corresponding to the first node is stored in each of the first nodes of the forward dictionary tree, and the word frequency of the word corresponding to the second node is stored in each second node of the reverse dictionary tree. Correspondingly, the word frequency of the first word is obtained from the first node corresponding to the first word; and the word frequency of the second word is obtained from the second node corresponding to the second word.

图8为本发明一实施例提供的正向字典树的示意图。所谓正向字典树,即由根节点到各级子节点,按照单词中各个字符正向排列顺序建立的字典树。如图8所示,在正向字典树中,“expend(消费)”和“expense(费用)”两个单词具有相同前缀“expen”,通过正向字典树表示后,可以使这两个单词的查找路径具有公共部分(即由正向字典树中虚线连接的5个节点构成的一段路径)。FIG. 8 is a schematic diagram of a forward dictionary tree according to an embodiment of the present invention. The so-called forward dictionary tree, that is, the dictionary tree established from the root node to the child nodes at each level in the forward order of the characters in the word. As shown in Figure 8, in the forward dictionary tree, the words "expend" and "expense" have the same prefix "expen", which can be made by the forward dictionary tree. The search path has a common part (ie a path consisting of 5 nodes connected by dashed lines in the forward dictionary tree).

图9为本发明一实施例提供的反向字典树的示意图。所谓反向字典树,即由根节点到各级子节点,按照单词中各个字符反向排列顺序建立的字典树。如图9所示,具有相同后缀“less”的两个单词“endless”和“useless”也在反向字典树中存在公共查找路径(虚线连接),即通过反向字典树可以使得具有相同后缀的两个或两个以上的单词具有一段相同的查找路径。FIG. 9 is a schematic diagram of a reverse dictionary tree according to an embodiment of the present invention. The reverse dictionary tree, that is, the dictionary tree established from the root node to the child nodes at each level in the reverse order of the characters in the word. As shown in Figure 9, the two words "endless" and "useless" with the same suffix "less" also have a common lookup path (dashed connection) in the reverse dictionary tree, ie, the same suffix can be made by reverse dictionary tree Two or more words have the same search path.

本实施例根据字典树来对字符串进行正向分割或反向分割,由于公共查找路径的存在,可以在读取到的子字符串增加一个字符后,基于该字符增加前的查找路径继续向下一级节点查找,从而可以避免重复查找,最大 限度地减少无谓的字符串比较,减少查询时间,提高查找效率。In this embodiment, the string is forward-divided or reverse-divided according to the dictionary tree. After the common sub-string is added, the search path may be continued based on the character before the character is added. The next level of node search, so as to avoid repeated lookups, the largest Limit the unnecessary string comparison, reduce the query time, and improve the search efficiency.

进一步地,在上述实施例的基础上,还可以预先构建语料库。语料库包括单词库和单词库中的单词的词频,然后根据语料库来构建正向字典树和反向字典树,并将各单词的词频存储至对应的第一节点和第二节点。下面采用结合图10来说明本发明构建预料库的具体实现过程。Further, on the basis of the above embodiments, the corpus can also be constructed in advance. The corpus includes the word frequency of the words in the word library and the word library, and then constructs a forward dictionary tree and a reverse dictionary tree according to the corpus, and stores the word frequency of each word to the corresponding first node and second node. The specific implementation process of constructing the predictive library of the present invention will be described below with reference to FIG.

图10为本发明一实施例提供的字符串的分词方法流程示意图。如图10所示,该方法包括:FIG. 10 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention. As shown in FIG. 10, the method includes:

步骤201、根据满足预设使用条件的词典,得到单词库。Step 201: Obtain a word library according to a dictionary that satisfies a preset use condition.

获取满足预设使用条件的词典,该满足预设使用条件的词典可以为词汇量超过预设值的词典,也可以为下载频率超过预设次数的词典等,提取该词典中的单词,所有的单词构成了单词库。Obtaining a dictionary that satisfies a preset use condition, the dictionary satisfying the preset use condition may be a dictionary whose vocabulary exceeds a preset value, or a dictionary whose download frequency exceeds a preset number of times, etc., extracting words in the dictionary, all of Words form the word library.

步骤202、确定单词库中的单词在满足预设使用条件的文本以及待分割的文本中出现的次数。Step 202: Determine the number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided.

其中,满足预设使用条件的文本可以为英文文学全集、英文教材、英文报纸等使用频率超过预设值的文本。确定单词库中的单词在这些文本中出现的次数。Among them, the text that satisfies the preset use condition may be a complete text of English literature, English textbooks, English newspapers and the like, which use frequencies exceeding a preset value. Determines the number of times a word in a word library appears in these texts.

在确定单词库中的单词在待分割文本中出现的次数时,根据待分割的文本中的空格符,获取至少一个第一字符串;将至少一个第一字符串与单词库中的单词进行匹配,得到与单词库中的单词匹配的至少一个第二字符串;根据各第二字符串在待分割的文本中出现的次数,确定单词库中的单词在待分割的文本中出现的次数。When determining the number of occurrences of a word in the word library in the text to be divided, acquiring at least one first character string according to a space character in the text to be divided; matching at least one first character string with a word in the word library Obtaining at least one second character string that matches the word in the word library; determining the number of occurrences of the word in the word library in the text to be segmented according to the number of occurrences of each second character string in the text to be segmented.

具体地,将待分割的文本去掉标点符号,按照空格分出一个一个的第一字符串,对于这些第一字符串,如果不在字典库中,则舍弃,剩下的都是单词,即第二字符串,然后统计第二字符串在待分割文本中出现的次数,即单词库中的单词在待分割文本中出现的次数。Specifically, the punctuation is removed from the text to be split, and the first character string is separated by a space. If the first string is not in the dictionary, the discard is discarded, and the rest are words, that is, the second The string, then counts the number of times the second string appears in the text to be split, that is, the number of times the word in the word library appears in the text to be split.

步骤203、根据单词库、单词库中的单词在满足预设使用条件的文本以及待分割的文本中出现的次数,构建语料库。Step 203: Construct a corpus according to the number of occurrences of the words in the word library and the word library in the text satisfying the preset use condition and the text to be divided.

其中,语料库中包括单词库以及单词库中的单词的词频。若同一单词即出现在满足预设使用条件的文本中,又出现在待分割的文本中,则该单词的词频为该单词在满足预设使用条件的文本中出现的次数与待分割的 文本中出现的次数的加和。Among them, the corpus includes the word library and the word frequency of the words in the word library. If the same word appears in the text that satisfies the preset use condition and appears in the text to be split, the word frequency of the word is the number of occurrences of the word in the text satisfying the preset use condition and the to-be-divided The sum of the occurrences in the text.

本实施例构建的语料库,该语料库中的单词的词频是通过待分割的文本进行修正的,与待分割的文本具有一定的相关性,使得语料库中的单词的词频更接近待分割的文本的应用情况,从而可以使得分割结果的语义与待分割的文本表达的语义更接近,提高了字符串分割的正确性。In the corpus constructed in this embodiment, the word frequency of the words in the corpus is corrected by the text to be segmented, and has a certain correlation with the text to be segmented, so that the word frequency of the words in the corpus is closer to the application of the text to be segmented. The situation can make the semantics of the segmentation result closer to the semantics of the text representation to be segmented, and improve the correctness of the string segmentation.

当本实施例的字符串的分词方法由云端服务器来执行时,本实施例中的云端服务器还可以与用户设备进行交互,以使得用户可以获知分割结果。下面采用详细的实施例来进行详细说明。When the word segmentation method of the character string of the embodiment is executed by the cloud server, the cloud server in this embodiment may also interact with the user equipment, so that the user can know the segmentation result. The detailed embodiments are described in detail below.

图11为本发明一实施例提供的字符串的分词方法流程示意图。该字符串的分词方法可以由字符串的分词装置来实现。该装置可以通过软件和/或硬件来实现。该分词装置还可以被配置到用户设备中,例如计算机、手机、平板等设备。在本实施例中,以该分词装置被配置到用户设备为例,进行详细说明。如图11所示,该方法包括:FIG. 11 is a schematic flowchart of a word segmentation method according to an embodiment of the present invention. The word segmentation method of the string can be implemented by a word segmentation device of the string. The device can be implemented by software and/or hardware. The word segmentation device can also be configured into a user device, such as a computer, a cell phone, a tablet, and the like. In this embodiment, a detailed description will be given by taking the word segmentation device as a user equipment as an example. As shown in FIG. 11, the method includes:

步骤301、向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;Step 301: Send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be divided, and according to each of the word frequency and the reverse segmentation result of each first word in the forward segmentation result. The word frequency of the second word determines the segmentation result;

步骤302、接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;Step 302: Receive segmentation result information of the character string to be divided that is fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein, the segmentation of the character string to be segmented The result is the forward segmentation result or the reverse segmentation result;

步骤303、向用户输出所述分割结果。Step 303: Output the segmentation result to a user.

在步骤301中,用户通过用户设备上安装的应用程序或者浏览器浏览电商平台时,当用户需要查找某一商品时,用户设备获取用户输入的待分割的文本,然后向云端服务器发送用户输入的待分割的文本。具体地,用户可以通过语音或者文字输入待分割的文本。In step 301, when the user browses the e-commerce platform through an application or a browser installed on the user device, when the user needs to find a certain product, the user device acquires the text to be divided by the user, and then sends the user input to the cloud server. The text to be split. Specifically, the user can input the text to be divided by voice or text.

云端服务器在获取到待分割的文本之后,根据该待分割的文本,获取待分割的字符串,然后对该待分割的字符串进行分词处理,可以得到正向分割结果、正向分割结果中的各第一单词的词频、第一词频和值,反向分割结果、反向分割结果中的各第二单词的词频、第二词频和值,以及最终的分割结果。云端服务器对待分割的字符串进行分词处理的具体实现方 式,可参见上述图2至图10所示的实施例,本实施例此处不再赘述。After obtaining the text to be segmented, the cloud server obtains the character string to be segmented according to the text to be segmented, and then performs word segmentation processing on the character string to be segmented, thereby obtaining a forward segmentation result and a forward segmentation result. The word frequency of each first word, the first word frequency and value, the inverse segmentation result, the word frequency of each second word in the reverse segmentation result, the second word frequency and value, and the final segmentation result. The specific implementation of the word segmentation processing of the segmented string by the cloud server For example, refer to the embodiment shown in FIG. 2 to FIG. 10 above, and details are not described herein again.

在步骤302中,云端服务器在得到分割结果之后,向用户设备反馈待分割的字符串的分割结果信息,该分割结果信息包括分割结果。In step 302, after obtaining the segmentation result, the cloud server feeds back the segmentation result information of the character string to be segmented to the user equipment, and the segmentation result information includes the segmentation result.

在步骤303中,用户设备在获取到分割结果之后,向用户输出分割结果。具体地,用户设备可以通过语音或文字的形式输出分割结果。In step 303, after obtaining the segmentation result, the user equipment outputs the segmentation result to the user. Specifically, the user equipment may output the segmentation result in the form of voice or text.

本实施例提供的字符串的分词方法,通过向云端服务器发送用户输入的待分割的文本,以使云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;由于云端服务器通过双向分割字符串,可以识别字符串头或字符串尾的干扰字符,提高了分割成功率,基于词频来确定最终的分割结果,提高了分割结果中的各单词语义正确的概率,接收云端服务器反馈的待分割的字符串的分割结果信息,分割结果信息包括待分割的字符串的分割结果;向用户输出分割结果,用户可以获知分割结果,使得用户可以获知最终的查询结果对应的查询单词,提高了用户的体验。The word segmentation method of the character string provided by the embodiment sends the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and The word frequency of each second word in the reverse segmentation result determines the segmentation result; since the cloud server can distinguish the interference characters of the string header or the string tail by bidirectionally dividing the character string, the segmentation success rate is improved, and the final frequency is determined based on the word frequency. The segmentation result improves the probability that the semantics of each word in the segmentation result is correct, and receives the segmentation result information of the character string to be segmented fed back by the cloud server, and the segmentation result information includes the segmentation result of the character string to be segmented; and outputs the segmentation result to the user. The user can know the segmentation result, so that the user can know the query word corresponding to the final query result, which improves the user experience.

下面结合图12,以一个具体的例子,来说明用户设备与云端服务器的交互过程。图12为本发明一实施例提供的字符串的分词方法的信令流程图。如图12所示,该方法包括:The interaction process between the user equipment and the cloud server will be described below with reference to FIG. 12 in a specific example. FIG. 12 is a signaling flowchart of a word segmentation method for a character string according to an embodiment of the present invention. As shown in FIG. 12, the method includes:

步骤401、用户设备获取用户输入的待分割的文本;Step 401: The user equipment acquires text to be divided by the user;

步骤402、用户设备向云端服务器发送用户输入的待分割的文本;Step 402: The user equipment sends the text to be divided by the user input to the cloud server.

步骤403、云端服务器根据待分割的文本得到待分割的字符串,确定待分割的字符串的分割结果;Step 403: The cloud server obtains a character string to be divided according to the text to be divided, and determines a segmentation result of the character string to be divided.

步骤404、云端服务器向用户设备发送待分割的字符串的分割结果信息;Step 404: The cloud server sends, to the user equipment, segmentation result information of the character string to be divided.

步骤405、用户设备向用户输出分割结果信息;Step 405: The user equipment outputs the segmentation result information to the user.

步骤401至步骤405的具体实现方式,可参见上述图11所示的实施例。可选地,在步骤405之后,还可以执行步骤406至步骤408。For the specific implementation of the steps 401 to 405, refer to the embodiment shown in FIG. 11 above. Optionally, after step 405, step 406 to step 408 may also be performed.

步骤406、用户设备获取用户确定的待处理的分割结果;Step 406: The user equipment acquires a segmentation result to be processed determined by the user.

步骤407、用户设备向云端服务器发送待处理的分割结果;Step 407: The user equipment sends the segmentation result to be processed to the cloud server.

步骤408、对待处理的分割结果进行自然语言处理。Step 408: Perform natural language processing on the segmentation result to be processed.

本实施例通过用户设备与云端服务器的交互,使得用户不仅可以获知 分割结果信息,还可以确定待处理的分割结果,提高了用户体验。In this embodiment, the user device interacts with the cloud server, so that the user can not only know Segmenting the result information can also determine the segmentation result to be processed and improve the user experience.

下面采用具体的实施例,对本实施例中的用户设备获取用户输入的待分割的文本,以及用户设备向用户输出分割结果信息进行详细说明。在本实施例中,以通过电商平台进行购物为例,来进行详细说明。本领域技术人员可以理解,该场景仅为示意性的场景,该方法还可以应用到网页搜索等场景中,本实施例对具体的场景不做特别限制。The user equipment in the embodiment is used to obtain the text to be divided by the user, and the user equipment outputs the segmentation result information to the user for detailed description. In this embodiment, a detailed description will be made by taking shopping by an e-commerce platform as an example. A person skilled in the art can understand that the scenario is only an exemplary scenario, and the method can also be applied to a scenario such as a webpage search. The specific embodiment does not specifically limit the specific scenario.

图13为本发明一实施例提供的字符串的分词方法的显示界面示意图。在本实施例中,用户可在用户设备的显示界面的搜索框中输入待查看的商品的类型。如图13所示,用户在显示界面的搜索框中输入了“slee pshirt”的文本,则用户设备将该文本发送给云端服务器。云端服务器在获取到待分割的文本后,对该待分割的文本进行处理,得到待分割的字符串“sleepshirt”。然后云端服务器对该待分割的字符串进行分割处理,具体的分割处理过程以及分割处理结果,可参见图5所示的实施例,本实施例此处不再赘述。FIG. 13 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In this embodiment, the user can input the type of the item to be viewed in the search box of the display interface of the user device. As shown in FIG. 13 , the user inputs the text of “slee pshirt” in the search box of the display interface, and the user equipment sends the text to the cloud server. After the cloud server obtains the text to be divided, the text to be divided is processed to obtain a string “sleepshirt” to be divided. Then, the cloud server performs the segmentation process on the character string to be divided, and the specific segmentation process and the segmentation process result are shown in the embodiment shown in FIG. 5, and details are not described herein again.

在本实施例中,当云端服务器获取到分割结果之后,云端服务器向用户设备返回分割结果信息。用户设备在接收到分割结果信息之后,根据该分割结果信息向用户输出分割结果。下面结合图14至图18来具体说明用户设备输出分割结果的实现过程。In this embodiment, after the cloud server obtains the segmentation result, the cloud server returns the segmentation result information to the user equipment. After receiving the segmentation result information, the user equipment outputs the segmentation result to the user according to the segmentation result information. The implementation process of the user equipment output segmentation result will be specifically described below with reference to FIG. 14 to FIG.

图14为本发明一实施例提供的字符串的分词方法的显示界面示意图。在本实施例中,分割结果信息中包括待分割的字符串的分割结果,则对应地在用户设备的显示界面上显示该分割结果。如图14所示,在显示界面上显示有分割结果“sleep shirt”。FIG. 14 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In this embodiment, the segmentation result information includes the segmentation result of the character string to be segmented, and the segmentation result is displayed correspondingly on the display interface of the user equipment. As shown in FIG. 14, the segmentation result "sleep shirt" is displayed on the display interface.

图15为本发明一实施例提供的字符串的分词方法的显示界面示意图。在本实施例中,分割结果信息中包括待分割的字符串的分割结果、分割结果对应的分割类型,分割类型为正向分割或反向分割。对应地,在用户设备的显示界面上显示该分割结果以及分割结果的分割类型。如图15所示,在显示界面上显示有分割结果“sleep shirt”,并显示有分割结果的分割类型“反向分割”。FIG. 15 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In this embodiment, the segmentation result information includes a segmentation result of the character string to be segmented, a segmentation type corresponding to the segmentation result, and the segmentation type is forward segmentation or reverse segmentation. Correspondingly, the segmentation result and the segmentation type of the segmentation result are displayed on the display interface of the user device. As shown in FIG. 15, the segmentation result "sleep shirt" is displayed on the display interface, and the segmentation type "reverse segmentation" having the segmentation result is displayed.

图16为本发明一实施例提供的字符串的分词方法的显示界面示意图。在本实施例中,分割结果信息中包括正向分割结果、反向分割结果,以及 最终的分割结果。对应地,在用户设备的显示界面上显示正向分割结果和反向分割结果,并标注待分割字符串对应的分割结果。如图16所示,在显示界面上显示有反向分割结果“sleep shirt”以及正向分割结果“sleeps hirt”,并通过灰色背影标注待分割字符串对应的分割结果为反向分割结果。FIG. 16 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In this embodiment, the segmentation result information includes a forward segmentation result, a reverse segmentation result, and The final segmentation result. Correspondingly, the forward segmentation result and the reverse segmentation result are displayed on the display interface of the user equipment, and the segmentation result corresponding to the string to be segmented is marked. As shown in FIG. 16, the reverse segmentation result "sleep shirt" and the forward segmentation result "sleeps hirt" are displayed on the display interface, and the segmentation result corresponding to the segmentation string is marked by the gray back image as the reverse segmentation result.

图17为本发明一实施例提供的字符串的分词方法的显示界面示意图。本实施例在图16实施例的基础上,分割结果信息中还包括正向分割结果中的各第一单词的词频和反向分割结果中的各第二单词的词频。对应地,在图17所示的显示界面上,显示有反向分割结果以及反向分割结果中的各第二单词的词频,还显示有正向分割结果以及正向分割结果中的各第一单词的词频。在本实施例中,用户设备在获取到分割结果信息之后,可以直接在显示界面上显示图17所示的内容,也可以先在显示界面上显示如图16所示的内容,然后当获取到用户操作显示界面触发的词频显示指令后,根据词频显示指令,显示各第一单词的词频和/或各第二单词的词频。本领域技术人员可以理解,当根据词频显示指令显示各第一单词的词频和各第二单词的词频时,具体的显示内容可如图17所示。FIG. 17 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In the embodiment, on the basis of the embodiment of FIG. 16, the segmentation result information further includes a word frequency of each of the word frequency and the reverse segmentation result in the forward segmentation result. Correspondingly, on the display interface shown in FIG. 17, the word frequency of each of the second word in the reverse segmentation result and the reverse segmentation result is displayed, and the first segmentation result and the first segment in the forward segmentation result are also displayed. The word frequency of the word. In this embodiment, after the user equipment obtains the segmentation result information, the content shown in FIG. 17 may be directly displayed on the display interface, or the content shown in FIG. 16 may be displayed on the display interface first, and then acquired. After the user operates the word frequency display command triggered by the display interface, the word frequency of each first word and/or the word frequency of each second word are displayed according to the word frequency display instruction. Those skilled in the art can understand that when the word frequency of each first word and the word frequency of each second word are displayed according to the word frequency display instruction, the specific display content can be as shown in FIG.

图18为本发明一实施例提供的字符串的分词方法的显示界面示意图。本实施例在图16实施例的基础上,分割信息中还包括正向分割结果中的各第一单词对应的第一词频和值以及反向分割结果中的各第二单词对应的第二词频和值。对应地,在图18所示的显示界面上,显示有反向分割结果以及各第二单词对应的第二词频和值,还显示有正向分割结果以及各第一单词对应的第一词频和值。在本实施例中,用户设备在获取到分割结果信息之后,可以直接在显示界面上显示图18所示的内容,也可以先在显示界面上显示如图16所示的内容,然后当获取到用户操作显示界面触发的词频显示指令后,根据词频显示指令,显示所述第一词频和值和/或所述第二词频和值。本领域技术人员可以理解,当根据词频显示指令显示第一词频和值和第二词频和值时,具体的显示内容可如图18所示。FIG. 18 is a schematic diagram of a display interface of a word segmentation method according to an embodiment of the present invention. In this embodiment, on the basis of the embodiment of FIG. 16, the segmentation information further includes a first word frequency sum value corresponding to each first word in the forward segmentation result and a second word frequency corresponding to each second word in the reverse segmentation result. And value. Correspondingly, on the display interface shown in FIG. 18, the reverse segmentation result and the second word frequency sum corresponding to each second word are displayed, and the forward segmentation result and the first word frequency corresponding to each first word are also displayed. value. In this embodiment, after the user equipment obtains the segmentation result information, the content shown in FIG. 18 may be directly displayed on the display interface, or the content shown in FIG. 16 may be displayed on the display interface first, and then acquired. After the user operates the word frequency display instruction triggered by the display interface, the first word frequency sum value and/or the second word frequency sum value are displayed according to the word frequency display instruction. Those skilled in the art can understand that when the first word frequency sum value and the second word frequency sum value are displayed according to the word frequency display instruction, the specific display content can be as shown in FIG. 18.

在上述图16至图18所示的实施例中,用户可以通过操作显示界面来决定云端服务器的待处理的分割结果。具体地,用户可以通过点击、滑动等操作方式来操作正向分割结果或反向分割结果。用户设备可以根据用户 对正向分割结果或反向分割结果的操作方式,来获取操作信息,根据操作信息来确定待处理的分割结果。在本实施例中,在用户执行点击“sleep shirt”输出框的操作时,用户设备根据该点击操作来获取操作信息,具体的操作信息为反向分割结果被用户选择,用户设备根据该操作信息确定待处理的分割结果为反向分割结果。然后,用户设备将待处理的分割结果反馈给云端服务器,由云端服务器对待处理的分割结果进行后续处理。In the embodiment shown in FIG. 16 to FIG. 18 above, the user can determine the segmentation result of the cloud server to be processed by operating the display interface. Specifically, the user can operate the forward segmentation result or the reverse segmentation result by clicking, sliding, or the like. User equipment can be based on users The operation mode of the forward segmentation result or the reverse segmentation result is obtained to obtain operation information, and the segmentation result to be processed is determined according to the operation information. In this embodiment, when the user performs an operation of clicking the "sleep shirt" output box, the user equipment acquires operation information according to the click operation, and the specific operation information is selected by the user for the reverse segmentation result, and the user equipment according to the operation information It is determined that the segmentation result to be processed is a reverse segmentation result. Then, the user equipment feeds back the segmentation result to be processed to the cloud server, and the segmentation result to be processed by the cloud server is subsequently processed.

在本实施例中,由于同时在显示界面上显示了正向分割结果以及反向分割结果,则用户可以根据正向分割结果以及反向分割结果来确定自己需要查找或搜索的对象,提高了搜索的准确性和有效性。进一步地,本实施例还在显示界面上显示词频,用户在看到该词频后,能够快速做出更正确的判断,提高了用户体验。In this embodiment, since the forward segmentation result and the reverse segmentation result are simultaneously displayed on the display interface, the user can determine the object that needs to be searched or searched according to the forward segmentation result and the reverse segmentation result, thereby improving the search. Accuracy and effectiveness. Further, in this embodiment, the word frequency is also displayed on the display interface, and after seeing the word frequency, the user can quickly make a more correct judgment and improve the user experience.

以下将详细描述根据本申请的一个或多个实施例的字符串的分词装置。该字符串的分词装置可以被实现在各种设备上,例如,服务端设备、服务器、网络服务器等。本领域技术人员可以理解,该字符串的分词装置均可使用市售的硬件组件通过本方案所教导的步骤进行配置来构成。例如,下述实施例中的涉及控制功能、更新功能的模块可以使用来自德州仪器公司、英特尔公司、ARM公司等企业的单片机、微控制器、微处理器等组件实现。A word segmentation device of a character string according to one or more embodiments of the present application will be described in detail below. The word segmentation device of the string can be implemented on various devices, such as a server device, a server, a web server, and the like. Those skilled in the art will appreciate that the word segmentation device of the string can be constructed using commercially available hardware components configured by the steps taught by the present solution. For example, the modules related to the control function and the update function in the following embodiments may be implemented using components such as a single chip microcomputer, a microcontroller, a microprocessor, and the like from a company such as Texas Instruments, Intel Corporation, and ARM Corporation.

下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。The following is an embodiment of the apparatus of the present application, which may be used to implement the method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

图19为本发明一实施例提供的字符串的分词装置的结构示意图。如图19所示,该装置包括:FIG. 19 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in Figure 19, the device includes:

第一分割模块10,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;a first segmentation module 10, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word;

第二分割模块11,用于获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;a second segmentation module 11 configured to acquire a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word;

词频获取模块12,用于获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;The word frequency acquisition module 12 is configured to obtain a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a predetermined number of occurrences of each word in the preset text;

结果确定模块13,用于根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。 a result determining module 13 configured to determine, according to a word frequency of each of the first words and a word frequency of each of the second words, a segmentation result of the character string to be divided, wherein a segmentation result of the character string to be segmented The result of the forward segmentation or the inverse segmentation.

本申请实施例提供的字符串的分词装置,可以执行上述方法实施例,其实现原理和技术效果类似,在此不再赘述。The word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.

图20为本发明一实施例提供的字符串的分词装置的结构示意图。本实施例在图19实施例的基础上实现,具体如下:FIG. 20 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. This embodiment is implemented on the basis of the embodiment of FIG. 19, and the details are as follows:

可选地,所述第一分割模块10具体用于,Optionally, the first segmentation module 10 is specifically configured to:

对所述待分割的字符串进行正向分割的操作,判断是否获取到第一单词;Performing a forward split operation on the character string to be divided to determine whether the first word is acquired;

若是,将除去所述第一单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If yes, the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned;

若否,对所述待分割的字符串的正向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If not, the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split;

重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到正向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained.

可选地,所述第二分割模块11具体用于,对所述待分割的字符串进行反向分割的操作,判断是否获取到第二单词;Optionally, the second segmentation module 11 is specifically configured to perform an operation of performing a reverse segmentation on the character string to be divided, and determining whether the second word is acquired;

若是,将除去所述第二单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If yes, the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed;

若否,对所述待分割的字符串的反向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If not, the first character in the reverse direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing reverse splitting on the string to be split;

重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到反向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a reverse segmentation result is obtained.

可选地,还包括:文本获取模块14,用于获取待分割的文本,对所述待分割的文本进行符号删除操作,得到所述待分割的字符串。Optionally, the method further includes: a text obtaining module 14 configured to acquire text to be divided, perform a symbol deletion operation on the text to be divided, and obtain the character string to be divided.

可选地,还包括:字典树构建模块15,用于构建正向字典树和反向字典树;Optionally, the method further includes: a dictionary tree building module 15 configured to construct a forward dictionary tree and a reverse dictionary tree;

所述第一分割模块10具体用于,The first segmentation module 10 is specifically configured to:

根据所述正向字典树,对所述待分割的字符串进行正向分割的操作;Performing a forward split operation on the character string to be divided according to the forward dictionary tree;

所述第二分割模块11具体用于,The second segmentation module 11 is specifically configured to:

根据所述反向字典树,对所述待分割的字符串进行反向分割的操作。 Performing an inverse split operation on the character string to be split according to the reverse dictionary tree.

可选地,所述正向字典树的每个第一节点中存储有所述第一节点对应的单词的词频,所述反向字典树的每个第二节点中存储有所述第二节点对应的单词的词频;Optionally, a word frequency of a word corresponding to the first node is stored in each first node of the forward dictionary tree, and the second node is stored in each second node of the reverse dictionary tree. The word frequency of the corresponding word;

所述词频获取模块12具体用于,The word frequency acquisition module 12 is specifically configured to:

从所述第一单词对应的第一节点中获取所述第一单词的词频;Obtaining a word frequency of the first word from a first node corresponding to the first word;

从所述第二单词对应的第二节点中获取所述第二单词的词频。And acquiring a word frequency of the second word from a second node corresponding to the second word.

可选地,还包括:语料库构建模块16,用于构建语料库,所述语料库包括单词库和所述单词库中的单词的词频;Optionally, the method further includes: a corpus construction module 16 for constructing a corpus, the corpus including a word library and a word frequency of a word in the word library;

所述字典树构建模块15具体用于,根据所述语料库,构建正向字典树和反向字典树,并将各单词的词频存储至对应的第一节点和第二节点。The dictionary tree construction module 15 is specifically configured to construct a forward dictionary tree and a reverse dictionary tree according to the corpus, and store the word frequency of each word to the corresponding first node and the second node.

可选地,所述预设文本包括:满足预设使用条件的文本以及待分割的文本;所述语料库构建模块16具体用于,Optionally, the preset text includes: text that meets preset usage conditions and text to be divided; the corpus construction module 16 is specifically configured to:

根据满足预设使用条件的词典,得到单词库;Obtain a word library according to a dictionary that satisfies a preset use condition;

确定所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数;Determining a number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided;

根据所述单词库、所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数,构建所述语料库。The corpus is constructed according to the word library, the words in the word library, the number of occurrences of the text satisfying the preset use condition and the text to be divided.

可选地,所述语料库构建模块16具体用于,Optionally, the corpus construction module 16 is specifically configured to:

根据所述待分割的文本中的空格符,获取至少一个第一字符串;Obtaining at least one first character string according to the space character in the text to be divided;

将所述至少一个第一字符串与所述单词库中的单词进行匹配,得到与所述单词库中的单词匹配的至少一个第二字符串;Matching the at least one first character string with a word in the word library to obtain at least one second character string that matches a word in the word library;

根据各所述第二字符串在所述待分割的文本中出现的次数,确定单词库中的单词在所述待分割的文本中出现的次数。And determining, according to the number of occurrences of each of the second character strings in the text to be divided, the number of times the words in the word library appear in the text to be divided.

可选地,所述结果确定模块13具体用于,Optionally, the result determining module 13 is specifically configured to:

对所有所述第一单词的词频进行求和处理,得到第一词频和值;And summing the word frequencies of all the first words to obtain a first word frequency sum value;

对所有所述第二单词的词频进行求和处理,得到第二词频和值;And summing the word frequencies of all the second words to obtain a second word frequency sum value;

若所述第一词频和值大于所述第二词频和值,则确定所述待分割的字符串的分割结果为正向分割结果;If the first word frequency sum value is greater than the second word frequency sum value, determining that the segmentation result of the character string to be segmented is a forward segmentation result;

若所述第二词频和值大于所述第一词频和值,则确定所述待分割的字符串的分割结果为反向分割结果。 And if the second word frequency sum value is greater than the first word frequency sum value, determining that the segmentation result of the character string to be segmented is a reverse segmentation result.

可选地,还包括:反馈模块17;Optionally, the method further includes: a feedback module 17;

所述文本获取模块14具体用于,获取用户设备发送的所述待分割的文本;The text obtaining module 14 is specifically configured to acquire the text to be divided sent by the user equipment;

所述反馈模块17用于,向所述用户设备反馈所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果,以使所述用户设备向用户输出所述分割结果。The feedback module 17 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment The user outputs the segmentation result.

可选地,还包括:结果获取模块18和处理模块19,Optionally, the method further includes: a result obtaining module 18 and a processing module 19,

所述结果获取模块18用于,获取所述用户设备发送的待处理的分割结果;The result obtaining module 18 is configured to acquire a segmentation result to be processed sent by the user equipment;

所述处理模块19用于,对所述待处理的分割结果进行自然语言处理。The processing module 19 is configured to perform natural language processing on the segmentation result to be processed.

本申请实施例提供的字符串的分词装置,可以执行上述方法实施例,其实现原理和技术效果类似,在此不再赘述。The word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.

图21为本发明一实施例提供的字符串的分词装置的结构示意图。如图21所示,该装置包括:FIG. 21 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in Figure 21, the device includes:

发送模块20,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;The sending module 20 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word in the determination of the segmentation result;

接收模块21,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;The receiving module 21 is configured to receive the segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the character to be segmented The segmentation result of the string is the forward segmentation result or the reverse segmentation result;

输出模块22,用于向用户输出所述分割结果。The output module 22 is configured to output the segmentation result to a user.

本申请实施例提供的字符串的分词装置,可以执行上述方法实施例,其实现原理和技术效果类似,在此不再赘述。The word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.

图22为本发明一实施例提供的字符串的分词装置的结构示意图。如图22所示,本实施例在图21所示实施例的基础上实现,具体如下:FIG. 22 is a schematic structural diagram of a word segmentation device according to an embodiment of the present invention. As shown in FIG. 22, the embodiment is implemented on the basis of the embodiment shown in FIG. 21, and the details are as follows:

所述输出模块22具体用于,在显示界面上显示所述分割结果。The output module 22 is specifically configured to display the segmentation result on a display interface.

可选地,所述分割结果信息中还包括所述分割结果对应的分割类型,所述分割类型为正向分割或反向分割;Optionally, the segmentation result information further includes a segmentation type corresponding to the segmentation result, where the segmentation type is forward segmentation or reverse segmentation;

所述输出模块22具体用于,在显示界面上显示所述分割结果以及所述分割结果的分割类型。 The output module 22 is specifically configured to display the segmentation result and the segmentation type of the segmentation result on a display interface.

可选地,若所述分割结果为正向分割结果,则所述分割信息中还包括反向分割结果;或者Optionally, if the segmentation result is a forward segmentation result, the segmentation information further includes a reverse segmentation result; or

若所述分割结果为反向分割结果,则所述分割信息中还包括正向分割结果;If the segmentation result is a reverse segmentation result, the segmentation information further includes a forward segmentation result;

所述输出模块22具体用于,在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果。The output module 22 is configured to display the forward segmentation result and the reverse segmentation result on the display interface, and label the segmentation result corresponding to the to-be-divided string.

可选地,所述分割信息中还包括所述正向分割结果中的各所述第一单词的词频和所述反向分割结果中的各所述第二单词的词频;Optionally, the segmentation information further includes a word frequency of each of the first words in the forward segmentation result and a word frequency of each of the second words in the reverse segmentation result;

所述显示装置还包括:指令获取模块23,用于获取所述用户操作所述显示界面触发的词频显示指令;The display device further includes: an instruction acquisition module 23, configured to acquire a word frequency display instruction triggered by the user to operate the display interface;

所述输出模块22还用于,根据所述词频显示指令,显示各所述第一单词的词频和/或各所述第二单词的词频;The output module 22 is further configured to display, according to the word frequency display instruction, a word frequency of each of the first words and/or a word frequency of each of the second words;

或者or

所述输出模块22具体用于,在所述显示界面上显示所述正向分割结果、所述正向分割结果中的第一单词的词频,以及所述反向分割结果、所述反向分割结果中的第二单词的词频。The output module 22 is specifically configured to display, on the display interface, the forward segmentation result, a word frequency of a first word in the forward segmentation result, and the reverse segmentation result and the reverse segmentation The word frequency of the second word in the result.

可选地,所述分割信息中还包括所述正向分割结果中的各所述第一单词对应的第一词频和值以及所述反向分割结果中的各所述第二单词对应的第二词频和值;Optionally, the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and a corresponding number of each of the second words in the reverse segmentation result Two word frequency sum value;

所述显示装置还包括:指令获取模块23,用于获取所述用户操作所述显示界面触发的词频显示指令;The display device further includes: an instruction acquisition module 23, configured to acquire a word frequency display instruction triggered by the user to operate the display interface;

所述输出模块22还用于,根据所述词频显示指令,显示所述第一词频和值和/或所述第二词频和值;The output module 22 is further configured to display the first word frequency sum value and/or the second word frequency sum value according to the word frequency display instruction;

或者or

所述输出模块22具体用于,在所述显示界面上显示所述正向分割结果、所述第一词频和值,以及所述反向分割结果、所述第二词频和值。The output module 22 is specifically configured to display, on the display interface, the forward segmentation result, the first word frequency sum value, and the reverse segmentation result and the second word frequency sum value.

可选地,还包括:操作信息获取模块24,用于获取所述用户对所述显示界面上的所述正向分割结果或反向分割结果的操作信息,Optionally, the method further includes: an operation information acquiring module 24, configured to acquire, by the user, operation information about the forward segmentation result or the reverse segmentation result on the display interface,

确定模块25,用于根据所述操作信息确定待处理的分割结果;a determining module 25, configured to determine, according to the operation information, a segmentation result to be processed;

所述发送模块20还用于,向所述云端服务器发送所述待处理的分割结 果,以使所述云端服务器对所述待处理的分割结果进行自然语言处理。The sending module 20 is further configured to send the to-be-processed split node to the cloud server. In order to enable the cloud server to perform natural language processing on the segmentation result to be processed.

本申请实施例提供的字符串的分词装置,可以执行上述方法实施例,其实现原理和技术效果类似,在此不再赘述。The word segmentation device of the character string provided by the embodiment of the present application may perform the foregoing method embodiments, and the implementation principle and technical effects thereof are similar, and details are not described herein again.

图23为本发明一实施例提供的字符串的分词设备的硬件结构示意图。如图23所示,该字符串的分词设备可以包括输入设备30、处理器31、存储器32和至少一个通信总线33以及输出设备34。通信总线33用于实现元件之间的通信连接。存储器32可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,存储器32中可以存储各种程序,用于完成各种处理功能以及实现本实施例的方法步骤。FIG. 23 is a schematic structural diagram of hardware of a word segmentation device of a character string according to an embodiment of the present invention. As shown in FIG. 23, the word segmentation device of the character string may include an input device 30, a processor 31, a memory 32, and at least one communication bus 33 and an output device 34. The communication bus 33 is used to implement a communication connection between components. Memory 32 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.

在本实施例中,输入设备30,用于获取待分割的文本;In this embodiment, the input device 30 is configured to acquire text to be divided;

处理器31,耦合至所述输入设备30,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor 31, coupled to the input device 30, for obtaining a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse of the character string to be segmented To the segmentation result, the reverse segmentation result includes at least one second word; the word frequency of each of the first words and the word frequency of each of the second words are obtained, wherein the word frequency is a predetermined word in the preset text a number of occurrences; determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the Forward segmentation results or the inverse segmentation results.

输出设备34,用于向用户设备反馈所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果,以使所述用户设备向用户输出所述分割结果。The output device 34 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment outputs the Segment the result.

可选地,该处理器31还用于执行上述图2至图10所述的方法,输入设备30对应执行输入操作,输出设备34对应执行输出操作,具体实现过程,可参见上述实施例,本实施例此处不再赘述。Optionally, the processor 31 is further configured to perform the foregoing method as shown in FIG. 2 to FIG. 10, where the input device 30 performs an input operation, and the output device 34 performs an output operation, and the specific implementation process may be referred to the foregoing embodiment. The embodiments are not described herein again.

图24为本发明一实施例提供的云端服务器的硬件结构示意图。如图24所示,该云端服务器可以包括输入设备40、处理器41、存储器42和至少一个通信总线43以及输出设备44。通信总线43用于实现元件之间的通信连接。存储器42可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,存储器42中可以存储各种程序,用于完成各种处理功能以及实现本实施例的方法步骤。FIG. 24 is a schematic structural diagram of hardware of a cloud server according to an embodiment of the present invention. As shown in FIG. 24, the cloud server may include an input device 40, a processor 41, a memory 42 and at least one communication bus 43 and an output device 44. Communication bus 43 is used to implement a communication connection between the components. Memory 42 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.

在本实施例中,输入设备40,用于获取待分割的文本; In this embodiment, the input device 40 is configured to acquire text to be divided;

处理器41,耦合至所述输入设备40,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor 41, coupled to the input device 40, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and obtaining a reverse of the character string to be segmented To the segmentation result, the reverse segmentation result includes at least one second word; the word frequency of each of the first words and the word frequency of each of the second words are obtained, wherein the word frequency is a predetermined word in the preset text a number of occurrences; determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the Forward segmentation results or the inverse segmentation results.

输出设备44,用于向用户设备反馈所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果,以使所述用户设备向用户输出所述分割结果。The output device 44 is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment outputs the Segment the result.

可选地,该处理器41还用于执行上述图2至图10所述的方法,输入设备40对应执行输入操作,输出设备44对应至少输出操作,具体实现过程,可参见上述实施例,本实施例此处不再赘述。Optionally, the processor 41 is further configured to perform the method described in the foregoing FIG. 2 to FIG. 10, where the input device 40 performs an input operation, and the output device 44 corresponds to at least an output operation. The embodiments are not described herein again.

图25为本发明一实施例提供的字符串的分词设备的硬件结构示意图。如图25所示,该字符串的分词设备可以包括输入设备50、处理器51、存储器52和至少一个通信总线53以及输出设备54。通信总线53用于实现元件之间的通信连接。存储器52可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,存储器52中可以存储各种程序,用于完成各种处理功能以及实现本实施例的方法步骤。FIG. 25 is a schematic diagram showing the hardware structure of a word segmentation device of a character string according to an embodiment of the present invention. As shown in FIG. 25, the word segmentation device of the character string may include an input device 50, a processor 51, a memory 52, and at least one communication bus 53 and an output device 54. The communication bus 53 is used to implement a communication connection between components. Memory 52 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.

其中,输出设备54,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;The output device 54 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency and the reverse of each first word in the forward segmentation result. The word frequency of each second word in the segmentation result determines a segmentation result;

输入设备50,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;The input device 50 is configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the segment to be segmented The segmentation result of the string is the forward segmentation result or the reverse segmentation result;

处理器51,用于根据所述分割结果信息,控制所述输入设备向用户输出所述分割结果。The processor 51 is configured to control, according to the segmentation result information, the input device to output the segmentation result to a user.

可选地,该处理器51还用于执行上述图11至图18所示的方法,输入设备50对应执行输入操作,输出设备54对应至少输出操作,具体实现过程,可参见上述实施例,本实施例此处不再赘述。 Optionally, the processor 51 is further configured to perform the foregoing method shown in FIG. 11 to FIG. 18, where the input device 50 performs an input operation, and the output device 54 corresponds to at least an output operation. The embodiments are not described herein again.

图26为本发明一实施例提供的用户设备的硬件结构示意图。如图26所示,该字符串的分词设备可以包括输入设备60、处理器61、存储器62和至少一个通信总线63以及输出设备64。通信总线63用于实现元件之间的通信连接。存储器62可能包含高速RAM存储器,也可能还包括非易失性存储NVM,例如至少一个磁盘存储器,存储器62中可以存储各种程序,用于完成各种处理功能以及实现本实施例的方法步骤。FIG. 26 is a schematic structural diagram of hardware of a user equipment according to an embodiment of the present invention. As shown in FIG. 26, the word segmentation device of the character string may include an input device 60, a processor 61, a memory 62, and at least one communication bus 63 and an output device 64. Communication bus 63 is used to implement a communication connection between the components. Memory 62 may include high speed RAM memory, and may also include non-volatile memory NVM, such as at least one disk memory, in which various programs may be stored for performing various processing functions and implementing the method steps of the present embodiments.

其中,输出设备64,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;The output device 64 is configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency and the reverse of each first word in the forward segmentation result. The word frequency of each second word in the segmentation result determines a segmentation result;

输入设备60,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;The input device 60 is configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the character to be segmented The segmentation result of the string is the forward segmentation result or the reverse segmentation result;

处理器61,用于根据所述分割结果信息,控制所述输入设备向用户输出所述分割结果。The processor 61 is configured to control, according to the segmentation result information, the input device to output the segmentation result to a user.

可选地,该处理器61还用于执行上述图11至图18所示的方法,输入设备60对应执行输入操作,输出设备64对应至少输出操作,具体实现过程,可参见上述实施例,本实施例此处不再赘述。Optionally, the processor 61 is further configured to perform the foregoing method shown in FIG. 11 to FIG. 18, where the input device 60 performs an input operation, and the output device 64 corresponds to at least an output operation. The embodiments are not described herein again.

在上述图23至图26所示的实施例中,上述处理器例如可以为中央处理器(Central Processing Unit,简称CPU)、应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现。In the embodiment shown in FIG. 23 to FIG. 26, the processor may be, for example, a central processing unit (CPU), an application specific integrated circuit (ASIC), a digital signal processor (DSP), and a digital signal. Processing device (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation.

上述输入设备可以包括多种输入设备,例如可以包括面向用户的用户接口、面向设备的设备接口、软件的可编程接口、收发信机中的至少一个。可选的,该面向设备的设备接口可以是用于设备与设备之间进行数据传输的有线接口、还可以是用于设备与设备之间进行数据传输的硬件插入接口(例如USB接口、串口等);可选的,该面向用户的用户接口例如可以是面向用户的控制按键、用于接收语音输入的语音输入设备以及用户接收用户触摸输入的触摸感知设备(例如具有触摸感应功能的触摸屏、触控板等);可选的,上述软件的可编程接口例如可以是供用户编辑或者修改程序的入口,例如芯 片的输入引脚接口或者输入接口等;可选的,上述收发信机可以是具有通信功能的射频收发芯片、基带处理芯片以及收发天线等。The input device may include a plurality of input devices, for example, at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, and a transceiver. Optionally, the device-oriented device interface may be a wired interface for data transmission between the device and the device, or may be a hardware insertion interface (for example, a USB interface, a serial port, etc.) for data transmission between the device and the device. Optionally, the user-oriented user interface may be, for example, a user-oriented control button, a voice input device for receiving voice input, and a touch-sensing device for receiving a user's touch input (eg, a touch screen with touch sensing function, touch Control board, etc.); optionally, the programmable interface of the above software may be, for example, an entrance for the user to edit or modify the program, such as a core The input pin interface or the input interface of the chip; optionally, the transceiver may be a radio frequency transceiver chip with a communication function, a baseband processing chip, and a transceiver antenna.

上述输出设备可以包括多种输出设备,例如可以包括面向用户的用户接口、面向设备的设备接口、软件的可编程接口、收发信机中的至少一个。可选的,该面向设备的设备接口可以是用于设备与设备之间进行数据传输的有线接口、还可以是用于设备与设备之间进行数据传输的硬件插入接口(例如USB接口、串口等);可选的,该面向用户的用户接口例如可以是面向用户的显示设备或语音输出设备;可选的,上述软件的可编程接口例如可以是供用户编辑或者修改程序的入口,例如芯片的输入引脚接口或者输入接口等;可选的,上述收发信机可以是具有通信功能的射频收发芯片、基带处理芯片以及收发天线等。The output device may include a plurality of output devices, for example, at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, and a transceiver. Optionally, the device-oriented device interface may be a wired interface for data transmission between the device and the device, or may be a hardware insertion interface (for example, a USB interface, a serial port, etc.) for data transmission between the device and the device. Optionally, the user-oriented user interface may be, for example, a user-oriented display device or a voice output device; optionally, the programmable interface of the software may be, for example, an input for the user to edit or modify the program, such as a chip. The input pin interface or the input interface, etc.; optionally, the transceiver may be a radio frequency transceiver chip with a communication function, a baseband processing chip, and a transceiver antenna.

在本发明实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本发明。在本发明实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。The terms used in the embodiments of the present invention are for the purpose of describing particular embodiments only and are not intended to limit the invention. The singular forms "a", "the" and "the"

应当理解,本文中使用的术语“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。It should be understood that the term "and/or" as used herein is merely an association describing the associated object, indicating that there may be three relationships, for example, A and/or B, which may indicate that A exists separately, while A and B, there are three cases of B alone. In addition, the character "/" in this article generally indicates that the contextual object is an "or" relationship.

应当理解,尽管在本发明实施例中可能采用术语第一、第二、第三等来描述XXX,但这些XXX不应限于这些术语。这些术语仅用来将XXX彼此区分开。例如,在不脱离本发明实施例范围的情况下,第一XXX也可以被称为第二XXX,类似地,第二XXX也可以被称为第一XXX。It should be understood that although the terms first, second, third, etc. may be used to describe XXX in embodiments of the invention, these XXX should not be limited to these terms. These terms are only used to distinguish XXX from each other. For example, the first XXX may also be referred to as a second XXX without departing from the scope of the embodiments of the present invention. Similarly, the second XXX may also be referred to as a first XXX.

还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的商品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种商品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的商品或者系统中还存在另外的相同要素。It should also be noted that the terms "including", "comprising" or "comprising" or any other variations thereof are intended to encompass a non-exclusive inclusion, such that the item or system comprising a plurality of elements includes not only those elements but also Other elements, or elements that are inherent to such goods or systems. An element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the item or system including the element, without further limitation.

上述可读存储存储介质可以是由任何类型的易失性或非易失性存储 设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。The above readable storage storage medium may be by any type of volatile or non-volatile storage Devices or combinations of them, such as static random access memory (SRAM), electrically erasable programmable read only memory (EEPROM), erasable programmable read only memory (EPROM), programmable read only memory (PROM) , read only memory (ROM), magnetic memory, flash memory, disk or optical disk.

最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 Finally, it should be noted that the above embodiments are only for explaining the technical solutions of the present application, and are not limited thereto; although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present application. range.

Claims (46)

一种字符串的分词方法,其特征在于,包括:A character segmentation method for a string, characterized in that it comprises: 获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;Obtaining a forward segmentation result of the string to be segmented, the forward segmentation result including at least one first word; 获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;Obtaining a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word; 获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;Obtaining a word frequency of each of the first words and a word frequency of each of the second words, the word frequency being a number of times the predetermined words appear in the preset text; 根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。Determining a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward segmentation result Or the reverse segmentation result. 根据权利要求1所述的方法,其特征在于,所述获取待分割的字符串的正向分割结果,包括:The method according to claim 1, wherein the obtaining a forward segmentation result of the character string to be segmented comprises: 对所述待分割的字符串进行正向分割的操作,判断是否获取到第一单词;Performing a forward split operation on the character string to be divided to determine whether the first word is acquired; 若是,将除去所述第一单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If yes, the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned; 若否,对所述待分割的字符串的正向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If not, the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split; 重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到正向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained. 根据权利要求1所述的方法,其特征在于,所述获取待分割的字符串反向分割结果,包括:The method according to claim 1, wherein the obtaining the reverse segmentation result of the character string to be segmented comprises: 对所述待分割的字符串进行反向分割的操作,判断是否获取到第二单词;Performing an inverse segmentation operation on the character string to be divided to determine whether the second word is acquired; 若是,将除去所述第二单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If yes, the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed; 若否,对所述待分割的字符串的反向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If not, the first character in the reverse direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing reverse splitting on the string to be split; 重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分 割的字符串分割结束,得到反向分割结果。Repeating the operation of performing forward segmentation on the character string to be divided until the to-be-divided The segmentation of the cut string ends, and the result of the reverse segmentation is obtained. 根据权利要求1至3任一项所述的方法,其特征在于,还包括:The method according to any one of claims 1 to 3, further comprising: 获取待分割的文本,对所述待分割的文本进行符号删除操作,得到所述待分割的字符串。Obtaining a text to be divided, performing a symbol deletion operation on the text to be divided, and obtaining the character string to be divided. 根据权利要求2或3所述的方法,其特征在于,还包括:The method according to claim 2 or 3, further comprising: 构建正向字典树和反向字典树;Construct a forward dictionary tree and a reverse dictionary tree; 所述对所述待分割的字符串进行正向分割的操作,包括:The operation of performing forward segmentation on the character string to be divided includes: 根据所述正向字典树,对所述待分割的字符串进行正向分割的操作;Performing a forward split operation on the character string to be divided according to the forward dictionary tree; 所述对所述待分割的字符串进行反向分割的操作,包括:The operation of performing the reverse splitting on the character string to be divided includes: 根据所述反向字典树,对所述待分割的字符串进行反向分割的操作。Performing an inverse split operation on the character string to be split according to the reverse dictionary tree. 根据权利要求5所述的方法,其特征在于,所述正向字典树的每个第一节点中存储有所述第一节点对应的单词的词频,所述反向字典树的每个第二节点中存储有所述第二节点对应的单词的词频;The method according to claim 5, wherein each of the first nodes of the forward dictionary tree stores a word frequency of a word corresponding to the first node, and each second of the reverse dictionary tree The word frequency of the word corresponding to the second node is stored in the node; 所述获取各所述第一单词的词频和各所述第二单词的词频,包括:The acquiring the word frequency of each of the first words and the word frequency of each of the second words includes: 从所述第一单词对应的第一节点中获取所述第一单词的词频;Obtaining a word frequency of the first word from a first node corresponding to the first word; 从所述第二单词对应的第二节点中获取所述第二单词的词频。And acquiring a word frequency of the second word from a second node corresponding to the second word. 根据权利要求6所述的方法,其特征在于,所述构建正向字典树和反向字典树之前,还包括:The method according to claim 6, wherein before the constructing the forward dictionary tree and the reverse dictionary tree, the method further comprises: 构建语料库,所述语料库包括单词库和所述单词库中的单词的词频;Constructing a corpus, the corpus including a word library and a word frequency of words in the word library; 所述构建正向字典树和反向字典树,包括:The constructing a forward dictionary tree and a reverse dictionary tree, including: 根据所述语料库,构建正向字典树和反向字典树,并将各单词的词频存储至对应的第一节点和第二节点。According to the corpus, a forward dictionary tree and a reverse dictionary tree are constructed, and the word frequency of each word is stored to the corresponding first node and second node. 根据权利要求7所述的方法,其特征在于,所述预设文本包括:满足预设使用条件的文本以及待分割的文本;所述构建语料库,包括:The method according to claim 7, wherein the preset text comprises: text that satisfies a preset usage condition and text to be divided; and the constructed corpus includes: 根据满足预设使用条件的词典,得到单词库;Obtain a word library according to a dictionary that satisfies a preset use condition; 确定所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数;Determining a number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided; 根据所述单词库、所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数,构建所述语料库。The corpus is constructed according to the word library, the words in the word library, the number of occurrences of the text satisfying the preset use condition and the text to be divided. 根据权利要求8所述的方法,其特征在于,所述确定单词库中的单词 在所述待分割的文本中出现的次数,包括:The method of claim 8 wherein said determining words in a word library The number of occurrences in the text to be split, including: 根据所述待分割的文本中的空格符,获取至少一个第一字符串;Obtaining at least one first character string according to the space character in the text to be divided; 将所述至少一个第一字符串与所述单词库中的单词进行匹配,得到与所述单词库中的单词匹配的至少一个第二字符串;Matching the at least one first character string with a word in the word library to obtain at least one second character string that matches a word in the word library; 根据各所述第二字符串在所述待分割的文本中出现的次数,确定单词库中的单词在所述待分割的文本中出现的次数。And determining, according to the number of occurrences of each of the second character strings in the text to be divided, the number of times the words in the word library appear in the text to be divided. 根据权利要求1所述的方法,其特征在于,所述根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,包括:The method according to claim 1, wherein the determining the segmentation result of the character string to be segmented according to the word frequency of each of the first words and the word frequency of each of the second words comprises: 对所有所述第一单词的词频进行求和处理,得到第一词频和值;And summing the word frequencies of all the first words to obtain a first word frequency sum value; 对所有所述第二单词的词频进行求和处理,得到第二词频和值;And summing the word frequencies of all the second words to obtain a second word frequency sum value; 若所述第一词频和值大于所述第二词频和值,则确定所述待分割的字符串的分割结果为正向分割结果;If the first word frequency sum value is greater than the second word frequency sum value, determining that the segmentation result of the character string to be segmented is a forward segmentation result; 若所述第二词频和值大于所述第一词频和值,则确定所述待分割的字符串的分割结果为反向分割结果。And if the second word frequency sum value is greater than the first word frequency sum value, determining that the segmentation result of the character string to be segmented is a reverse segmentation result. 根据权利要求1至3任一项所述的方法,其特征在于,所述正向分割和所述反向分割均采用最长单词分割方式。The method according to any one of claims 1 to 3, characterized in that the forward segmentation and the inverse segmentation each adopt the longest word segmentation mode. 根据权利要求4所述的方法,其特征在于,所述获取待分割的文本,包括:The method according to claim 4, wherein the obtaining the text to be segmented comprises: 获取用户设备发送的所述待分割的文本;Obtaining the text to be divided sent by the user equipment; 所述确定所述待分割的字符串的分割结果之后,还包括:After the determining the segmentation result of the character string to be divided, the method further includes: 向所述用户设备反馈所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果,以使所述用户设备向用户输出所述分割结果。The segmentation result information of the character string to be divided is fed back to the user equipment, and the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment outputs the segmentation result to the user. 根据权利要求12所述的方法,其特征在于,所述分割结果信息中还包括所述分割结果对应的分割类型,所述分割类型为正向分割或反向分割。The method according to claim 12, wherein the segmentation result information further includes a segmentation type corresponding to the segmentation result, and the segmentation type is forward segmentation or reverse segmentation. 根据权利要求12所述的方法,其特征在于,若所述分割结果为正向分割结果,则所述分割信息中还包括反向分割结果;或者The method according to claim 12, wherein if the segmentation result is a forward segmentation result, the segmentation information further includes a reverse segmentation result; or 若所述分割结果为反向分割结果,则所述分割信息中还包括正向分割结果。 If the segmentation result is a reverse segmentation result, the segmentation information further includes a forward segmentation result. 根据权利要求14所述的方法,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词的词频和所述反向分割结果中的各所述第二单词的词频。The method according to claim 14, wherein the segmentation information further comprises a word frequency of each of the first words in the forward segmentation result and each of the second segmentation results The word frequency of the word. 根据权利要求14所述的方法,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词对应的第一词频和值以及所述反向分割结果中的各所述第二单词对应的第二词频和值。The method according to claim 14, wherein the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and a result of the reverse segmentation result a second word frequency sum value corresponding to each of the second words. 根据权利要求12所述的方法,其特征在于,所述向所述用户设备反馈所述待分割的字符串的分割结果信息之后,还包括:The method according to claim 12, after the step of feeding back the segmentation result information of the character string to be segmented to the user equipment, the method further includes: 获取所述用户设备发送的待处理的分割结果;Obtaining a segmentation result to be processed sent by the user equipment; 对所述待处理的分割结果进行自然语言处理。Natural language processing is performed on the segmentation result to be processed. 一种字符串的分词方法,其特征在于,包括:A character segmentation method for a string, characterized in that it comprises: 向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;Sending the text to be divided by the user to the cloud server, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and each second word in the reverse segmentation result The word frequency determines the segmentation result; 接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;And receiving, by the cloud server, the segmentation result information of the to-be-divided character string, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the segmentation result of the character string to be segmented is Describe the forward segmentation result or the reverse segmentation result; 向用户输出所述分割结果。The segmentation result is output to the user. 根据权利要求18所述的方法,其特征在于,所述向用户输出所述分割结果,包括:The method according to claim 18, wherein the outputting the segmentation result to a user comprises: 在显示界面上显示所述分割结果。The segmentation result is displayed on the display interface. 根据权利要求19所述的方法,其特征在于,所述分割结果信息中还包括所述分割结果对应的分割类型,所述分割类型为正向分割或反向分割;The method according to claim 19, wherein the segmentation result information further includes a segmentation type corresponding to the segmentation result, and the segmentation type is forward segmentation or reverse segmentation; 所述在显示界面上显示所述分割结果,包括:Displaying the segmentation result on the display interface, including: 在显示界面上显示所述分割结果以及所述分割结果的分割类型。The segmentation result and the segmentation type of the segmentation result are displayed on the display interface. 根据权利要求19所述的方法,其特征在于,若所述分割结果为正向分割结果,则所述分割信息中还包括反向分割结果;或者The method according to claim 19, wherein if the segmentation result is a forward segmentation result, the segmentation information further includes a reverse segmentation result; or 若所述分割结果为反向分割结果,则所述分割信息中还包括正向分割结果;If the segmentation result is a reverse segmentation result, the segmentation information further includes a forward segmentation result; 所述在显示界面上显示所述分割结果,包括: Displaying the segmentation result on the display interface, including: 在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果。Displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string. 根据权利要求21所述的方法,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词的词频和所述反向分割结果中的各所述第二单词的词频;The method according to claim 21, wherein said segmentation information further comprises a word frequency of each of said first words in said forward segmentation result and said second one of said reverse segmentation results Word frequency of words; 在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果之后,还包括:After displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string, the method further includes: 获取所述用户操作所述显示界面触发的词频显示指令;Obtaining a word frequency display instruction triggered by the user operating the display interface; 根据所述词频显示指令,显示各所述第一单词的词频和/或各所述第二单词的词频;Displaying a word frequency of each of the first words and/or a word frequency of each of the second words according to the word frequency display instruction; 或者or 在所述显示界面上显示所述正向分割结果和所述反向分割结果,包括:Displaying the forward segmentation result and the reverse segmentation result on the display interface, including: 在所述显示界面上显示所述正向分割结果、所述正向分割结果中的第一单词的词频,以及所述反向分割结果、所述反向分割结果中的第二单词的词频。And displaying, on the display interface, the forward segmentation result, a word frequency of the first word in the forward segmentation result, and a word frequency of the second segment of the reverse segmentation result and the reverse segmentation result. 根据权利要求21所述的方法,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词对应的第一词频和值以及所述反向分割结果中的各所述第二单词对应的第二词频和值;The method according to claim 21, wherein the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and a result of the reverse segmentation result a second word frequency sum value corresponding to each of the second words; 在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果之后,还包括:After displaying the forward segmentation result and the reverse segmentation result on the display interface, and labeling the segmentation result corresponding to the to-be-divided string, the method further includes: 获取所述用户操作所述显示界面触发的词频显示指令;Obtaining a word frequency display instruction triggered by the user operating the display interface; 根据所述词频显示指令,显示所述第一词频和值和/或所述第二词频和值;Displaying the first word frequency sum value and/or the second word frequency sum value according to the word frequency display instruction; 或者or 在所述显示界面上显示所述正向分割结果和所述反向分割结果,包括:Displaying the forward segmentation result and the reverse segmentation result on the display interface, including: 在所述显示界面上显示所述正向分割结果、所述第一词频和值,以及所述反向分割结果、所述第二词频和值。Displaying the forward segmentation result, the first word frequency sum value, and the reverse segmentation result, the second word frequency sum value on the display interface. 根据权利要求21至23任一项所述的方法,其特征在于,所述在所述显示界面上显示所述正向分割结果和所述反向分割结果之后,还包括:The method according to any one of claims 21 to 23, wherein after the displaying the forward segmentation result and the reverse segmentation result on the display interface, the method further comprises: 获取所述用户对所述显示界面上的所述正向分割结果或反向分割结果的 操作信息,Obtaining, by the user, the forward segmentation result or the reverse segmentation result on the display interface Operational information, 根据所述操作信息确定待处理的分割结果;Determining a segmentation result to be processed according to the operation information; 向所述云端服务器发送所述待处理的分割结果,以使所述云端服务器对所述待处理的分割结果进行自然语言处理。And sending, to the cloud server, the segmentation result to be processed, so that the cloud server performs natural language processing on the segmentation result to be processed. 一种字符串的分词装置,其特征在于,包括:A word segmentation device, characterized in that it comprises: 第一分割模块,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词;a first segmentation module, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word; 第二分割模块,用于获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;a second segmentation module, configured to acquire a reverse segmentation result of the character string to be segmented, the reverse segmentation result including at least one second word; 词频获取模块,用于获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;a word frequency acquisition module, configured to acquire a word frequency of each of the first words and a word frequency of each of the second words, where the word frequency is a predetermined number of occurrences of each word in the preset text; 结果确定模块,用于根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a result determining module, configured to determine a segmentation result of the character string to be segmented according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is The forward segmentation result or the reverse segmentation result. 根据权利要求25所述的装置,其特征在于,所述第一分割模块具体用于,The device according to claim 25, wherein the first segmentation module is specifically configured to: 对所述待分割的字符串进行正向分割的操作,判断是否获取到第一单词;Performing a forward split operation on the character string to be divided to determine whether the first word is acquired; 若是,将除去所述第一单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If yes, the character string to be divided of the first word is removed as a new character string to be divided, and an operation of performing forward segmentation on the character string to be segmented is returned; 若否,对所述待分割的字符串的正向的首字符进行删除处理,得到处理后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行正向分割的操作;If not, the first character in the forward direction of the character string to be divided is deleted, and the processed character string to be divided is obtained, and the processed character string to be divided is used as a new character string to be divided, and Returns the operation of performing forward splitting on the string to be split; 重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到正向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a forward segmentation result is obtained. 根据权利要求25所述的装置,其特征在于,所述第二分割模块具体用于,对所述待分割的字符串进行反向分割的操作,判断是否获取到第二单词;The device according to claim 25, wherein the second segmentation module is configured to perform an operation of performing a reverse segmentation on the character string to be segmented to determine whether the second word is acquired; 若是,将除去所述第二单词的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If yes, the character string to be divided of the second word is removed as a new character string to be divided, and an operation of performing reverse segmentation on the character string to be divided is performed; 若否,对所述待分割的字符串的反向的首字符进行删除处理,得到处理 后的待分割的字符串,将处理后的待分割的字符串作为新的待分割的字符串,并返回执行对待分割的字符串进行反向分割的操作;If not, the first character of the reverse character string to be divided is deleted and processed. After the string to be split, the processed string to be split is used as a new string to be split, and the operation of performing the reverse splitting of the string to be split is performed; 重复执行对所述待分割的字符串进行正向分割的操作,直至对所述待分割的字符串分割结束,得到反向分割结果。The operation of performing forward segmentation on the character string to be segmented is repeatedly performed until the segmentation of the character string to be segmented ends, and a reverse segmentation result is obtained. 根据权利要求25至27任一项所述的装置,其特征在于,还包括:文本获取模块,用于获取待分割的文本,对所述待分割的文本进行符号删除操作,得到所述待分割的字符串。The device according to any one of claims 25 to 27, further comprising: a text acquisition module, configured to acquire text to be segmented, perform a symbol deletion operation on the text to be segmented, and obtain the to-be-segmented String. 根据权利要求25至27任一项所述的装置,其特征在于,还包括:字典树构建模块,用于构建正向字典树和反向字典树;The apparatus according to any one of claims 25 to 27, further comprising: a dictionary tree building module, configured to construct a forward dictionary tree and a reverse dictionary tree; 所述第一分割模块具体用于,The first segmentation module is specifically configured to: 根据所述正向字典树,对所述待分割的字符串进行正向分割的操作;Performing a forward split operation on the character string to be divided according to the forward dictionary tree; 所述第二分割模块具体用于,The second segmentation module is specifically configured to: 根据所述反向字典树,对所述待分割的字符串进行反向分割的操作。Performing an inverse split operation on the character string to be split according to the reverse dictionary tree. 根据权利要求29所述的装置,其特征在于,所述正向字典树的每个第一节点中存储有所述第一节点对应的单词的词频,所述反向字典树的每个第二节点中存储有所述第二节点对应的单词的词频;The apparatus according to claim 29, wherein each of the first nodes of the forward dictionary tree stores a word frequency of a word corresponding to the first node, and each second of the reverse dictionary tree The word frequency of the word corresponding to the second node is stored in the node; 所述词频获取模块具体用于,The word frequency acquisition module is specifically used, 从所述第一单词对应的第一节点中获取所述第一单词的词频;Obtaining a word frequency of the first word from a first node corresponding to the first word; 从所述第二单词对应的第二节点中获取所述第二单词的词频。And acquiring a word frequency of the second word from a second node corresponding to the second word. 根据权利要求30所述的装置,其特征在于,还包括:语料库构建模块,用于构建语料库,所述语料库包括单词库和所述单词库中的单词的词频;The apparatus according to claim 30, further comprising: a corpus construction module, configured to construct a corpus, the corpus including a word library and a word frequency of a word in the word library; 所述字典树构建模块具体用于,根据所述语料库,构建正向字典树和反向字典树,并将各单词的词频存储至对应的第一节点和第二节点。The dictionary tree construction module is specifically configured to construct a forward dictionary tree and a reverse dictionary tree according to the corpus, and store the word frequency of each word to the corresponding first node and the second node. 根据权利要求31所述的装置,其特征在于,所述预设文本包括:满足预设使用条件的文本以及待分割的文本;所述语料库构建模块具体用于,The device according to claim 31, wherein the preset text comprises: text that satisfies a preset use condition and text to be divided; the corpus construction module is specifically configured to: 根据满足预设使用条件的词典,得到单词库;Obtain a word library according to a dictionary that satisfies a preset use condition; 确定所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数;Determining a number of times the word in the word library appears in the text satisfying the preset use condition and the text to be divided; 根据所述单词库、所述单词库中的单词在所述满足预设使用条件的文本以及所述待分割的文本中出现的次数,构建所述语料库。 The corpus is constructed according to the word library, the words in the word library, the number of occurrences of the text satisfying the preset use condition and the text to be divided. 根据权利要求32所述的装置,其特征在于,所述语料库构建模块具体用于,The apparatus according to claim 32, wherein said corpus construction module is specifically configured to: 根据所述待分割的文本中的空格符,获取至少一个第一字符串;Obtaining at least one first character string according to the space character in the text to be divided; 将所述至少一个第一字符串与所述单词库中的单词进行匹配,得到与所述单词库中的单词匹配的至少一个第二字符串;Matching the at least one first character string with a word in the word library to obtain at least one second character string that matches a word in the word library; 根据各所述第二字符串在所述待分割的文本中出现的次数,确定单词库中的单词在所述待分割的文本中出现的次数。And determining, according to the number of occurrences of each of the second character strings in the text to be divided, the number of times the words in the word library appear in the text to be divided. 根据权利要求28所述的装置,其特征在于,还包括:反馈模块;The device according to claim 28, further comprising: a feedback module; 所述文本获取模块具体用于,获取用户设备发送的所述待分割的文本;The text obtaining module is specifically configured to acquire the text to be divided sent by the user equipment; 所述反馈模块用于,向所述用户设备反馈所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果,以使所述用户设备向用户输出所述分割结果。The feedback module is configured to feed back the segmentation result information of the character string to be segmented to the user equipment, where the segmentation result information includes a segmentation result of the character string to be segmented, so that the user equipment is provided to the user The segmentation result is output. 根据权利要求34所述的装置,其特征在于,还包括:结果获取模块和处理模块,The device according to claim 34, further comprising: a result acquisition module and a processing module, 所述结果获取模块用于,获取所述用户设备发送的待处理的分割结果;The result obtaining module is configured to acquire a segmentation result to be processed sent by the user equipment; 所述处理模块用于,对所述待处理的分割结果进行自然语言处理。The processing module is configured to perform natural language processing on the segmentation result to be processed. 一种字符串的分词装置,其特征在于,包括:A word segmentation device, characterized in that it comprises: 发送模块,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;a sending module, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result; 接收模块,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;a receiving module, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result; 输出模块,用于向用户输出所述分割结果。And an output module, configured to output the segmentation result to a user. 根据权利要求36所述的分词装置,其特征在于,所述输出模块具体用于,在显示界面上显示所述分割结果。The word segmentation device according to claim 36, wherein the output module is specifically configured to display the segmentation result on a display interface. 根据权利要求37所述的分词装置,其特征在于,所述分割结果信息中还包括所述分割结果对应的分割类型,所述分割类型为正向分割或反向分割;The word segmentation device according to claim 37, wherein the segmentation result information further includes a segmentation type corresponding to the segmentation result, and the segmentation type is forward segmentation or reverse segmentation; 所述输出模块具体用于,在显示界面上显示所述分割结果以及所述分割 结果的分割类型。The output module is specifically configured to display the segmentation result and the segmentation on a display interface The type of segmentation of the result. 根据权利要求37所述的分词装置,其特征在于,若所述分割结果为正向分割结果,则所述分割信息中还包括反向分割结果;或者The word segmentation device according to claim 37, wherein if the segmentation result is a forward segmentation result, the segmentation information further includes a reverse segmentation result; or 若所述分割结果为反向分割结果,则所述分割信息中还包括正向分割结果;If the segmentation result is a reverse segmentation result, the segmentation information further includes a forward segmentation result; 所述输出模块具体用于,在所述显示界面上显示所述正向分割结果和所述反向分割结果,并标注所述待分割字符串对应的分割结果。The output module is specifically configured to display the forward segmentation result and the reverse segmentation result on the display interface, and label the segmentation result corresponding to the to-be-divided string. 根据权利要求39所述的分词装置,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词的词频和所述反向分割结果中的各所述第二单词的词频;The word segmentation device according to claim 39, wherein said segmentation information further includes each of said word frequency and said reverse segmentation result of said first word in said forward segmentation result The word frequency of two words; 所述显示装置还包括:指令获取模块,用于获取所述用户操作所述显示界面触发的词频显示指令;The display device further includes: an instruction acquisition module, configured to acquire a word frequency display instruction triggered by the user to operate the display interface; 所述输出模块还用于,根据所述词频显示指令,显示各所述第一单词的词频和/或各所述第二单词的词频;The output module is further configured to display, according to the word frequency display instruction, a word frequency of each of the first words and/or a word frequency of each of the second words; 或者or 所述输出模块具体用于,在所述显示界面上显示所述正向分割结果、所述正向分割结果中的第一单词的词频,以及所述反向分割结果、所述反向分割结果中的第二单词的词频。The output module is specifically configured to display, on the display interface, the forward segmentation result, a word frequency of a first word in the forward segmentation result, and the reverse segmentation result and the reverse segmentation result The word frequency of the second word in it. 根据权利要求39所述的分词装置,其特征在于,所述分割信息中还包括所述正向分割结果中的各所述第一单词对应的第一词频和值以及所述反向分割结果中的各所述第二单词对应的第二词频和值;The word segmentation device according to claim 39, wherein the segmentation information further includes a first word frequency sum value corresponding to each of the first words in the forward segmentation result and the reverse segmentation result a second word frequency sum value corresponding to each of the second words; 所述显示装置还包括:指令获取模块,用于获取所述用户操作所述显示界面触发的词频显示指令;The display device further includes: an instruction acquisition module, configured to acquire a word frequency display instruction triggered by the user to operate the display interface; 所述输出模块还用于,根据所述词频显示指令,显示所述第一词频和值和/或所述第二词频和值;The output module is further configured to display the first word frequency sum value and/or the second word frequency sum value according to the word frequency display instruction; 或者or 所述输出模块具体用于,在所述显示界面上显示所述正向分割结果、所述第一词频和值,以及所述反向分割结果、所述第二词频和值。The output module is specifically configured to display, on the display interface, the forward segmentation result, the first word frequency sum value, and the reverse segmentation result and the second word frequency sum value. 根据权利要求39至41任一项所述的分词装置,其特征在于,还包括:操作信息获取模块,用于获取所述用户对所述显示界面上的所述正向分 割结果或反向分割结果的操作信息,The word segmentation device according to any one of claims 39 to 41, further comprising: an operation information acquisition module, configured to acquire the forward score of the user on the display interface The operation information of the cut result or the reverse split result, 确定模块,用于根据所述操作信息确定待处理的分割结果;a determining module, configured to determine a segmentation result to be processed according to the operation information; 所述发送模块还用于,向所述云端服务器发送所述待处理的分割结果,以使所述云端服务器对所述待处理的分割结果进行自然语言处理。The sending module is further configured to send the segmentation result to be processed to the cloud server, so that the cloud server performs natural language processing on the segmentation result to be processed. 一种字符串的分词设备,其特征在于,包括:A word segmentation device, characterized in that it comprises: 输入设备,用于获取待分割的文本;An input device for acquiring text to be divided; 处理器,耦合至所述输入设备,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor, coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results. 一种云端服务器,其特征在于,包括:A cloud server, comprising: 输入设备,用于获取待分割的文本;An input device for acquiring text to be divided; 处理器,耦合至所述输入设备,用于获取待分割的字符串的正向分割结果,所述正向分割结果包括至少一个第一单词,并获取所述待分割的字符串的反向分割结果,所述反向分割结果包括至少一个第二单词;获取各所述第一单词的词频和各所述第二单词的词频,所述词频为预先确定的各单词在预设文本中出现的次数;根据各所述第一单词的词频以及各所述第二单词的词频,确定所述待分割的字符串的分割结果,其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果。a processor, coupled to the input device, configured to obtain a forward segmentation result of the character string to be segmented, the forward segmentation result including at least one first word, and acquiring a reverse segmentation of the character string to be segmented a result, the reverse segmentation result includes at least one second word; a word frequency of each of the first words and a word frequency of each of the second words are obtained, the word frequency being a predetermined word appearing in a preset text a number of times; determining a segmentation result of the character string to be divided according to a word frequency of each of the first words and a word frequency of each of the second words, wherein a segmentation result of the character string to be segmented is the forward direction Segmentation results or the reverse segmentation results. 一种字符串的分词设备,其特征在于,包括:A word segmentation device, characterized in that it comprises: 输出设备,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;An output device, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result; 输入设备,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;An input device, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result; 处理器,耦合至所述输出设备和所述输入设备,用于根据所述分割结果 信息,控制所述输入设备向用户输出所述分割结果。a processor coupled to the output device and the input device for determining a result of the segmentation And controlling the input device to output the segmentation result to the user. 一种用户设备,其特征在于,包括:A user equipment, comprising: 输出设备,用于向云端服务器发送用户输入的待分割的文本,以使所述云端服务器获取待分割的字符串,并根据正向分割结果中的各第一单词的词频以及反向分割结果中的各第二单词的词频确定分割结果;An output device, configured to send, to the cloud server, the text to be divided by the user, so that the cloud server obtains the character string to be segmented, and according to the word frequency of each first word in the forward segmentation result and the reverse segmentation result The word frequency of each second word determines the segmentation result; 输入设备,用于接收所述云端服务器反馈的所述待分割的字符串的分割结果信息,所述分割结果信息包括所述待分割的字符串的分割结果;其中,所述待分割的字符串的分割结果为所述正向分割结果或所述反向分割结果;An input device, configured to receive segmentation result information of the to-be-divided character string fed back by the cloud server, where the segmentation result information includes a segmentation result of the character string to be segmented; wherein the string to be segmented The segmentation result is the forward segmentation result or the reverse segmentation result; 处理器,耦合至所述输出设备和所述输入设备,用于根据所述分割结果信息,控制所述输入设备向用户输出所述分割结果。 And a processor coupled to the output device and the input device, configured to control the input device to output the segmentation result to a user according to the segmentation result information.
PCT/CN2017/091783 2016-07-13 2017-07-05 Character string segmentation method, apparatus and device Ceased WO2018010579A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610552115.0 2016-07-13
CN201610552115.0A CN107622044A (en) 2016-07-13 2016-07-13 Segmenting method, device and the equipment of character string

Publications (1)

Publication Number Publication Date
WO2018010579A1 true WO2018010579A1 (en) 2018-01-18

Family

ID=60952791

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/091783 Ceased WO2018010579A1 (en) 2016-07-13 2017-07-05 Character string segmentation method, apparatus and device

Country Status (3)

Country Link
CN (1) CN107622044A (en)
TW (1) TW201804341A (en)
WO (1) WO2018010579A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522550A (en) * 2018-11-08 2019-03-26 和美(深圳)信息技术股份有限公司 Text information error correction method, device, computer equipment and storage medium
CN111310450A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Character string word segmentation method, device, equipment and storage medium
CN112684905A (en) * 2019-10-17 2021-04-20 北京搜狗科技发展有限公司 Word learning method and device and electronic equipment
CN113569027A (en) * 2021-07-27 2021-10-29 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment
CN114722815A (en) * 2022-04-18 2022-07-08 上海喜马拉雅科技有限公司 Affix determination method, affix determination device, electronic equipment and storage medium

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657738B (en) * 2018-10-25 2024-04-30 平安科技(深圳)有限公司 Character recognition method, device, equipment and storage medium
CN109800435B (en) * 2019-01-29 2023-06-20 北京金山数字娱乐科技有限公司 Training method and device for language model
CN111078083A (en) * 2019-06-09 2020-04-28 广东小天才科技有限公司 Method for determining click-to-read content and electronic equipment
CN110532112B (en) * 2019-08-29 2022-10-04 维沃移动通信有限公司 Object extraction method and mobile terminal
TWI772709B (en) * 2019-11-14 2022-08-01 雲拓科技有限公司 Automatic claim-element-noun-and-position-thereof obtaining equipment for no-space text
CN113591440B (en) * 2021-07-29 2023-08-01 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
CN117422071B (en) * 2023-12-19 2024-03-15 中南大学 Text term multiple segmentation annotation conversion method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915299A (en) * 2012-10-23 2013-02-06 海信集团有限公司 Word segmentation method and device
CN103646018A (en) * 2013-12-20 2014-03-19 大连大学 Chinese word segmentation method based on hash table dictionary structure

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063424A (en) * 2010-12-24 2011-05-18 上海电机学院 Method for Chinese word segmentation
CN103699524A (en) * 2013-12-18 2014-04-02 百度在线网络技术(北京)有限公司 Word segmentation method and mobile terminal
CN103678282B (en) * 2014-01-07 2016-05-25 苏州思必驰信息科技有限公司 A kind of segmenting method and device
CN104899187A (en) * 2014-03-06 2015-09-09 武汉元宝创意科技有限公司 Man-computer interaction word segmentation and semantic marking method and man-computer interaction word segmentation and semantic marking system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102915299A (en) * 2012-10-23 2013-02-06 海信集团有限公司 Word segmentation method and device
CN103646018A (en) * 2013-12-20 2014-03-19 大连大学 Chinese word segmentation method based on hash table dictionary structure

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PENG, QI ET AL.: "A General Method of Chinese Word Segmentation Based on the Resolution of Word Frequency Ambiguity", JOURNAL OF GUANGXI NORMAL UNIVERSITY (NATURAL SCIENCE EDITION), vol. 34, no. 1, 31 March 2016 (2016-03-31) *
ZHANG, HENG ET AL.: "Chinese Word Segmentation Method Based on Dictionary and Frequency of the Words", MICROCOMPUTER INFORMATION, vol. 24, no. 1-3, 31 December 2008 (2008-12-31) *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522550A (en) * 2018-11-08 2019-03-26 和美(深圳)信息技术股份有限公司 Text information error correction method, device, computer equipment and storage medium
CN109522550B (en) * 2018-11-08 2023-04-07 和美(深圳)信息技术股份有限公司 Text information error correction method and device, computer equipment and storage medium
CN112684905A (en) * 2019-10-17 2021-04-20 北京搜狗科技发展有限公司 Word learning method and device and electronic equipment
CN111310450A (en) * 2020-03-23 2020-06-19 中国建设银行股份有限公司 Character string word segmentation method, device, equipment and storage medium
CN111310450B (en) * 2020-03-23 2023-07-14 中国建设银行股份有限公司 Character string word segmentation method, device, equipment and storage medium
CN113569027A (en) * 2021-07-27 2021-10-29 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment
CN113569027B (en) * 2021-07-27 2024-02-13 北京百度网讯科技有限公司 Document title processing method and device and electronic equipment
CN114722815A (en) * 2022-04-18 2022-07-08 上海喜马拉雅科技有限公司 Affix determination method, affix determination device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN107622044A (en) 2018-01-23
TW201804341A (en) 2018-02-01

Similar Documents

Publication Publication Date Title
WO2018010579A1 (en) Character string segmentation method, apparatus and device
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US10073840B2 (en) Unsupervised relation detection model training
CN104035966B (en) A kind of method and apparatus that expanded search item is provided
US9268749B2 (en) Incremental computation of repeats
CN112836057A (en) Method, device, terminal and storage medium for generating knowledge graph
CN112115232A (en) A data error correction method, device and server
CN104915264A (en) An input error correction method and device
WO2018201600A1 (en) Information mining method and system, electronic device and readable storage medium
CN108804642A (en) Search method, device, computer equipment and storage medium
CN109800427B (en) A word segmentation method, device, terminal and computer-readable storage medium
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN103473217B (en) The method and apparatus of extracting keywords from text
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
US20170185653A1 (en) Predicting Knowledge Types In A Search Query Using Word Co-Occurrence And Semi/Unstructured Free Text
CN105518661A (en) Browsing images via mined hyperlinked text snippets
US20180349354A1 (en) Natural language indexer for virtual assistants
KR20140023677A (en) Terminal and method for determining a type of input method editor
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN103608805B (en) Dictionary generating device and method
CN113761923A (en) Named entity identification method, device, electronic device and storage medium
JP2017535850A (en) Link image thumbnails to web pages
CN108763202A (en) Method, apparatus, equipment and the readable storage medium storing program for executing of the sensitive text of identification
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
CN112579937A (en) Character highlight display method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17826916

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17826916

Country of ref document: EP

Kind code of ref document: A1