SYMBOLIC TOKENIZER FOR WORDS AND PHRASES
TECHNICAL FIELD
This invention relates to microprocessors and more particularly to a custom microprocessor which permits a conventional microprocessor to manipulate quickly and efficiently entire language words and phrases as tokens as opposed to manipulating single characters.
BACKGROUND ART
Conventional microprocessors process text by manipu¬ lating binary values which represent single characters. In order to process an entire word the conventional micro¬ processor must sequentially process each individual char¬ acter of the word. This invention is a custom microprocessor called the symbolic tokenizer which permits a conventional micropro¬ cessor to process language words and phrases in their entirety instead of on a character-by-character basis. This leads to greatly increased speed and efficiency where the manipulation of text is involved.
A data compression method called tokenization is the basis of this invention. Language words and phrases are converted into tokens which utilize fewer bits than the ASCII or EBCDIC binary equivalents of the same words.
The following United States patent documents were considered in the investigation and evaluation of this invention:
2,709,199 3,309,694 3,388,380 3,599,205
3,685,033 3,810,154 4,122,299 4,229,817
Patent 2,709,199 entitled "CODE SIGNAL CONVERTER" is an apparatus for converting signals of a code having a given number of units into signals of a code employing a different number of units. An example would be converting signals from a five- to an eight-unit code.
This invention does not employ data compression me¬ thods. It actually increases the amount of code being transmitted and is merely used to change from one code format to another, longer code format for use in telegra¬ phy.
Patent 3,309,694 entitled "CODE CONVERSION APPARATUS" is an electronic translator for converting a binary code representation of a letter, number, or phrase to an audible, and/or visible indication of that letter, number, or phrase. This is not a system of text compression.
Patent 3,388,380 entitled "SYSTEM FOR DISPLAY OF A WORD DESCRIPTION OF PARAMETERS AND VALUES THEREOF IN RE¬ SPONSE TO AN INPUT OF A WORD DESCRIPTION OF THE PARAMETER" is a data display system. It does not involve the compres¬ sion of text.
Patent 3,599,205 entitled "BINARY TO TERNARY PROTECTED CODE CONVERTER" is an apparatus for converting nonprotected code signals having a specified number of bits to protected code having approximately the same number of bits. This system does not attempt to compress the code.
Patent 3,685,033 entitled "BLOCK ENCODING FOR MAGNETIC RECORDING SYSTEMS" is a high density recording and repro¬ ducing system. It is a method for increasing the number
of stored bits per transition on a magnetic recording medium. This is not a means of text compression.
Patent 3,810,154 entitled "DIGITAL CODE TRANSLATOR" is an apparatus for substantially increasing the amount of information that can be transferred by a teletype in a given amount of time. This method does not require an increase in the bandwidth of the carrier. The transmitting station converts parallel, fixed-place, digital-coded mes¬ sage characters into serial, variable-place, digital-coded message characters. The receiving station reverses the conversion. This is not a method of data compression using tokenization.
Patent 4,122,299 entitled "DATA OUTPUT MODIFYING SYSTEM" is a system for placing data in a format for accep- tance by a general purpose communications printer. The data is originally in a format for providing a television display to news wire service subscribers. This invention involves the conversion of data from one format to another. It does not attempt to compress text. Patent 4,229,817 entitled "PORTABLE ELECTRONIC cryp¬ tographic DEVICE" is a hand-held apparatus for enciphering and deciphering text. This device merely changes the presentation of data and does not compress it.
None of the above-cited patents concern data compres¬ sion techniques utilizing tokenization.
DISCLOSURE OF INVENTION
It is an object of this invention to provide a custom microprocessor which allows a conventional microprocessor to process in parallel entire language words and phrases by first converting the words and phrases into tokens. The tokens may then be manipulated in a manner similar to the conventional manipulation of characters. The tokens are converted back into language words or phrases for display or printout.
A further object of this invention is to provide a custom microprocessor that permits fast searching through parallel processing of a human language dictionary. The custom microprocessor contains an internal electronic dictionary which contains language words and phrases. This dictionary may be interrogated by an external comput¬ ing device such as a conventional microprocessor for pur¬ poses such as spelling verification. Search times are very fast because the symbolic tokenizer is designed to simultaneously process whole or partial words and phrases in parallel instead of individual characters.
These objects are attained through the use of a custom microprocessor, called the symbolic tokenizer that is de¬ signed to process human language. The symbolic tokenizer produces a binary code that represents a human language word or phrase. The binary symbol is used -as a token for the representation of a language word or phrase. The token can be used by a con¬ ventional microprocessor to manipulate text in such appli¬ cations as mass storage and telecommunications. When the text is to be output, for example, to a terminal or prin¬ ter, the tokens are reconverted to human language.
The tokenization of phrases is a two-step process. First, the words comprising a phrase are tokenized. When the control microprocessor recognizes a key word which may be the beginning of a tokenizable phrase, a search is made of the phrase portion of the dictionary using the already tokenized language words. Since the search is conducted using tokenized words, the same architecture allows for the simultaneous searching of the several words which comprise the phrase.
The symbolic tokenizer can be used with a number of different systems and serves many useful purposes. Token¬ ized text (a series of tokens representing words or phrases of the text under consideration) can be transmitted or
stored far more efficiently than its ASCII or EBCDIC equiv¬ alent. The ability to rapidly search a language dictionary through parallel processing makes a very fast spelling checker feasible. The control microprocessor can be any central process¬ ing unit from a simple microprocessor to a conventional mainframe computer.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 is a block diagram of the symbolic tokenizer showing a dictionary ROM and a control microprocessor em¬ bodying the present invention; FIG. 1A is a schematic and block diagram depicting a storage array in the control microprocessor containing data for operating the symbolic tokenizer;
FIG. 2 is a schematic and block diagram of the dic¬ tionary portion of the symbolic tokenizer and the control logic for the dictionary;
FIG. 3 is a schematic and block diagram of the com¬ parator portion of the symbolic tokenizer and showing a mailbus and output multiplexer;
FIG. 4 is a schematic and block diagram of a control logic portion of the symbolic tokenizer;
FIG. 5 is a schematic of an interface to a control microprocessor, such as a Commodore 64K computer;
FIG. 6 is a schematic and block diagram of a dictio¬ nary ROM text storage arrangement allowing parallel char¬ acter processing;
FIG. 7 is a schematic and block diagram depicting a flowchart of the process of controlling the symbolic token¬ izer for tokenization;
FIGS. 8A-8 are schematic and block diagrams depicting a flowchart of the process of tokenization;
FIG. 9 is a schematic and block diagram depicting a flowchart of the process of detokenization.
FIG. 10 is a schematic depiction of a keytoken list stored in the dictionary ROM;
FIG. 11 is a schematic and a depiction of a phrase table stored in the dictionary ROM;
FIG. 12 is a schematic and a depiction of a memory map stored in the control microprocessor and used to access the phrase table;
FIG. 13 is a schematic and a depictin of a memory map stored in the control microprocess and used to access the keytoken list; and
FIGS. 14 and 14A are a schematic and block diagram depicting a flowchart of the process of phrase tokeniza¬ tion.
MODES FOR CARRYING OUT THE INVENTION
As depicted in FIG. 1, a conventional microprocessor 100 is used to control the symbolic tokenizer 102A. The control microprocessor may be a Commodore 64K computer, an IBM Personal Computer, or any other type of processor. The control microprocessor includes a storage area, for example, ROM 100A. However, it is to be understood that the storage area to be used in the manner described below may also be a removable storage medium, such as a floppy diskette.
The storage area contains a map of the start and end points of areas to be searched in the dictionary. In the preferred embodiment, the map is a memory map similar to that depicted in FIG. 1A and is accessed according to the starting character of the English-language trial word to be tokenized and the number of characters in that word. For any given starting character and character length, there is both a unique starting address, alpha, corre¬ sponding to an address of a memory location in the dictio¬ nary ROM 104 and a representation of the total number of grammar bytes, beta, having the given starting character and the given character length. The term grammar byte will be defined below with respect to FIG. 6 after the dictionary architecture is defined. The starting address points to a location in the dictionary ROM for starting a search for words to be compared to the particular word under consideration. The number of grammar bytes indicates the total number of dictionary look-ups required to compare the trial word with all the words in the dictionary ROM containing the given starting character and the given number of characters in the trial word. Therefore, the number of grammar bytes is used to determine the end point for searching.
Prior to using the symbolic tokenizer, a dictionary of preselected, commonly-used words is stored in ROM 104.
The symbolic tokenizer's dictionary ROM 104 stores words according to starting letter and character string length. The dictionary effectively contains segments which depend upon the number of characters in the words they contain. In the preferred embodiment, the first segment contains words having between one and four characters, the second between five and eight, the third between nine and twelve, and the fourth between 13 and 16. The reason for the divisions into segments is that the language words in any given segment contain the same number of grammar bytes.
The use of four dictionary segments in the prototype symbolic tokenizer allows for language words of 16 charac¬ ters in length, or tokenized phrases of 80 bits in length in the preferred embodiment. Language words of twenty characters, or phrases of 100 bits, can be stored with an appropriate change in the memory map of FIG. 1A. Addition¬ ally, . a different trial word register would be required because the register of the present embodiment has a phys¬ ical capacity of only 16 characters or 80 bits.
Segment codes are used to indicate how many dictionary reads must occur before the results of the comparison should be sampled. Only four codes, SI, S2, S3, and S4, are required because, for example, for the character code SI, one grammar byte is needed to read a single character word, a two-character word, a three-character word, and a four-character word. Therefore, only one clock cycle or grammar byte is required to read the entire word for the character code S2, two grammar bytes are needed to read a 5-character word, and 6-, 7-, or 8-character words. Each read during a given clock cycle transfers a maximum of four characters from the dictionary ROM to the compare logic 103.
Each dictionary segment contains up to four subseg- ments and up to 104 sectors. There is one subsegment for each of the 4-character string lengths in the segment.
There is one sector for each different starting letter and each unique character string length of the language words within that dictionary segment. For example, if a particular dictionary subsegment contained language words of two characters starting with the letters A through Y, but no language words starting with Z, then that dictionary subsegment would contain 25 sectors of words. That is, one sector of words for each unique starting character and character string length. 0 Division of the memory map into both segments and sectors permits two-dimensional accessing of the memory map data for obtaining the dictionary starting address for the search. Therefore, no searching of the memory map is required, only direct look-up, and only a minimum 5 of searching of the dictionary is necessary to determine whether the trial word is present in the dictionary. Only that sector of the stored dictionary where words 'of the appropriate starting letter and character length are stored is searched. This two-dimensional accessing of 0 the memory map permits an extremely fast search of the- stored dictionary because only a small portion of the stored dictionary is actually searched.
The symbolic tokenizer compares a trial word to a language word within the stored dictionary and returns 5 the address in the dictionary ROM where that word may be found. The address is then provided to the control micro¬ processor where the token is derived. In the preferred embodiment, the token is the address of that word within the dictionary, divided by the number of the dictionary Q segment within which the word was found, and added to an offset. The offset is determined by the dictionary segment and depends upon the location of the segment within the dictionary.
This is done, instead of simply using the unaltered _ dictionary address as a token, because the division results
in a binary number one digit less in length than the ori¬ ginal number. The addition of the offset restores informa¬ tion lost in the division so that each token still defines a unique address within the dictionary.
During detokenization, the reverse process takes place. The control microprocessor knows from which dic¬ tionary segment the token was generated, because tokens generated from given segments fall within given numeric ranges. Tokens numerically greater than one given number, but less than another given number, must have their corre¬ sponding language word within a corresponding dictionary segment. That is, each dictionary segment has tokens that fall between predetermined numbers.
Since the dictionary segment can be determined from the token itself, the offset can be subtracted from the token and the token then multiplied by the appropriate dictionary segment number to find the address of the lan¬ guage word within the dictionary ROM. During detokeniza¬ tion, the ROM address is supplied to the symbolic tokenizer by the control microprocessor, and the symbolic toke'nizer returns the language word represented by the token to the control microprocessor.
As shown in FIG. 6, language words are represented in the dictionary by one or more 20-bit bytes called gram¬ mar bytes. Each 20-bit byte represents four characters from the language word. Language words may be represented by up to four 20-bit bytes in the dictionary because the present embodiment contemplates storing words of up to 16 characters. The use of up to four 20-bit grammar bytes provides a variable-byte word length of 20 bits each to maximize dictionary storage efficiency. For example, the word "water" is stored in two 20-bit grammar bytes in ROMs 1-5, four characters in the first byte and one char¬ acter in the second. Additionally, as indicated in FIG. 2 of the preferred embodiment, ten dictionary ROMs 1-10
1 are separated into two banks of five ROMs each, ROMs 1-5 and ROMs 6-10. Henceforth, ROMs 1-5 will be considered, but all the grammar bytes for any given word are stored in the same bank of ROMs, and it will be understood that
5 ROMs 6-10 can be treated in an equivalent manner. For the word "water", the last four characters are stored in one 20-bit grammar byte of ROMs 1-5, as indicated in FIG. 6. The least significant four bits of the letter "r" are stored in the least significant four bits of ROM 1. The 0 most significant bit of the letter "r" is stored in the least significant bit of first ROM 2. The least signifi¬ cant three bits of the letter "e" are stored in the remain¬ ing least significant three bits of the first four bits of ROM 2, and the most significant two bits of the letter
15 "e" are stored in the least two significant bits of the first four bits of ROM 3. The two least significant bits of the letter "t" are stored in the two most significant bits of the first four bits of ROM 3, and the three most significant bits of the letter "t" are stored in the three
20 least significant bits of the first four bits of ROM 4. The least significant bit of the letter "a" is stored in the most significant bit of the first four bits of ROM 4, and the four most significant bits of the letter "a" are stored in the first four bits of ROM 5. Because five 5 ROMs are used in a bank, and only four bits from each ROM are used for any given word, one character, or five bits, from the word "water" still remains to be stored. A sec¬ ond 20-bit byte is then used to store the last character. It should be noted that whether the least significant or
30 most significant bits of the ROMs are used depends only on the address used to access the ROMs and will vary ac¬ cording to how the actual words are stored.
As is apparent from FIG. 6, each of the characters "a", "t", "e", and "r" are stored in the same memory loca-
, tion having the same address in each of ROMs 1-5. There-
fore, all the letters in a grammar byte may be processed in parallel for any given address or memory location in the bank of ROMs. In other words, the letters "a", "t", "e", and "r" can be accessed simultaneously or in parallel. As is well known in the art, standard ROMs have eight pins for accessing the memory locations. However, the preferred embodiment provides access to only four pins, or bits, of memory locations per ROM at any one time and provides five ROMs in a bank so that an integer number (i.e., 4) of characters may be represented in the bank of ROMs without repeatedly leaving blanks or null characters. The other four bits, or memory locations, for a given address in ROMs 1-5 are used for another grammar byte, but only in conjunction with each other. Hereafter, only one set of four bits of the eight bits available in each of ROMs 1-5 will be discussed. It will be assumed that other grammar bytes are allocated among each of the remain¬ ing four bits in ROMs 1-5.
It is undesirable to use all eight bits from a bank °f only three ROMs, for example, all eight bits of each of ROMs 1 and 2, and only four bits of the eight bits of ROM 3 to form a 20-bit grammar byte. The remaining four bits in ROM 3 would not be used. Similarly, the same arrangement would not be used with a bank of five ROMs, since the remainder of ROM 3 and all of ROMs 4 and 5 may contain a 20-bit grammar byte of an entirely different word than the 20-bit grammar byte of ROMs 1 and 2 and the first half of ROM 3. Special multiplexing would be neces¬ sary to be able to read the grammar byte from ROMs 1-3 and not the grammar byte in ROMs 3-5. Therefore, using only four bits of each of five ROMs is preferred.
Because one whole character remains to be stored from the word "water", the next succeeding memory location in each of ROMs 1-5, defined by the next succeeding ad- dress, is used to store the character "w". In this regard.
the least significant four bits of the character "w" are stored in the least significant four bits of ROM 1, and the most significant bit of the character "w" is stored in the least significant bit of the first four bits of
5 ROM 2. Zeros or null characters are placed in the three most significant bits of the four bits used in ROM 2, and in the least significant four bits used in each of ROMs 3, 4, and 5.
Words containing only four characters or less would
10 require an appropriate number of bits of only one grammar byte in ROMs 1-5. Words containing between five and eight characters would require all 20 bits of a first grammar byte and an appropriate number of the 20 bits of a second grammar byte. For an 8-character word, two 20-bit grammar
2_5 bytes would be required. For words containing between 9 and 12 characters, a series of three 20-bit grammar bytes would be required, all of the bits in the first two.grammar bytes being used while an appropriate number of bits in the last 20-bit grammar byte are taken up by the required
2o bits for the appropriate number of characters remaining. The same comments apply with respect to words containing between 13 and 16 characters. Additionally, it would be a short extension to include words having between 17 and 20 characters by allowing up to five 20-bit grammar bytes
25 per word.
The storage of the bits in the various ROMs is con¬ trolled during creation of the dictionary in a manner that would be obvious to one skilled in the art in view of the description herein. The memory map in the control
30 microprocessor is then created based on the particular physical location of the grammar bytes in the respective bank of ROMs. Access to the ROMs for searching for dict¬ ionary words and for reading the contents of the various memory locations is as described below with respect to
__ FIG. 2.
1 Phrases up to 80 bits long are also stored in the dictionary. The phrases are stored as previously tokenized words; therefore, several words may comprise a phrase which is stored in four 20-bit bytes. That is, since the phrases 5 are stored as tokens of, at most, 20 bits each, and not as language words, their storage requires much less memory than if the phrases were stored as their ASCII or EBCDIC equivalents. Depending on whether a token is used, or merely the original address, the "token" is typically 10 between 14 and 16 bits long.
The character length distribution of language words varies depending upon a particular given dictionary's vocabulary. For example, a dictionary may be designed around a standard business vocabulary. The business vocab- 15 ulary has a character population density that places about 90% of its words in the 5- to 12-character length category.
The following is a segment map that represents a business dictionary. Other vocabularies may be divided differently. The segment division points are software
20 programmable functions of vocabulary and are defined within- the control microprocessor's tokenizer dictionary map.
The prototype symbolic tokenizer's business dictionary is composed of 13,312 language words. The business dic¬ tionary is divided into four segments designated as SI,
25 S2, S3, and S4.
Segment SI contains approximately one thousand words and is addressed by the hexadecimal numbers 0000 through 03FF.
Segment S2 contains approximately six thousand words
30 and is addressed by the hexadecimal numbers 0400 through 1BFF.
Segment S3 contains approximately five thousand words and is addressed by the hexadecimal numbers 1C00 through
2FFF. o ~o Segment S4 contains approximately one thousand words
and is addressed by the hexadecimal numbers 3000 through 33FF.
SI expresses words requiring one 20-bit grammar byte (up to four characters) . S2 expresses words requiring two 20-bit grammar bytes or 40 bits (between five and eight characters) . S3 expresses words requiring three 20-bit grammar bytes or 60 bits (between nine and twelve characters) . S4 expresses words requiring four 20-bit grammar bytes or 80 bits (between thirteen and sixteen 0 characters) .
The remainder of the tokenizer 102A will now be de¬ scribed. The compare logic 103 compares the trial word from the trial word register 111 to selected individual words from the proper segment of the dictionary ROM 104. 5 Address counter 105 determines each address of the dictio¬ nary ROM 104 from which a word, or a portion of a word is to be taken to be compared by the c'ompare logic 103. Words are transferred from the dictionary ROM 104 to the . compare logic 103 over the grammar bus 112. 0 The address counter 105 is a register/counter and holds the memory address for ROM 104 indicating the grammar byte to be accessed. After a first grammar byte is read, the address counter is incremented so that the grammar byte in the next succeeding address can be read. 5 The segment register 106 determines how many dictio¬ nary reads must take place to determine when the output of the compare logic 103 should be sampled. This is ne¬ cessary because words stored in the dictionary ROM 104 are variable in length and require from one to four reads Q to be output to the compare logic 103.
A segment counter 107 counts down while the address counter 105 counts up. The segment counter 107 counts the number of grammar bytes required to read a given language word from the selected segment of the dictionary ROM 104 being searched. During the read of dictionary ROM 104,
the first grammar byte will be read from the memory loca¬ tions of ROMs 1-5. At the end of the read, during the first clock cycle, the segment counter will be decremented, and the address counter will be incremented. Each time the segment counter 107 reaches zero (i.e., when the entire word has been read from the dictionary ROM) , the segment counter 107 is reloaded from the segment register 106 so the next word can be read.
The control logic 109 stops the address counter 105 from incrementing when either a comparison by the compare logic 103 is true, or the entire sector of the dictionary ROM 104 which was to be searched had been searched, and no comparison by the compare logic 103 was found to be true. The control logic 109 also accepts tokenizer control codes, shown as TCC in FIG. 1, from the control microprocessor
100 through the mailbus 102 to be described more fully below.
A brief description of the process will now be given with respect to FIGS. 1 and 7.
As shown in FIG. 1, a control microprocessor 100* (a Commodore 64K in the prototype) will have a memory map for the dictionary ROM 104 already created as discussed above, and the dictionary ROM will contain the desired library of dictionary words. The control microprocessor 100 issues a tokenizer reset command through interface
101 to the mailbus 102. The interface 101 is described below with respect to FIG. 5. The reset command is used to reset the segment register 106, and reinitialize the segment counter 107. The tokenizer reset also reinitial- izes the address counter 105 and the control logic 109.
The scan limit down counter 108 and the status register 110 are also reinitialized.
The control microprocessor then determines the number of characters in the trial word under consideration and determines a segment code according to the number of char-
acters in the trial word. As discussed above, if the trial word contains between one and four characters, a segment code representing "SI" will be issued to the seg¬ ment register 106. If the trial word contains between five and eight characters, a segment code representing "S2" will be loaded in the segment register. If the word contains between nine and twelve characters, a representa¬ tion of the segment code "S3" is loaded, and if the trial word contains between thirteen and sixteen characters, a 0 segment code representing "S4" is loaded. For example, the word "water" uses two grammar bytes, so the segment counter would contain a representation of "S2" .or "2", indicating that two grammar bytes must be read for "water".
The control microprocessor then supplies the trial 5 word "water" through the interface 101, and the parallel mailbus 102, tto the trial word register 111. During the tokenization process; the trial data is the word on which tokenization is to be attempted.
The control microprocessor utilizes the character o string length of the trial word and the beginning character of the trial word to access the dictionary ROM information contained in the memory map in storage device 100A. The microprocessor obtains the starting address and the total number of grammar bytes for the words starting with the 5 given character and having the given character count. As indicated above, beta indicates the number of total grammar bytes of the words containing the given letter and having the given character length. Since a single grammar byte may not contain representations of all characters in the stored word, several grammar bytes may be required to form a complete representation of the stored word. How¬ ever, beta is a representation of the total number of grammar bytes needed to step through all the entries in the dictionary ROM containing words starting with the given character and having the given character string
length. By way of example, the word "water" and all other five-character words beginning with the character "w" and stored in the dictionary ROM would require two grammar bytes to store the individual word. If there were five words stored in the dictionary ROM, each starting with the character "w", and each consisting of a total of five characters, then the total number of grammar bytes, beta, required to search through the dictionary ROM for the five words would be ten grammar bytes. In other words, the tokenizer 102A must cycle or search through the dic¬ tionary ROMs, starting at the starting address, represented by alpha, twice for each word. Carrying out the twice- repeated cycle for each stored word requires a total of ten cycles or look-ups. Therefore, beta, corresponding to words starting with the character "w" and having five characters, would contain a representation of "10".
The control microprocessor loads the starting address into the address counter 105. For example, if the words starting with "w" and having five characters were found beginning at the first address location of ROM 104, the address counter would be loaded with a representation of the hexadecimal number 0000. The control microprocessor also loads the scan limit down counter 108 with a repre¬ sentation of beta, the total number of 20-bit grammar bytes to be read from the dictionary ROM 104. In the present example, beta is 10.
Each of the above steps are each preferably accom¬ plished during a respective single clock cycle of the microprocessor. The particular order in which the steps are carried out is not significant, as long as the token¬ izer registers are loaded with representations of the appropriate information. However, as would be clear to one skilled in the art, reset of the tokenizer must occur prior to loading of any registers in the tokenizer. Once the registers have been loaded, the control microprocessor
issues a start-scan command to the tokenizer 102A. Once the start-scan command is issued, the tokenizer steps through a defined sequence of steps, and the control micro¬ processor is free to carry out other operations. The control microprocessor intermittently tests the mailbus 102 for one of three status codes, a representation of DEQ representing data equals, a representation of DNF meaning data not found, or a representation of SIP indi¬ cating scan in progress. In the prototype, a mailbus Q interrupt was possible for the interface 101 to interrupt the operation of the control microprocessor 100 when a mailbus interrupt signal is received from the tokenizer.
During the tokenization process, the tokenizer com¬ pares the trial word with each word in the dictionary ROM 5 identified according to the starting address and the number of grammar bytes. The compare logic 103 will compare the first grammar byte of the trial word contained in register 111 with the first grammar byte read from the first memory location in dictionary ROM 104. At the same time, the 0 segment counter will be decremented so that it contains a representation of "1", indicating that one more grammar byte must be read in order to read the entire word "water". Additionally, the address counter will be incremented so that the counter contains a representation of the hexade¬ 5 cimal number 0001. The scan limit down counter 108 is also decremented so that it contains a representation of "09". If a match is found in the first grammar byte, no further change in state of the tokenizer is made. During the next clock cycle, the next grammar byte is read from dictionary ROM 104 and compared with the next grammar byte in trial word register 111. Additionally, the segment down counter is decremented to a representation of "0", indicating that all grammar bytes for the given word have been read and compared. At this point, the compare logic _ is sampled to determine the status of the compare. Since 5
a match would be found in the present example, the token¬ izer issues the DEQ (data equals) signal, and the control microprocessor issues a read command to the tokenizer so that the address of the dictionary word for which a match was found can be read by the control microprocessor. The address is then used by the control microprocessor to determine an appropriate token. Specifically, the address is preferably divided by the segment number within which the word was found and added to an offset, as described above. The token is either then stored as a substitute text storage scheme, or may be communicated to another processor as part of a text transfer. Detokenization may then be conducted at any time.
If no match is found after the comparison of the trial word with the word in the starting address, the starting address is incremented, the segment down counter is reloaded with the representation of the segment code, the scan limit down counter is decremented„ and the next
5-character word starting with the character "w" will then be available to be read from dictionary ROM 104. It should be noted that if a match is not found for the first grammar byte, the comparison process is still continued so that the address counter is incremented, and the scan limit down counter is decremented, as appropriate, so that the tokenizer can start with the address of the next succeeding word as the comparison process continues.
Additionally, the continued cycling allows the segment down counter to be reloaded with the representation of the segment code so that all of the grammar bytes for the next succeeding word will be read and compared. If no match is found after the appropriate number of grammar bytes have been accessed, the DNF (data not found) signal is provided to the status register 110, which then produces a DNF to mailbus 102. This signal is provided to the control microprocessor which then either terminates the
text processing or repeats the above-described process with a subsequent trial word.
As shown in FIG. 7, if the control microprocessor identifies the trial word as having greater than 17 char- acters, a data not found code is issued, and the control microprocessor returns to process another word or ends the text processing. In the situation where the trial word contains more than 16 characters, the particular trial word is stored as the ASCII or EBCDIC equivalents. However, this would occur in a very small percentage of the words.
The tokenizer and the method and means by which the control microprocessor directs the tokenizer will now be considered in detail. The tokenizer control codes are 9- bit commands from the control microprocessor 100 to the symbolic tokenizer. The commands are divided between write commands, which cause outputs to be sent from the control microprocessor 100 to the symbolic 'Tokenizer, and read commands, which cause inputs to be sent from the symbolic tokenizer to the control microprocessor 100. The commands are as follows:
TABLE I WRITE COMMANDS
MAILBUS BIT HEX
ADDRESS VALUE FUNCTION SSSS00000 00 TOKENIZER RESET SSSS00001 01 TRIAL WORD BYTE 1A SSSS00010 02 TRIAL WORD BYTE IB SSSS00011 03 TRIAL WORD BYTE 1C SSSS00100 04 LOAD SCAN ADDRESS REGISTER LOW BYTE
SSSS00101 05 TRIAL WORD BYTE 2A SSSS00110 06 TRIAL WORD BYTE 2B SSSS00111 07 TRIAL WORD BYTE 2C SSSS01000 08 LOAD SCAN ADDRESS REGISTER HIGH BYTE
SSSS01001 09 TRIAL WORD BYTE 3A SSSS01010 OA TRIAL WORD BYTE 3B SSSS01011 OB TRIAL WORD BYTE 3C SSSS01100 OC "LOAD SCAN ADDRESS WORD COUNTER LOW BYTE
SSSS01101 OD TRIAL WORD BYTE 4A SSSS01110 OE TRIAL WORD BYTE 4B SSSS01111 OF TRIAL WORD BYTE 4C SSSS10000 10 LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE
SSSS10100 14 LOAD SEGMENT REGISTER SSSS11000 18 START SCAN
TABLE II READ COMMANDS
MAILBUS BIT HEX
ADDRESS VALUE FUNCTION SSSSXX000 00 TOKEN ADDRESS BITS 0-7 SSSSXX001 01 TOKEN ADDRESS BITS 8-14 SSSSXX010 02 GRAMMAR BUS BITS 0-7 SSSSXX011 03 GRAMMAR BUS BITS 8-15 SSSSXX100 04 GRAMMAR BUS BITS 16-19 SSSSXX101 05 TOKENIZER STATUS BITS 4-7
The first four digits of each mailbus bit address (SSSS) are the bits which the control microprocessor 100 uses to select with which symbolic tokenizer it will com¬ municate when more than one symbolic tokenizer sits across the mailbus 102. This is the address set into the dip switch 57 and sensed by the address select 53, both of FIG. 4. The X's in the read commands indicate digits not used.
The read and write commands are utilized by a mailbus address line 116, to be described below with respect to FIG. 4, to enable the correct logic component to accept the data being provided by the control microprocessor on the mailbus data lines 115.
The TOKENIZER RESET write command is sent from the control microprocessor 100 to the symbolic tokenizer to reset the symbolic tokenizer's logic.
The TRIAL WORD BYTE commands are used to load the trial word into the trial word register 111 in the form of register files 34-38. As discussed above, the trial word transmitted to the symbolic tokenizer may be up to 80 bits long. This allows for trial words of up to 16 characters, as twenty bits are required for each four characters. Transmission of the trial word to the symbolic tokenizer is accomplished in up to twelve separate trans- missions, i.e., a maximum of eight 8-bit loads and four 4-bit loads.
The TRIAL WORD BYTE 1A through 1C write commands represent commands to load the first 20 bits of the trial word. TRIAL WORD BYTE 2A through 2C represent commands to load the second 20 bits of the trial word. TRIAL WORD BYTE 3A through 3C represent commands to load the third 20 bits of the trial word. TRIAL WORD BYTE 4A through 4C represent commands to load the fourth 20 bits of the trial word. In each case, a TRIAL WORD BYTE having an !A' suf- fix, such as TRIAL WORD BYTE 4A, represents a command to
load bits 0 through 7, a TRIAL WORD BYTE having a 'B' suffix represents a command to load bits 8 through 15, and a TRIAL WORD BYTE having a 'C suffix represents a command to load bits 16 through 19. LOAD SCAN ADDRESS REGISTER LOW BYTE and LOAD SCAN ADDRESS REGISTER HIGH BYTE write commands enable loading in the address counter of the 16-bit dictionary address, where the search is to begin. This is the address of the first word in the dictionary sector where the word on which tokenization is being attempted will be contained, if it is in the dictionary. This address is derived from alpha in the memory map and is loaded in two 8-bit bytes to the address counter 105 in the form of token address counter 16-19. LOAD SCAN ADDRESS WORD COUNTER LOW BYTE and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE write commands enable loading of the scan limit down counter 108 in the form of dictionary scan down counter 54-56 of FIG. 4. The dictio¬ nary scan down counter 54-56 is loaded with the number of grammar bytes in the sector of the dictionary being searched so that the dictionary scan down counter 54-56 can decrement as grammar bytes are compared until all words of that dictionary sector have been compared. SCAN ADDRESS WORD COUNTER LOW BYTE allows loading of eight bits of the count, and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE allows loading of the remaining four bits of the count. As discussed above, the counter prevents dic¬ tionary reads beyond the proper dictionary sector. For example, if there are five 5-letter words beginning with "w", then the dictionary scan down counter will be loaded with a representation of 10 (five words times two grammar bytes per word) .
The LOAD SEGMENT REGISTER write command enables load¬ ing of the Segment Register 106 with the segment code. The START SCAN write command is the command from the
control microprocessor 100 to the symbolic tokenizer to begin a search of the dictionary. It is issued after the above output command functions have been issued and the appropriate logic device set.
5 The READ commands are address codes utilized in con¬ junction with a MBRW (mail 45 read/write) code from the control microprocessor to enable access to the contents of various logic components in the tokenizer for reading. The TOKENIZER ADDRESS BITS 0-7 and the TOKENIZER 0 ADDRESS BITS 8-14 read commands enable access by the con¬ trol microprocessor to the address counter to read the address of the dictionary word with which a comparison was found true.
The GRAMMAR BUS BITS 0-7, the GRAMMAR BUS BITS 8-15,
15 and the GRAMMAR BUS BITS 16-19 read commands load the contents of the grammar bus 112 into the control micropro¬ cessor 100. During detokenization, the grammar bus con¬ tains the language word which was represented by the ad¬ dress derived from the token.
20 The token address, shown in FIG. 1 as TKA, is a form of the token. The token may be the actual memory address or may be a derivation of the memory address, as described above. The token address is the address within the dic¬ tionary ROM 104 where the word was found. The token ad¬
25 dress is transmitted through the token address bus 113 and the mailbus 102 to the control microprocessor 100 when the compare logic 103 is true.
Detokenization, or the converting of tokens back into language words, is achieved by presenting the token
30 as a dictionary address on the token address bus 113 through the address counter 105. During detokenization, the address counter 105 functions as a simple address register, instead of as a counter.
The data found at the addressed ROM 104 location is
__ output to the grammar bus 112. This data is the language o5
1 word which was represented by the token on which detokeni¬ zation is taking place.
During detokenization, the language word stored in the dictionary ROM 104 at the address represented by the token
5 is transferred through the grammar bus 112 and the mailbus 102 to the control microprocessor 100.
The TOKENIZER STATUS BITS 4-7 read command requests the symbolic tokenizer to send its status to the control microprocessor 100. The symbolic tokenizer status may be
10 Data Equals (DEQ) , Data Not Found (DNF) , or Scan In Prog¬ ress (SIP) . Data Equals is represented by the binary code 001. Data Not Found is represented by the binary code 010. Scan In Progress is represented by the binary code 100.
15 Scan In Progress indicates to the control micropro¬ cessor 100 that a dictionary search is currently in pro¬ gress. Data Equals indicates that a comparison was found to be true, and Data Not Found indicates that no comparison was found to be true for the trial word on which tokeniza-
2o tion is being attempted.
The components of the tokenizer will now be described. As shown in FIG. 2, dictionary ROMs 1 through 10 store the dictionary of words which may be tokenized. The dic¬ tionary of the prototype unit utilizes ten 2764 EPROMs.
25. EPROMs are used in the prototype unit to allow for ease of changing their content as the tokenization of different vocabularies is studied. ROMs will be used in production units because of their low cost. The ROMs are 8-bit devices, each having a capacity of eight kilobytes. Toge-
30 ther they are used to generate 32 kilobytes of 20-bit words.
As discussed above, the tokenizer manipulates 20-bit grammar bytes. The grammar bytes are generated by taking five of the ten ROMs at a time and utilizing four of the 5 eight bits from each ROM to obtain a 20-bit grammar byte.
The ROMs are accessed using the dictionary address through token address counter 16-19 described below. The least significant bit of the dictionary address causes multiplexing between the least significant and the most
5 significant, four bits of each Dictionary ROM 1 through 10. In other words, if the address is even, for example, the first four bits of each ROM in the bank of ROMs will be accessed, whereas if the address is odd, only the last four bits of the eight bits of each ROM will be accessed.
10 This obtains the full 32-kilobyte address range of the ROMs and facilitates parallel processing as discussed above. The dictionary ROMs 1 through 10 are addressed by the token address bus 113.
There are eight data lines exiting each of the ten
15 ROMs and going to the data multiplexers 11-15 for reading the contents of ROMs. This gives forty data lines coming from each group of five ROMs. Twenty lines are required for each 20-bit grammar byte. As described above, multi¬ plexing switches between the low four and high four output
20 lines of each ROM so that each group of five ROMs forms a 20-bit byte at any given time. This effectively doubles the addressing capability of the dictionary ROM.
All of the lines on FIGS. 2 through 5 designated with a "+" are positive 5 volts.
25 The data multiplexers 11 through 15 are 74LS257 data multiplexer chips. They switch between the low and high four output lines of dictionary ROM 1-10 and are switched by the least significant bit of the absolute address (TKAD00) . That is, the multiplexers are switched by the
30 token address 00 of the token address bus 113. A zero on line 00 of the dictionary address bus indicates that the low four bits of each ROM on the dictionary address bus are to be used for data and a 1 on line 00 of the token address bus 113 indicates that the high four bits of each
,__ ROM on the dictionary address bus are to be used for data. •35
Token address line 00 (TKAD00) and token address line 14 (TKAD14) are both part of the token address bus 113.
The grammar bus 112 connects the data multiplexers 11-15 to the comparator 29-33 of FIG. 3. The absolute address required to access words in the dictionary ROM is 15 bits long, as can be seen from the output of the token address counter 16-19. The 15-bit address with multiplexing allows 32,768 absolute dictionary addresses. The token address counter 16-19 is comprised of four 74LS191 up/down counters configured as an incremental counter. The token address counter 16-19 accepts the starting address of the first word in the dictionary sector to be searched and then increments from that address to the end of the list in that sector at the mailbus clock rate. The end of the list is determined by the starting address and beta.
The starting address for the dictionary segment and sector derived from the memory map is parallel loaded from control microprocessor 100 through the external mail¬ bus 115 to the internal mailbus 114 and then into the address counter 16-19.
As sh wn in FIG. 5, the mailbus 102 of FIG. 1 is di¬ vided, first, into the external mailbus (Databus) 115, the mailbus address bus 116, a mailbus input/output (MBIO) , mailbus interrupt (MBINT inverse) , mailbus clock (MBCLK) , and mailbus read/write (MBRW) and, second, into the inter¬ nal mailbus 114. The external mailbus 115 is the eight data lines which enter the bidirectional data bus trans- ceiver 46 from the interface 101 of FIG. 1. The internal mailbus 114 is the eight data lines between the bidirec¬ tional data bus transceiver 46 and the tokenizer output multiplexer 41-45, the register file 34-38, as well as the token address counter 16-19 of FIG. 2, and the dictio¬ nary scan down counter 54-56 of FIG. 4.
Each of the three chips of the token address counter 16-18 supplies four of the bits re-quired to form the 15-bit dictionary address, with the last chip of the token address counter 19 supplying the most significant three bits. As seen in FIG. 2, inverter 20 is part of a 7404 digital inverter chip. Inverter 20 produces the inverse of the mailbus clock (MBCLK) from the mailbus clock (MBCLK) signal. The inverse of the mailbus clock is used by com¬ pare latch 39 of FIG. 3, and it is also the beginning 0 of a time-delay chain comprising inverters 21 and 22.
Digital inverters 20, 21, and 22 provide a timing delay of the clock signal through sequential propagation. Digital inverters 21 and 22 are part of a 74LS38 NAND gate configured as inverters. This delayed clock signal 5 is used for the clocking of the token address counter 16-19. This delay guarantees that the address counter will not increment prior to the acquisition of dictionary data from the grammar bus 112. The token address counter 16-19 increments on the rising edge of the delayed clock o signal.
Noninverting buffer 23 is a 7407 noninverting buffer chip. It provides a sequential propagation delay as part of the timing delay circuit which further delays the clock signal to the token address counter 16-19. 5 Dictionary enable AND Gates 25 and 26 are 7408 AND gates. They provide the dictionary enable, which supplies the enable signal to the dictionary ROMs 1-10 so that the ROM address locations may be accessed. Two AND gates are utilized to provide fanout since they must provide a signal Q to each of the ten dictionary ROMs 1 through 10. AND gate 25 provides an enable signal to six of the dictionary ROMs 1-5 and 6-10, and AND gate 26 provides an enable signal to four.
The dictionary ROMs 1-10 are tri-state output devices. When the outputs of 25 and 26 go low the selected dictio-
nary ROMs 1 through 5, or 6 through 10, appear on the data bus. The dictionary ROMs 1 through 5, or 6 through 10, are selected by the token address (TKAD 14) described below) , and its output appears on the grammar bus 112. The dictionary enable AND gates 25 and 26 go low when two signals are present at their inputs. Those sig¬ nals are mailbus clock inverse (MBCLK) and grammar bus lock inverse (GBLOCK) .
Grammar bus lock (GBLOCK) (see the output of NAND gate 28 in FIG. 3) is used for reading the contents of the dictionary ROM 1-10. During detokenization when a direct look-up is made to the dictionary ROMs to convert a token into a language word, the output of the grammar bus 112 must be on the mailbus 102 so that it may be accepted by the control microprocessor 100. GBLOCK allows output of the contents of the addressed memory.
Inverter 27 is a part-of a 7404 Hex inverter. Inver¬ ter 27 selects the group of five dictionary ROMs 1-5, or 6-10, which is to be enabled. The input of the inverter 27 is coupled to the bank of ROMs 1-5, while the output is coupled to the bank of ROMs 6-10. Either dictionary ROMs 1-5, or 6-10, will always be selected. The input to inverter 27 comes from the fifteenth bit (most significant bit) of the token address counter 19 (TKAD14) . That is, the fifteenth bit from the token address counter 19 deter¬ mines which group of five dictionary ROMs 1-5, or 6-10, is to be enabled. Dictionary ROMs 1-5 contain the lower 16 kilobytes of dictionary memory, and dictionary ROMs 6-10 contain the upper 16 kilobytes of dictionary memory. NAND Gate 28 of FIG. 3 is for producing GBLOCK inverse and is a part of a 74LS00 Quad NAND gate. It provides the grammar bus lock inverse (GBLOCK) which is used by the dictionary enable AND gates 25 and 26 to force the enable of the dictionary ROMs 1-10 so that the dictionary
ROMs 1-10 may be interrogated during the detokenization
1 operation.
FIGS. 3 and 4 can be placed side by side, with FIG. 4 on the right, to form one continuous schematic.
As shown in FIG. 3, comparators 29-33 compare the
5 contents of the grammar bus 112 (from the multiplexers 11-15) with a trial 20-bit grammar byte which is part of a trial word loaded into the register file 34-38. From one to four 20-bit bytes (grammar bytes) are compared for each language word, depending upon the length of the word.
10 Each grammar byte represents up to four characters of a language word.
Register file 34-38 is loaded with the trial word from the microprocessor through the bidirectional data bus transceiver 46 and the external mailbus 115, as de-
1 scribed above. Each chip of the register file is a group of 4-bit register slices, i.e., four 4-bit registers. At any one time, the address buffer 58 and the register file select 59, described below, may address any 4-bit register in each chip in order to load .data in the form
20 of bits representing characters of the trial word. How¬ ever, because internal mailbus 114 passes only eight bits of data at a time, only two 4-bit registers may be loaded at a time. The register file 34-38 is coupled to the internal mailbus 114 such that the lowest four bits of
25 the data bus and the highest four bits are coupled to pins 15, 1, 2, 3 of chips 34 and 35, respectively, and to the same pins of chips 36 and 37, respectively. The lowest four bits are also coupled to pins 15, 1, 2, 3 of chip 38. Because the control microprocessor and the internal
3Q mailbus provide only eight bits of data per cycle, the first four bits of the first eight bits of a trial word are placed in the first 4-bit register of chip 34, and the last four bits are placed in the same 4-bit register of chip 35. The same 4-bit register of chip 36 gets the
_ first four bits of the second eight bits of data, while
the same 4-bit register of chip 37 gets the second four bits. Finally, the same 4-bit register of chip 38 gets the next four bits of the trial word from the control microprocessor over Internal mailbus 114. In summary, the first 20-bit grammar byte of the trial word has been loaded from the control microprocessor into the first 4- bit registers of register file 34-38. The remaining 20- bit grammar bytes are loaded in a similar manner. There¬ fore, the trial word will be stored in the form of 20-bit ' grammar bytes, just as the language words are stored as 20-bit grammar bytes in the dictionary ROM. The remaining 20-bit grammar bytes, up to a total of four grammar bytes or 16 characters totaling 80 bits, are stored in respective 4-bit registers of the register file 34-38. As a result, all 20 bits representing four characters in a given grammar byte may be accessed and processed in parallel in a manner similar to that for a given bank of dictionary ROMs. A 20-bit or 4 character comparison can be made in one clock cycle. As words are read, grammar byte-by-grammar byte from the appropriate bank of the dictionary ROM 1-10, they are compared to respective grammar bytes of the trial word until either a match is found between words, or all of the words in the chosen sector of the dictionary ROM 1-10 are exhausted. The five 74170 chips combined have a maximum capacity of 80 bits, or four grammar bytes, com¬ prising a total of 20 characters.
Inter grammar byte compare latch 39 is a 7474 Dual D Flip-Flop. A successful comparison (e.g., bit-for-bit equivalence) on the appropriate number of grammar bytes as defined by the segment code will cause DEQ (Data Equals) on the inter grammar byte compare latch 39 to go high. When DEQ goes high produces scan termination logic 40, the stop scan signal, and a start scan flip-flop in scan control logic 51 is reset at pin 1. This causes scan enable inverse (SCANEN) from scan control logic 51 to go
high. The inter grammar byte compare latch 39 uses the mailbus clock inverse signal (MBCLK) from inverter 20 for timing.
Pins 1, 10, and 14 of inter grammar byte compare latch 39 go to positive 5 volts. Pin 7 is ground. Pin 4 is the output of NAND gate 65. Pin 11 is an input to NAND gate 65. Pin 8 is an input to NAND gate 70. Pin 13 goes to TKRST and to pin 10 on scan termination logic 40. Pin 2 goes to pin 6 of comparator 33. Pins 5 and 12 go
10 to pin 2 of comparator 29. Mailbus clock inverse (MBCLK) is pin 3, and Data Equals (DEQ) is pin 9.
Scan termination logic 40 allows the loading of suc¬ cessive 20-bit bytes, even though an earlier byte failed comparison. The appropriate number of bytes must be com-
15 pared for the selected dictionary segment, since each dictionary segment contains words of a given length. This permits a new comparison of a subsequent word in the same sequence to begin at the start of the next -language word. For example, a search for the word "water" may
2o first produce the dictionary word "waste" for comparison. Since a match would not be found for the first grammar byte, the third characters in each word not being the same, a failure would be indicated in the compare latch 39. However, the second grammar byte for the word "waste"
25 containing the letter "w" would nonetheless be loaded into the comparators 29-33, and the segment counter would be decremented, the address counter would be incremented, and the scan limit down counter would be decremented. This is done, even though a match is not found, so that
3Q the address of the next word for comparison is present in the address counter. Therefore, the continued cycling through the remaining grammar bytes of a word which failed comparison merely serves to update the appropriate counters and registers. Any loss in computing time is relatively
_ minimal. The scan termination logic 40 also serves to
1 reset scan control logic 51 from pin 8 to stop all counters when a match is found. Logic 40 may also provide an inter¬ rupt to processor 100 through AND gate 66.
Pins 4 and 10 of scan termination logic 40 are the
5 tokenizer reset inverse (TKRST) signal. Pin 5 is an input to NAND gate 70. Pin 3 goes to pin 13 of dictionary scan down counter 56. Pin 11 is an output from inverter 62. Pin 8 is an input to AND gate 66 and to pin 1 of scan control logic. Pins 2 and 7 go to ground. Pins 1, 13, 0 and 14 go to positive 5 volts. Pin 6 is the Data Not
Found (DNF) latch. Pin 12 is the output of NAND gate 70.
Tokenizer output multiplexer 41-45 multiplexes token address bus 113, the grammar bus 112, and the tokenizer status data to the microprocessor. The given multiplexer
1 used for output is determined by address decoder 52, de¬ scribed more fully below.
The tokenizer status is presented to the control microprocessor 100 from three free pins of the tokenizer output multiplexer 45. The status will be either Data o Equals (DEQ) , Data Not Found (DNF) , or Scan In Progress (SIP), which are found on pins 11, 13, and 15, respective¬ ly.
If the control microprocessor 100 interrogates the symbolic tokenizer and finds the status to be Data Equals
25 (DEQ) , then the control microprocessor 100 will read the address of the appropriate bank of dictionary ROM 1 through 10 where the word was found. This is done by providing the appropriate read command address to the mailbus address line 116. The address is decoded by the address decoder
30 52, described below, which provides an enable signal for tokenizer output multiplexers 41 and 42 for reading the appropriate data lines from the token address bus 113. The address is output onto internal mailbus 114 and read from the bidirectional data bus transceiver 46, described
__ below. The address can be read from the output of, or
<J5
settings for, the token address counter 16-19. This can be done because the token address counter was stopped when a match was found through compare latch 39 and scan termination logic 40. That address subtracted by the number of the segment code (the number of address incre¬ ments) divided by the dictionary segment number and added to an offset by the control microprocessor, then becomes the token for the word.
Bidirectional data bus transceiver 46 is an 8-bit 0 tri-state data transceiver to interface the symbolic token¬ izer with the external mailbus 115. It is used to direct data flow between the symbolic tokenizer and the control microprocessor 100.
Segment Up Counter 47 is a 2-bit binary up counter, 5 and it is reset prior to reading a new language word from the dictionary ROM 1-10 to reflect the proper number of grammar bytes required to compare the entire trial word in the register file 34-38. It progressively selects grammar bytes from the register file 34-38 for comparison o until the required number of grammar bytes have been com¬ pared with a language word.
Segment holding register 48 stores the selected 2- bit segment code, designated above as SI, S2, S3, or S4, generated by the control microprocessor. The segment 5 code is taken from the internal mailbus 114 on lines 120 to pins 2 and 12 of register 48. The segment code deter¬ mines the number of reads of the given bank of dictionary ROM 1-10 required to compare a given language word to each dictionary word during a search. Each read of the dictio- 0 nary ROM 1-10 loads a maximum of four characters into the comparators 29-33, so that to compare a 16-character word requires four successive loads.
Pins 1, 4, 10, 13, and 14 of segment holding register 48 go to positive 5 volts. Pin 2 goes to pin 18 of the bidirectional data bus transceiver 46. Pin 12 goes to
1 pin 17 of the bidirectional data bus transceiver. Pin 7 is ground. Pins 3 and 11 go to pin 10 of address decoder 50. Pin 9 goes to pin 1 of the segment down counter 49, and pin 5 goes to pin 15 of the Segment Down Counter 49.
5 Segment down counter 49 determines when the proper number of grammar bytes have been accessed by the segment up counter 47 for comparing with a given language word. Pin 11 goes to the output of AND gate 67. Pin 13 goes to pin 12 of scan control logic 51 and pin 14 goes to SCANCLK.
10 Pin 4 goes to pin 6 of scan control logic 51. This counter must be reloaded from register 48 each time the proper number of grammar bytes have been read from the dictionary.
Address decoder 50 is a 74LS138 decoder for decoding the address and write commands from the control micropro-
15 cessor and providing a strobe to the correct logic compo¬ nent. It supplies the inverse tokenizer reset signal (TKRST) which is initiated by the microprocessor to reset all of the tokenizer logic. Address decoder 50 also sup¬ plies the scan address load strobe low byte inverse
20 (SCANLDL) to token address counter 16 and 17 through in¬ ternal data bus 114. It causes the least significant eight bits of the token address counter 16 through 19 to be loaded from the control microprocessor 100.
Address decoder 50 also supplies the scan address
25 load high byte inverse (SCANLDH) , which loads the most significant eight bits of the token address counter 16 through 19. Only 15 bits of the combined least significant eight bits and most significant eight bits are actually used. Decoder 50 also supplies the scan counter load low
30 strobe (SCANCTL) to load the least significant eight bits of the dictionary scan down counter 54-56. Address decoder 50 also loads the most significant four bits of the scan counter using the scan counter load high strobe (SCANCTH) . The dictionary scan down counter is loaded from internal
,_. mailbus 114 with a representation of beta from the appro-
priate entry of the memory map.
Pins 5 and 8 of the address decoder 50 are ground. Pins 6 and 16 are positive 5 volts. Pin 1 goes to pin 3 of address decoder 52 and to pin 14 of address buffer 58. Pin 2 goes to pin 12 of address buffer 58. Pin 3 goes to pin 9 of address buffer 58. Pin 9 goes to pin 4 of scan control logic 51. Pin 15 goes to pin 13 of scan control logic 51 and provides the tokenizer reset code. Pin 10 goes to pins 3 and 11 of segment holding register 48. Pin 4 goes to pin 11 of register file select 59. Pin 11 is the scan counter high byte load strobe (SCANCTH) , and pin 12 is the scan counter low byte load strobe (SCANCTL) . These strobes are used to enable loading of the dictionary scan down counter 54, 55, and 56 when the control micro- processor 100 issues the LOAD SCAN ADDRESS WORD COUNTER LOW BYTE and LOAD SCAN ADDRESS WORD COUNTER HIGH BYTE write commands. Pin 13 is the scan address load strobe high (SCANLDH) , and pin 14 is the scan address load strobe low (SCANLDL) . These strobes are used to enable loading of the token, address counter 16-19 when the control micro¬ processor 100 issues the LOAD SCAN ADDRESS REGISTER LOW BYTE and LOAD SCAN ADDRESS REGISTER HIGH BYTE write com¬ mands.
The scan control logic 51 is a 7474 Dual D Flip-Flop. Scan control logic 51 provides the scan enable latch, controls the operation of the segment up counter 47, and provides the Scan In Progress code. Scan control logic also stops all tokenizer operation when logic 51 is reset. When the scan enable inverse of the scan control logic 51 goes high after the reset signal from scan termination logic 40, all address counting in the token address counter 16 through 19 terminates (see SCANEN input to device 16 of FIG. 2) . At the same time, a mailbus interrupt signal (MBINTO) is produced to get the attention of the control microprocessor 100. Also, the tokenizer status can at
1 this time be read by the microprocessor, and the address of the language word in the dictionary ROMs can be read after issuing an appropriate command.
Pin 14 of scan control logic 51 is positive 5 volts.
5 Pin 13 goes to pin 15 of address decoder 50. Pin 4 comes from pin 9 of address decoder 50. Pin 8 is an input to NAND gate 65 at pin 13 and to pins 2 and 3 of segment up counter 47. Pin 1 goes to pin 8 of scan termination logic 40. Pins 3 and 7 are ground. Pin 10 is the output of
10 NAND gate 65 through inverters 63 and 64. Pin 6 is the scan enable inverse (SCANEN) signal. Pin 9 is an input to pin 9 of AND gate 67. Pin 5 is an input to pin 10 of AND gate 67 and provides the Scan In Progress (SIP) signal. Pin 12 goes to pin 13 of the segment down counter 49.
15 Pin 11 goes to pin 14 of the segment down .counter 49 and is coupled to the SCANCLK signal.
Address decoder 52 is a 74LS138 Decoder for decoding the address and write commands of the control microproces¬ sor and for providing a strobe to the correct logic compo¬
20 nent. It selects the tokenizer output multiplexer 41-45 channels for control microprocessor 100 readout.
Pin 16 of address decoder 52 goes to positive 5 volts. Pin 6 goes to pin 6 of register file select 59 and pin 6 of address select 53 and to inverter 60. Pin 1 goes to
25 pin 1 of register file select 59 and to pin 18 of address buffer 58. Pin 2 goes to pin 2 of register file select 59 and to pin 16 of buffer 58. Pin 3 goes to pin 1 of address decoder 50, to pin 14 of address buffer 58, and to pins 14 of each register file chip 34-38. Pin 15 goes
30 to pins 1 and 19 of tokenizer output multiplexer 41. Pin 14 goes to 1 and 19 of tokenizer output multiplexer 42. Pin 13 goes to pins 1 and 19 of tokenizer output multiplexer 43. Pin 12 goes to pins 1 and 19 of tokenizer output multiplexer 44. Pin 11 goes to pin 1 of tokenizer
_ output multiplexer 45, and pin 10 goes to pin 19 of token-
izer output multiplexer 45. Pin 4 goes to pin 3 of regis¬ ter' file select 59 and to inverter 69. Pins 5 and 8 are ground.
Address select 53 is a 7485 comparator for sensing the condition of the dip switch 57 and for taking the most significant four bits of the nine bits of the mailbus address write or read commands and enabling the correct address decoder. Each symbolic tokenizer may have an address from 1 to 16. The output of address select 53, 0 is used by the symbolic tokenizer to ascertain if that symbolic tokenizer has been addressed by the microprocessor and if so, to enable transceiver 46, address decoder 52 and register file select 59.
The mailbus address lines 116, which allow the control 5 microprocessor 100 to address the symbolic tokenizer through the address select 53, include lines 05, 06, 07, and 08 of the complete mailbus address lines 116.
Dictionary scan down-counter 54, 55, and 56 are 74191 counters and are loaded with a representation of beta. o Scan down counter decrements with each mailbus clock cycle as words are read from the given banks of ROMs in the dictionary ROM 1-10 for comparison to the trial word in the register file 34-38. As discussed above, the dictio¬ nary scan down counter is loaded with a representation of 5 beta from the appropriate entry of the memory map, e.g., the number of grammar bytes needed to read the sector beginning with the starting address read from the appro¬ priate entry in the memory map into the token address counter 16-19. When the dictionary scan down-counter 54, 0 55, and 56 reaches zero, Data Not Found (DNF) is produced, causing a mailbus interrupt (MBINTO) to the control micro¬ processor 100. The mailbus interrupt causes the control microprocessor 100 to interrogate the status of the sym¬ bolic tokenizer. When the status of the symbolic tokenizer _ is interrogated, the DNF (Data Not Found) is sent to the 5
control microprocessor 100. That is, the word to be token¬ ized was not found in the appropriate bank of the dictio¬ nary ROM 1 through 10.
Dip switch 57 is used in conjunction with address select 53 to select the control microprocessor 100 address of the symbolic tokenizer. More than one symbolic token¬ izer may reside on the mailbus.
Address buffer 58 generates the binary address to supply a 2-bit code to pins 13 and 14 of the register file 34-38. This selects the address of the proper 20- bit byte of the four possible 20-bit bytes for loading into the register file 34-38. Address buffer 58 also selects whether address decoder 50 or register file select 59 is enabled and provides a code for the enabled component for producing an appropriate output. Address buffer is a 74244 chip.
Register file select 5"9 uses the input from address buffer 58 to either select the appropriate' register file 34-38 for loading the trial word from the control micropro¬ cessor 100 or to enable address decoder 50. The primary function is to select the appropriate registers in the register file 34-38.
Pin 16 of register file select 59 goes to positive 5 volts. Pin 10 goes to pin 12 of register files 34 and 35. Pin 9 goes to pin 12 of register files 36 and 37. Pin 7 goes to pin 12 of register file 38. Pins 5 and 8 are ground. Pin 3 is the output of inverter 69 and is coupled to pin 4 of address decoder 52. Pin 4 is the mailbus clock inverse signal (MBCLK) . Pin 1 goes to pin 1 of address decoder 52 and to pin 18 of address buffer 58. Pin 2 goes to pin 2 of address decoder 52 and to pin 16 of address buffer 58. Pin 6 goes to pin 6 of address decoder 52 and pin 6 of address select 53. Pin 11 goes to pin 4 of address decoder 50.
Inverter 60 is part of a 74LS04 digital inverter
chip. It inverts the enable signal from address select 53 to pin 19 of the tri-state enable of the bidirectional data bus transceiver 46 to provide the correct polarity or direction. Inverter 62 is part of a 74LS04 digital inverter chip. It inverts the SCANCLK signal to SCANCLK inverse for NAND gate 65.
Inverter 63 and inverter 64 are part of a 74LS04 digital inverter chip. They form a sequential propagation 0 delay to increase the pulse width of the preset pulse from the NAND gate 65 to the scan control logic 51 and the inter grammar byte compare latch 39.
NAND gate 65 is part of a 74LS38 inverting AND gate. It generates the preset pulse to the inverters 63 and 64. 5 The preset pulse presets the compare logic posting flip-flop of the inter grammar byte compare latch 39 which causes pins 5 and 12 in the inter grammar byte compare latch 39 to go high. This causes pin 3 of the comparator 29 to go high, which posts a 1 into the comparators 29-33 o which indicates that the last compare was true. A failed comparison thereafter results in a 0 on pin 3 of the com¬ parator 29 and on the remaining comparators until the preset pulse again presets the compare latch 39.
AND gate 66 is part of a 7408 AND gate chip. When 5 multiple symbolic tokenizers sit across the mailbus 102, AND gate 66 causes an interrupt to the control micropro¬ cessor 100 when a scan termination by the compare logic 103 is found true, i.e., when the scan is terminated by scan termination logic 40. This is caused by either a 0 true signal on the line from termination logic 40 or a true signal for mailbus interrupt input inverse (MBINTI) from one of the other tokenizers on the mailbus 102. The appropriate output of AND gate 66 alerts the control micro¬ processor 100 that a symbolic tokenizer has completed a _ scan cycle. The scan through the other tokenizers are
terminated. Interrogation of the tokenizer status will yield the cause of the termination. MBINTI low indicates that another tokenizer on the bus has already caused an interrupt, and every other symbolic tokenizer must wait for interrogation by the control microprocessor 100 if more than one interrupt request was generated.
AND gate 67 is part of a 7408 AND gate chip. It generates the load strobe for the segment counter 49 to cause it to be reloaded from the segment holding register 48.
Inverter 69 provides an appropriate signal to the transceiver 46 for input or output and to address decoder 52 to provide a correct output.
FIG. 5 shows the interface 101 of FIG. 1 which inter- faces the symbolic tokenizer to the control microprocessor 100 of FIG. 1. The prototype symbolic tokenizer utilizes a Commodore 64K computer for its control microprocessor. Any central processing unit can serve as a control micro¬ processor, if appropriately interfaced to the symbolic tokenizer.
A connector 201 attaches the interface 101 to the control microprocessor 100. This connection is made through the I/O expansion jack of the Commodore 64K compu¬ ter. The pin numbers of the Commodore 64K I/O expansion jack are shown on the connector 201.
NAND gate 202 is part of a 7437 inverting AND gate chip. It provides an enable signal to the bidirectional bus data transceiver 204.
NAND gate 203 is part of a 7437 inverting AND gate chip. It provides the mailbus I/O signal (MBIO) to the symbolic tokenizer.
The mailbus I/O signal (MBIO) is high when the mailbus 102 is being addressed by the control microprocessor 100.
Bidirectional data bus transceiver 204 is a 74LS245. It provides data transfer between the control microproces-
1 sor 100 and the symbolic tokenizer.
Buffers 205 and 206 are both 7407 buffer chips. They provide address, mailbus clock, and mailbus read/write signal isolation between the control microprocessor 100
5 and the symbolic tokenizer. The six lines exiting buffer 205 are mailbus address lines 116. The uppermost three lines exiting buffer 206 are also mailbus address lines. The lower two lines exiting buffer 206 are the mailbus clock signal (MBCLK) and the mailbus read/write signal
10 (MBRW) .
The mailbus interrupt inverse signal (MBINT) supplies an interrupt signal to the control microprocessor 100 when a dictionary scan is complete. This may be omitted if the control microprocessor also does other tasks.
15 The external mailbus 115 is the data path between the control microprocessor 100 and the symbolic tokenizer. The mailbus address lines 116, along with the mailbus clock and mailbus read/write signals, provide the control microprocessor 100 with the capability to address the
20 dictionary ROMs 1-10 of the symbolic tokenizer. When sev¬ eral symbolic tokenizers are connected to the external mailbus 115, the mailbus address lines 116 also provide the addressing signals by which each symbolic tokenizer recognizes that it is the symbolic tokenizer being ad¬
25 dressed. This happens when the address select 53 (of FIG. 4) decodes the address coming across the mailbus address lines 116 as that address set into the dip switches 57.
The method of using the apparatus described herein 0 will be described with respect to FIGS. 8 and 9. It will be assumed that a dictionary ROM 104 has been created according to the arrangement described with respect to FIG. 6, and that a memory map has been created based upon the ROM addresses. The memory map is contained in ROM
__ 100A. A software routine has been created which is usable
with the control microprocessor to derive tokens from the dictionary ROM address and to derive a dictionary ROM address and the corresponding language word given a particular token. Additionally, the read and write com- mands for accessing the tokenizer have also been stored in the control microprocessor. These have been defined as described above with respect to the flowchart of FIG. 7.
The following method description will be of the token- 0 izer 102A. The tokenizer condition is static and the tokenizer is available for accepting various read or write commands from the control microprocessor through interface 101. It will be assumed henceforth that all commands to be accepted by the tokenizer 102A are made by the control 5 microprocessor through interface 101 to mailbus 102, i.e., busses 115 and 116. The control microprocessor will pro- vide an address code over the mailbus address line 116, the address bus, and will provide data over the external mailbus 115, the data bus. Additionally, the mailbus o input/output, the inverse mailbus interrupt, ground, mail¬ bus clock, and the mailbus read/write codes are provided over pins on the interface 101, as indicated in FIG. 5.
Given a command from the control microprocessor, the tokenizer accepts a read or write command on mailbus ad- 5 dress line 116, as indicated in block 301 of FIG. 8A. The tokenizer accepts a 4-bit code from the four most significant bits of the read or write command and decodes the 4-bit code in address select 53 by comparing the code with the settings of the DIP switch 57. If the 4-bit corresponds to SSSS, transceiver 46, address decoder 52 and register file select 59 are enabled. If the address code does not correspond to SSSS, the tokenizer returns to 300 at the beginning of block 301 to await another address code. In block 304, and in subsequent blocks, it is assumed
that the most significant four bits of the address code correspond to the DIP switch set for the tokenizer. If there was no correspondence, the write or read command on the mailbus address line 116 would correspond to a differ- ent device coupled to the control microprocessor through mailbus 102. Hereafter, it will be assumed that the proper address SSSS was provided with the 8-bit address code, and only the least significant five bits on the mailbus address line 116 going into address buffer 58 will be
10 considered.
In blocks 304-340, blocks 372-382, and blocks 402- 414, write and read commands are being issued by the con¬ trol microprocessor for initializing and starting the tokenizer or for reading data. In blocks 342-368, the
15 tokenizer is processing the commands loaded from the con¬ trol microprocessor to provide either a memory address of a word to be tokenized, or a language word based on an address derived from a token in the control microprocessor (detokenization) . Number 370 is a flow indicator.
20 ~~y block 304, a tokenizer reset command is issued by the control microprocessor to reset the scan control logic 51 and the scan termination logic 40. The segment down counter 49 and segment up counter 47, are reset, and a 1 is posted on compare latch 39. Specifically, if the ad- 5 dress is 00000, the tokenizer is reset by enabling address decoder 50 to output TKRST from pin 15. The scan control logic 51 is reset through pin 13, and the scan termination logic 40 is reset through pins 4 and 10. Compare latch 39 is reset through pin 13 so that a 1 is posted at pins
3Q 5 and 12. The segment up counter 47 and segment down counter 49 are also reset. The tokenizer status then returns to 300 to await an additional command. Otherwise, the tokenizer operates as described below for a different input command.
_ In blocks 306-328, the register file 34-38 is loaded
with the trial word. Specifically, the trial word is separated by the control microprocessor 100 into 20-bit grammar bytes, each grammar byte being loaded into regis¬ ters in the register file 34-38 in a sequence of eight bits, eight bits, and four bits. Subsequent grammar bytes are loaded in the same manner. Specifically, if the ad¬ dress is 00001, the first four bits of register files 34 and 35 are enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58 so that the first four bits of each of register files 34 and 35 will accept the respective four bits from the internal mailbus 114 through transceiver 46. As mentioned above, the first grammar byte is loaded first by loading the first eight bits of the grammar byte through the external mailbus 115. The next eight bits are loaded according to block 308, and the last four bits of the first grammar byte are loaded according to block 310. In any case, the tokenizer is ready for another command by returning to 300.
Specifically with respect to block..308, if the address is 00010, the first four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58. The first bit of each regis¬ ter 36 and 37 will accept respective four bits from the eight bits of data coming from the internal mailbus 114. In this manner, eight bits of data can be loaded in paral¬ lel to the respective registers in a single clock cycle. After the first eight bits are loaded, the tokenizer is then ready for loading of additional bits of the first grammar byte. In block 308, this is represented by a return to 300.
In block 310, if the address code from control micro¬ processor 100 is 00011, the first four bits of register file 38 are enabled through pin 7 of register file select 59 and pins 12 and 14 of buffer 58. The first four bits will then accept the last four bits of the first grammar
1 byte from internal mailbus 114 to complete the loading of the first grammar byte into register file 34-38. Null characters are loaded into the appropriate bits in the register file in the situation where the trial word does
5 not take up an entire grammar byte. The tokenizer is then ready to accept an additional address code in the form of a read or write command. This is represented in block 310 by a return to 300.
In block 312, if the address is 00101, the second
10 four bits of the register files 34 and 35 are enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58. The second four bits of each register will then be able to accept respective four bits from the first eight bits of the next grammar byte from internal
15 mailbus 114. This will occur only if the trial word is more than four characters in length, i.e., requires more than one grammar byte. The tokenizer is then ready to accept another address.
In block 314, if the address is 00110, the second 20 four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58. The respective second four bits can then accept respective four bits of the next eight bits in the second grammar byte from internal mailbus 114. The tokenizer is
25 then ready to accept a further read or write command.
In block 316, if the address is 00111, the second four bits of register file 38 are enabled through pin 7 of register file select 59 and pins 12 and 14 of buffer 58. As a result, the second four bits of register file
30 38 will accept the remaining four bits from the second grammar byte off of internal mailbus 114. In summary, three clock cycles are used to load the second grammar byte from internal mailbus 114 into the registers 34-38 of the register file. If the trial word did not require
,_ a complete second grammar byte, null characters are in-
serted in the bits not required. If the word requires only two 20-bit grammar bytes, the remaining bits in the register file are not used.
If additional grammar bytes are required to store the trial word in register file 34-38, the control micro¬ processor will issue a write command having an address of 01001. In block 318, this address causes the third four bits of register files 34 and 35 to be enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58. The third four bits of each register will then accept respective four bits from the first eight bits of the third grammar byte off of the internal mailbus 114.
In block 320, the next eight bits of the third grammar byte are loaded into the register file. Therefore, if the address is 01010, the third four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58. .The third four bits of the two register files will accept respective four bits from the second eight bits of the third grammar byte off of internal mailbus 114.
In block 322, if the address is 01011, the third four bits of register file 38 are enabled through pin 7 of register file select 59 and pins 12 and 14 of buffer 58. The third four bits will accept the last four bits of the third grammar byte from internal mailbus 114. In this manner, three grammar bytes, or up to twelve charac¬ ters, are loaded into the register file in nine clock cycles. If an additional grammar byte is required to store the entire trial word, an additional write command is issued by the control microprocessor, as discussed below.
In block 324, the first eight bits of the fourth grammar byte are loaded into the register file. If the address is 01101, the fourth four bits of register files
1 34 and 35 are enabled through pin 10 of register file select 59 and pins 12 and 14 of buffer 58.' The fourth four bits will then accept respective four bits of the first eight bits of the fourth grammar byte off of internal
5 mailbus 114.
In block 326, if the address is OHIO, the fourth four bits of register files 36 and 37 are enabled through pin 9 of register file select 59 and pins 12 and 14 of buffer 58. The fourth four bits will then accept respec-
10 tive four bits from the second group of eight bits of the fourth grammar byte off of internal mailbus 114.
In block 328, if the address is 01111, the fourth four bits of register file 38 are enabled through pin 7 of register file select 59 and pins 12 and 14 of buffer
15 58. The fourth four bits will then accept the last four bits of the fourth grammar byte from internal mailbus 114. After this step, the register file will contain all four grammar bytes required to store the trial word. The tokenizer will then be ready to accept a further write
20 or read command, as appropriate. In summary, if all regis¬ ter bits are filled, a 16-charaσter trial word has been stored into the register file using twelve clock cycles to store eighty bits of information. This is done by reading representations of characters into the register
25 file in parallel and successive reads of eight bits, eight bits, and four bits. When the register file contains the trial word, the tokenizer is then ready to accept further commands.
It should be noted that four address bits are signif¬
30 icant for appropriate loading of the register file 34-38. The codes used to access the register file begin with the first address described in block 306, namely, 00001, and end with the address discussed in block 328, namely, 01111. Though only pins 12 and 14 of address buffer 58 and 7, 9,
__ and 10 of register file select 59 were discussed in blocks 35
1 306-328, it is to be understood that all four bits of the address were used to access and load the register file 34-38. For example, pins 16 and 18 of buffer 58 were used to produce the appropriate output signal from pins
5 7, 9, or 10 of register file select 59.
In block 330, the control microprocessor issues a write command for loading into the token address counter 16-19 part of the starting address of the sector to be searched. Specifically, if the address is 00100, the
10 address decoder 50 is enabled through low signals on pins 16 and 18 of buffer 58 and a high on pin 11 of register file select 59. This causes a 3-bit code to be sent from pins 9, 12, and 14 of address buffer 58 to the decoder 50 to strobe token address counter-register 16 and 17 from
15 pin 14 of the decoder. Upon the production of the strobe, SCANLDL is sent to counters 16 and 17 so that the counters will accept the low eight bits of the starting address from the internal mailbus 114 through transceiver 46. The tokenizer is then ready to accept the high eight bits
2o of the starting address to be loaded into counters 18 and 19.
In block 332, the remaining eight bits of the starting address are loaded into the counters 18 and 19 over the internal mailbus 114. Specifically, if the command address
25 is 01000, the address decoder 50 is enabled as in block 330 to strobe token address counter-register 18 and 19. A strobe SCANLDH is provided to counters 18 and 19 from pin 13 of decoder 50 to accept the high bits of the start¬ ing address from the internal mailbus 114. The token-
3Q izer is then ready to accept a further read or write com¬ mand.
In block 334, a portion of the value of beta, the total number of grammar bytes to be read to search through the entire sector under consideration, is loaded into the
_ dictionary scan down counter 54-56. Specifically, if the
1 command address is 01100, the address decoder 50 is enabled as in block 330 to strobe the dictionary scan down counter 54 and 55 to accept the low eight bits of beta from the internal mailbus 114. The upper four bits can then be
5 loaded into counter 56.
In block 336, the remaining four bits of the beta value are loaded into the counter 56. Therefore, if the command address is 10000, the address decoder 50 is enabled as in block 330 to strobe the dictionary scan down counter
10 56 using the SCANCTH from pin 10 of decoder 50. The dic¬ tionary scan down counter 56 then accepts the high four bits of beta from the internal mailbus 114.
In block 338, the se-gment code is loaded into segment holding register 48 according to a 2-bit code read from
15 the internal mailbus 114 through bidirectional transceiver 46. Specifically, if the control microprocessor outputs an address of 10100, address decoder 50 is enabled through low signals on pins 16 and 18 of the buffer 58 and a high signal on pin 11 of register file select 59. A 3-bit
20 code is sent from pins 9, 12, and 14 of buffer 58 to the decoder 50 to provide a load strobe from pin 15 of decoder 50 to cause the segment holding register 48 to accept the 2-bit segment code. The segment code is taken over lines 120 from the internal mailbus 114 through transceiver 46.
25 The tokenizer is then available during the next clock cycle to accept another command. Otherwise, a different command and address code was provided by the control micro¬ processor and is processed as provided herein.
It should be noted that the above-described sequence
30 of steps is not significant for operation of the tokenizer. As long as the various components are reset and loaded with the appropriate information, the particular order in which the information is loaded into the components is not significant. However, the appropriate values should
,5_ be entered in the registers prior to issuing the start
1 scan command, since the tokenizer functions as an indepen¬ dent processor once the required parameters are loaded. In this way, the control microprocessor can do separate tasks while the tokenizer is operating. However, once the
5 appropriate registers are loaded, the start scan command may be issued.
In block 340, the tokenizer is issued a command to start scanning for the word to be tokenized. Since the trial word is already stored in the register file 34-38,
10 and the start address is loaded in token address counter 16-19, the counters have been loaded with the appropriate values, the tokenizer can step through the language words stored in the appropriate bank of dictionary ROMs 1-10 to compare the words with the trial word. This is done in a
15 grammar byte-by-grammar byte sequence where the grammar byte of the trial word is compared with a respective gram¬ mar byte of a language word found in the dictionary ROM, all during a single clock cycle. If a match is found after all grammar bytes of a given language word have
20 been compared, a data equals (DEQ) is issued, and the scan is terminated. If the end of the sector is reached without a match being indicated, the data not found signal (DNF) is issued, arid the tokenizer awaits a further com-- mand. The actual tokenization process is depicted in
25 blocks 342-368 of the flowchart of FIG. 8. Therefore, if the write address is 11000, the address decoder 50 is enabled as in block 330 to strobe the scan control logic 51 to set the SCANEN latch for token address counter 16- 19. The latch is also provided to the segment down counter
30 49 and to the dictionary scan down counter 54-56 so that those components are enabled. The scan control logic 51 also provides a scan in progress (SIP) code to the internal mailbus 114 through multiplexer 45 so that a check by the microprocessor 100 will show that the tokenizer is scan-
„ ning. Thereafter, the tokenizer operates as described 35
1 below without further input from the control microprocessor until the scan is terminated.
In block 342, during the first clock cycle, the scan control logic 51 provides a high signal to pin 10 of AND gate 67, along with the SIP code, and provides a second high signal from pin 9 of scan control logic 51 to pin 9 of AND gate 67 for providing a load strobe to pin 11 of segment down counter 49. This allows loading of the 2- bit segment code from the segment holding register 48.
10 For a five-character word such as "water", two reads or two grammar bytes are required to completely access the language word from the appropriate bank of dictionary ROMs 1-10.
In block 344, the token address in counters 16-19,
15 including the token addresses TKAD00 and TKAD14 and the enable signals from AND gates 25 and 26 allow access to the first grammar byte of the first word under comparison in the given bank of ROMs 1-10. The particular high or low bits of each ROM are selected through the multiplexers
2o 11-15 by the address TKAD00, and the particular bank of ROMs is selected by the address TKAD14. The actual address in each ROM is determined by the contents of token address counter 16-19. Given the address in the token address counter, the first grammar .byte of the first word under
25 comparison is provided through multiplexers 11-15 to the bank of comparators 29-33. The first grammar byte is compared with the first grammar byte in register file 34- 38, as selected by the 2-bit code from segment up counter 47. For example, if the trial word is "water", and the
3Q first language word in the particular sector of dictionary ROMs 1-10 is "waste", two grammar bytes are required to make a complete comparison. The token address counter will have selected the first grammar byte for the word "waste", and the segment up counter 47 will have selected
_ the first grammar byte from the register file 34-38. For
example, the segment up counter may provide a code 00 to indicate the first four bits of the 4-bit register slice in each register 34-38 to be output to the comparator bank 29-33. In block 346, a determination is made as to whether or not the dictionary language word grammar byte is iden¬ tical to the corresponding trial word grammar byte under comparison. As discussed above, the tokenizer reset caused a 1 to be posted at pins 5 and 12 of the compare latch 39 to be input to the comparators 29-33. The comparators are linked by a daisy chain connection so that a 1 posted on the input of comparator 29 is also posted on the input of comparators 30, 31, 32, and 33. As a result, the output from pin 6 of comparator 33 also indicates a 1 for input to pin 2 of latch 39. However, if any comparator does not indicate a match, the output of that particular com¬ parator goes to zero, and the output from the subsequent comparators, including the output from pin 6 of comparator 33, goes to zero. Thereafter, a zero is provided to pin 2 of latch 39 and is output from pins 5 and 12 of the compare latch until the compare latch 39 is reset.
Assuming that there is no match in the first grammar byte under comparison, a zero is posted on pins 5 and 12 of the compare latch 39, indicating a mismatch. See block 348. As a result, the output from pin 9 of compare latch 39 remains low, indicating the lack of "data equals". The scan limit down counter 54-56 is then tested according to block 350 to determine whether or not the down counter has reached zero. As discussed above, the down counter is initially loaded with beta, the total number of grammar bytes required to be read from a given sector to read all of the words contained in that sector. By way of example, a sector comprising 5-character words, beginning with "w" and containing five language words, would be stored in five groups of two grammar bytes each so that a total of
1 ten grammar bytes would be required to be read in order to scan the contents of the sector.
If the scan limit down counter had reached zero, indicating that all grammar bytes of the words contained
5 in the given sector had been read as in block 352, a signal is sent from dictionary scan down counter to pin 3 of scan termination logic 40, which then provides a low output from pin 5 to NAND gate 70. The output of NAND gate 70 is then fed back to pin 12 of scan termination logic 40
10 as a high signal, which produces a low from pin 8. The low signal from pin 8 provides a reset to pin 1 of scan control logic 51, which causes the scan in progress (SIP) to go low. The scan enable code (SCANEN) is also stopped to stop all the counters in the tokenizer. Simultaneously,
15 a data not found code (DNF) is produced at pin 6 of scan termination logic 40 to be output through pin 13 of multi¬ plexer 45; The countdown of the dictionary scan down counter to zero indicates that all dictionary words in the sector have been scanned and compared. In the example
2o where the given sector under consideration contains five- words, five letters each, ten grammar bytes will have been read over the course of ten clock cycles. After the tokenizer is stopped and when the data not found signal is produced, the tokenizer waits for a read or write com-
25 mand.
Where the scan limit down counter has not yet reached zero in block 350, the tokenizer then tests to see whether or not the segment down counter 49 has reached zero, as indicated in block 354. In the example of "water", the
30 segment code originally indicated that two grammar bytes must be read from the appropriate bank of the dictionary ROMs 1-10 in order to compare the entire word. Since only one grammar byte had been read, the segment down counter shows a representation of "1", having been decre-
_. __ mented once.
•JO
Where the segment down counter has reached zero, as indicated in block 365, the scan control logic 51 is given a signal from the segment down counter 49 to test the compare latch 39 to determine the results of the compare of the two words. Since block 346 produced a mismatch, scan outputs a load strobe from AND gate 67 to reload the segment down counter 49 from the segment holding register 48. The signal to scan control logic 51 also resets the segment up counter 47 through the output from pin 8 of scan control logic 51. The same signal resets the compare latch 39 to post a 1 on pins 5 and 12 of latch 39 and increments the token address counter and decrements the dictionary scan down counter. This indicates that all grammar bytes for a given word have been read, but that subsequent words remain to be compared in the comparators 29-33, since dictionary scan down counter had not reached zero. The compare process then repeats, as discussed above and as described below.
If the segment down counter has not yet reached zero, the segment up counter 47 is incremented as indicated in block 360. This provides a signal to the register files 34-38 to allow parallel access to the appropriate 4-bit registers in each register 34-38. Additionally, the segment down counter is decremented, and the dictionary scan down counter is also decremented. Finally, the token address counter is incremented to allow access to the next succeeding grammar byte in the appropriate bank of dictionary ROMs 1-10. The compare process is then repeated through block 344 for each succeeding grammar byte and subsequent language words on a grammar byte-by-grammar byte basis (since a mismatch was found for the word under comparison) .
Returning to block 346, if a match is indicated, the flow continues to block 362. In this case, the daisy chain of comparators 29-33 have found a match of the re-
1 spective bits being compared and maintain a 1 at the output from pin 6 of comparator 33. Therefore, a 1 is maintained on pins 5 and 12 of compare latch 39. For example, if a match was not found for "waste", and if the next word in
5 the dictionary ROMs is "water", a comparison of the first grammar byte for the word "water" from the dictionary ROM with the first grammar byte from register files 34-38 would produce a match. Then, the scan limit down counter would be tested as in block 362 to determine whether or
10 not the counter had reached zero.
Assuming for the moment that the scan limit down counter did not reach zero, block 362 continues to block 364. The tokenizer then determines whether or not the segment down counter has reached zero. This step deter- 5 mines whether or not all the grammar bytes for the trial word "water" have been read and compared. In the case where the second grammar byte for "water" has not been compared, the tokenizer increments the segment up counter 47 to designate a new register file location to be accessed
20 and decrements the segment down counter with the next clock cycle. The tokenizer also decrements the dictionary scan down counter and increments the token address counter so that the next grammar byte in the dictionary ROMs can be accessed. These steps are carried out in block 360. The
25 operation then returns to the input of block 344 to conduct an additional comparison.
Continuing to block 364, and assuming that a match was found for the word "water" in the comparison of the second grammar byte, the segment down counter will have
30 reached zero, since a representation of the segment code for two grammar bytes in the segment down counter 49 will have been decremented to zero. This indicates that all grammar bytes for the word "water" have been read and compared. A reset signal is then provided to pin 12 of
__ scan control logic 51, which provides an output signal at «J5
1 pin 8 to pin 11 of compare latch 39. This tests the status of compare latch 39 and finds a high signal at pin 9, indicating DEQ. At the same time, a low signal is output from pin 8 of latch 39 to NAND gate 70, which provides a
5 high signal to pin 12 of scan termination logic 40. This produces an interrupt signal at pin 8 of logic 40 to reset scan control logic 51. As a result, the scan in progress
(SIP) goes low, and the SCANEN goes high. This stops all counters and maintains the token address counter 16-19
10 with the same address as was present during the last read for which a match was found between the language word and the trial word. The tokenizer is then available to accept additional input such as a read command.
Assuming now that the decision block 362 indicates
15 that the scan limit down counter has reached zero, and that a match was indicated for comparators 29-33, block 368 is entered. It should be apparent that, if the scan limit down counter reaches zero, the segment down counter 49 has also reached zero. In that situation, a signal is
20 sent to pin 3 of the scan termination logic 40 from pin 13 of the •dictionary scan down counter 56. Additionally, the segment down counter 49 has sent a signal to scan control logic 51, which produces an output at pin 8 for testing the status of compare latch 39. Since the output
25 at pin 9 of compare latch 39 is high, indicating DEQ, an output is provided from pin 8 to NAND gate 70 which pro¬ duces a low pulse to pin 12 of scan termination logic 40. As discussed above, pin 8 provides an interrupt signal to AND gate 66 and a reset to scan control logic 51 through 0 pin 1 of logic 51. This stops all counters and terminates the SCANEN and stops the compare. The tokenizer is then available for reading the address of the matched word. The address may then be used to produce a token using the software of control microprocessor 100.
_ In summary, the sequence of operations for loading and
«
-61-
1 starting the tokenizer have been described. The operation of the tokenizer, after appropriate parameters have been loaded, has also been described. Once the tokenizer pro¬ cesses the trial word and the language words stored in
5 dictionary ROMs 1-10, output signals are provided indi¬ cating the status of the tokenizer with respect to the processing conducted by the tokenizer. During that pro¬ cess, an address may be derived which can be used by the control microprocessor to produce a token, as is well
10 known in the art.
In this regard, it will be assumed that an address for a matched word resides in token address counter 69- 19. In block 372, the control microprocessor provides a read command to access the address for the purpose of
15 deriving a token. Specifically, if the address is xxOOO and MBRW is high, use decoder 52 to enable reading of the least significant eight bits of the token address counter 69-19 through token address bus 113 and external mailbus 115. Then return to 300 to await a read command for the
20 remaining bits of the token address.
In block 274, the remaining bits of the token address for the match trial word is read from the token address counter 69-19. Specifically, if the read address is xxOOl and MBRW is high, use the decoder 52 to enable reading of 5 the most significant seven bits of the token address and return to 300 to await a further command.
Once the control microprocessor has read the 15 bits of the token address, a token can be derived as discussed above.
30 The next succeeding three blocks 376-380 are used to effectively translate a token into a human language word. This process is called detokenization and is dis¬ cussed below with respect to Fig. 9. However, assuming the control microprocessor issues a read command to read
- the ASCII or EBCDIC representation of a word contained in
*
-62-
the dictionary ROM corresponding to the address, the lan¬ guage word can be output to the control microprocessor on a grammar byte-by-grammar byte basis. Specifically, in block 376, an address xxOOlO and a high signal for MBRW causes GBLOCK is low. The decoder 52 is used to enable reading of the first eight bits of the grammar byte of the language word contained in the appropriate bank of ROMs 1-10.
If the control microprocessor issues an address xxOll and a high signal for MBRW, the decoder 52 is used to enable reading of the second eight bits of the grammar byte of the language word contained in the given bank of ROMs 1-10. Finally, in block 380, an address xxlOO and a high for MBRW allows decoder 52 to enable reading -of the last four bits of the grammar byte of the language word. If the particular word corresponding to the token requires more than one • grammar byte for storage of the word, the remaining grammar byte or grammar bytes can be accessed in the same manner.
The final read command the control microprocessor can produce is the address xxlOl. If MBRW is high, the decoder 52 enables reading of the tokenizer status bits from the pins of multiplexer 45. The tokenizer then re¬ turns to 300 to await a subsequent command and returns to 300 if the tokenizer fails to recognize any of the commands output by the control microprocessor.
During the process of detokenization of a token to produce a natural language word, the read and write com¬ mands described are used. Considering the flow chart of Fig. 9, detokenization is initiated by the control microprocessor first issuing a tokenizer reset command for resetting the tokenizer as indicated in block 402. The control microprocessor then utilizes an appropriate routine to derive a segment code and address from the particular token for which a natural language word is
desired from ROMs 1-10. The segment code and address are used as data and placed on the data bus 115 when the appro¬ priate address code is produced at the address bus. Specifically, block 406 indicates that the control micro- processor issues a write command ssssOOlOO for loading the low byte of the scan address register. When the ad¬ dress decoders decode the write command, the low byte of the scan address register is loaded with the least significant bits of the address to be accessed. Block 408 indicates the control microprocessor issues the address ssssOlOOO to load the high byte or the scan address regis¬ ter. This write command is also decoded in the decoders 52 and 50. The most significant bits of the address are then loaded in the token address counter 16-19. Reading of the language word from the appropriate bank of ROMs 1-10 is accomplished on a grammar byte-by- grammar byte basis wherein representations of each grammar byte of the language word are transmitted in a sequence of eight bits, eight bits, and four bits over three clock cycles. Specifically, the control microprocessor issues a MBRW code and an address ssssxxQIO in order to read the first eight bits of the appropriate grammar byte from grammar bus 112. The next eight bits and the last four bits of the particular grammar byte are read sequentially after the control microprocessor issues the read command ssssxxOll and ssssxxlOO.
As indicated in block 412, additional grammar bytes may be required to be read if the segment code derived by the control microprocessor from the particular token indicates that more than one grammar byte must be read to obtain the entire language word. If the segment code is a representation of "S2" the above steps in blocks 406, 408 and 410 are repeated once by incrementing the address by one and reloading the incremented address into the token address counter. Similar comments apply if the
1 segment code is a representation of "S3" or "S4". The control microprocessor then waits for the next read or write command.
Using the above described method and apparatus, natu-
5 ral language words can be transformed into tokens according to addresses of common words contained in a dictionary. The tokens derived in the tokenization process can be used to store files in a compressed form or to provide textual transmission over a substantially decreased amount
10 of time. When the language word is to be retrieved using the token, the token is manipulated to derive an address and a segment code for the language word corresponding to the address. The token may be processed to provide the language word so that the tokenized text can be displayed
15 or printed.
A further application of the method and means de¬ scribed herein is for tokenizing strings of words such as .phrases. Therefore, not only can common words be repre¬ sented by tokens, but common phrases including common 0 words which are already stored in the dictionary ROM can also be represented by a single token. This application provides for even greater reduction in required storage space or transmission time over regular tokenization. The same apparatus as described above can be used for
25 phrase tokenization. However, the dictionary ROM 104 contains additional lists comprising a keytoken list and a phrase table.
The keytoken list is preferably comprised of 16-bit tokens stored as a 20-bit grammar byte in a given bank of
30 ROMs 1-5 or 6-10. The last four bits of the grammar byte are not used. Additionally, phrases of up to four words in length are the preferred size phrases to be tokenized. It has been found that the effectiveness decreases with larger strings of words. It should be noted that 16-bit m m tokens are used for keytokens, rather than 15-bit tokens,
to take full advantage of the 30,000 language words storage capability of ROMs 1-10. In the prototype discussed above, 15-bit tokens wee used, because the full capacity of the address register was not used. The keytoken list is essentially a 1 x N array of tokens representing keytokens. In the language sense, a keytoken is taken to be the starting word of any common phrase which has been represented by a token in the dic¬ tionary ROMs 1-10. The phrase is then represented by a single 16-bit token in ROMs 1-10. If the common phrase has four words in it, those four words in the defined sequence are represented by a single 16-bit token. The same is true of three word phrases and two word phrases.
The keytoken list is arranged according to the al- phabetiσ starting letter of the word which is a potential keytoken and the relative address at which the keytoken is found for th .given starting, letter. For example, all of the 'possible keytokens having a starting letter of "a" are listed horizontally on the first row of Fig. 10. Each keytoken entry is in the form of a 16-bit keytoken byte stored in a portion of the dictionary ROMs 1-10. All the Ka entries in Fig. 10 represent predefined key¬ tokens representing words having an alphabetic beginning letter of "a".
A portion of the dictionary ROMs 1-10 also includes a phrase table such as the phrase table schematically depicted in Fig. 11. This portion of the dictionary ROM includes a two dimensional array of token triplets, S1-3. Each triplet has a phrase table address and a corresponding keytoken address. For example, the keytoken address 0 from Fig. 10 corresponds to the keytoken address 0 in the phrase table in Fig. 11. Each entry or triplet com¬ prises three tokens, the first token being the token for the first word after the keytoken word in the phrase, the second token representing the second word in the phrase
after the keytoken word, and the third token representing the third word in the phrase after the keytoken word. Each triplet in a given column under the keytoken address are tokens representing potential second, third, and fourth
5 words in a phrase beginning with the word represented by the keytoken found in the keytoken list of Fig. 10. Each entry in the phrase table comprises a triplet of tokens representing at most three words in a phrase. Each triplet is entered in the phrase table according to a starting
10 address derived from the keytoken list address of Fig. 10 and the particular phrase table address for the given triplet. Therefore, for a given starting address in the phrase table, a search can be conducted through each of the triplets to determine whether or not the trial text
15 phrase is included in one of the triplets under the asso¬ ciated starting address. If a match is found between a triplet and the trial phrase, the phrase table address "corresponding to the matching triplet is used by the con¬ trol microprocessor to produce a token for the given
2o phrase. The token is then used by the control micropro¬ cessor in the same manner as any other token representing a single language word.
Finally, the control microprocessor includes a memory map in the form of a phrase table memory map such as that
25 depicted schematically in Fig. 12. The memory map includes two entries for corresponding to a plurality of keytoken addresses. The keytoken address is that derived from the address of Fig. 10. The phrase table memory map is used to derive a starting address for the phrase table of
30 Fig. 11 and the total number of grammar bytes required to read a particular set of triplets corresponding to the starting address derived from the memory map. For example, a keytoken word which is the starting word for a four word phrase and which has a beginning letter of "a" may m __ match the keytoken Kan. The control microprocessor then
reads the address, namely 1, and accesses the phrase table memory map to determine the starting address and the total number of grammar bytes so that the trial phrase, in the form of tokens, can be compared to the stored tokens for the given starting address. For example, the address derived from the keytoken list of Fig. 10 corresponds to a starting address for the phrase triplets of 101 and includes a total of 15 grammar bytes. This information is used to load the token address counter 16-19 with the 0 starting address and to load the dictionary scan down counter 54-56 with a representation of the total number of grammar bytes, 15. The segment holding register 48 is loaded with a representation of 3 since the triplets in the phrase table in the dictionary ROM comprises three 5 tokens.
Considering the flow chart of Fig. 13 in conjunction with the tables of Figs. 10-12, the process of phrase tokenization will now be described. It will be assumed that a block of text is available in the control micro- 0 processor for tokenization. For each word in the text, a token is derived as discussed above with respect to Fig. 8. However, the dictionary ROMs now contain a keytoken list and a phrase table in addition to the stored library of language words. Therefore, phrase tokenization includes 5 the additional step of determining whether or not a token can be substituted for an entire phrase of up to four language words. As indicated in Fig. 13, a language word is retrieved from the text contained in the control pro¬ cessor and tokenized using the tokenizer 102a. During 0 block 504, the token is then used as a potential keytoken, namely the first word in a potential phrase to be substi¬ tuted by a token. The control microprocessor accesses a memory map for keytokens, such as that depicted schemat¬ ically in Fig. 13, using the starting letter of the origi¬ nal text word to determine a starting address and a key- D
1 token segment length for searching in a keytoken list. The start address is the address in dictionary ROMs 1-10 at which the search for the keytoken is to begin. As indicated above, the keytoken list contains the first
5 word of all phrases represented by tokens in the dictionary ROMs. The keytoken segment length is the number of gram¬ mar bytes required to be searched and compared for all keytokens of a given starting character. This step is similar to tokenization as discussed above.
10 During block 506, a search is conducted of the key¬ token list in ROMs 1-10 for a keytoken match using the start address obtained from the memory map and the keytoken segment length. Specifically, the start address is loaded into token address counter 16-19 with an appropriate write
15 address code. This will allow access to the keytoken list in dictionary ROMs 1-10. The dictionary scan down counter 54-56 is loaded with a representation of the key¬ token segment length to serve the same function as the term beta used in conjunction with tokenization. The .
2o dictionary scan down counter is loaded using an appropriate- address code to the mailbus address line 116. The segment holding register 48 is loaded with a representation of 1 since all keytokens are one grammar byte in length. The segment holding register is loaded using an appropriate
25 address to address buffer 58. Finally, the keytoken is loaded into register file 34-38 with an appropriate address write command in a manner similar to that discussed above with respect to tokenization. A start scan command is then provided to the mailbus address line and the tokenizer
3Q then begins searching through the keytoken list for a match.
As indicated in block 508, a match is indicated in the same manner as discussed above with respect to the description in conjunction with Fig. 8 for tokenization.
_ If a data not found command is issued, the control micro-
1 processor returns to retrieve the next subsequent language word for repeating the above-described process. If a match is found, the word from the text is a keytoken and is the first word in a potential phrase in the text. Therefore, it must be determined whether or not the word in the text is followed by at most three words which are identical to a phrase in the phrase table. To this end, the control microprocessor reads the address from the keytoken list at which a match was found between the trial
10 token and the keytoken.
As indicated in block 510, the keytoken address is used to search the phrase table memory map for the start address of the potential phrase and the segment length in the phrase table for those potential phrases having
1 the keytoken as the beginning word.
The phrase table memory map is depicted schematically in Fig. 12. Given the keytoken address from the keytoken list, for example address 1, the starting address of the three potential words or triplets in the phrase table and
20 the total number of grammar bytes needed to search through all the triplets having the keytoken as the first word can be found. For example, where the keytoken address is 1, the start address of the phrase triplets in the phrase table is 101. The total number of grammar bytes needed
25 to access all the token triplets corresponding to the keytoken is 15. It should be noted that the triplets to be searched in the phrase table are tokens rather than language words. The tokens were defined during creation of the dictionary ROM in a manner similar to the creation
30 of the dictionary library discussed above with respect to Figs. 1-9.
During block 512, the control microprocessor tokenizes the three words coming immediately after the trial keytoken obtained from the text in block 502. The three tokens
• ,JO_ are then loaded into the register file 34-38 to be compared
1 to the respective tokens in the triplets in the phrase table. The first word of the potential phrase need not be compared with any word in the phrase table since the trial keytoken was already found to match a keytoken in
5 the keytoken list.
In block 514, the start address derived from the phrase table memory map, 101, and the segment length, 15, are loaded in the tokenizer. Specifically, a repre¬ sentation of 101 is loaded into the token address counter
10 16-19 with an appropriate write address command. A repre¬ sentation of 15 is loaded into the dictionary scan down counter 54-56 with an appropriate write address command. The segment holding register 48 is loaded with a 2-bit code representing a segment code S3 using an appropriate
15 address command. The segment code is S3 since there are three tokens or three grammar bytes to be compared in comparators 29-33. It should be noted that the three trial tokens representing the last three words in the potential phrase are loaded into the register file 34-38
20 in a manner similar to that described with respect to tokenization in Fig. 8. This also allows for parallel processing resulting in shorter scan times. A start scan command is then issued by the control microprocessor to begin comparing the thre tokens in the register file 5 34-38 with the token triplets in the phrase' table starting at starting address 101. Specifically, the first token in the register file is compared with the token represented by SI, the second token is compared with the token repre¬ sented by S2 and the third token is compared with the token
3Q represented by S3. The steps involved are similar to those described with respect to Fig. 8. If a match is found so that compare latch 39 provides a data equals code, DEQ, the tokenizer is stopped and the address in the phrase table is used as the token for the phrase. mm The control microprocessor reads the address in a manner
similar to that described above with respect to tokeniza¬ tion. The control microprocessor then continues with the next word after the phrase which was just tokenized.
As indicated in block 518, if a match is not found between the first trial token and SI, the second trial token and S2 or the third trial token and S3, the control microprocessor computes a memory variable to determine whether or not the tokenizer compared three trial tokens with three tokens in the phrase table. If a comparison was made of three tokens, and a match was not found, block 520 is entered and the control microprocessor issues a write command to the tokenizer 102a to null the bits repre¬ senting the last token in the trial tokens being compared. This is done by placing null characters in the last grammar byte loaded into register file 34-38. In the present example, where three trial tokens were being compared, two trial tokens containing usable .information for comparing remain for comparison with the tokens represented by SI and S2. However, the control microprocessor still loads the segment holding register with a representation of S3. This allows comparison by the tokenizer of all three tokens represented by SI, S2, and S3. The control microprocessor then issues an address write command to load the dictionary scan down counter with the starting address of the phrase corresponding to the keytoken address 1, namely 101, and the total number of grammar bytes, 5.
The search is conducted again through the phrase table in the dictionary ROM 1-10 using the representa¬ tion of 101 as the starting address. The tokenizer co - pares only the first two tokens in the register file 34-38 with the respective grammar bytes represented by SI and S2. The tokenizer determines whether or not there is a match according to block 516 as described above. If a match is found, the phrase table address is read by the control microprocessor to be used in creating a token.
It should be noted in this case that the address and there¬ fore the resulting token represents only three language words; namely the keytoken and the language words repre¬ sented by SI and S2. It should be noted even though the phrase in the phrase table for which a match was found in this case only contains two tokens corresponding to SI and S2, the token corresponding to S3 contains only null characters. As a result, the control microprocessor re¬ places only three words from the text, namely the keytoken and the next two words, with the token derived from the phrase table. Tokenization of words and of phrases then continues as usual.
Returning to block 516, if a match is not found, the control microprocessor senses the data not found code (DNF) . The control microprocessor then tests its memory variable to determine whether or not the memory variable is greater than 1. If the memory variable is greater then 1, this indicates that in addition to the keytoken, there is at least one token in the phrase table the can still combine with the keytoken to constitute a phrase. The process is then repeated until a match is not found, and the bits storing the token corresponding to S2 and S3 have already been loaded with null characters. In such a case, the text word for which a keytoken was found will merely be represented by the 16-bit token.
In the process described above, an entire textual block can be represented by tokens requiring significantly less storage space or processing time than the ASCII or EBCDIC equivalent. Many of the phrases in the textual passage will be capable of being represented by a single token, the remaining words being represented by their own tokens.
Significantly, the tokenizer can conduct searches or scans very efficiently because a complete operation, in terms of a comparison with a match or rejection, can be
carried out in at least one system clock cycle and in, at most, four clock cycles. This is in contrast to other processors which require many clock cycles to complete a given task. Additionally, the disclosed embodiments pro- vide for sufficient flexibility to allow less than four grammar bytes to be processed in a given search, if the trial word contains less than thirteen characters.
The disclosed method and apparatus also enhances searching efficiency by limiting the search or scan range only to that small portion of the dictionary ROM where the word will appear, if at all. The scan limit down counter stops the search and compare process after there is no chance of finding a match. Also, all grammar bytes of a word are compared, even though a match is no longer possible, in order to properly increment the address coun¬ ter for the next compare cycle.
It is also significant that parallel processing is used to simultaneously access language words in the dic¬ tionary ROM to compare the trial and dictionary words and to read and write data. This enhances the efficiency.
TABLE III
A 02 - 0000 OFFF A 03 - OOFO OOAO A 04 - OFFF OOEA
A16 • B 02
10 B 03
B 16 -
15
Z 16 - ODDD; OOFO
20
25
30
35
The following is a listing of a basic program for causing the symbolic tokenizer to check the dictionary ROM 1 through 10 for a word typed into the control micro¬ processor 100 through its keyboard. If the word is not found in the dictionary ROM 1 through 10, then the word is highlighted after the return key is struck. If the word is found, then the word is not highlighted.
TABLE IV
1 PRINT CHR$(14) ; CHR$ (5) ; CHR$(147)
2 POKE 53280,0: POKE 53281,0
10 IF FL=1 THEN 25
20 FL=1: LOAD "SPELL.HC",8,1
25 GOSUB 1000
30 A$=»"
40 B$=""
50 A=l: G=0 6 600 G=G+1
65 GET A$: IF A$="" THEN 65
70 IF A$=CHR$(13) OR A$=CHR$ (141) THEN 65
75 PRINT A$;: B$=B$+A$: GOSUB 1000
77 K=ASC(A$): IF K>127 THEN K=K-128
78 IF K=32 OR K=45 THEN 240
79 IF K <65 OR K>90 THEN G=G-l:GOTO 100
80 POKE (52991+G) , ASC(A$)
100 A$=»"
110 A=A+1
120 GOTO 60
130 POKE 52991,G
140 SYS 49182
150 IF PEEK (52990)=1 THEN 30
160 PRINT CHR$(146) ;
170 GOSUB 2000
180 PRINT CHR$(18) ; B$; CHR$(146) ; : A$=""
190 GET A$: IF A$="" THEN 190
200 IF ASC(A$) <> 133 AND ASC (A$) <> 137,
THEN A=l: B$="": G=l: GOTO 70
210 GOSUB 2000
220 GOSUB 1000
230 GOTO 30
240. IF A=2 THEN 30
250 GOTO 130 1 1000000 PRINT CHR$(18)" " CHR$(157) ; CHR$ (146) ;:RETURN
2000 FOR C=l TO A: PRINT CHR$(157) ;: NEXT C: RETURN
The following is a Commodore 6510 Assembly Language listing of the code required to cause the control micro¬ processor 100 (a Commodore 64K) to access the symbolic tokenizer when the above Basic program is run. The Commo- dore 64K computer uses a 6510 central processing unit which has a similar assembly language instruction set to the more popular 6502 central processing unit.
l TABLE V
MAP .WORD $C800
.WORD $C868
.WORD $C8D0 .WORD $C938
.WORD $C9A0 .WORD $CA08 .WORD $CA70 .WORD $CAD8 10 .WORD $CB40
.WORD $CBA8 •WORD $CC10 .WORD $CC78 .WORD $CCE0 15 .WORD $CD48
.WORD $DDB0
RFLG=52990
• CCNT=RFLG+1
CSTK=CCNT+1 o MEM=CSTK+40
LENGTH=MEM+1 SEGC=LENGTH+1 OUTSTK=SEGC+l CNT5=OUTSTK+3 5 CNT8=CNT5+1
XSAVE=CNT8+1 ZFLAG=XSAVE+1 SHIFTC=ZFLAG+1 YCNT=SHIFTC+1 0 FLET=YCNT+1 ZPAGE=$FB TKRST=56832 TKSAL=56836 TKSAH=56840 TKSCL=56844 5 TKSCH=56848
TABLE V (Contd.) TKSEG=56852 TKRUN=56856 TKSTAT=56837 CLD ;CLEAR DECIMAL
LDX CCNT ;GET CHARACTER COUNT DEX ;LAST CHARACTER
LDA CSTK,X ;GET CHARACTER AND #$7F FORCE LOWER CASE STA MEM STUFF IT SEC SET CARRY HI
SBC #$5A SUBTRACT TOP OF RANGE BCC NCHR BRANCH IF NOT CHARACTER LDA #$40 GET BOTTOM OF RANGE SBC MEM GET CHARACTER BCS NCHR+1 BRANCH IF CHARACTER
NCHR DEX LAST NQT LETTER
STX LENGTH LAST CHARACTER ADDRESS LDX #0 INDEX TO 0 RFMTLP LDA CSTK,X GET FIRST LETTER EOR #39 TEST FOR APOSTROPHE BNE LETTER BRANCH IF NOT LDA #26 GET APOSTROPHE STA CSTK,X STUFF IT JMP CET GET NEXT LETTER
LETTER SEC SET CARRY HI
LDA CSTK,X GET A LETTER AND #$7F MAKE IT LOWER CASE SBC #64 CREATE 5 BIT CODE STA CSTK,X PUT IT BACK
CET CPX LENGTH TEST FOR LAST LETTER BEQ PACK BRANCH IF RFMT DONE INX SELECT' NEXT LETTER
JMP RFMTLP GO REFORMAT TO 5 BITS
PACK LDA CSTK GET FIRST CHARACTER STA FLET SAVE IN FIRST LETTER CELL
TABLE V (Contd.)
LDA #0 ;GET 0
STA SEGC ;GET SEGMENT TO 1
STA TKRST ;RESET TOKENIZER
STX XSAVE ;SAVE X
OUTCLR LDA #0 •GET 0
STA CNT5 ;RESET SOURCE BIT COUNTER
STA CN 8 ?RESET DESTINATION BIT COUNTER
STA ZFLAG ?RESET ZERO LETTER FLAG
STA SHIFTC ?RESET SHIFT COUNTER
STA OUTSTK •CLEAR TRIAL BYTE A
STA OUTSTK+1 •CLEAR TRIAL BYTE B
STA OUTSTK+2 ■CLEAR TRIAL BYTE C
STA YCNT •CLEAR PSEUDO Y
SLOOP LDX XSAVE •GET X
LDA ZFLAG . ;GET ZERO FLAG
- CMP #1 " •SEE IF SET
CLC •CLEAR CARRY
BEQ ZERO ■BRANCH IF SET
ROR CSTK,X •SHIFT SOURCE TO CARRY
STX XSAVE •SAVE X
ZERO LDX YCNT GET PSEUDO Y
ROR OUTSTK, SHIFT INTO TRIAL BYTE
LDX XSAVE GET X BACK
INC SHIFTC INCREMENT BIT COUNT
LDA CNT5 GET SOURCE BIT COUNT
CMP #4 TEST FOR BYTE EXHAUSTION
BEQ RST5 ; BRANCH IF EXHAUSTED
INC CNT5 INCREMENT SOURCE BIT COUNT
JMP BTST J GO CHECK TOTAL BITS
RSTΞ LDA #0 GET ZERO
STA CNT5 " ~ ', CLEAR SOURCE BIT COUNT
CPX #0 SEE IF LAST LETTER OUT
BNE NXTLET J BRANCH IF NOT
LDA #1 GET 1
TABLE V (Contd. )
STA ZFLAG ;SET ZERO LETTERS FLAG
JMP BTST ;GO CHECK TOTAL BITS
NXTLET DEC XSAVE ;SELECT NEXT LETTER
BTST LDA SHIFTC ;GET SHIFT COUNT
CMP #20 ;TEST FOR 20 BITS SHIFTED
BEQ GBOUT ;OUTPUT THE GRAMMAR BYTE
LDA CNT8 ;GET DESTINATION BIT COUNT
CMP #7 ;TEST FOR BYTE EXHAUSTION
BEQ NXT8 ;BRANCH TO GET NEXT BYTE
INC CNT8 ;INCREMENT DESTINATION BIT COUNT
. JMP SLOOP ;GO SHIFT MORE BITS
NXT8 LDA #0 ;GET ZERO
STA CNT8 ;CLEAR THE DESTINATION COUNT
INC YCNT ;INCREMENT PSEUDO Y
JMP SLOOP ;GO SHIFT MORE BITS
GBOUT CLC ;CLEAR CARRY
STX XSAVE ;SAVE X
LDX YCNT ;GET PSEUDO X
ROR OUTSTK,X ;MOVE LAST 4 BITS DOWN
ROR OUTSTK,X
ROR OUTSTK,X
ROR OUTSTK,X
LDY #0 ;GET 0
LDA SEGC ;GET SEGMENT COUNT
CMP #0 ;CHECK FOR 0
BEQ OFFO ;BRANCH IF SEGMENT 1
CMP #1 ;CHECK FOR 1
BEQ OFF1 ;BRANCH IF SEGMENT 2
CMP #2 ;CHECK FOR 2
BEQ OFF2 ;BRANCH IF SEGMENT 3
LDX #13 ;GET S4 OFFSET
JMP DOIT
OFF0 LDX #1 ;GET SI OFFSET
JMP DOIT
TABLE V (Contd.)
OFF1 LDX #5 ;GET S2 OFFSET JMP DOIT
OFF2 LDX #9 GET S3 OFFSET DOIT JSR DOUT OUTPUT TRIAL BYTE A JSR DOUT OUTPUT TRIAL BYTE B JSR DOUT OUTPUT TRIAL BYTE C JMP NXGB SEE IF MORE GRAMMAR BYTES
DOUT LDA OUTSTK,Y GET BYTE FROM STACK STA TKRST,X OUTPUT TO TOKENIZER INX NEXT TRIAL ADDRESS INY NEXT STACK ADDRESS RTS RETURN
NXGB LDX XSAVE RESTORE X CPX #0 SEE IF DONE BEQ SEGLD BRANCH IF WORD LOADED INC SEGC SELECT NEXT SEGMENT LDA SEGC CMP #4 BEQ ABORT • ;BRANCH IF MORE THAN 16 LETTERS
DEC XSAVE SELECT NEXT LETTER JMP OUTCLR GO' LOAD NEXT SEGMENT
ABORT LDA #1 GET OK CODE STA RFLG PASS IT TO BASIC RTS RETURN TO BASIC
SEGLD LDA SEGC GET FINAL SEGMENT STA TKSEG PUT IN TOKENIZER DEC LENGTH OFFSET WORD LENGTH LDA LENGTH GET WORD LENGTH ASL A MULTIPLY BY 2 TAX PUT IN X LDA MAP,X READ THE MAP LO BYTE STA ZPAGE SAVE IT INX ADVANCE TO HI BYTE LDA MAP,X READ THE MAP HI BYTE
TABLE V (Contd.) STA' ZPAGE+1 ;SAVE IT DEC FLET ;BIAS THE LETTER CODE LDA FLET ;GET THE LETTER OF THE WORD ASL A ; ULTIPLY BY 4 ASL A TAY USE AS SUB-MAP INDEX
LDA (ZPAGE) ,Y GET START ADR LO STA TKSAL OUTPUT TO TOKENIZER INY ADVANCE INDEX
LDA (ZPAGE) ,Y GET START ADR HI STA TKSAH OUTPUT TO TOKENIZER INY ADVANCE INDEX
LDA (ZPAGE) ,Y GET SCAN LIMIT LO STA TKSCL OUTPUT TO TOKENIZER INY ADVANCE INDEX
LDA (ZPAGE) ,Y GET SCAN LIMIT HI STA TKSCH OUTPUT TO TOKENIZER LDA #0 GET 0 STA TKRUN START THE SCAN
TKWAIT LDA TKSTAT GET TOKENIZER STATUS AND #$30 MASK OFF UNWANTED BITS CMP #0 CHECK FOR BUSY BEQ TKWAIT WAIT FOR DONE CMP #$20 TEST FOR DATA NOT FOUND BEQ BAD BRANCH IF NOT FOUND LDA #1 GET FOUND CODE
DONE STA RFLG PASS IT TO BASIC RTS RETURN TO BASIC BAD LDA #0 GET DATA NOT FOUND CODE JMP DONE GO PASS IT ON .LIB MAP.SR
The following is a sample page from the memory map of the dictionary ROM 1 through 10 as stored in the control microprocessor 100 (a Commodore 64K) . This memory map is a portion of the Assembly Language program listed above. It tells the control microprocessor 100 where the beginning address for each sector is and how many words are in each sector. This information is used to load the address counter 105 (in FIG, 1) and the scan limit down counter 108 (in FIG. 1) as discussed earlier. The map entries are listed in groups of two. The first entry is the starting address for each sector, and the second entry is the number of grammar bytes in that sector.
For example, in .WORD $FFFF below, FFFF is the start- ing address of a sector. It is the sector containing words starting with "A" and being two characters in length. In .WORD 13, which follows .WORD $FFFF, 13 is the number of grammar bytes in that sector. In .WORD 11, 11 is the starting address of another sector. This sector contains words starting with "B" and which are also two characters in length.
1 TABLE VI
*=$C800
.WORD $FFFF
.WORD 13 5 .WORD 11
.WORD 4
.WORD 14
.WORD 3
•WORD 16 10 .WORD 4
.WORD 19
.WORD 7
.WORD 25
.WORD 2 15 .WORD 26
.WORD 2 , .WORD 27
.WORD 5
.WORD 31 20
25
30
35
The following is a list of the designated components e symbolic tokenizer:
TABLE VII
1-10 Dictionary ROM FFIIGG.
11- -15 Data Multiplexers i"t
16- -19 Token Address Counter I"I
20- -22 Digital Inverters I"I
23 Non-Inverting Buffer I"I
24 Deleted "
25- •26 Dictionary Enable AND Gates "
27 DELETED "
28 GBLOCK Inverse AND Gate FFIIGG.. 3
29- -33 Comparators I"I
34- ■38 Register File I"I
39 Inter Grammar Byte Compare Latch I"I
40 Scan Termination Logic I"I
41- •45 Tokenizer Output Multiplexer I"I
46 Bi-Directional Data Bus Transceiver I"I
47 Segment Up Counter FFIIGG.. 4
48 Segment Holding Register 2-bit register " Dual Flip-Flop
49 Segment Down Counter - "
50 Address Decoder "
51 Scan Control Logic "
52 Address Decoder "
53 Address Select "
54- ■56 Dictionary Scan Down Counter "
57 Dip Switch "
58 Address Buffer "
59 Register File Select "
60 Inverter "
61 Inverter "
62 Inverter "
63 Inverter "
64 Inverter "
65 AND Gate "
66 AND Gate »
TABLE VII (Contd.)
67 AND Gate II
68 Inverter FIG. 3
69 Inverter II
70 NAND Gate II
100 Control Microprocessor FIG. 1
101 Interface It
102 Mailbus II
103 Compare Logic II
104 Dictionary ROM II
105 Address Counter II
106 Segment Register II
107 Segment Counter II
108 Scan Limit Down Counter II
109 Control Logic II
110 Status Register II
111 Trial Word Register II
112 Grammar Bus II .
113 Token Address Bus II
114 Internal Mailbus II
115 External Mailbus »
116 Mailbus Address Lines II
118 Lines Between Figs 3 and 4 FIGS. 3
120 II II II II II
122 II II II II II
124 II II II II II
126 II II II II II
128 II II II II II
201 Commodore 64K Connector FIG. 5
202 Bus Transceiver Enable NAND Gate II
203 Mailbus I/O NAND Gate II
204 Bidirectional Bus Transceiver II
205- 206 Buffers II