Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.
Previous:Chapter 7 | Top:Table of Contents | Next:Chapter 9
This part of the study consists of a series of analyses based on Guo's (1997) original finding on the two types of ambiguity in tokenization: conjunctive (hidden) and disjunctive (critical).
The materials were 4,327,663 critical fragments identified from the entire reduced ASBC corpus in the manner described below. For each sentence, each critical fragment was identified with the 131,616-word lexicon, using the definition of critical fragment in Guo (1997). When a critical fragment was identified, the correct tokenization of the original string was used to check if the fragment itself happened to be a correct word, and such information was also recorded. The following is an illustration of the identification process.
Suppose "ABCDEFGHIJK" is a Chinese string where each letter is a Chinese character. We first use the lexicon to look for the longest match beginning with "A". Suppose it is "ABC". Then we look for the longest words beginning with "B" and "C". Suppose they are "BC" and "C". Since none of these exceeds the right boundary of the longest word "ABC", we know that a critical fragment with conjunctive ambiguity is found, and its length is three.
Now the unprocessed string becomes "DEFGHIJK". Again we find the longest word beginning with "D". Suppose we find "DE". Then we look for the longest word beginning with "E". Suppose it is "EFG". Since it exceeds the right boundary of "DE", we know that "DEFG" is possibly a critical fragment with disjunctive ambiguity. But we do not stop here. We continue to look for the longest words beginning with "F" and "G" and check if they exceed the right boundary of "DEFG". If none of which exceeds the right boundary of "DEFG", a critical fragment with disjunctive ambiguity is found, which is "DEFG". On the other hand, if at least one of them does exceed the right boundary of "DEFG", the process continues recursively. For example, suppose from "G", the longest word is "GH", and from "H" the longest word is "H". At this point we know that a critical fragment with disjunctive ambiguity is found, and that it is "DEFGH".
Now suppose "DEFGH" is indeed a critical fragment, so the unprocessed string becomes "IJK". We look for the longest word beginning with "I", and the longest word is "I". In this situation, "I" itself is a critical fragment, with no ambiguity, of course. The unprocessed string becomes "JK" and the identification process continues until it reaches the end of string.
A total of 4,327,663 critical fragment tokens were identified. Table A2 shows the frequency counts of critical fragments with different types of ambiguity (see Table A2, p. 89). Critical fragments with disjunctive ambiguity accounted for less than 4%. A "dictionary" of unique critical fragments was also compiled. There were 191,426 unique critical fragments.
It was also found that 93.93% of the critical fragments were correct words themselves (that is, the appropriate words in the original text), thus replicating Guo's (1997) finding. Since critical fragments with disjunctive ambiguity cannot be words, the high percentage of identification rate is primarily contributed by critical fragments with conjunctive or no ambiguity. (Identification and recall rates were two indices used in the analyses. They are commonly used in information retrieval studies. Identification rate is the proportion of number of items identified that are correct identifications relative to the total number of items identified. Recall rate is the proportion of number of items correctly identified relative to the total number of items that is supposed to be identified in the source material. For example, suppose the correct tokenization consists of five words. A word identification algorithm identifies six words, where four words are correct. The identification rate is 120%, and the recall rate is 80%.)
Table A3 shows the distribution of lengths of unique critical fragments with conjunctive or no ambiguity (see Table A3, p.90). The distribution is very similar to that of unique words. In Table A3, the recall rate was the proportion of number of unique critical fragment tokens relative to the number of unique words in the lexicon with a given length. As can be easily seen, the recall rates are all very high. However, this is not surprising. Since this is a distribution based on unique fragments/words, it does not necessarily reflect the real situation.
Table A4 shows the distribution of lengths of critical fragment tokens with conjunctive or no ambiguity (see Table A4, p. 91). Again, the distribution is very similar to that of word tokens, with one and two as the most dominant lengths. Fragments with length longer than two are quite rare. As expected, the recall rates are not as high as those in Table A3. However, they are still quite high. It is also interesting to find that identification rates are very high.
The overall recall rate for words is 88.12%. Of all 4,612,595 words, 4,064,861 were correctly identified. A sentence was counted as correctly identified if all its constituent words were correctly identified. The recall rate for sentences is 69.15%. Of all 714,390 sentences, 493,987 were correctly identified.
We have seen that most critical fragments have either no or conjunctive ambiguity, and that most of these critical fragments are desired words themselves. Moreover, near 90% of word tokens can be identified this way without further disambiguation. On the other hand, critical fragments with disjunctive ambiguity are still unresolved.
If a heuristic favoring the longest word is so effective in dealing with critical fragments with conjunctive ambiguity, it in principle should also work for those with disjunctive ambiguity. To achieve this generalization, we need to know what the fact that most critical fragments with conjunctive ambiguity are desired words actually means. The most direct inference is that it could mean that a single word is favored over multiple words. This can be easily generalized to a heuristic that favors fewer words over more words. In fact, this is also one of the components of Brent and Cartwright's (1996) DR function of word segmentation.
A heuristic that favors tokenizations with fewer words over those with longer words can be re-described as one that favors tokenizations with fewer numbers of words over those with more, or that favors tokenizations with longer average word lengths over those with shorter ones. I will hereafter call this heuristic generalized maximum matching (GMM), since it in principle could operate on any string with multiple tokenizations. Suppose a critical fragment "ABCD" has two critical tokenizations: "AB/CD" and "A/BC/D". The first tokenization has two words, or an average word length of two, while the second has three words, or an average word length of 1.33. GMM will favor the first tokenization over the second.
Another generalization, which is much simpler and is already widely used, is the forward maximum matching (FMM) heuristic. It scans the string from left to right to find longest words one by one. In the above case of "ABCD", FMM would attempt to find the longest possible first word, which is "AB". Then it continues the process with the rest of string and finds "CD". So in this case it produces the same result as GMM. But this is not always the case. For example, the English word string "fundandruff" has two critical tokenizations: "fun/dandruff" and "fund/and/ruff". GMM will choose the first tokenization while FMM will choose the second.
FMM is very greedy and does not take the context into consideration. GMM, on the other hand, takes local context into consideration and is more conservative. Suppose "ABC" has two critical tokenizations, "AB/C" and "A/BC". FMM would favor "AB/C". However, since both "AB/C" and "A/BC" have an average word length of 1.5 characters, GMM cannot decide which one is "better". As a result, for GMM to work, additional heuristics is needed to handle ties. FMM could be one of them, but I will propose two more heuristics based on word frequency and mutual information.
The number of heuristics that could be used in solving those GMM ties could be infinite. However, when it comes to psychological plausibility, what a psychologist should consider in the first place is word frequency. After all, effects of word frequency are strong in almost every lexical processing study. If we use the frequency of a word as an index of its "strength", we could use it to rate the "total strength" of a tokenization. This can be viewed as a crude implementation of lexical competition in the mental lexicon. Since the effects of word frequency are usually a function of the logarithms of raw frequencies (Forster, 1990), using the logarithm, rather than raw frequency, can better represent the "strength" of a word. Base 2 is used in all logarithms in the current studies for the sake of convention and convenience. In the rest of the text, I will use "word frequency" to refer to log2 (word frequency), and "raw word frequency" to refer to the raw frequency count.
How do we measure the strengths of tokenizations? Since the numbers of words vary, the sum of word frequency is not suitable. The average of word frequencies seems more appropriate. This heuristic is hereafter called the average word frequency heuristic (AWF).
Without resorting to input from higher levels such as syntax and semantics, and even without word frequency information, it is still possible to use low-level, primitive statistical information to rank competing tokenizations. This is possible based on our impression that many characters seem to have different probabilities of occurring in different positions of words. Some characters are most frequently used as single-character words, some are most frequently used as initial characters of multi-character words, and some are most frequently used as final characters of multi-character words. As a result, a character can carry four kinds of word boundary information:
I will call these boundary roles. The set of characters (C) and the set of boundary roles (BR) constitute an information channel, which can be mathematically described as a two dimensional matrix (C by BR) of conditional probabilities: Pi,j = P(brj | ci). Each row contains all the probabilities that a particular character ci becomes a character with the particular boundary role brj.
The simple probability values do not take the probability of ci and the a priori probability of brj into account. To measure the amount of information gained due to the reception of ci, an index called mutual information is used (Hamming, 1986):
I(brj ; ci) = I(ci ; brj) = log2(p(ci, brj) / (p(ci) x p(brj)))
With this mutual information index it is then possible to evaluate tokenizations. Because a character may have different boundary roles in different tokenizations, the amount of information it carries varies across tokenizations. Since the length (in character unit) is fixed, we can use either the average information per character or the sum of information of all characters as an index. This heuristic is hereafter called the mutual information heuristic (MI).
The 131,616-word lexicon and its frequency information were used to construct a C x BR mutual information matrix. The 149,907 critical fragment tokens with disjunctive ambiguity were used to test the effectiveness of the various heuristics.
Table A5 shows the distribution of lengths of unique critical fragments with disjunctive ambiguity (see Table A5, p. 92). As can be seen from the table, three and four are the most frequent lengths. The average number of words is two in the correct tokenization of three-character critical fragments. This is predictable, since the competition of "AB/C" and "A/BC" is the sole cause. The average number of words is also two in the correct tokenization of four-character critical fragments. This is more interesting because a critical fragment with disjunctive ambiguity would have at least two words. In other words, this is an indirect evidence for the effectiveness of GMM. Longer critical fragments are rare, and the average numbers of words in correct tokenizations also seem to support GMM. Another interesting point is that although any critical fragment with disjunctive ambiguity could have at least two critical tokenizations, throughout the corpus the majority of them have only one correct tokenization, and alternative tokenizatons never occur. (The "correct tokenizations" are defined operationally. For example, suppose a critical fragment "ABC" occurs in many different contexts of the corpus. Although it could be tokenized as either "AB/C" or "A/BC", in reality "AB/C" is the correct [desired] tokenization in all instances of "ABC". In such a situation we say that "ABC" has only one correct tokenization. Now suppose another critical fragment "DEF" also occurs in many different contexts of the corpus. Mostly "DE/F" is correct, while occasionally "D/EF" is correct. In this situation we say that "DEF" has multiple correct tokenizations.)
Table A6 shows the distribution of lengths of critical fragment tokens with disjunctive ambiguity (see Table A6, p. 93). This token-based distribution is very similar to the type-based distribution in Table A5, with two and three being the most dominant lengths. Other measurements are also similar, except that the proportion of multiple correct tokenizations for three-character fragments is slightly higher than that in type-based statistics (12.47% vs. 2.86%).
Table A7 compares the performance of FMM with GMM in disambiguating unique critical fragments with disjunctive ambiguity (see Table A7, p. 94). FMM and GMM have comparable performance levels for critical fragments longer than three characters, although FMM's performance is slightly better. Of course, it is due to the conservativeness of GMM. It is also worth noting that the recall rates are in general not very high, except for those rare, long critical fragments. The recall rates produced by FMM indicate that there is a preference of left-to-right order, but such a preference is not very strong. Specifically, for three-character critical fragments a left-to-right preference seems most ineffective. The recall rate is less than fifty percent. GMM can do nothing to three-character critical fragments because alternative critical tokenizations always have the same average number of words, which is two. Table A8 shows similar statistics, based on critical fragment tokens (see Table A8, p. 95). The pattern is quite similar.
GMM left about one fifth to one fourth unresolved due to ties in ranking tokenization by the average number of words. As a result, further analyses were performed on these critical fragments. Three heuristics were tested: FMM, AWF, and MI. Table A9 shows the results based on unique critical fragments, and Table A10 shows results based on critical fragment tokens (see Table A9 and A10, pp. 96-97). Overall, AWF has the best performance, with recall rates around 85% to 90%. MI is good but not as good, with recall rates around 75% to 80%. FMM has the worst performance, with recall rates around 40% or lower.
The first part of study replicated Guo's (1997) finding that when matched with a lexicon, unambiguous word boundaries in a sentence can be found. Over 96% of the critical fragments (character strings segmented by unambiguous word boundaries) had conjunctive ambiguity. Only less than 4% of critical fragments had disjunctive ambiguity. Over 93% of the critical fragments were desired words by themselves, and over 88% of words in the corpus were correctly recalled this way.
Such findings can be regarded as supportive evidence for the effectiveness of the maximum matching heuristic that favors the tokenization with only one word in disambiguating conjunctive ambiguity. A generalized maximum matching heuristic that favors the tokenization with minimum number of words, or maximum average length of words, was derived. GMM alone performed nearly as effective as the directional, greedier FMM. Both word-based, competition-driven AWF and character-based, information-driven MI effectively resolved cases GMM could not handle due to ties in the ranking of tokenizations.
There seems to be directional preference for disambiguating three-character critical fragments with disjunctive ambiguity, because FMM only achieved slightly less than 50% of accuracy. On the other hand, either GMM+AWF or GMM+MI could disambiguate quite effectively.
Finally, even disjunctive ambiguity was not that ambiguous in reality. The majority of critical fragments with disjunctive ambiguity were "single-sided" in the corpus. That is, although they have multiple tokenizations, only one had actually shown up in the corpus. Plus the fact that most of them can be correctly identified without resorting to input from higher levels, it is therefore not difficult to understand why the apparently difficult task seems so easy to the Chinese readers.
© Copyright by Chih-Hao Tsai, 2001