Doctoral Dissertation of Chih-Hao Tsai >

July 2001

Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.

Previous:Chapter 5 | Top:Table of Contents | Next:Chapter 7


p. 26Chapter 6
Word Identification in Reading Chinese: A New Analysis

Instead of viewing word identification in reading as an arbitrary, isolated problem-solving task as has usually been done in the past, I will treat this problem as an adaptive one and reanalyze this problem by bringing the problem back to its original context: cognition and language processing.

Spoken Word Identification

Speech perception seems a very remote field to reading researchers. However, when it comes to word identification, especially in Chinese, reading and speech perception become much more similar. In speech there are no consistent, physical boundaries for words. Although prosodic and phonological cues may help the listeners identify words, the volatile and noisy nature of speech signal may have limited the reliability of those cues.

Since speech perception is brought in for its similarity with Chinese reading in terms of the absence of consistent, physical word boundaries, I will focus on computational models of spoken word identification that operate without the presence of phonological cues. TRACE (McClelland & Elman, 1986) and Shortlist (Norris, 1994) are two well-known connectionist models aimed at modeling adult listeners' online spoken word identification. TRACE allows top-down feedback from lexical units to phoneme units, while Shortlist does not. Their architectures are also different. The architecture of Shortlist allows it to operate on a lexicon of real-life size, while TRACE has a lexicon of only about 200 words. However, they share a number of features in common. They only identify words stored in the lexicon, and have no way to discover new p. 27words. The identification of a word relies on competition between lexical candidates automatically activated by the input.

When a longer word is composed of two or more short ones, such as party (par tea) or cargo (car go), our experience is that we usually perceive only the longest word. Of course, stress or intonation change may partially account for this phenomenon, but it is interesting to note that TRACE and Shortlist also exhibit similar behavior.

With respect to this maximal matching heuristic, the most relevant point that is made evident in TRACE and Shortlist is that in the interactive-activation framework, long words are favored over their embedded words because they eventually receive more bottom-up inputs and win the competition. Segmentation is also achieved via competition. The input is segmented if two or more lexical candidates finally win out. This is quite evident from the architecture of the interactive-activation framework. Long words are favored over embedded words because long words eventually receive more bottom-up inputs and win the competition. Segmentation is also achieved via competition. The input is segmented if two or more lexical candidates finally win out.

PARSER (Perruchet & Vintner, 1998) is a model simulating Saffran, Newport, and Aslin's (1996) finding that adults were able to segment into "words" an artificial language that included no pauses or other prosodic cues for word boundaries. This language was constructed by repeating six trisyllable words in random order. PARSER took in the input within a range of 1 to 3 syllables randomly. If the input was new, then it was added into the lexicon. If it was recognized by the lexicon, its weight was increased. The lexicon was subject to forgetting and interference. However, PARSER was able to acquire the six trisyllable words in the long run.

p. 28DR (Distributional Regularity) (Brent & Cartwright, 1996) is not an online word identification model. Similar to PARSER, it is also a model testing hypotheses about how language learners segment the speech into words, even though most words are unfamiliar to them. The DR model is a typical statistical learning model that first enumerates all possible segmentations of the input, computes a value for each segmentation according to some function, then finds the minimum or maximum (depending on the function) of all the values. The segmentation associated with that value is then chosen as the output. The function Brent and Cartwright developed was called the DR function, which has four components: (a) the number of word types in the segmentation, (b) the number of word tokens in the segmentation, (c) the sum of the lengths (measured in phonemes) of the words in the lexicon, and (d) the entropy of the relative frequencies of the words. The segmentation algorithms try to minimize segmentations with fewer types of words over those with more, segmentations with fewer tokens over those with more, segmentations with shorter tokens with those with longer ones, and segmentations with a few words accounting for most of the frequency over those in which the frequency is divided more evenly. The inputs were phonetic transcripts of spontaneous child-directed English. It was found that DR function reached 41.3% of accuracy, while the baseline function only reached 13.4%. (The baseline function was to insert the correct number of words randomly.)

INCDROP (INCremental DR Optimization) (Brent, 1997; Dahan & Brent, 1999) is an extension of DR. DR was originally intended to model language learners' word discovery process. To model the online word segmentation and identification by adults, INCDROP segments the input incrementally. In addition, new words in the selected segmentation are stored into the lexicon, and frequencies of familiar words in the selected segmentation are updated.

p. 29Speech Perception Versus Reading

Although the absence of word boundaries imposes a very similar problem of word identification in both speech perception and reading, reading seems to be in a more advantageous position in solving this problem. First of all, speech information is input via a strictly serial channel, while visual information within the perceptual span is available in parallel. Secondly, speech signal is much more noisy than that of written language. Thirdly, the processing of speech is unidirectional, while the visual scanning in reading can be bi-directional. Finally, especially in Chinese, in terms of word boundary ambiguity, written language seems to be less ambiguous due to the fact that many characters are homophonic, and the written form disambiguates homophones.

Nevertheless, the nature of the word recognition process (identification of words in the mental lexicon) is still the same. It is now generally believed that when encountering a word, its constituent units (letters, phonemes, etc.) activate words containing them in the mental lexicon. The activated words then compete for a unique solution against the input. In speech perception, segmentation occurs as a by-product of such competition if more than one word eventually wins out (e.g., cat/log). When a single long word is composed of shorter words (e.g., catalog vs. cat a log), the longest word usually wins the competition. Since the nature of the word identification problem and the basic word recognition mechanism are the same, it seems reasonable to assume that the word identification process is also similar in both speech perception and reading.

Characteristics of Word Boundary Ambiguity

Tokenization is a technical term in natural language processing, referring to the process of dividing the input into distinct tokens--words and punctuation marks. Computer scientists p. 30working on practical text processing systems have already found that to solve the tokenization problem effectively, a lexicon is necessary. The basic approach involves two steps. First, words in the text are recognized (matched) with a lexicon. This greatly reduces the degree of ambiguity by dropping off tokenizations containing improbable words. Second, disambiguation heuristics are applied to solve the remaining ambiguity (K.-J. Chen & Liu, 1992; Chiang, Chang, Lin, & Su, 1996; Fan & Tsai, 1988; N. Liang, 1990; Mo, Yang, Chen, & Huang, 1996; Sproat & Shih, 1990; Sproat, Shih, Gale, & Chang, 1996; Yeh & Lee, 1991; see Webster & Kit, 1992, for an overview). The heuristics proposed are mostly exploratory, making use of all kinds of information ranging from statistics to syntax, and are usually aimed at producing correct results without considering the efficiency of operation.

One of the simplest and most frequently used heuristic is forward maximum matching. It scans the text from left to right, trying to find the longest first word in the lexicon. Then it works recursively to find the longest word after the identified word, until it reaches the end of sentence. This is a surprisingly simple yet powerful heuristic. In general, it can achieve a recall rate (percent of words correctly identified) of 90% or higher in Chinese. As we have seen in Section 6.1, human listeners and computational models of spoken word identification also have the tendency of favoring longer words over embedded short ones. Thus, the maximum matching principle (not the particular implementation of forward maximum matching) seems to have certain degree of psychological plausibility.

K.-J. Chen and Liu's (1992) maximum matching heuristic is a variant of the maximum matching principle. It also scans the text from left to right, but looks for three words, instead of just one word. All possible three-word sub-strings are generated. If there is ambiguity and if there is only one combination with longest string length (i.e., no ties), declare the first word in p. 31the longest three-word sub-string as the correct identification. The procedure then continues recursively from the first character after the identified word. Chen and Liu reported that this simple heuristic correctly resolved 93% of ambiguities. When there are ties, however, a number of additional heuristics are applied to resolve the ambiguity.

Despite the vast number of heuristics proposed, the nature of the word identification problem is poorly understood. Virtually all researchers have proceeded to presenting heuristics they invented and reporting their results without investigating the nature of the problem in the first place. This was partly because many of them were doing practical research where formal models were not important. Only very recently has a formal, mathematical analysis of the nature and properties of word identification ambiguities in Chinese been developed (Guo, 1997), which provides a mathematical description of the commonly adopted principle of maximum matching.

Guo (1997) classified word identification ambiguities into two categories. (a) Hidden (conjunctive) ambiguity: a tokenization has hidden ambiguity if some words in it can be further decomposed into word strings, such as blueprint to blue print in the string theblueprint. They are called hidden because others cover them. "Cover" is used as a mathematical term in Guo's article. To put it in plain language, tokenization A covers B if B can be derived by inserting word boundaries into A. (b) Critical (disjunctive) ambiguity: A tokenization has critical ambiguity if there exists another tokenization such that neither one covers the other. For example, the string fundandruff has critical ambiguity, because the two tokenizations, fund and ruff and fun dandruff, do not cover each other. Guo also introduced the concept of critical point. Critical points are all and only unambiguous boundaries. Critical fragments are those substrings separated by critical points. For example, in the string theblueprint, besides the beginning and end of string, there is only one critical point, which is the boundary between the and blueprint.

p. 32In Guo's analysis, all characters were listed as single-character words in the dictionary to avoid ill formedness in tokenization. As a result, any critical fragment with length longer than one must have either (or both) hidden or critical ambiguity. It is important to stress that Guo's analysis is purely mathematical, and should not be taken as a psychological or linguistic assumption.

One of Guo's most striking empirical findings was that when tested on a large Chinese corpus (Guo, 1993), 98% of the critical fragments generated are by themselves desired words (that is, the words that are appropriate to the meaning of the text). In other words, about 98% accuracy can be achieved efficiently without disambiguation (Guo, 1997); only about 2% of the critical fragments have critical ambiguities that need further disambiguation.


© Copyright by Chih-Hao Tsai, 2001