Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.
Previous:Chapter 9 | Top:Table of Contents | Next:References
By systematically analyzing a large, representative Chinese corpus, the first part of this two-part study replicated Guo's (1997) finding that there exist unambiguous word boundaries, and ambiguities can be contained in critical fragments. Most (93%) of the critical fragments are desired words by themselves, and those correctly identified words account for 88% of total word tokens.
Such findings are a demonstration of the effectiveness of the word-length based maximum matching principle. A generalized maximum matching heuristic was derived, and it was found that GMM was quite effective in disambiguation. Two supportive heuristics for GMM, the lexical competition based AWF and the information based MI, were also found to be quite effective. The implication is that words can be identified with simple heuristics that use only low-level information (i.e., without linguistic input from higher levels), such as word length, word frequency, and the mutual information between characters and word boundaries. The combination of GMM+AWF is particularly interesting, because it approximates the competition of words in an interactive-activation model of lexical organization and word identification (McClelland & Elman, 1986; McClelland & Rumelhart, 1981; Norris, 1994; Rumelhart & McClelland, 1982). In other words, word identification in reading Chinese could be achieved in a very similar manner as word identification in the auditory channel.
In the second part of the study, the discovery of Part 1 was adopted to construct a model of word identification and eye movements in reading Chinese text. The perceptual constraints--the perceptual span and foveal/parafoveal vision--were implemented. An uncertainty-driven heuristic of eye movement control was also devised to instruct the model to take the next input based on the result of word identification in the current window.
The model was able to identify near 94% of word tokens correctly with a size of perceptual span similar to that of human readers’. The saccade length distribution produced by the model was a function of two factors: (a) the distribution of word lengths, and (b) the size of perceptual span. The majority of word tokens are either one- or two-character words. Such a distribution would cause the uncertain region (i.e., the last full/partial word) of a window of any size to be most frequently one or two. For example, for a four-character window, the word-identified strings "A/B/C/D", "A/BC/D", "AB/C/D" have a one-character last word, while the word-identified strings "A/B/CD" and "AB/CD" have a two-character last word. Since uncertainties are usually associated with short words as those words frequently occur as the beginnings of longer words, a model with a fixed-size, four-character window would frequently make two- or three-character saccades, which is exactly what has been observed. With a three-character window, the word-identified strings "A/B/C" and "AB/C" have a one-character last word, while the word-identified string "A/BC" has a two-character last word. The actual distribution of saccade lengths the model produced indicates that one-character last words are more common than two-character ones for three-character windows.
The distributions of human data lie somewhere between those produced by models with fixed-size windows of three and four characters. By allowing the model to have a variable-size window, the model was able to produce a distribution of saccade lengths very similar to those of human readers'. Therefore, it takes more than a fixed-size window and a distribution of word lengths to produce a distribution of saccades similar to those of human readers'. Variability in window sizes (due to variation in the probability of identifying a character in the parafovea, resulting from character complexity differences) is critical for the model to produce a human-like distribution.
Using the artificial fixation points of the model to compute fixation-based indices commonly adopted in reporting human data, the model was able to produce distributions of skipping rates and landing positions very similar (albeit not identical) to those of human readers'. The model was also able to produce effects of (fixated) word frequency on fixation and gaze durations, despite the fact that the fixation durations were computed on the basis of all words in the window, not just on the fixated word. These findings suggest that in reading Chinese, the eyes are not necessary targeted on words. The eye movement control may be more similar to what the model does, which is more relaxed. That is, the eyes move in order to take input from a different location, rather than sending the fixation to certain specific target.
As mentioned earlier, there are two dimensions of word identification: (a) the identification of words in the mental lexicon, and (b) the identification of words in the text. Traditional lexical processing research has focused more on the former, while this study focuses on the latter. The two dimensions, although distinct, are nevertheless tightly related. More specifically, the automatic word identification in the mental lexicon and the competition among lexical entries drive the identification of words in the text.
This study has demonstrated the relationship between the two dimensions of word identification quite clearly, and such demonstration has important theoretical implications. Ultimately, a model of eye movements and reading in Chinese needs to be developed to consolidate our knowledge of Chinese reading and to guide further research. Without a good understanding of both dimensions as well as the relationship between them an adequate model of eye movements and word identification during reading cannot be developed.
The lack of physical word boundaries in Chinese text has led most of past psycholinguistic inquires to make either one of the following two assumptions: (a) the identification of words in text is difficult (therefore reading will be facilitated if word boundaries are provided); (b) words are not necessarily the reading or perception units in reading Chinese.
Both assumptions ignore the automatic aspect of word identification in the mental lexicon. By binding the two dimensions of word identification together, this study has found that the identification of words in text is usually not difficult, given a set of simple heuristics that approximates the lexical competition process. It was also found in this study that a word-based model of eye movement control could capture many characteristics of reading eye movement data, thus supporting the position that the words are the reading units.
The phenomenon that native speakers of Chinese do not have a clear and intuitive concept of what a word is has raised the question of the psychological reality of the "word" in Chinese. This study has not assessed this question directly. Rather, this study has maintained that words are psychologically real, and it is our inability to explicitly access implicit knowledge, rather than the psychological reality of the word, that has blurred the notion of the "word" in Chinese.
Such an assumption is derived in part from the finding that the word unit is linguistically universal, and the word in Chinese is no exception (Packard, 2000). In fact, even if the present study had not made such an assumption, the findings could still have served as a supportive evidence for the psychological reality of the "word" in Chinese. Specifically, this study has demonstrated that the distribution of saccade lengths of readers can be reproduced by a model that reads words with psychologically plausible processing characteristics. Furthermore, such a finding is robust in that it is quite resistant to change in the criteria of the "word". Different criteria may create slightly different distributions of word lengths, but most of those distributions will be very similar to the one shown in Table A1. As a result, one must see the forest beyond the trees to realize that both the forest and the trees are real. The seemingly fuzziness of the notion of "word" in Chinese vanishes, and the word comes into sharp focus when it is viewed from an appropriate perspective.
The systematic analyses in Part 1 provide us a substantial understanding of the nature of word identification ambiguity in Chinese text, and the modeling analyses in Part 2 further provide a psychologically plausible model of word identification and eye movements in reading Chinese. Both the systematic analyses in Part 1 and the modeling analyses in Part 2 are the first attempts in the field. The success of this study indicates the usefulness of the methodology and approach adopted in this study. Since this is an initial work, progress made by this study is limited. However, its limitations also indicate its potential for further expansion. In the following sections I discuss the major findings and limitations of this study and the directions for future research in the context of these findings and limitations.
In normal reading, readers encounter words not listed in their mental lexicons from time to time. These may include newly formed words and personal and proper names. With the aid of morphology, adult readers can recognize these words with little difficulty. The problem of unknown words in this study is solved by creating a lexicon that lists all words. By doing so, the involvement of morphology becomes unnecessary. Such simplification may be acceptable in a preliminary, analytical approach to modeling such as Part 2, but eventually morphology must be added if the model is to be further developed.
Despite the model's lack of morphology, findings from this study have an important implication for the role of morphology in word identification: morphology is not necessary in identifying the majority of words. Our results indicate that the automatic activation of words and the competition of words in the mental lexicon can resolve most ambiguities effectively and efficiently. In other words, a rule-based (morphological rules) approach to word identification appear to be unnecessary except for unknown words. In fact, it is not just unnecessary, but probably also unfeasible. A rule-based approach means words are identified as a result of pre-lexical parsing. Since the set of morphology and the set of syntax are not mutually exclusive, and many characters also represent free morphemes, the parsing operation would be extremely difficult. Packard (1999) also argued that a real-time morpheme combination algorithm is improbable. In addition, he also argued that even a morpheme-indexed lexical access that identifies a morpheme as the first step in searching a list of precompiled lexical entry is improbable. This is easy to understand, since if a word can be looked up as a whole, there seems to be little benefit in taking the much slower, indirect route. Consequently, morphology seems to play a minor role in identifying known words, though it is definitely necessary in identifying unknown words.
The lack of word boundaries in Chinese text has made it difficult to study sentence parsing. For example, Lee (1995) presented a few Chinese garden path sentences. However, careful inspection of his sentences reveals that many of these sentences are not true garden path sentences at all. For example, the following are two tokenizations of the same sentence:
Zhe jige nianqingren ai shang mingxing de dang
Zhe jige nianqingren aishang mingxing de dang
"Zhe jige nianqingren" means "these few young men". The first sentence is the one with correct tokenization in which "ai" (love) is the main verb and "shang mingxing de dang" (to be fooled by celebrities) is a complement clause. The second tokenization is still syntactically well formed, but is semantically ill formed: "aishang" (fall in love with) "mingxing de dang" (celebrities' tricks).
Despite the fact that this is not a true garden path sentence, it is a good example of the complicated interaction between syntax, morphology, and word identification. Parsing ambiguities resulted from errors in tokenization only occur in Chinese. As a result, they provide a unique opportunity for Chinese psychologists and psycholinguists to explore the nature of language processing. The model developed in this study does not receive input from morphological or syntactic level, but it still has much explanatory power. In the above example, it is obvious that the ambiguity between "aishang" and "ai shang" is hidden, so readers may already have the tendency to identify "aishang" because the corresponding lexical entry gets activated automatically. Although rule-based explanation, such as late closure (Frazier, 1979), is also possible, the word identification explanation is nevertheless more parsimonious and more plausible.
The word identification heuristics in this model (GMM+AWF/MI) can be viewed an imprecise approximate to the more dynamic and distributional lexical competition process. This study has demonstrated that such a competition can indeed resolve most word identification ambiguities. Naturally, the logical next step is to implement the real competition process in a connectionist network. Since the Chinese writing system has thousands of characters and hundreds of thousands of words, the representations for characters and words must be highly distributed; otherwise the network would require too much computational power to run. Another low-cost solution is to build smaller in-context networks (Norris, 1994). For example, if four characters are seen in some fixation, a small network is dynamically constructed. The network has four input units, each of which represents a character in the perceptual span. Words associated with these characters would constitute the output units. When the process is finished, the network is destroyed. A new network is built when the new input is taken.
In this study, characters are treated as unanalyzed visual patterns, and the only factor that affects their identification is the number of strokes. In reality, characters are not just visual patterns of random strokes. They are highly structured. Some character components carry phonological information, while some carry semantic information. Their role may be a minor one in resolving word identification ambiguities, because the competition is at a higher level. However, their visual structure may well affect the identification process, especially when they are seen in the parafovea. Representing the structure of characters would be crucial if a finer-grained model of eye movement control is to be developed.
This study assumes a rather modular eye movement control mechanism. The words are identified first, and a saccade is planned based on the result of word identification. Although many models of eye movements in reading English also make similar assumption (e.g., Legge et al., 1997; Reichle et al., 1998), the large variability in human eye movement data suggests that word identification and saccade planning may not be two completely independent stages, and that the link between word identification and saccade planning may not be so direct.
Ever since Hung and Tzeng's (1981) seminal review on the cognitive processes of reading different orthographies (especially Chinese), there have been numerous studies on the psychology of reading in Chinese (see Hoosain [1991] and Wang, Inhoff, and Chen [1999], for two milestone reviews). When Hung and Tzeng wrote their review, reading research had regained interest from psychologists for less than ten years (before that, it had long been suppressed by the dominance of behaviorism). Twenty years ago, English and Chinese reading research was about in the same, preliminary stage of discovery: automatic phonological recoding in word recognition, word frequency effects, and so on. Today, we have witnessed significant progress in English reading research, but very little progress in Chinese reading research. Most researchers are still working on issues regarding the processing of isolated characters or words, and it is very rare to find a researcher who is interested in the more interesting, and more complicated, "real" reading process. In addition to its unique orthography, the lack of physical word boundaries is perhaps the most salient characteristic of the Chinese writing system. Liu et al. (1974) had this vision way back to the very beginning of psychology of reading in Chinese. However, in the past thirty years there has been virtually no advance in our knowledge about this topic.
This study has set a new direction and established a new research framework in the psychological studies of reading in Chinese. It is hoped that this study will help trigger a new wave of research and pave ways for further research into various complicated processes involved in online reading.
© Copyright by Chih-Hao Tsai, 2001