Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.
Previous:Chapter 9 | Top:Table of Contents | Next:References
By systematically analyzing a large, representative Chinese corpus, the first part of this two-part study replicated Guo's (1997) finding that there exist unambiguous word boundaries, and ambiguities can be contained in critical fragments. Most (93%) of the critical fragments are desired words by themselves, and those correctly identified words account for 88% of total word tokens.
Such findings are a demonstration of the effectiveness of the word-length based maximum matching principle. A generalized maximum matching heuristic was derived, and it was found that GMM was quite effective in disambiguation. Two supportive heuristics for GMM, the lexical competition based AWF and the information based MI, were also found to be quite effective. The implication is that words can be identified with simple heuristics that use only low-level information (i.e., without linguistic input from higher levels), such as word length, word frequency, and the mutual information between characters and word boundaries. The combination of GMM+AWF is particularly interesting, because it approximates the competition of words in an interactive-activation model of lexical organization and word identification (McClelland & Elman, 1986; McClelland & Rumelhart, 1981; Norris, 1994; Rumelhart & McClelland, 1982). In other words, word identification in reading Chinese could be achieved in a very similar manner as word identification in the auditory channel.
In the second part of the study, the discovery of Part 1 was
adopted to construct a model of word identification and eye
movements in reading Chinese text. The perceptual
constraints--the
perceptual span and foveal/parafoveal vision--were implemented.
An uncertainty-driven heuristic of eye movement control was also
devised to instruct the model to take the next input based on the
result of word identification in the current window.
The model was able to identify near 94% of word tokens correctly with a size of perceptual span similar to that of human readers’. The saccade length distribution produced by the model was a function of two factors: (a) the distribution of word lengths, and (b) the size of perceptual span. The majority of word tokens are either one- or two-character words. Such a distribution would cause the uncertain region (i.e., the last full/partial word) of a window of any size to be most frequently one or two. For example, for a four-character window, the word-identified strings "A/B/C/D", "A/BC/D", "AB/C/D" have a one-character last word, while the word-identified strings "A/B/CD" and "AB/CD" have a two-character last word. Since uncertainties are usually associated with short words as those words frequently occur as the beginnings of longer words, a model with a fixed-size, four-character window would frequently make two- or three-character saccades, which is exactly what has been observed. With a three-character window, the word-identified strings "A/B/C" and "AB/C" have a one-character last word, while the word-identified string "A/BC" has a two-character last word. The actual distribution of saccade lengths the model produced indicates that one-character last words are more common than two-character ones for three-character windows.
The distributions of human data lie somewhere between those
produced by models with fixed-size windows of three and four
characters. By allowing the model to have a variable-size window,
the model was able to produce a distribution of saccade lengths
very similar to those of human readers'. Therefore, it takes more
than a fixed-size window and a distribution of word lengths to
produce a distribution of saccades similar to those of human
readers'. Variability in window sizes (due to variation in the probability of
identifying a character in the parafovea, resulting from
character complexity differences) is critical for the model to
produce a human-like distribution.
Using the artificial fixation points of the model to compute fixation-based indices commonly adopted in reporting human data, the model was able to produce distributions of skipping rates and landing positions very similar (albeit not identical) to those of human readers'. The model was also able to produce effects of (fixated) word frequency on fixation and gaze durations, despite the fact that the fixation durations were computed on the basis of all words in the window, not just on the fixated word. These findings suggest that in reading Chinese, the eyes are not necessary targeted on words. The eye movement control may be more similar to what the model does, which is more relaxed. That is, the eyes move in order to take input from a different location, rather than sending the fixation to certain specific target.
As mentioned earlier, there are two dimensions of word identification: (a) the identification of words in the mental lexicon, and (b) the identification of words in the text. Traditional lexical processing research has focused more on the former, while this study focuses on the latter. The two dimensions, although distinct, are nevertheless tightly related. More specifically, the automatic word identification in the mental lexicon and the competition among lexical entries drive the identification of words in the text.
This study has demonstrated the relationship between the two
dimensions of word identification quite clearly, and such
demonstration has important theoretical implications. Ultimately, a model of eye
movements and reading in Chinese needs to be developed to
consolidate our knowledge of Chinese reading and to guide further
research. Without a good understanding of both dimensions as well
as the relationship between them an adequate model of eye
movements and word identification during reading cannot be
developed.
The lack of physical word boundaries in Chinese text has led most of past psycholinguistic inquires to make either one of the following two assumptions: (a) the identification of words in text is difficult (therefore reading will be facilitated if word boundaries are provided); (b) words are not necessarily the reading or perception units in reading Chinese.
Both assumptions ignore the automatic aspect of word identification in the mental lexicon. By binding the two dimensions of word identification together, this study has found that the identification of words in text is usually not difficult, given a set of simple heuristics that approximates the lexical competition process. It was also found in this study that a word-based model of eye movement control could capture many characteristics of reading eye movement data, thus supporting the position that the words are the reading units.
The phenomenon that native speakers of Chinese do not have a clear and intuitive concept of what a word is has raised the question of the psychological reality of the "word" in Chinese. This study has not assessed this question directly. Rather, this study has maintained that words are psychologically real, and it is our inability to explicitly access implicit knowledge, rather than the psychological reality of the word, that has blurred the notion of the "word" in Chinese.
Such an
assumption is derived in part from the finding that the word unit
is linguistically universal, and the word in Chinese is no
exception (Packard, 2000). In fact, even if the present study had
not made such an assumption, the findings could still have served
as a supportive evidence for the psychological reality of the
"word" in Chinese. Specifically, this study has demonstrated that
the distribution of saccade lengths of readers can be reproduced
by a model that reads words with psychologically plausible
processing characteristics. Furthermore, such a finding is robust
in that it is quite resistant to change in the criteria of the
"word". Different criteria may create slightly different
distributions of word lengths, but most of those distributions
will be very similar to the one shown in Table A1. As a result, one must see the forest
beyond the trees to realize that both the forest and the trees
are real. The seemingly fuzziness of the notion of "word" in
Chinese vanishes, and the word comes into sharp focus when it is
viewed from an appropriate perspective.
The systematic analyses in Part 1 provide us a substantial understanding of the nature of word identification ambiguity in Chinese text, and the modeling analyses in Part 2 further provide a psychologically plausible model of word identification and eye movements in reading Chinese. Both the systematic analyses in Part 1 and the modeling analyses in Part 2 are the first attempts in the field. The success of this study indicates the usefulness of the methodology and approach adopted in this study. Since this is an initial work, progress made by this study is limited. However, its limitations also indicate its potential for further expansion. In the following sections I discuss the major findings and limitations of this study and the directions for future research in the context of these findings and limitations.
In normal reading, readers encounter words not listed in their mental lexicons from time to time. These may include newly formed words and personal and proper names. With the aid of morphology, adult readers can recognize these words with little difficulty. The problem of unknown words in this study is solved by creating a lexicon that lists all words. By doing so, the involvement of morphology becomes unnecessary. Such simplification may be acceptable in a preliminary, analytical approach to modeling such as Part 2, but eventually morphology must be added if the model is to be further developed.
Despite the model's lack of morphology, findings from this
study have an important implication for the role of morphology in
word identification: morphology is not necessary in identifying
the majority of words. Our results indicate that the automatic
activation of words and the competition of words in the mental
lexicon can resolve most ambiguities effectively and efficiently.
In other words, a rule-based (morphological rules) approach to
word identification appear to be unnecessary except for unknown
words. In fact, it is not just unnecessary, but probably also
unfeasible. A rule-based approach means words are identified as a
result of pre-lexical parsing. Since the set of morphology and
the set of syntax are not mutually exclusive, and many characters
also represent free morphemes, the parsing operation would be
extremely difficult. Packard (1999) also argued that a real-time
morpheme combination algorithm is improbable. In addition, he
also argued that even a morpheme-indexed lexical access that
identifies a morpheme as the first step in searching a list of
precompiled lexical entry is improbable. This is easy to
understand, since if a word can be looked up as a whole, there
seems to be little benefit in taking the much slower, indirect
route. Consequently, morphology seems to play a minor role in identifying known
words, though it is definitely necessary in identifying unknown
words.
The lack of word boundaries in Chinese text has made it difficult to study sentence parsing. For example, Lee (1995) presented a few Chinese garden path sentences. However, careful inspection of his sentences reveals that many of these sentences are not true garden path sentences at all. For example, the following are two tokenizations of the same sentence:
Zhe jige nianqingren ai shang mingxing de dang
Zhe jige nianqingren aishang mingxing de dang
"Zhe jige nianqingren" means "these few young men". The first sentence is the one with correct tokenization in which "ai" (love) is the main verb and "shang mingxing de dang" (to be fooled by celebrities) is a complement clause. The second tokenization is still syntactically well formed, but is semantically ill formed: "aishang" (fall in love with) "mingxing de dang" (celebrities' tricks).
Despite the fact that this is not a true garden path sentence,
it is a good example of the complicated interaction between
syntax, morphology, and word identification. Parsing ambiguities
resulted from errors in tokenization only occur in Chinese. As a
result, they provide a unique opportunity for Chinese
psychologists and psycholinguists to explore the nature of
language processing. The model developed in this study does not
receive input from morphological or syntactic level, but it still
has much explanatory power. In the above example, it is obvious
that the ambiguity between "aishang" and "ai shang" is hidden, so
readers may already have the tendency to identify "aishang"
because the corresponding lexical entry gets activated
automatically. Although rule-based explanation, such as late
closure (Frazier, 1979), is also possible, the word identification
explanation is nevertheless more parsimonious and more
plausible.
The word identification heuristics in this model (GMM+AWF/MI) can be viewed an imprecise approximate to the more dynamic and distributional lexical competition process. This study has demonstrated that such a competition can indeed resolve most word identification ambiguities. Naturally, the logical next step is to implement the real competition process in a connectionist network. Since the Chinese writing system has thousands of characters and hundreds of thousands of words, the representations for characters and words must be highly distributed; otherwise the network would require too much computational power to run. Another low-cost solution is to build smaller in-context networks (Norris, 1994). For example, if four characters are seen in some fixation, a small network is dynamically constructed. The network has four input units, each of which represents a character in the perceptual span. Words associated with these characters would constitute the output units. When the process is finished, the network is destroyed. A new network is built when the new input is taken.
In this study, characters are treated as unanalyzed visual
patterns, and the only factor that affects their identification
is the number of strokes. In reality, characters are not just
visual patterns of random strokes. They are highly structured.
Some character components carry phonological information, while
some carry semantic information. Their role may be a minor one in
resolving word identification ambiguities, because the
competition is at a higher level. However, their visual structure
may well affect the identification process, especially when they
are seen in the
parafovea. Representing the structure of characters would be
crucial if a finer-grained model of eye movement control is to be
developed.
This study assumes a rather modular eye movement control mechanism. The words are identified first, and a saccade is planned based on the result of word identification. Although many models of eye movements in reading English also make similar assumption (e.g., Legge et al., 1997; Reichle et al., 1998), the large variability in human eye movement data suggests that word identification and saccade planning may not be two completely independent stages, and that the link between word identification and saccade planning may not be so direct.
Ever since Hung and Tzeng's (1981) seminal review on the
cognitive processes of reading different orthographies
(especially Chinese), there have been numerous studies on the
psychology of reading in Chinese (see Hoosain [1991] and Wang,
Inhoff, and Chen [1999], for two milestone reviews). When Hung
and Tzeng wrote their review, reading research had regained
interest from psychologists for less than ten years (before that,
it had long been suppressed by the dominance of behaviorism).
Twenty years ago, English and Chinese reading research was about
in the same, preliminary stage of discovery: automatic
phonological recoding in word recognition, word frequency
effects, and so on. Today, we have witnessed significant progress
in English reading research, but very little progress in Chinese
reading research. Most researchers are still working on issues
regarding the processing of isolated characters or words, and it
is very rare to find a researcher who is interested in the more
interesting, and more complicated, "real" reading process. In
addition to its unique orthography, the lack of physical
word boundaries
is perhaps the most salient characteristic of the Chinese writing
system. Liu et al. (1974) had this vision way back to the very
beginning of psychology of reading in Chinese. However, in the
past thirty years there has been virtually no advance in our
knowledge about this topic.
This study has set a new direction and established a new research framework in the psychological studies of reading in Chinese. It is hoped that this study will help trigger a new wave of research and pave ways for further research into various complicated processes involved in online reading.
© Copyright by Chih-Hao Tsai, 2001