Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.
Previous:Chapter 8 | Top:Table of Contents | Next:Chapter 10
In the first part of this study, the characteristics of different types of word boundary ambiguity and the logic of disambiguation were carefully analyzed. In addition, it was also demonstrated how low level, primitive information, such as word lengths, word frequency counts, and the relative probabilities of boundary roles of individual characters, can effectively resolve most cases of tokenization ambiguity.
In the second part of this study, I will restore the word identification problem to its natural context--reading. A model of reading in Chinese will be developed that will identify words as it reads while at the same time suffering from the same constraints as human readers do: limited input window (perceptual span) and poor parafoveal vision. Since the input window is limited, the model needs to move the window around to acquire information from different parts of text. Consequently, the model is not just a model of word identification, but also one of eye movements, in reading Chinese. Since there are no psychological models of either word identification or eye movements in reading Chinese at present, this model is the first of its kind. Consequently, it has to be developed entirely from scratch. In this respect, the goal is not to develop a model that is capable of simulating every behavioral detail of word identification and eye movements in reading Chinese. To ensure that the scale of the model is manageable, the model is bound to involve assumptions that are psychologically plausible but do not (yet) have direct support from empirical data. Besides, it is also bound to involve simplifications in various aspects.
However, even with so many limitations and restrictions, this part of the study is still crucial. It is a demonstration of a novel research methodology in the psychology of reading and psycholinguistics that incorporates both an adaptive and computational perspective of problem analysis (Part 1) and a deductive and constructive approach to scientific discovery in psychology (Part 1 and 2).
The model is a rule-based model of reading in Chinese. To simulate perceptual constraints, the model has a limited size of input window that can hold only a few characters. In addition, the window consists of a fixed-size foveal area and a one-character parafoveal area to the right of the foveal area. Characters in the foveal area are 100% identifiable, while the probability of identification for the character in the parafoveal area depends on the character's complexity.
The model reads from left to right with its window. Each time an input is taken, the model identifies words with a set of word identification heuristics. When word identification is completed, the model decides whether to take the next input based on the result of word identification in current window. As a result, the output consists of the result of word identification and the location of the next input window. Saccades in the model are represented as movements of windows.
Except for the rule that determines the size of window, which is probabilistic, all other processing rules are deterministic. Consequently, the model is not designed to perform data fitting or parameter estimation. Rather, it is an implementation of what are known to be true and what are assumed to be reasonably and psychologically plausible about word identification and eye movements in reading Chinese. By constructing the model, having it read, and observing its performance, we will have a better understanding of reading in Chinese.
As mentioned earlier, human readers have very limited perceptual spans. For Chinese readers, the perceptual span is about one character space to the left of fixation and two to the right (Inhoff & Liu, 1998; Tsai & McConkie, 1995). Furthermore, characters seen in the parafoveal vision are not always identifiable. Complex characters are less likely to be identified, while simple characters have higher chances of being identified in the parafovea (Yang, 1994).
The fixated character and the ones to the left and right of it can fall within fovea, assuming the width of each character is about 0.75 to one degree of visual angle. Therefore, they can be identified without difficulty. Whether the fourth character (the second to the right of fixation) can be identified depends on its complexity. Character identification itself is a complicated process that may involve processing of components of characters, and processing of phonology and semantics, etc. Since the primary focus of this study is word identification, the model assumes the following:
The limited perceptual span not only limits the information available on any fixation, but also at the same time creates uncertainty. Nevertheless, Chinese readers are still able to cope with all these severe limitations and read well. Why is it so? This has to do with the unique processing characteristics of human readers.
Readers do not read one letter at a time, or one character a time. Visual information within the perceptual span is processed in parallel, and that parallelism is further enhanced by the automatic word recognition process. Readers do not need to deliberately look up a word in a dictionary. Rather, words are automatically activated by the visual input, just as words are automatically activated by the auditory input in speech perception.
Yet even with parallel processing and automatic word activation, readers still face the problem of uncertainty. Engaging in a left to right reading activity, there should be no uncertainty regarding word boundaries near the left boundary of the perceptual span. However, there is indeed uncertainty near the right boundary of the span, because the size of the perceptual span is limited in that information beyond the right boundary is unavailable. Another characteristic of word recognition that could be helpful in such a situation is that words can be activated even if the input is incomplete. Of course it does not eliminate uncertainty completely, but it at least reduces the uncertainty to some degree. Such uncertainty could affect word identification and/or eye movement control. The role of uncertainty in eye movement control will be the focus of the next section. This section deals primarily with word identification.
The word identification process is assumed to be basically the same as the word identification process in speech perception: words are activated automatically. Consequently, different tokenizations are made available automatically. The activated words compete for a unique solution, and the surviving words constitute the winning tokenizaton. Ideally, this kind of process is best implemented as an artificial neural network, or PDP model. However, current artificial neural networks need to be implemented on traditional, serial computers. Parallel processing can only be simulated. With a realistic size of lexicon the network would contain too many connections. Such a network would be incredibly slow even on the fastest computers available to date. It is also possible to avoid this problem by building a small network that consists of only relevant characters and words dynamically in each fixation, like the Shortlist model (Norris, 1994). However, even small networks contain large numbers of connections, and that still severely hurts the performance when processing a large set of data, such as a large corpus.
As a result, the model uses traditional, symbolic algorithms to approximate the parallel activation and competition processes in a neural network. Parallel activation is simulated by simply enumerating all possible tokenizations. Partial words are recognized by a pre-compiled, expanded lexicon that contains all possible leading fragments of all words. With the capability of recognizing partial words, when the model sees a string "ABC" with a three-character window, it will be able to tell if "ABC", "BC", "C" could be the beginnings of longer words.
As to the competition, or ambiguity resolution process, three simple heuristics discovered in Part 1 that approximate the competition process in the mental lexicon (or in a neural network) are used: GMM, AWF, and FMM. GMM is applied to the set of all possible tokenizations first. If there are ties, AWF is applied. If there are still ties after AWF is applied (such cases are extremely rare), use FMM to make the final decision. (The role of FMM is not prominent here. The important part is still the GMM+AWF combination.) AWF is favored over MI because AWF has a slightly better performance and it represents the concept of lexical competition more directly.
Word frequencies become a little bit complicated, because the lexicon now contains not just words, but also partial words. An entry in the lexicon could have more than one frequency count. If it is a word by itself, then it has a word frequency count. If it is also the beginning of longer words, than it has another frequency count, which is the sum of frequencies of those words beginning with it (I will call this bound fragment frequency). If an entry cannot be a word by itself, then it only has bound fragment frequency.
Which frequency does the AWF heuristic use? It follows a simple two-step decision process. If there is only one frequency count, use that one. If there are two, use the one with greater value.
The relationship between word identification, or more generally, cognition, and eye movements is not without controversy. The controversy is not whether cognition and eye movements are related. They are. It is the degree of cognitive control over eye movements that has been in the center of controversy.
The group of models proposed by Morrison (1984), Rayner and Pollatsek (1987), and Reichle et al. (1998) is very popular. They are word-based models. In these models, words are not just perception units, but also targets for saccade programming. These models are primarily aimed at simulating fixation and gaze durations and word skipping.
Another recent model, Mr. Chips (Legge, Klitz, & Tjan, 1997) is aimed at simulating another dimension of eye movement behavior: saccades. The model makes saccades based on information available in the current perceptual span. Since in English orthography words are separated by space boundaries, there is no difficulty locating the last word (which may only be partially seen) in the input window (visual span). Incomplete words create uncertainty, and Mr. Chips tries to use the lexicon to compute the best next location for sending the input window to reduce the most uncertainty and to maximize the saccade length. Despite the fact that what Mr. Chips does is merely minimize uncertainty caused by incomplete words and that it does not have a word-based saccade targeting assumption or algorithm, it nevertheless is capable of simulating many phenomena such as word skipping, optimal viewing position, and regression.
All of the above models can be viewed as having a considerable degree of cognitive control. However, they are all based on English orthography and therefore are hardly applicable to Chinese. In fact, we have no data or theory at all about how the eye movements in reading Chinese are controlled. But we do know that in reading Chinese the perceptual span is limited, and that there is similar uncertainty caused by incomplete words. From an adaptive perspective, the eyes should move in such a way that would maximize the efficiency of information acquisition. To do so, the readers must use information available in the current perceptual span to calculate where to take the next input to minimize uncertainty while at the same time maximize the length of the saccade.
The above analysis will serve as the basis of the eye movement part of the model of word identification and eye movements in Chinese. It lacks empirical evidence, but it is logical. It also shares the same "ideal-observer" spirit as Mr. Chips.
The model assumes that word identification precedes eye movement planning. When the word identification process is completed, the model identifies the region with uncertainty and plans a saccade accordingly. Ideally, when words (and possibly partial words) are identified in the input window, the model could use the same principle as that of Mr. Chips' to plan saccades. However, since there is much less variability of word length in Chinese, complicated computations may not be necessary.
Once the word process is completed, the model identifies the "last fragment" in the perceptual span. The uncertainty associated with the last fragment is then used to determine the location of the next input window. For example, suppose the word identification process determines that the string "ABC" seen in a three-character window should be tokenized as "A/BC", the last fragment is "BC".
Unique identification. If "BC" itself is not a word but can be the beginning of one and only one longer word in the lexicon (suppose the only word in the lexicon beginning with "BC" is "BCD", then advance the input window by four characters. That is, the next window begins with the second character after "C". In other words, the character "D" is not seen in either window. A unique identification results in a saccade (window offset) longer than the size of current window.
Firm boundary. If the last fragment is a word by itself, and there are no words beginning with either "ABC", BC", or "C" in the lexicon, then there is a firm word boundary to the right of "C". The input window is then advanced by three characters. That is, the next window begins with the first character after "C". A firm boundary results in a saccade equal to the size of current window.
Left-justify. Chinese has a very productive morphology, so almost every character can be the leading character of longer words. Firm boundaries are rarely seen. Due to the productivity of morphology, unique identifications are also infrequent. Consequently, the last fragments are more likely to have uncertainty. If neither unique identification nor firm boundary is true, advance the window by an offset of current window size minus the length of the last fragment. That is, the next window begins with the first character of the last fragment of the current window. In our example, the next window begins with "BC". In other words, this heuristic left-justifies the last fragment with next input window. The left-justify heuristic results in a saccade shorter than the size of the current window.
Although most Chinese words are either one- or two-character words, there are still longer words. What if a word is at least as long as the size of the input window? Let us continue using "ABC" as an example. If "ABC" itself is a word or could be the beginning of longer words, it is viewed as the last fragment. If there is a unique identification or firm boundary, advance the window according to the standard heuristics.
If, however, there is no unique identification or firm boundary, then the model keeps "ABC" in a temporary buffer and advances the input window by three. As a result, the next input window begins with the first character after "C". Suppose what is seen in the next window is "DEF", what the model needs to process is "ABCDEF". The word identification process is then applied to the six-character string, rather than to what is seen in the current window.
If "ABCDEF" is tokenized as "ABCD/EF" after word identification, the buffer is erased. "EF" is treated as the last fragment, and the same procedure in the previous section applies.
If, however, "ABCDEF" is a word or could be the beginning of longer words, then "ABCDEF" is stored in the buffer, and the next input is taken. This process continues recursively until the word identification process results in at least two fragments. When this occurs, the last fragment is taken and the same procedure in the previous section applies.
The model's eye movement programming mechanism is aimed at reducing its memory load by avoiding using its buffer. It is also aimed at minimizing errors. Unless it can be sure there is a unique identification or a firm boundary, it will always let two consecutive windows overlap. The model does not make regressive saccades.
The model only has an input window, and does not have a fixation point. To simulate fixation-related phenomena, the fixation point of each input window is assumed to be on the center of the second character.
An immediate question brought up by defining the fixation point is how to compute fixation duration. Again, there have been no theories about the relationship between fixation duration and cognition in reading Chinese. The only fact we have known so far is that the frequency of words affects gaze duration, the amount of time spent fixating on a word during its first encounter (Yang & McConkie, 1999). Superficially it may appear as if frequency of the "fixated word" is the primary source of variation in fixation/gaze durations. However, Chinese words are short. With an average of 1.5 characters, a three-character perceptual span can easily cover two words, and a four-character span can cover 2.7, in most situations. It does not make much sense to assume that only "fixated words" affect fixation durations. It seems more plausible to assume that the fixation duration reflects the time it takes to recognize all words in the perceptual span.
For the recognition of individual words, the model assumes that the recognition time is negatively correlated with the logarithm of word frequency. The following simple equation is used to compute the model's recognition time for a word with frequency x:
19 - int (log2 (x))
Only the integer part of the logarithm is taken. Because the integer value of the logarithm of the highest word frequency in the lexicon is 18, by subtracting it from 19 we get a recognition time of one. The lowest frequency would have an integer base two logarithm of zero, thus having a recognition time of 19. It has to be noted that these "recognition times" are relative indices. It only makes sense to say that 4 is faster than 6, but does not make sense to say that 4 is 50% faster than 6.
Since parallel processing is assumed, the time it takes to recognize all words would be equal to the time needed to recognize the word with the lowest frequency in the perceptual span. The model computes fixation duration based on the results of the word identification process. When the "last fragment" has two frequencies (word and bound fragment frequencies), the higher one is used.
The materials were the reduced, 714,390-sentence ASBC. An expanded lexicon that was needed to give the model the capacity of identifying partial words was compiled using the following procedure. A word with length N has N-1 leading fragments. For example, a four-character word "ABCD" has three leading fragments: "A", "AB", and "ABC". Each fragment was used to search the lexicon. If the fragment was found, the word frequency of "ABCD" was added to the bound fragment frequency count of that entry. If not, a new entry was created. The number of words beginning with each fragment was also recorded. In addition, the length of the longest word beginning with each fragment was also recorded. The expanded lexicon consisted of a total of 167,599 fragments (including words), or 35,983 more entries than the original lexicon.
Four smaller lexicons with different sizes were also compiled by using different fragment frequencies as cutoff points to select entries from the 131,616-word master lexicon. Only words with frequencies greater than or equal to the cutoff point could enter each lexicon. The four cutoff points were: 8 (2^3), 64 (2^6), 512 (2^9), and 4096 (2^12). The sizes of the resulting lexicons were 32,326, 6,865, 1,212 and 130, respectively. Four expanded lexicons were also compiled based on the smaller lexicons, and their sizes are 38,112, 7,794, 1,496, and 146, respectively. The purpose of the manipulation of lexicon size was two-fold. Firstly, it was an attempt to replicate Mr. Chips' (Legge et. al., 1997) behavioral pattern that when the size of lexicon was reduced, saccade size lengthened. Secondly, by comparing saccade length distributions produced with different sizes of lexicon, we could evaluate the stability of the distribution produced with the full lexicon.
The stroke counts for traditional Chinese characters were from Tsai (1999). The stroke counts for simplified characters were from Stroke counts for GB characters (1994).
Here is a summary of model parameters:
The standard (psychologically plausible) traditional character model consisted of the following parameters: a fixed window size of three, extra character allowed (hereafter called the "3+1" window); fixation point on the second character; traditional character and the full lexicon.
For any given set of parameters, the model read the whole corpus with that set of parameters, and summary statistics such as distributions of saccade lengths and fixation durations, etc., were recorded. The behaviors of the model that were observed can be grouped into three categories:
Human data, when available, were plotted against model data.
A total of 4,329,730 of the 4,612,595 word tokens were correctly identified by the standard model. The recall rate was 93.87%, which was very high. The model does not have the capability of handling errors in word identification, since the detection of errors requires input from higher levels. However, it is fair to assume that if human readers identify words in a similar manner as the model does, what causes the model to go wrong would also bother the readers. Changes in eye movements due to errors in word identification could include lengthened fixation durations, refixations, or regressions.
Figure B1 shows the distributions of saccade lengths for models with four different window sizes: 3, 3+1, 4, and 4+1 (see Figure B1, p. 99). One of the most salient characteristics of these distributions is, not surprisingly, that sizes of saccades are bounded by window sizes. In other words, saccade sizes tend to lengthen as the window size becomes longer, and unique identifications (based upon seen partial words) that would result in saccades longer than window sizes are extremely rare. Another salient characteristic is that lengths of saccades equal to the sizes of windows (that is, cases where the windows on successive fixations do not overlap) are also rare, which is most evident in models with fixed-size windows (3 and 4). This can only happen when the right boundary is a firm word boundary, which means that most frequently the right boundaries of windows are not firm word boundaries.
In addition, short (one-character) saccades are fewer than longer ones. This is also predictable from the logic of the model. A one-character saccade implies that the first word in the window is a one-character word, and the rest of the text in the window contains uncertainty. If in a three-character window the remaining text is a two-character word, a one-character saccade will be made if there is neither unique identification nor firm boundary. This kind of situation is not rare, but also not very common. As the window size gets larger, the chance of seeing three-character or longer last fragments becomes much smaller, thus resulting in even fewer one-character saccades.
The distribution for a model with a fixed-size, three-character window is unimodal, with over 60% of two-character saccades (one character less than the size of window). For models with windows larger than three (or that at least have a chance to increase to three, such as the 3+1 model), the distributions are still unimodal, but the distributions are much less extreme. Most saccades are either one or two characters shorter than the window size. The models with window sizes 4 and 4+1 produced a very similar pattern, in which the two most frequent saccade lengths occurring almost equally frequently. The distribution of the standard 3+1 model is different from other distributions. It has more two-character saccades than three-character ones.
Finally, the recall rates are quite close. Recall rates for models 3, 3+1, 4 and 4+1 are 93.87%, 93.80%, 93.78% and 93.81%, respectively. This indicates that the buffering mechanism was flexible enough to cope with smaller windows, and the recall rate of 94% may be an upper bound for GMM+AWF.
Before comparing the distribution of the standard model and those of actual saccades made by human readers, one more test was performed. Figure B2 shows the effects of lexicon size on the distribution of the standard (3+1) model (see Figure B2, p. 100). It is evident from Figure B2 that as the size of lexicon decreases, the distribution of saccade lengths shifts rightward. That is, with a smaller lexicon the model tends to make longer saccades. This is exactly the same pattern observed by Legge, Klitz, and Tjan (1997) in their Mr. Chips model. This kind of pattern is easy to explain, because as lexicon size decreases, uncertainty also decreases. The decrease of uncertainty, in turn, leads to increased frequencies of unique identifications and firm boundaries, which produces an increase of saccade lengths.
Perhaps the most important thing to note in Figure B2 is that even if the size of lexicon is reduced to only one quarter of the original size, the saccade length distribution is very similar to that found with the entire lexicon. As a result, such a pattern is quite stable for large (greater or equal to 6,865 words) lexicons.
Figure B3 shows the distribution of saccade lengths for the standard model, and three distributions of lengths of saccades made by human readers in three independent studies (see Figure B3, p. 101). The distributions of Inhoff, Liu, and Tang (1999) and McConkie and Yang (1999) are very similar to that of the standard model. The only notable differences are that human readers make slightly more long saccades, and that they also make slightly fewer two-character saccades. But the basic pattern holds quite well. J.-L. Tsai's (2000) data is more interesting. The relative proportion of two- and three-character saccades still holds, but in general there are more one-character saccades and more saccades longer than three characters. The one-character saccades are hard to explain, while it is much easier to explain the longer saccades. Unlike Inhoff, Liu, and Tang (1999) and McConkie and Yang (1999), in which participants read unrelated sentences (which is not unlike what the model did in simulation), J.-L. Tsai's precipitants read articles. Participants are usually more cautious when reading unrelated sentences in experiments. On the other hand, when reading articles the goal of comprehension at a higher level and with richer contextual information comes into play. The readers may therefore have a greater chance of making longer saccades.
It is possible to make the model less conservative so it could make longer saccades. Such a model needs to make probabilistic judgments regarding firm boundaries and unique identification, and needs to be willing to take risks when the probability of firm boundary or unique identification is high but not 100%.
Simplified Chinese characters have fewer strokes in general than do traditional characters, which may result in better chances for characters to be identified in the parafovea. Figure B4 shows two distributions, one for the standard model reading traditional characters, the other for the standard model reading simplified characters (see Figure B4, p. 102). The two distributions are extremely similar, although the model reading simplified characters did tend to make slightly longer saccades. At present, there are no human data comparing eye movements when reading simplified and traditional characters with properly equated reading groups.
In real eye movement data, a word is said to have been skipped if there is no fixation on it. An analysis was conducted to observe how the "simulated" fixation point, which is on the second character of the 3+1 window, would behave. Figure B5 compares the skipping rates of human readers' and the standard model's reading data (see Figure B5, p. 103). Human data were from Inhoff and Liu (1998). The skipping rate for one-character words is high and is comparable between performances of human readers and the model. For both human readers and the model, skipping rates for longer words are much lower, and word length does not seem to have an effect. The only difference is that the model has near-zero skipping rates, while skipping rates of human readers are around 25%. Such a difference may have been caused by the conservativeness of the model.
A word is said to be refixated if it is fixated more than once before the eyes leave it. In other words, refixations are consecutive fixations on the same word. The model refixates on a word when two simulated fixation points fall on the same word. For example, suppose "A/BC" is the identification result of a three-character window, so the fixated character is "B", and the fixated word is "BC". Now, suppose the model makes a one-character saccade and takes the next input "BCD". The fixated character is "C", and the fixated word is still "BC". As a result, the word is refixated.
Figure B6 shows the refixation rate (proportion of words refixated) by word length for the standard model (see Figure B6, p. 104). It is predictable that refixation rate increases as word length increases. However, there is a big jump between refixation rates for five- and six-character words. It is unclear whether empirical data from human readers would also show a similar pattern.
The landing position is the initial fixation location on a word, if the word is fixated at least once. The simulated fixation point of the model always lands on the center of a character. As a result, if the fixation point is on the first character, the landing position is 0.5. If on the second, the landing position is 1.5, and so on. Figure B7 compares the landing positions for different lengths of words between empirical (Inhoff & Liu, 1998) and modeling data (see Figure B7, p. 105). For one-character words, readers tended to land on the center of the words (characters). For words longer than one character, readers' tendency was to land on the middle of the left half of the second character. On average, the model tended to land on the middle of the first and the second characters. But this is just an average, since the model does not have the capability to actually land on the middle of two characters. Even with such a severe limitation, the model was still able to produce a pattern of landing position by word length that is very similar to that of human readers.
It has already been observed that the frequency of fixated words affects gaze durations in reading Chinese text (Yang & McConkie, 1999). Recall that the model's fixation duration is computed based on all words in the window, rather than solely on the fixated word. It is interesting to see if the frequency of the fixated word and the model's fixation duration could be related. Figure B8 shows the effects of frequency of fixated word on first fixation and gaze durations (see Figure B8, p. 106). It is very clear that word frequency effects are present, and that the effect for gaze durations is larger than that for first fixation durations. As can be seen in Figure B8, gaze durations tend to be longer than first fixation durations, especially for low frequency words. This is because low frequency words are more likely to be refixated. Yang and McConkie (1999) also reported the same tendency in human readers' eye movement data. However, in the model the frequency effect may not be the real cause, as there is no predictable relationship between word frequency and probability of being refixated. The reality is that word frequency and word length are negatively correlated. High frequency words tend to be shorter, and low frequency words tend to be longer. This is evident by comparing the type- vs. token-based distributions of word lengths in Table A1.
By using the model's "composite" measure of fixation duration, the effect of frequency of the fixated word can still be observed. In natural language, many things are correlated. As a result, the most plausible explanation is that the frequencies of consecutive words are correlated. Of course, whether the model's algorithm for deriving fixation durations is psychologically real is an empirical question. What the model has contributed here is giving a warning signal saying, "what you see is not necessarily what you get". A correlation of word frequency and fixation duration does not necessarily mean that the frequencies of fixated words are solely responsible for the word frequency effect.
The second part of study successfully brought the word identification problem from offline analysis back to online reading. A model of word identification and eye movements in reading Chinese is built on the basis of findings from Part 1, and psychologically plausible and logically reasonable assumptions of perceptual constraints, processing characteristics, and eye movement behaviors. This model was not a statistical model, nor was it designed to fit existing data. It was a processing model that implemented well-defined constraints and algorithms. In other words, the model was theory-driven, rather than data-driven. It provides a fully instantiated, testable model that can be verified, augmented, or challenged.
The resulting model, although far from perfect, was still able to identify words correctly and had captured many spatial and temporal characteristics of human readers' eye movements. Specifically, compared with empirical data, the model produced very similar patterns of distributions of saccade lengths, skipping rates, and landing positions. In addition, the effect of word frequency on the fixation duration of the fixated word was also observed.
© Copyright by Chih-Hao Tsai, 2001