Doctoral Dissertation of Chih-Hao Tsai >

July 2001

Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.

Previous:Chapter 6 | Top:Table of Contents | Next:Chapter 8

p. 33Chapter 7
Research Framework

The lack of interdisciplinary perspective has prevented past attempts from effectively binding knowledge of linguistics and psychology together. In addition, the limitation of experimentation has prevented past attempts from making significant empirical and theoretical progress. As a result, based on the new analysis, this study investigates the word identification problem in Chinese reading from an unconventional approach.

Corpus-based Research

The Academia Sinica Balanced Corpus (ASBC)

This study is a corpus-based computational study. A large, representative sample of segmented Chinese text--the Academia Sinica Balanced Corpus (ASBC; Academia Sinica, 1998)--is used. ASBC is segmented into segmentation units--the smallest string of character(s) that has both an independent meaning and an identifiable and constant grammatical function (Huang, Chen, Chen, & Chang, 1997). In other words, Huang et al.'s definition is an orthographic-syntactic definition based on invariance of form-function mapping. The units segmented by this definition are very similar to the "words" defined by the syntactic criterion in Packard (2000) except in one aspect: Grammatical affixes as defined by Packard (e.g., the verb aspect marker -le, -zhe, and -guo etc.) are independent segmentation units in ASBC. This is because the standard Huang et al. proposed is not a strict linguistic criterion. Rather, it is a compromise of linguistic felicity, computational feasibility, and data uniformity. In this respect, the grammatical markers are indeed "detachable" due to their invariant orthographic appearances p. 34and grammatical functions. The segmentation standard by Huang et al. (1997) is summarized in the next section.

The Segmentation Standard

The segmentation standard consists of a set of segmentation criteria and a reference lexicon. The segmentation criteria consist of a definition of segmentation unit, two segmentation principles, and a set of heuristic guidelines. The segmentation lexicon contains a list of Mandarin Chinese words and other linguistic units that the heuristic guidelines must refer to.

Segmentation Unit

The segmentation unit is the smallest string of character(s) that has both an independent meaning and an identifiable and constant grammatical function.

Segmentation Principles

  1. A string whose meaning cannot be derived by the sum of its components should be treated as a segmentation unit.
  2. A string whose structural composition is not determined by the grammatical requirements of its components, or a string which has a grammatical category other than the one predicted by its structural composition should be treated as a segmentation unit.

Segmentation Guidelines

  1. Bound morphemes should be attached to neighboring words to form a segmentation unit when possible.
  2. A string of characters that has a high frequency in the language or high co-occurrence frequency among the components should be treated as a segmentation unit when possible.
  3. String separated by overt segmentation markers should be segmented.
  4. Strings with complex internal structures should be segmented when possible.

p. 35The Reference Lexicon

Both segmentation principles and guidelines refer to the reference lexicon. Entries in this lexicon include non-derivational words as well as productive morpho-lexical affixes. It also contains the list of mandatory segmentation markers, such as the end of sentence markers etc. Bound morphemes are also listed. Non-derivational words are also listed. However, neologism constantly adds new forms and meanings to words in a language and old forms and meanings do become obsolete. Therefore, the reference lexicon must be constantly updated.


There are several properties of ASBC other than the segmentation standard that makes it feasible for this study. First of all, it is balanced, which means it is a representative sample of what typical readers would have experienced with their eyes. By analyzing this sample from an adaptive perspective, one could discover what could have been used by the readers in optimizing their reading performance. Secondly, hypotheses can be effectively tested using the corpus. Since it is representative, the test results are also very likely to be representative. Finally, since the goal of this study is to develop a psychologically plausible model, the more practical, orthographic-syntactic definition seems feasible.

Structure of Research

This study consists of two parts. Part 1 investigates the nature of ambiguities in word identification as well as the heuristics of disambiguation. Part 2 is the actual modeling study that constructs and tests a model of word identification and eye movements in reading Chinese by taking advantage of findings from Part 1.

p. 36General Method

Computer Software and Hardware

The entire study was carried out on a PC with an Intel Pentium III 933 MHz CPU running the Microsoft Windows 2000 operating system. Computer programs used in all analyses were written in the C programming language (American National Standards Institute, 1990). Executables were compiled with GCC 2.95.3 (Free Software Foundation, Inc., 2001).


The original ASBC had a size of 749,126 sentences, 4,911,231 words, or 8,107,600 characters. A lexicon with 131,616 unique Chinese words and their frequency counts was compiled using the raw corpus. The lexicon was complete in the sense that it contained all unique words in the corpus. In other words, with the lexicon, all words in the corpus were recognizable. Table A1 shows the distribution of word lengths (See Table A1, p. 88). From Table A1 we can see that there are quite a few words that are much longer than typical words. Most of those words are numbers, which are segmented as words in ASBC. For example, the 11-character word yibaibashiliuyijiuqianyibaiwan (one hundred eight ten six billion nine thousand one hundred ten-thousand) represents the number 18,691,000,000 (A Chinese billion [yi] is a hundred million, rather than a thousand million as in English).

After the compilation of the lexicon, sentences containing non-Chinese words, such as Arabic numbers or English words were removed to created a reduced, Chinese only corpus. Comments and tags were also removed. All subsequent analyses were performed on the reduced corpus. The reduced corpus consisted of 714,390 sentences, 4,612,595 words, or 7,459,582 characters.

© Copyright by Chih-Hao Tsai, 2001