Doctoral Dissertation of Chih-Hao Tsai >

July 2001

Tsai, C.-H. (2001). Word identification and eye movements in reading Chinese: A modeling approach. Doctoral dissertation, University of Illinois at Urbana-Champaign.

p. 6Chapter 2
Words in Chinese

It seems odd that in a study of word identification one has to discuss what a word is in the first place. After all, most psychologists studying Chinese lexical processing simply assume that words are what could be found in a dictionary. However, compared with their Western counterparts, native speakers of Chinese have a less clear concept of what a word is. Researchers often find it difficult to give instructions to participants in experiments involving word units, even in simple lexical decision experiments. For example, when participants hear an instruction like "press YES when the characters displayed on the screen constitute a word and NO when they do not", usually the participants will turn their heads to the experimenter and ask what a word is with puzzling faces. The experimenter usually has to change the word "word" in the instruction to something like "meaningful unit" in response to the question. In this chapter I analyze this phenomenon from historical, psychological, and linguistic perspectives.

Differential Evolution of Language and Script

The Chinese have an orthography with a very long history, which can be traced back to the second millennium BC (Norman, 1988). Ancient Chinese orthography was pictographic, but the writing system has evolved into a modern form in which characters are no longer pictographic. Instead, just as all modern writing systems, Chinese characters now serve as written tokens of the spoken language (DeFrancis, 1989; Hung & Tzeng, 1981). In the modern system, the characters map onto the language at both syllable and morpheme levels. Despite the significant shift of their function from logographic to morphosyllabic, the Chinese characters still p. 7 bear much resemblance in structure to their classic yet stylish forms stabilized in the Han dynasty about two thousand years ago.

Contrary to the orthography that has been quite stable in the past two millenniums, the Chinese languages themselves have undergone changes continuously, with the most drastic changes occurring in the past two centuries. For a very long period in history, the classic Chinese lexicon consisted of mainly monosyllabic words. In the 19th century, concepts from the Western world and Japan began to flood into China. Chinese scholars realized that it was no longer possible to invent new characters to represent new concepts, since there were simply too many new concepts. Consequently, new words were generated by compounding existing words. Numerous polymorphemic words have been constructed and have entered the lexicon (Masini, 1993). Today, compounding is still the major word-formation device, and the modern lexicon consists of mainly polymorphemic words.

China adopted a standard national language even later. The Ministry of Education of the Republic of China adopted Mandarin as the national language in 1909. Around the same time, in parallel with the adoption of a national language, there was a vernacular movement in literary circles promoting the writing of spoken Mandarin instead of classic literature (Norman, 1988). The vernacular movement has fostered the modern role of the Chinese characters.

Despite the fact that multi-character words have outnumbered single-character words, in the modern Chinese writing system characters are still spaced evenly. This style of character spacing has not changed for thousands of years. Thus, the word identification problem that modern Chinese readers face results from the differential evolution of the Chinese language and writing system.

p. 8The Concept of Word in Chinese

With words not being visibly marked in written Chinese, they also play a much less significant role in people's concepts about that particular language unit. In fact, the Chinese word for "word" (ci) occurs at only about half the probability of their word for "character" (zi) in a large corpus (.00384 vs. .00842; Chinese Knowledge Information Processing Group, 1994). Hoosain (1992) also reported that when a group of college students was asked to mark the word boundaries in a text, there was disagreement among the responses. He also indicated (Hoosain, 1991, p. 18) that the underlying meaning of the character for "word" is not very well understood by many ordinary Chinese speakers. Tsai, McConkie, and Zheng (1998) replicated Hoosain's finding and found that there is indeed disagreement among the word-marking responses. However, they also found that there is a systematic basis underlying much of this disagreement. It appeared that the disagreement results from the ineffectiveness in accessing the implicit knowledge of the word units consciously, rather than from not having consistent knowledge of words. They concluded that it is a performance issue, rather than a competence issue.

Neither Hoosain's (1992) or Tsai, McConkie, and Zheng's (1998) study can determine if the magnitude of disagreement observed is indeed larger in Chinese than in English or other languages in which words are marked in written forms. Recently, Miller, Chen, and Zhang (2000) have further demonstrated that English speakers showed strong agreement on word boundaries when asked to parse sentences (whose spaces between words were removed) into words. Chinese speakers, on the other hand, showed considerable disagreement. Therefore, there is indeed substantial disagreement on word boundary judgments in Chinese.

Since Western psycholinguistics gives such a prominent place to the word unit in its analyses, the lack of prominence of words, and even of the concept of word itself, in Chinese is p. 9 quite surprising. It further raises the question of whether the role of the word unit has been overemphasized in traditional psycholinguistics, or whether the critical nature of the word unit is only characteristic of processing in languages in which words are prominently marked.

In their introduction to a special issue of Language and Cognitive Processes devoted to issues of processing East Asian languages, H.-C. Chen and Zhou (1999) clearly expressed the feelings of uncertainty of Chinese psychologists about the concept of the word:

For instance, contemporary theories of language processing unexceptionally consider words as the basis of complex comprehension processes.... This is not surprising, because, after all, words are transparent units for speakers of European languages. However, it is not obvious whether the same arguments and conclusions relating to word processing that have been reached through psycholinguistic studies with European languages can be generalized to other languages, such as Chinese, in which words are not transparent units (pp. 425-426).

H.-C. Chen and Zhou might have made an inappropriate inference. One must keep in mind that most linguistic knowledge is implicit, and it is well known that implicit knowledge is often hard to be explicitly accessed and manipulated at the conscious level. It is just like our inability to describe how to keep one's balance when riding a bicycle. Since Chinese readers are seldom called upon to indicate what does and does not constitute a word in any situation in which they receive consistent feedback, they may simply fail to develop explicit knowledge of word-units that can be called upon to perform this type of task. By this explanation, English readers develop a consistent notion of what is and is not a word because words are marked in the language and whenever they write they must constantly decide where the boundaries between words should be placed. Errors of judgment in this decision are corrected by teachers and parents. Without this type of explicit feedback, Chinese readers have no basis for developing agreement on what constitutes a word.

p. 10Defining the Word

Admittedly, the word itself is sometimes difficult to define, even in languages whose word boundaries are marked in the scripts. This is mainly due to the fact that there are multiple identities or senses of "word" (Crystal, 1991; Packard, 2000). For example, Packard listed eight different criteria:

  1. Orthographic: A language unit defined by writing conventions.
  2. Sociological: The basic language unit intuitively recognized by linguistically naïve people.
  3. Lexical: An entry in the lexicon.
  4. Semantic: A concept.
  5. Phonological: A "word-sized" entity that is defined using phonological criteria.
  6. Morphological: An output of a word-formation rule.
  7. Syntactic: An independent occupant of a syntactic form class slot.
  8. Psycholinguistic: A construct at roughly the "word" level of linguistic analysis that is salient and highly relevant to language processing.

In English, the sets of words defined by different criteria overlap to a very high extent. In Chinese, however, different criteria could result in very different kinds of "words". For example, the first two criteria basically define the set of characters as words. The sets of units defined by the first two criteria are explicitly accessible. On the other hand, the units defined by the remaining six criteria are more or less in the form of implicit knowledge in the speakers' mind.

Packard argues that the syntactic criterion of "word" is the best, because it is the most common current linguistic characterization of the notion "word", and that it motivates the concept of "word" in most other languages. Moreover, the sets of words defined by the syntactic criterion and by the Chinese technical term for "word" (ci) overlap to a very large extent.

p. 11Other criteria have their relative weaknesses (Packard, 2000). Orthographic and sociological: Characters basically are inappropriate units. Lexical: There are words not listed in the lexicon, and there are also entries in the lexicon that are not words. Semantic: "Concept" is even harder to define than "word". Phonological: Phonological structure is not as good as other definitions in predicting speakers' intuitions of what words are. Morphological: Some entities formed by morphological rules must be augmented with additional information before they can appear freely in utterances. Psycholinguistic: Little research has been done to determine what its properties might be in any language.

The syntactic criterion may not be very appealing to many psychologists, especially those studying Chinese reading and word identification, as it is a very abstract definition purely in the linguistics domain that is pretty much insulated from psychology. But the question is: The word is a linguistic unit, why isn't it appropriate to be defined linguistically? For pure visual perception activities, a linguistic definition of word may not be that crucial. However, reading is not a pure visual perception activity. Reading is an activity whose ultimate goal is to comprehend the language. From this perspective, it should be fairly reasonable to accept the syntactic criterion in developing theories about the processing of "words".

© Copyright by Chih-Hao Tsai, 2001