Lexical Parsing by Chinese Readers

Chih-Hao Tsai and George W. McConkie

University of Illinois at Urbana-Champaign, U.S.A.

Xianjun Zheng

Beijing Normal University, China

Tsai, C. H., McConkie, G. W., & Zheng, X. J. (1998, November). Lexical parsing by Chinese readers. Poster session presented at the Advanced Study Institute on Advances in Theoretical Issues and Cognitive Neuroscience Research of the Chinese Language, University of Hong Kong.


We investigated several issues concerning lexical parsing and the Chinese reader's understanding of the concept of the 'word.' A total of 138 Chinese readers from China and Taiwan were asked to put marks between the words in a 300-character text by using subjective evaluation. Results show substantial disagreement among Chinese readers on where to divide the characters into words. However, there is a systematic basis underlying much of this disagreement. We concluded that Chinese readers largely agree in their linguistic competence about words. Variation comes in the performance aspect of the task. Between-subject consistency in this task only develops through practice with consistent feedback provided by a word-based orthography, such as hanyu pinyin.


In written Chinese, all characters are equally spaced, and there are no visible word-boundary markers except for the recently-introduced convention of using punctuation marks at the ends of sentences and some clauses. The readers must parse the string of characters into lexical units, and there exists great ambiguity and difficulty in this lexical parsing process.

Furthermore, it seems that people's concepts about written language are also affected. Hoosain (1991, p. 18) reports that the Chinese word for 'word' occurs at about 1/10 the frequency of the word for 'character' in a large sample of written language. He also reports (Hoosain, 1992) that when a group of college students was asked to mark the word boundaries in a text, there were disagreements among the responses. Finally, he indicates (Hoosain, 1991, p. 18) that the underlying meaning of the word for 'word' is not very well understood by many ordinary Chinese speakers.

We employed Hoosain's (1992) word-marking task to explore several of the issues concerning lexical parsing and the Chinese reader's understanding of the concept of the 'word.' (1) We wished to provide a systematic and quantitative indication of the amount of disagreement among people in what they think constitutes the words in a text. (2) Assuming that Hoosain's observation of lack of agreement were indeed replicated, we wished to investigate whether the pattern of difference among people's responses should be regarded as arising from differences in linguistic competence or performance. (3) We wished to determine whether Chinese readers show greater consistency in their word judgements than do Taiwanese readers, thus indicating that consistency in word judgements tend to develop with experience with a word-based (hanyu pinyin) orthography.


Participants. Sixty-nine undergraduate students at Beijing Normal University, China, and 69 undergraduate students at Kaohsiung Medical College, Taiwan, participated in this study.

Material. A typed page of a 300-character story was used. Thirty-one boundaries at the place where punctuation existed were excluded from data analyses. Thus, there were 269 valid boundary locations. The text was judged by colleagues having native languages of Mainland Mandarin and of Taiwanese Mandarin as being neutral in wording, sentence patterns and content between the two dialects. Two versions of the story were prepared, one with traditional characters which was used with Taiwanese students, and one with simplified characters which was used with Mainland Chinese students.

Procedure. At each location, a person from that country administered the task to the group of participants. Each participant was given a printed copy of the story and was instructed to put marks between the words in the text. A word-marked short passage was given as an example. This passage was segmented roughly based on a linguistic analysis. The participants were told that they were to use their subjective evaluation, and that there were not necessarily correct answers.


The Chinese marked significantly more locations in the text (M=122.35, SD=11.92) than did the Taiwanese (M=98.46, SD=27.00), t(136)=6.722, p<.001. The Taiwanese had significantly greater variance in the number of marks used than did the Chinese, F(68,68)=5.13, p<.0001. There was a high correlation between the two groups, r(269)=.96, p<.001.

We examined the 65 locations where most inconsistent responses occurred. The Chinese participants placed more marks at syntactic (between-word) boundaries than at morphological (within-word) boundaries1, F(1,61)=32.35, p<.0001 (Effect of Linguistic Level). The Taiwanese showed a similar pattern, F(1,61)=31.98, p<.0001. The Chinese participants placed more marks at content boundaries whose adjacent items were content morphemes or words than at function boundaries whose adjacent items contained at least one function morpheme or word, F(1,61)=8.03, p<.01 (Effect of Linguistic Context). The effect was not significant for Taiwanese, F(1,61)=0.00, p>.95. Interaction of linguistic level and context was significant for Chinese (F[1,61]=6.54, p<.05), but not for Taiwanese (F[1,61]=0.84, p>.35). (See Figure 1.)

1 Chao (1968), Li and Thompson (1981), and Packard (1996) were used as the primary reference sources in determining the linguistic level and context of boundaries.


1. Hoosain (1992) observed that Chinese readers do not agree on how to divide the characters in text into words. We have confirmed this observation in a more quantitative and systematic manner.

2. There is a systematic basis underlying much of this disagreement. That is, there is substantial agreement among Chinese readers in their linguistic intuitions about words, reflecting their linguistic competence; variation comes in the criterion used in applying this competence to the unfamiliar task of parsing the string of characters into words, which is the performance aspect.

3. The greater consistency in word marking among Mainland Chinese suggests that between-subject consistency in this task only develops through practice with consistent feedback provided by hanyu pinyin orthography.


Figure 1. Click to enlarge.