Similarities Between Tongyong Pinyin and Hanyu Pinyin: Comparisons at the Syllable and Word Levels

Chih-Hao Tsai ()

Kaohsiung Medical University

2000-10-16 (Updated: 2004-07-01)

Related Research: The response conflict problem in reading tongyong pinyin: A cognitive perspective

Similarities between tongyong pinyin and hanyu pinyin at the syllable level and the word level were investigated. Both type-based and token-based similarity statistics were calculated. It was found that at the syllable level, 19.47% of the 406 syllable types had different spellings, and 27.49% of the 83,434,515 syllable tokens had different spellings. At the word level, 48.84% of the 111,415 word types had different spellings, and 38.27% of the 51,119,108 word tokens had different spellings. The findings indicate that hanyu pinyin and tongyong pinyin are not as similar as proponents of tongyong pinyin have claimed. Since the pinyin orthography, either hanyu pinyin or tongyong pinyin, is word based, similarity between the two systems at the word level is a very suitable index of compatibility. Obviously, there is no reason to say that hanyu pinyin and tongyong pinyin are compatible. Tongyong pinyin should not be adopted as Taiwan's national standard unless internationalization is considered of no importance.

Hanyu pinyin, the most commonly used system for Chinese romanization, has been the national standard of China since 1958, and an international standard ( ISO 7098:1991, 2nd ed.) since 1982. See Feima Xiaojian ("Pegasus Studio"; Chinese BIG5 page) for official specifications of the two layers of hanyu pinyin: (a) The phonetic system: Hanyu Pinyin Fang'an ("Hanyu Pinyin Scheme"), and (b) The writing system: Hanyu Pinyin Zhengcifa Jiben Guize ("Basic Rules for Hanyu Pinyin Orthography"). Almost all Mandarin-speaking communities, except Taiwan, adopts hanyu pinyin. Taiwan has traditionally viewed hanyu pinyin, as well as simplified characters, as political symbols of China's rulling. It is therefore not difficult to understand why Taiwan has been resisting hanyu pinyin for such a long time.

Tongyong (universal) pinyin, devised chiefly by Bo-cyuan Yu (Boquan Yu) in 1998, is best described as an attempt to achieve internationalization without adopting hanyu pinyin. Tongyong pinyin is a modified version of hanyu pinyin, with a difference of about 10%. Most of the changes are made to consonant symbols: tongyong pinyin gets rid of those q's, x's, and zh's. There are also some changes of vowel symbols, such as using "iou" to replace "iu". The hard-to-type umlaut is also dropped. Basically Yu wanted to make his system friendly to English speakers. See Republic of China Yearbook (Government Information Office, 2000), for a more detailed review of the history of Chinese romanization. In addition, Yaochu Qiu and Hezhong Xu's Chinese Romanization Commentary Page (Chinese BIG5 page), Dan Jacobson's Homepage (Chinese BIG5 and English page), and Yimei Hsueh's Discussing Natural (Tongyong) Pinyin Website (Chinese BIG5 page) have collected much historical as well as current data about tongyong pinyin since its birth in 1997. Please note that I do not necessarily endorse their comments, and I suggest you separate data from comments when visiting their sites and make your own judgments.

Although tongyong pinyin has been changing constantly that it is hard to find a stable version, it has nevertheless gained more and more attention. Being advertised as "compatible with hanyu pinyin", thus being able to achieve internationalization without chinalization, tongyong pinyin is intuitively appealing. On October 7th, 2000, even the Mandarin Commission of the Ministry of Education decided to propose a draft that adopts tongyong pinyin as the national standard of romanization. Please visit Mandarin Commission's website for the latest draft for tongyong pinyin, which was made available in mid-November, 2000.

The Mandarin Commission's decision has already triggered a war between proponents of hanyu pinyin and tongyong pinyin. Tongyong pinyin's political correctness appears to have attracted more supporters. However, the assumption to ensure its internationalization capability--compatible with hanyu pinyin--is probably unwarranted. Ovid Tzeng, Minister of Education and a cognitive scientist well known for his expertise in the cognitive processes of the Chinese language, has expressed his concern. He mentioned that a small difference at a lower level could cause the difference to expand exponentially as the levels go higher. For example, the DNAs of chimpanzees and humans have 97% or higher in common. However, the 3% difference in DNA results in a huge difference between the two species ("Zhongwen yiyin," 2000). After devoting three weeks to the Chinese romanization issue, on October 30th, 2000, Ovid Tzeng announced in a press conference that the final version of the draft for Chinese romanization which is going to be submitted to Executive Yuan by the Ministry of Education will be based on hanyu pinyin, rather than tongyong pinyin. However, he also stressed that the adopted hanyu pinyin is likely to be slightly modified by consulting several features of tongyong pinyin. It has to be noted that the Minister of Education's decision is still not final. The final decision has yet to be made by the Executive Yuan.

The present study compares tongyong pinyin and hanyu pinyin at two levels higher than that of phonetic symbols, which is known to have 10% difference. The first level is the syllable level, which represents the basic writing unit. The second level is the word level, which represents the basic linguistic unit which has a space on either side of it when it is written.

Lexicon. The words and their associated pronunciation and frequency information in the lexicon distributed with the libtabe library (Hsiao, Hsieh, Tan, Tsai, & Yeh, 2000) was used. The lexicon consisted of 138,612 entries. However, since some entries did not have pronunciation information, they were excluded from the analyses. For entries had more than one pronunciations, only the first was used. The first pass cleanup resulted in a list of 111,415 unique words. See Table 1 for word counts across different word lengths.

Table 1
Word counts across different word lengths
Length Number of unique words
1 13,058
2 63,754
3 18,413
4 14,953
5 711
6 339
7 130
8 54
9 3

Syllables. The 416 basic Mandarin syllables (tones disregarded) defined in Wang (1998, p. 98) were adopted as the standard list for this study.

Zhuyin, Hanyu Pinyin, and Tongyong Cross-Reference Table. The table was compiled by the author (Tsai, 2000b). Hanyu pinyin information were taken from DeFrancis's (1996) ABC Chinese-English Dictionary. Tongyong pinyin information were taken from Hsueh's (2000) Ziran Pinyin Dajia Tan (Let's Discuss About Natural Pinyin; in Chinese BIG5) website. Syllables spelled differently were encoded in the table. Since tongyong pinyin is quite young and still undergo continuous revision, it was not easy to find a "stable" version. I chose Hsueh's website as the information source for tongyong pinyin because her zhuyin-tongyong pinyin mapping table appeared to be quite recent. I cannot guarantee that my table is 100% accurate. If you should find any error, please let me know. (A month after the compilation of this mapping table, the Mandarin Commission of the Minister of Education finally made the final version of the draft for tongyong pinyin available on the web in mid-November, 2000. I have verified my table with this recently released draft, and everything appears to be up-to-date.)

Design and Procedure

Analyses were performed separately for syllable and word levels. For each syllable/word, whether the spellings of the two systems were exactly the same was recorded. Then, for each level, two kinds of statistics were calculated. The first was based on unique syllable/word types without considering frequencies, which is commonly known as type-based statistics. The second was weighted by syllable/word frequencies, which is commonly known as token-based statistics.

Syllable Level

Table 2
Number and percentage of syllable types with different spellings
Different Total %Different
81 416 19.47
Table 3
Number and percentage of syllable tokens with different spellings
Different Total %Different
22,936,039 83,434,515 27.49

Word Level

Table 4
Number and percentage of word types with different spellings
Different Total %Different
54,411 111,415 48.84
Table 5
Number and percentage of word tokens with different spellings
Different Total %Different
19,562,049 51,119,108 38.27

From the results obtained, it is very obvious that as the level goes higher, the difference between the two romanization systems expands. And the expansion is indeed exponentially in type-based measures: from 10% (phonetic symbols) to 19.47% (syllables) to 48.84 (words). Consequently, it clearly demonstrates that the two systems are not as similar as proponents of tongyong pinyin have claimed.

The token-based measures reflect the actual differences a person would encounter when he or she reads or writes/types. For example, suppose a Chinese book is translated into both hanyu pinyin and tongyong pinyin. We compare the two translated versions, character by character (actually, syllable by syllable), we would get a percentage of difference around 27.49%. In other words, there is one syllable spelled differently every four words. If we compare the two translated versions, word by word, we would get a percentage of difference around 38.27%. In other words, there is one word spelled differently every three words. Again, by no means the two romanization systems can be said to be "similar".

One might be tempted to ask if the distribution obtained is representative, especially when considering the quality of the lexicon used. It is true that the libtabe lexicon was not compiled by computational linguists, nor was its frequency information calculated from a carefully balanced corpus. I have also performed the same analyses on an academic-level lexicon, and the resulted distribution is surprisingly similar. The results from the academic-level lexicon are not presented in this paper. This is my style of doing web publication: If I find something interesting and think it valuable to the public while in the midst of working on a real project, I switch to a non-research purpose data source, replicate the findings, and post the replicated findings. (Another serious reason is that academic-level resources usually restrict the range of legal usage to academic purpose only, so the results are usually unable be made available on the web.) My Mandarin syllable frequency counts for Chinese characters web page (Tsai, 2000a) is another example of this dual-mode practice. The underlying philosophy is "quzhi-yuwang, yongzhi-yuwang". That is, you took good stuff from the web, and you found them beneficial. So you make them even better, and feed back to the web.

You may also want to replicate my findings with other lists of words. What I would like to remind you is that I am not a person who wants to deceive the readers by providing biased data. As you can see, the present study is fully objective and unbiased. You are encouraged to find faults with any part of the study, and you are also encouraged to replicate my findings.

Finally, I would like to take the chance to clarify the concept of "pinyin system," which is commonly misunderstood by proponents of tongyong pinyin. Chinese Romanization system, either hanyu pinyin or tongyong pinyin, is more than a phonetic scheme used to transcribe character pronunciations. It is also a writing system. It can be viewed is the "second form" of the Chinese writing system. (The first form is, of course, the Chinese characters.) The "second form" is commonly used when Chinese characters are not desirable or are unable to be used. Although it is rarely to see a pure Chinese text written completely in pinyin, the second form is commonly used when the target readers do not know Chinese characters quite well or when Chinese words or sentences need to be used in non-Chinese documents. Road signs are good examples. Written languages are intended to be read. Visual word recognition is not merely converting letters to sounds and then sounds to meanings. Reading is a visual activity in which frequently used words are recognized directly without the mediation of phonology (Rayner & Pollatsek, 1989). If Mandarin words are spelled consistently across different cultures and countries, they will be easily recognizable. However, if nearly half of the Mandarin words are spelled differently in Taiwan, it will only lead to confusions, rather than internationalization. After all, we spend more time reading than writing. "Compatibility" should certainly be evaluated at the level of reading. Since the pinyin orthography, either hanyu pinyin or tongyong pinyin, is word based, similarity between the two systems at the word level is a very suitable index of compatibility. Obviously, tongyong pinyin and hanyu pinyin are not compatible. (See also, The Response Conflict Problem in Reading Tongyong Pinyin: A Cognitive Perspective, for a micro-level analysis of the compatibility problem from cognitive and engineering psychology perspective.)

Proponents of tongyong pinyin have argued that tongyong pinyin can achieve internationalization because it is more compatible with English orthography-phonology conversion rules. They have reasoned that tongyong pinyin is more friendly to English speakers. Since English is the most widely spoken language in the world, they have further reasoned that tongyong pinyin can make connection from Taiwan to the world directly without going through China (i.e., using hanyu pinyin.) With the correct concept of pinyin system we immediately recognize the fallacy of such reasoning. Their reasoning is based on the assumption that pinyin system is nothing more than a phonetic scheme. They have neglected the far more important role of pinyin as a writing system. Consequently, incorrect assumption leads to incorrect conclusions. And the political correctness of dechinaization has further prevented them from seeing the truth.

Hanyu pinyin and tongyong pinyin are not as similar as proponents of tongyong pinyin have claimed. In fact, they are very dissimilar either at the level of writing units (character/syllables) or linguistic units (words). If one uses the similarity between the two systems at the word level as an index of compatibility, there is no reason to believe that hanyu pinyin and tongyong pinyin are compatible. Plus additional evidence from "The Response Conflict Problem in Reading Tongyong Pinyin", it becomes even more evident that the claim that the compatibility issue has been resolved is untrue. Consequently, tongyong pinyin should not be adopted as Taiwan's national standard unless internationalization is considered of no importance.

DeFrancis, J. (1996). ABC Chinese-English dictionary. Honolulu, HI: University of Hawaii Press.

Government Information Office. (2000). The Republic of China yearbook [On-line]. Available

Hsiao, P.-H., Hsieh, T.-H., Tan, K.-S., Tsai, C.-H., & Yeh, W. (2000). Localization library for Taiwan and BIG5 encoding (libtabe) (Version 0.1-3) [Computer programming library]. Available

Hsueh, Y.-M. (2000). Ziran pinyin dajia tan [Discussing natural (tongyong) pinyin] [On-line]. Available

Rayner, K., & Pollatsek, A. (1989). The psychology of reading. Englewood Cliffs, NJ: Prentice-Hall.

Tsai, C.-H. (2000a). Mandarin syllable frequency counts for Chinese characters [On-line]. Available

Tsai, C.-H. (2000b). Zhuyin, hanyu pinyin, and tongyong pinyin cross-reference table [On-line]. Available

Wang, H.-M. (1998). Statistical analysis of Mandarin acoustic units and automatic extraction of phonetically rich sentences based upon a very large Chinese text corpus. Computational Linguistics & Chinese Language Processing, 3, 93-114.

Zhongwen yiyin xitong bian bu bian? Zeng Zhilang: Xu pinggu daijia you duo gao [Should the romanization system be modified? Ovid Tzneg: The cost resulted from modification must be evaluated]. (2000, October 14). China Times.

Author Note

I am very grateful to Bo-cyuan Yu, Iau-chu Chiu, and Dan Jacobson for their helpful comments on earlier versions of this article.

Research reported here was an independent project and was not financially supported by any institution or individual person. In fact, it was a low-cost project, either in terms of finance or time. The computer on which the study was carried out was just an ordinary PC. The C/C++ compiler used to compile the programs for computing various statistics is DJGPP, a freely available (under GNU GPL license), complete 32-bit C/C++ development system for Intel 80386 (and higher) PCs running MS-DOS/Windows. For those of you who are familiar with Unix, yes, DJGPP is the PC version of gcc. The libtabe lexicon is also freely available. The lexicon used belongs to the libtabe project (of which I am one of the developers), which is freely available under BSD license.

StarOffice, a freely available, full-featured, integrated office suite, was used in preparing the final write up of this study. NoteTab Light, a freeware text editor which was used in composing the HTML version of the write up. Freely available HTML tidy was used to check the HTML syntax and reformat the HTML code. Cascade, also free, was used to write the cascading style sheets (CSS). The web pages were published on my website at GeoCities, whose space was also obtained for free. The email address I used to communicate with the readers is provided for free by Yahoo!Mail. Finally, AltaVista Free Access, the dial-up service I used to establish the Internet connection when transferring the HTML files to my website, is also free. As a result, the only noticeable financial cost was the cost to make a couple of local telephone calls.

In terms of time, it took about four hours to prepare the cross-reference table and the C programs. Running the programs and collecting the outputs took just a couple of minutes. It took six hours to write the first draft in English and two hours to make it an HTML file. It took six extra hours to translate the English write up into Chinese. Plus the time spent on double-checking to make sure that everything was right, the total time spent from the beginning of study to the final write up was less than 24 hours.

