Fanjian Translation

Introduction

Most Chinese literates, including native Chinese speakers, share a common misbelief that the two systems of Chinese characters, Simplified Chinese ("SC") and Traditional Chinese ("TC") correspond directly with each other, and the conversion between them only requires simple code to code mapping. It actually takes much more than that, as we will see below.

Background

There was only one Chinese script, the Traditional Chinese, before the establishment of the People's Republic of China in 1949. A new set of Chinese script was established by the Communist government in the 1950s, named Simplified Chinese, and became the official script in the PRC ever since. Among the SC characters, some are existing characters; some are simplified forms of the traditional characters that were commonly believed to be too difficult to memorize. The latter sums 2,244 characters according to the latest edition of Comprehensive List of Simplified Characters published in 1986.

Hong Kong, Macau, Taiwan and most overseas Chinese communities, however, still adopt Traditional Chinese while the PRC and Singapore use Simplified Chinese.

The Complexities of Conversion

1. Many simplified characters are no longer recognizable from their traditional forms, e.g. SC 呆{C}- TC 獃{C}

2. In numerous cases, one Simplified character corresponds to two or more Traditional forms, e.g. SC 呆{C}- TC 獃{C}and 呆{C}.Sometimes only one of these is the correct one; sometimes any of these may be correct, depending on the context:

SC SourceTC TargetMeaningTC Example
发fa1發emit出發 start off
发"> fa4髮hair頭髮 hair
干gan1乾dry乾燥 dry
干gan4幹trunk精幹 able, strong
干gan1干intervene干涉 interfere with
干gan4榦tree trunk楨榦 central figure
面mian4麵noodles湯麵 noodle soup
面mian4面face面具 mask
后hou4後after後天 day after tomorrow
后hou4后queen王后 queen

3. Encoding: SC is encoded GB2312-80, GBK; TC is encoded in Big5. The two standards are not compatible, resulting in numerous characters missing on both sides:

Chinese CharacterGB CodeBig5 Code
看BFB4ACDD
漢 BA7E
互BBA5A4AC
聯C170 
網BAF4 
軟 B36E
件BCFEA5F3

4. Vocabulary: SC mainly but not necessarily follows the usage of vocabulary in Mainland China, whereas TC follows Taiwan and Hong Kong.

Another common misconception is that one can switch a Chinese Website from Traditional Chinese to Simplified Chinese, or vice versa, simply by choosing the encoding in browsers. Inevitably it will only result screenfuls of indecipherable symbols displayed.

Our Approach

In the paper "The Pitfalls and Complexities of Chinese to Chinese Conversion", Jack Halpern and Jouni Kerman of the CJK Dictionary Institute offer their tremendous insight to Chinese-to-Chinese conversion philosophy. We are grateful to their work, and humbly believe we share the same understanding with them and try to evaluate our software with their work.

In the context of Halpern and Kerman's work, our research found out that Code and Orthograhic conversion rules are critical to the accuracy level of up to 99% because of the grammatical proximity of TC and SC. The accuracy can be further boosted to exceed 99.9% with the incorporation of carefully selected Lexemic and Contextual rules. However, the extra accuracy comes at a huge cost of significantly dragging down the performance especially when sophisticated Contextual rules are in place.

In order to strike a good balance between performance and accuracy, we opted to forsake the Contextual and Lexemic approach. This development strategy has rendered our translation engine one most sophisticated, high performance real-time conversion platform between the TC and SC Chinese available in the commercial market place. On a Pentium III/800 CPU, Our translation engine consistently delivers performance exceeding 10,000 characters per second conversion rate while maintaining average accuracy level more than 95%.

Our Implementation

We implemented our translation engine in both C and Java, which are currently in production use on platforms ranging from low-end Linux and Windows workstations to high-end Solaris and AIX servers. A COM version is also available for Windows-based applications and even an experimental PHP extension version was implemented at one point. The Orthograhic conversion rules governing the outcome of multiple character mapping and vocabulary substition can be updated in the rule database files, which allow the engine to adapt to different context without the need of recompilation.