Skip to main content

Fujitsu

中文 | 日本語

China

Archived content

NOTE: this is an archived page and the content is likely to be out of date.

Deep learning with a Reduced Number of Training Data

Cutting down 70% training images for ancient Chinese character recognition

Fujitsu Research & Development Center Co. Ltd.

Beijing, China, February 21, 2017

Fujitsu R&D Center Co., Ltd.[1] today announced an advanced deep learning technique for Chinese ancient character recognition. By applying the developed technique, highly accurate character recognition can be achieved with only a small number of training data. The technique consists of two types of recognition engines, one for supervised learning of character image features with their corresponding character labels, the other for analyzing a pair of images judging whether they are from the same character or not. Comparing with conventional supervised learning techniques, it can achieve high accuracy with a reduced number of training data. In a recent research for ancient Chinese character recognition, the technique successfully cut down 70% training data while achieving the same recognition accuracy as conventional methods. We hope this technique might make further contribution to the protection of Chinese ancient works for libraries and archives in China.

This technique was presented on the “IS&T International Electronic Imaging Symposium 2017” held on January 31, Tuesday in San Francisco, USA.

Background

In China, there are about 50 million volumes of ancient books. They have a very high value of heritage and many of them are still attractive to the modern society. However, with the passage of time and the impact of human activities, these cultural heritage has been eroded and destroyed a lot.

In the preservation of ancient books, the libraries usually store them in the form of digital images. In addition to the protection of the literature, libraries also rely on ICT technology to make full use of the ancient literature by converting the data from character image to character code or text. At present, through the domain experts’ manual annotation, only a few of the ancient books have been completely digitized. To realize the digitization of all the ancient Chinese books, a lot of time and cost are needed.

Topics

In deep learning based character recognition, training data is of great importance which consists of huge number of text images and their associated text labels, so that the recognition engine can be trained sufficiently. Apparently, the more training data, the higher the recognition accuracy. However, as the number of character categories in ancient China is huge, the work of manually collecting and labeling character samples is not practical due to its prohibitively high cost and time consumption.

In contrast, the recognition of Alphabet letters and numbers can be achieved by collecting the training data in a relatively easy way since the number of characters is only a few. However, it is difficult to apply these techniques to the ancient Chinese literature in the form of a large number of fonts and different writing styles.

Technology

  1. Make full use of huge unlabeled character images and a small number of labeled training data

    In order to increase the available training data, we utilize large number of character images which have not been used in the conventional training process as these images are not associated with correct character labels. As shown in Figure 1, recognition engine X gives each input character image a temporary label. In this step, due to inefficient training, these temporary labels may contain some errors as shown by the red characters in Figure 1.

  2. Compare the patterns of ‘same-label’ image pairs and the patterns of ‘different-label’ image pairs to improve the recognition engine

    Extract a large number of sample pairs from the temporally labeled character image set and send them to the recognition engine Y and Z which share the same structure with recognition engine X. According to the comparison result that the input pair of images are associated with the same character or not, the recognition engine X is further refined. This process can be carried out iteratively. With more training iteration, the number of wrongly labeled character images is reduced and the accuracy of engine X increases. By the training of powerful recognition engine X, even with limited training data, the technique can achieve highly accurate recognition results.

    The developed character recognition framework

    Figure 1: The developed character recognition framework


Effects

The developed technique was applied to 1000 historical document images of Dunhuang, China[2]. The results show that compared with the traditional techniques, only a small amount of training data is used to achieve the recognition accuracy of 81%. In average, the training data can be reduced by 70% for each character. Besides, when using 50 training samples per character, the technique achieves 88% recognition accuracy, while the previous released record in the academic society is 82%.

The effect of the developed character recognition technique applied to the Dunhuang dataset

Figure 2: The effect of the developed character recognition technique applied to the Dunhuang dataset


Future plans

Fujitsu R&D Center will continually popularize our historical books digitization solution to the major Chinese local libraries and archives, promoting the integration of ancient books digitization, image retrieval and culture preservation. This technology can also effectively improve the recognition accuracy of Japanese, Korean and other languages. In order to be integrated effectively in 2018 Fujitsu AI platform 「Human Centric AI Zinrai」, we will continue to apply small sample based effective training technique to more object recognition fields, for example, the classification of a plurality of image data, the identification of plant species, etc.

Notes

  • [1] Fujitsu R&D Center Co., Ltd.
  • [2] The Chinese Dunhuang historical document images: A set of ancient documents and digital texts published on the Internet for the purposes of public and academic use by the National Library of China. There is no access restriction, which can be widely used, so it is often used as the standard dataset for the recognition of ancient literature.

About Fujitsu

Fujitsu is the leading Japanese information and communication technology (ICT) company offering a full range of technology products, solutions and services. Approximately 156,000 Fujitsu people support customers in more than 100 countries. We use our experience and the power of ICT to shape the future of society with our customers. Fujitsu Limited (TSE: 6702) reported consolidated revenues of 4.7 trillion yen (US$41 billion) for the fiscal year ended March 31, 2016. For more information, please seehttp://www.fujitsu.com.

About Fujitsu R&D Center Co., Ltd.

Established in 1998, Fujitsu R&D Center Co., Ltd. is a wholly owned R&D center of Fujitsu Limited, located in Beijing. The center's research areas cover the major business fields of the Fujitsu Group, including information processing, telecommunications, semiconductors, and software and services. For more information, please see: http://www.fujitsu.com/cn/frdc/en/

Technical Contacts

E-mail: E-mail: hndwrt-recog@ml.labs.fujitsu.com
Company:Fujitsu Laboratories Ltd.
Knowledge Information Processing Laboratory

Press Release ID: 2017-02-21
Date: 21 February, 2017
City: Beijing, China
Company: Fujitsu Research and Development Center Co., Ltd.