Skip to main content

Fujitsu

中文 | 日本語

China

Archived content

NOTE: this is an archived page and the content is likely to be out of date.

Text and Background Separation Technology for High Performance Document Image Compression

Fujitsu Research & Development Center Co. Ltd.

February 12, 2015

Fujitsu Research and Development Center today announced an accurate text and background separation technology for high performance document image compression. It separates an image into a background layer (i.e., paper texture and pictures) and a foreground layer (text and line drawings). Selective compression algorithms are then applied to different layers to keep the sharp color transitions between highly contrasted areas, while at the same time maintaining a compact file size for efficient store and transfer.

Different from traditional text and background separation technologies, our method can effectively classify the pixels of a document image into four categories: dark text, light text, figure and background. This intermediate representation enables robust text separation and produces visually more pleasing compressed images for various document types. Our technology allows the distribution on the Internet of very high resolution images of scanned documents while at the same time maintaining a compact file size. Information that was previously trapped in hard copy form can now be made available to wide audience.

Details of the new technology will be introduced at the IS&T/SPIE Symposium on Electronic Imaging 2015, to be held 8-12 February 2015, at San Francisco, California, United States.

【 Background 】

Despite the remarkable development of networked information distribution means such as portals website, personal blogs and social media, the vast majority of human knowledge in the world is still recorded on paper documents. One obstacle of digitizing this rich content and getting them onto the network is the large file size of the high resolution scanned images. For example, a typical A4 size color document scanned at 300 dpi requires approximately 25 megabytes of storage without compression. Digitization of document image requires high quality with low file size. These two requests are conflict. Content based document image compression is one promising solution. Our algorithm achieve the world best performance in text image separation. The first application target is image document scanner.

【 Topics 】

Conventional image encoding formats such as JPEG, JPEG2000 are dedicated for natural scene images, and they always produce prohibitively large image files at decent resolution. The mixed raster content (MRC) standard specifies a framework for document compression which can dramatically improve the tradeoff between image quality and compression rate. Among different modules of the MRC encoding framework, text segmentation is the most important step, which creates a binary mask that separates text and line-graphics from natural image and background regions in the document.

【 Technology 】

Our text separation algorithm takes into consideration the variations in text font style, size, intensity polarity and of string orientation. It successfully avoids two kinds of failures.

  • Text region missing error
  • Background elements being wrongly extracted as text

Our method first separate the pixels of a document image into four categories: “dark text/lines”, “bright text/lines”, “dark figures/graphics” and “white background” as shown in Figure 1. By this representation, we can then extract text regions surrounded in a variety of contexts as well as picture regions embedded in a smooth background. To further refine the initial segmentation, we group candidate text components into paragraph/text lines and reject non-text components (i.e., false text detections).

0212-11

Figure 1: The technology separates the pixels of a document image into four categories: dark text/lines (gray level = 0), bright text/lines (gray level = 128), figures (gray level = 192), and white background (gray level = 255).

After the document image is segmented into foreground and background layers, different compression algorithms can be used to process different layers so that a high compression ratio is achieved. However, there are inevitably many “holes” in the two layers. For example, the background layer only contains the pixels of the background region, and the positions of the foreground pixels in the background image do not possess any meaningful color information. In order to have a high compression ratio of the background and foreground images, we developed a hierarchical blank region filling technology to get a complete smooth image. Figure 2 shows one example of the filled results of foreground and background images.

0212-22

Figure 2: One illustrating example of foreground and background image filling

Using our technology, scanned pages at 300 DPI (dots per inch) in full color can be compressed down to 30 to 100KB files from 25MB. It typically achieves compression ratios about 5 to 10 times better than existing methods such as JPEG and GIF for color documents at similar quality.

【 Future Plan 】

Fujitsu R&D Center will popularize our text and background separation technology to libraries and publishing institutions for document digitization with our partners. We will improve our technology based on the feedbacks of customers.


All company or product names mentioned herein are trademarks or registered trademarks of their respective owners. Information provided in this press release is accurate at time of publication and is subject to change without advance notice.

About Fujitsu

Fujitsu is the leading Japanese information and communication technology (ICT) company offering a full range of technology products, solutions and services. Approximately 162,000 Fujitsu people support customers in more than 100 countries. We use our experience and the power of ICT to shape the future of society with our customers. Fujitsu Limited (TSE: 6702) reported consolidated revenues of 4.8 trillion yen (US$46 billion) for the fiscal year ended March 31, 2014.For more information, please see http://www.fujitsu.com.

About Fujitsu Research and Development Center

Established in 1998, Fujitsu Research and Development Center Co., Ltd. is a wholly owned R&D center of Fujitsu Limited, located in Beijing. The center's research areas cover the major business fields of the Fujitsu Group, including information processing, telecommunications, semiconductors, and software and services. For more information, please see: http://www.fujitsu.com/cn/en/about/local/subsidiaries/frdc/.

Technical Contacts

E-mail: E-mail: sunjun@cn.fujitsu.com
Company:Fujitsu R&D Center Co., Ltd.

Press Release ID: February 12, 2015
Date: 12 February, 2015
Company: Fujitsu Research & Development Center Co., Ltd.