Fujitsu The Possibilities are Infinite

  1. Home >
  2. Case Studies >
  3. National Institute of Genetics - Center for Information Biology and DNA Data Bank of Japan (DDBJ)

Industries:

  • Education
  • Government
  • Life Sciences

Offering Groups:

  • Servers

Solution Areas:

  • XML/Web Services

Regions:

  • Japan

Challenges:

  • Accelerated access to vast and rapidly increasing database
  • Support for volumes of data that increase by 1.5-2 times annually
  • Huge reductions in time and cost required for indexation and tuning when updating data

Benefits:

  • Stable and high-speed retrieval in 5-6 seconds always maintained regardless of data volumes (about 43 million records), complex search conditions, or traffic quantity.
  • Increasing data support possible by simple addition of CPUs and memory. Search performance is guaranteed, and future vision is clearer.
  • Indexation and tuning becomes unnecessary. Data update is possible with just text manipulation.

National Institute of Genetics - Center for Information Biology and DNA Data Bank of Japan (DDBJ)


World’s fastest next-generation Bio-Database using the Fujitsu developed XML Database engine [Shunsaku].

The decoding of the human genome was completed and released in April 2003, and genome-related research has shifted from decoding to application. Three major international DNA data banks support research and development in the biotechnology field, an increasingly important area of human research. The National Institute of Genetics - Center for Information Biology and DNA Data Bank of Japan(DDBJ), manage the Japanese DNA data bank. In answer to requirements, including data support and high-speed searching of rapidly increasing volumes of data, they are advancing the development of next generation biotechnology databases in conjunction with Fujitsu.

The world leading XML based database engine Interstage Shunsaku Data Manager developed by Fujitsu has been adopted as the infrastructure technology for the next generation biotechnology database. Unique functionality, that surpasses the concepts of past database management, supports astonishingly high speed searching while removing the need for excessive indexing. This opens the way to providing solutions to problems that would otherwise occur with current and future mass-volume biotechnology databases. The test version, that has become the stepping stone towards a de facto standard in the life sciences field, has been released publicly.

As a result, Fujitsu's advanced technology is contributing to the early realization of ubiquitous biotechnology based medical/health treatment on an international scale.

Introduction Background

Support for huge volumes of data (100 Million data articles, equivalent to 500 years of news, projected)

Professor Takashi Gojobori
Chief of Center for Information Biology and DNA Data Bank of Japan(DDBJ)

The National Institute of Genetics, the center of Japan's life sciences, was established in 1949, 4 years before the discovery of the DNA double-helix by Watson and Crick (1953). The National Institute of Genetics is very much a pioneer in life sciences.

Essential in today's life science research, are information technology skills that quickly process and analyze accumulated DNA data on a daily basis. The Center for Information Biology and DDBJ within the National Institute of Genetics, as an international base for bioinformatics (Note 1), must answer these needs of researchers.

The DDBJ is one of three international DNA data banks. The other two are provided by Europe's molecular biology laboratories (European Bioinformatics Institute EBI / The European Molecular Biology Laboratory EMBL) and America's National Center for Biotechnology Information (NCBI/GenBank). In order to manage the DNA data, discovered and validated through experiments performed by biotechnology researchers, the three international databanks (DDBJ/EMBL/GenBank International Base Sequence Database (Note 2)) are constructed collaboratively and released publicly via the internet.

About 43 million articles of DNA data (120 billion characters, around 200 years of news) are currently registered in the international base sequence database. Plus, data volumes are increasing by 1.5-2 times yearly.

"The absolute quantity and the rates of increase of the [DDBJ] database are huge. As the data load grows, a lot of time is required for searches. Research and development in biotechnology is a race against time, with speed an indispensable element. Providing a solution to the conflicting problems of huge volumes of data and high speed was a big problem for DDBJ, EMBL and GenBank from the beginning," said Professor Takashi Gojobori, Chief of the Center for Information Biology and DNA Data Bank of Japan (DDBJ).

Introduction Process

Many millions of articles are instantaneously searched. DDBJ data was use for the prototype version with amazing results.

The DDBJ holds data on living things and patented and registered information on base sequencing for genome projects all over Japan. Plus its use has now been extended to researchers from around the world as well as in Japan. With DDBJ identified as the face of bioinformatics in Japan the problems had to be resolved fast.

These problems could basically be consolidated into four points. The first was a review of the fundamental database to achieve a breakthrough in data management. The second was the creation of features in DDBJ to allow differentiation between the three major international DNA data banks. The third was to achieve results in research and development in advanced bioinformatics. The fourth was the acceleration of university-industry cooperation. "Fujitsu has been supporting the applications and maintenance of the super-computers used by DDBJ for over 10 years. When Fujitsu, through our association, was informed of the problems faced by the DDBJ, they proposed the adoption of Shunsaku, a new paradigm XML (Note 3) database engine, which is very high-speed and completely different from the former databases." (Professor Gojobori)

"What is considered high-speed? To see is to believe. As it was possible in theory to use XML-data directly, the DDBJ data being in XML format could therefore be used in a trial. It was really surprising to be able to search millions of articles of data instantaneously. So we put in the effort to understand this technology." (Professor Gojobori)

Shunsaku has at its core the super-high-speed algorithm "SIGMA" (Note 4). Developed by Arikawa Setsuo's R & D group at Kyushu University's Department of Science, Fujitsu has spent 10 years refining and perfecting it.

"Professor Arikawa has been quite well-known for many years. We hoped to grow a unique technology developed and implemented in Japan and nurture it further aiming at the world. In addition, it not only solves current problems, I believe Shunsaku has the technical answers to future development and usage evolution of the DDBJ database. With these details of system implementation decided on, it became an applied joint research development in the life sciences field." (Professor Gojobori)

Effects of Introduction

In addition to the high-speed searching, using Shusaku as the infrastructure technology simplified updating as indexing was unnecessary.

A first next generation biotechnology database containing 35 million articles of data was released to the researchers of the National Institute of Genetics in July, 2004.

"Above all, everyone was pleasantly surprised by the search and retrieval speeds. The researchers, till now, had believed that search and retrieval processes were time consuming. It could take 10-20 minutes or more for complicated searches. With Shunsaku the process was shortened to 5-6 seconds, even with very complicated search conditions. Many were struck with amazement by this technology." Professor Gojobori exclaimed. "A surprise is that the retrieval time hardly ever changes even if the access increases." Professor Gojobori continued.

In addition to the super- high-speed algorithm "SIGMA", Shunsaku uses high traffic management technology (Note 5), which organizes multiple search requests. This achieves the stable high speeds regardless of the complexity of each individual search request or the amount of traffic.

"The ease of data update has also been highly appreciated. Multiple corrections have been made to the DDBJ data despite its huge size. RDB, the former database, required registration, indexation (Note 6) and standardization (unification of data items) when changing or updating the database. It needed huge amounts of manpower and took up great amounts of time and cost. With Shunsaku, indexation is not necessary, only simple text manipulation is needed for updates." (Professor Gojobori)

As the Shunsaku search method simply reads the XML text data word by word from start to finish, it is easy to increase the number and type of registration items. Moreover, as the number of words and items are not restricted, it is not necessary to plan the data area meticulously beforehand. As a result, indexing and tuning (Note 7) become unnecessary. "As indexation has become unnecessary, each search can be carried out freely on all items without being bounded by the index. This means it guarantees the flexibility of research in the future and is also a breakthrough on its own." (Professor Gojobori)

system-configuration

With regard to the pending problem of increasing amounts of data, Shunsaku's search time is as much a function of the number of CPUs and memory as the amount of data. The problem can be solved just by adding additional CPU and memory. In the same way performance of the search process can also be assured. "In response to future increases in the amount of data, the simple addition of CPU and memory can more than support the increase. Fujitsu's explanation of this process promises a bright future." (Professor Gojobori)

Future visions and expectations of Fujitsu

De facto standard for database technology in the field of life sciences

The Center for Information Biology and DNA Data Bank of Japan test-released the prototype version of their DDBJ next generation biotechnology database, together with a new keyword search system called ARSA (All-round Retrieval of Sequence and Annotation), on their web site on 27th December 2004.

"ARSA is being test-released now. In full scale operation, it is believed to be the fastest in the world." (Professor Gojobori)
Further cooperation with the data created by multiple projects is essential in advancing the research and development in the biotechnology field. Therefore, a data grid (Note 8) that provides virtual integration of two or more databases has become an important concept.

"In achieving a data grid, indexation and regularization become big hurdles with past database technology. However, if it is Shunsaku, it is likely to be able to clear these hurdles without difficulty. In addition, there have been further development requests in complex high-speed retrieval of different types of information and for natural language processing. We believe that Shunsaku has the potential to support these new needs. For example, Shunsaku can be applied to improve health preventives and tailor-made medical treatments (Note 9). Fujitsu aims to be the world's No. 1 in such database technology. It is no dream, as the database technology developed with DDBJ is progressing towards becoming Japan's de facto standard in the world-changing life sciences field. It is possible for it to contribute towards breakthroughs in other industries as well." (Professor Gojobori)

Based on the results of this joint development Fujitsu is planning to introduce this technology to a variety of businesses. In the future, with bioinformatics positioned as key to biological research, Shunsaku will continue to contribute as an advanced and comprehensive IT technology. It will enable progress towards a ubiquitous society for health and medical treatment based on targeted and tailor-made medical treatments and genome medicine (Note 10).


Organizational Outline

National Institute of Genetics

  • Head of Institute: Yuji Kohara
  • Established: 1949. Restructured as National Institute of Genetics in 2004.
  • Aims: Conduct research on genetic studies and its applications, contribute towards the development of academic pursuits
  • Homepage: National Institute of Genetics

Center for Information Biology and DNA Data Bank of Japan (DDBJ), National Institute of Genetics

Profile of Professor Takashi Gojobori

  • Present Profession: Director and Professor, Center for Information Biology and DNA Data Bank of Japan (DDBJ), National Institute of Genetics/ Professor, School of Life science, The Graduate University for Advanced Studies/ Vice-Director, Japan Biological Information Research Center (JBIRC), National Institute of Advanced Industrial Science and Technology (AIST), / Visiting Professor, Keio University.

Footnote

Note 1: Bioinformatics

A technical field which combines biotechnology and information technology.

Note 2: International nucleotide sequence database

The International Nucleotide Sequence Database Collaboration (INSDC, http://insdc.org) consists of a joint effort to collect and disseminate databases containing DNA and RNA sequences. It involves the following computerized databases: DNA Data Bank of Japan (Japan), GenBank (USA) and the EMBL Nucleotide Sequence Database (European Molecular Biology Laboratory, Germany). New and updated data on nucleotide sequences contributed by research teams to each of the three databases are synchronized on a daily basis through continuous interaction between the staff at each collaborating organization.

Note 3: XML (Extensible Markup Language)

The XML (Extensible Markup Language) is a W3C-recommended general-purpose markup language that supports a wide variety of applications. It is one of the infrastructure technologies of information distribution and IT integration in the broadband era.

Note 4: Super-high-speed algorithm [SIGMA]

High-speed string pattern-matching algorithm developed by Professor Arikawa Setsuo (Kyushu University) by one-way sequential search.

Note 5: High Traffic Technology

A technology developed by Fujitsu, which organizes multiple search requests, achieving stable high speed, regardless of complexity of search requests and amount of traffic. (Patent pending)

Note 6: Index

Classification of location information on the retrieval target, in order to achieve high-speed retrieval.

Note 7: Tuning

Adjustment of database to optimal condition, to allow high-speed processing of data operations.

Note 8: Data Grid

An infrastructure which allows the user to fully utilize the data, which has been distributed across various systems, regardless of the access method and location of data.

Note 9: Tailor-made Medical Treatment

Medical treatment which is customized to individual needs. For example, analysis of minute differences of individual genes, and judgment of efficacy and side effects of drugs prior to application.

Note 10: Genome Medicine

Based on the information on specific cancer-prone genes and proteins, medicine is developed to counter these problems. Matching individual genes, highly effective genome medicine can be developed.


Note: This content is a translation of a case study in Japan dated February 23, 2005.
All job titles, proper names and numerical values are correct as at the date of publication on this website. Please note that they may have changed at time of browsing.


For more information:

  • Interstage Shunsaku Data Manager: Interstage Shunsaku Data Manager
  • Industry Standard Servers: PRIMERGY
  • UNIX servers: PRIMEPOWER