Kawasaki, Japan, August 20, 2012
Fujitsu Laboratories Limited today announced the development of the industry's first integrated big data development platform for processing large volumes of diverse time-series data.
In recent years, massive amounts of diverse data—as represented by sensor data, human location data, and other kinds of time-series data—continue to grow at an explosive pace. This has prompted the development of parallel batch processing technologies such as Hadoop(1), as well as complex event processing technologies(2) for processing data in real time. However, because each processing technology has employed different types of development and execution environments, it has been difficult to quickly apply insights gained from data analysis results to real-time processing applications. Moreover, maximizing the performance of Fujitsu's event processing engine(3) has required considerable knowledge about parallel application design, such as how to estimate network traffic.
By developing an integrated development platform able to handle description languages for both stored data analysis and complex event processing, Fujitsu was able to reduce development time for both batch and event processing by roughly 80% (from 8 weeks to 1.5 weeks) in a case study involving POS analysis-based coupon issuing. The platform is also equipped with a newly developed parallelism extraction function that automatically improves the processing efficiency of complex event processing. This function automatically extracts parallelism opportunities from event processing applications and recommends how to combine each analysis step in a way that optimizes execution plans without any extra effort.
This is one of the technologies that will be put to use to support human-centric computing, which will provide precisely targeted services anywhere.
Details of the new technology will be published at the IPSJ/SIGSE Software Engineering Symposium 2012 (SES2012), to be held from August 27 - 29, 2012.
Background
In recent years, massive amounts of diverse data—sensor data, human location data, and other kinds of time-series data—continue to grow at an explosive pace. There is a strong demand for taking this kind of "big data" and efficiently extracting valuable information that can be put to immediate use in delivering services, such as various navigation services.
Challenges
To process big data, Hadoop and other parallel batch processing technologies are deployed to analyze large volumes of stored data, as well as complex event processing technologies for processing event data in real time as it arrives. At the same time, these technologies are supported by different development and execution environments that, until now, have not been integrated. As a result, these disparate environments have made it difficult for analysts to quickly apply insights gained from data analysis results to complex event processing.
Furthermore, high-speed processing employing multiple servers in the cloud has proven to be crucial for performing complex event processing of large volumes of events. While Fujitsu has implemented a distributed event processing engine, its ability to raise performance by simply provisioning new servers, so as to take advantage of a cloud computing environment, relied on a complex application design phase that aims to identify effective ways to distribute each processing step.
Newly Developed Technology
Fujitsu has developed an integrated development platform that combines big data analysis and complex event processing. Using this platform, for example, companies can analyze the most up-to date purchasing trends from accumulated POS data and then hone in on a specific customer segment to issue coupons in real time, all as part of a simple process that does not require additional programming.
This technology consists of two parts: 1) A development platform integration feature that easily performs automatic program generation, regardless of development language; and 2) a parallelism extraction function that automatically improves the processing efficiency of complex event processing (Figure 1).
Figure 1: Overview of the Integrated Development Platform
Features of the newly developed technology are as follows.
1. Development platform integration feature
After the processing details have been defined via data-flow diagrams and properties (corresponding to processing parameters), a proper set of patterns are selected and used to automatically generate either batch or real-time processing programs. During this generation phase, the operations produced by the selected templates are automatically supplemented with data conversion steps wherever necessary. The generated programs are finally deployed and executed on either a batch or real-time execution environment (Figure 2).
Figure 2: Development Platform Integration Feature
2. Parallelism extraction function for complex event processing
The parallelism extraction function extracts parallelism from real-time processing programs that have been automatically generated by the integrated development platform. The function will automatically recommend optimal combinations of parallelization schemes in order to decrease network traffic (Figure 3).
In real-time processing, incoming events can be distributed among multiple servers for parallel execution. For each processing step, various distribution schemes are applicable, and performance varies greatly depending on the scheme that is chosen. Generally speaking, a distribution scheme with finer granularity makes it easier to evenly distribute loads and facilitates better performance. For event processing, however, better performance is achieved by reducing network traffic. Here, traffic is minimized by the application of a uniform distribution scheme that tries to place inter-dependent processing steps on a same server whenever possible. By doing so, it avoids intermediary transfers and optimizes the overall application performance.
At runtime, the recommended distribution scheme will be used to select an optimal server allocation strategy in response to event volume fluctuations, resulting in better overall performance.
Figure 3: Parallelism Extraction Function for Complex Event Processing
Results
1. Development platform integration feature
Using the new integrated development platform, as demonstrated during a Fujitsu case study, it was possible to shorten development time for both batch and event processing by approximately 80% (from 8 weeks to 1.5 weeks). Moreover, because parameters for each kind of processing can be easily modified without additional programming, trial-and-error tests on the development platform can be easily performed, such as for quickly applying insights gained from data analysis results to event search criteria.
2. Parallelism extraction function for complex event processing
The new parallelism extraction function generates executable programs that are specifically adapted for dynamic load balancing. Those programs can be easily scaled-out or down without having to re-compile the original application.
In addition, after measuring the performance of sample programs with different event distribution schemes, Fujitsu Laboratories confirmed that a uniform distribution scheme, by placing inter-dependent processing steps onto a same server, is able to reduce communications traffic by 60% and achieve a 3.5x improvement in processing efficiency compared to isolated distribution schemes that distribute each processing step independently (Figure 4).
Figure 4: Results from Parallelism Extraction Function for Complex Event Processing
Future Developments
Going forward, Fujitsu plans to further expand the features of the new technology while aiming to commercialize it in the company's platforms and middleware for big data by fiscal year 2013. Fujitsu will also explore deploying the technology in a wide range of applications, such as its services and products, in order to enable the utilization of valuable information generated through the process of collecting, accumulating and analyzing large volumes of sensor data.