Clock Design Technology for High-Performance Processors

Kinya Ishizaka  Hiroaki Komatsu

This paper introduces technology used in designing high-performance processors installed in servers so that they can distribute high-frequency clock signals among the whole chip. Clock synchronization systems must be designed so that clock signals can reach all synchronous sequential circuits in the processor at the same time. The influence of interconnect inductance cannot be ignored because clock frequencies have been getting higher, chip sizes have been getting larger, and low-resistance interconnects have appeared. Therefore, it is necessary to give sufficient consideration to inductance when designing such systems. The technology introduced in this paper uses both circuit design techniques to suppress clock skew and crosstalk noise, and CAD techniques which employ these techniques in an actual chip. The influence of inductance in clock transmission can be analyzed electrically by simulating actually designed clock circuits at the design stage, and this can help lead to higher performance processors.

1. Introduction

The high-performance, high-speed processor designed by the authors adopts a clock synchronization system. This makes it possible for all the circuit blocks to perform processing and signal transmission at the same timing by distributing just one clock signal among the whole chip. By using this system, clock signals should reach all the synchronous sequential circuits in the processor at the same time, ideally. However, due to differences in signal channels, some timing gap is observed in actual practice. This gap is called “clock skew.” Minimizing it is the key to achieving high performance.

To minimize clock skews while distributing a high-frequency clock over a long distance across the whole processor, it is effective to improve performance to drive a clock distribution buffer and to reduce the resistance in the clock signal wiring (hereafter, clock wiring). In implementing these measures, however, the influence on inductance in processor interconnection cannot be ignored, and this has a serious impact in terms of delay and crosstalk noise. Therefore, adequate consideration should be given to this issue in designing a circuit.

The processor designed by the authors in the current study distributes clock signals of 3.0 GHz to all over the domain of approximately 21 mm square. This paper explains the design approach taken to achieve this signal distribution.

2. Clock distribution system

The current processor design adopts a clock tree system based on an H shape, which is illustrated in Figure 1. Clock skew is reduced by having equal length, equal delay interconnects that enable overall clock delay to be reduced as much as possible. Where delay balance is poor, a dummy load not related to logics is connected to the clock tree to establish a preferred delay balance.
With points where it is hard to have an equal delay only by distributing load and changing the interconnect path, several buffers with identical cell sizes are used. Each of these buffers has a different delay line (at increments of 10 ps) inside and the delay can be changed drastically, if designers replace one buffer with another one.

While these design approaches are taken to minimize the clock skew, it is still difficult to achieve a sufficient performance actually in some cases due to a clock skew exceeding its design value. This may be because of factors such as process-related ones inside the chip, temperature deviations or noise from the power source. To address this issue, a ring oscillator using a clock path is made for each clock domain (i.e. a circuit domain controlled by the same clock) in advance, and inter-domain clock skew is determined by observing the frequency of these ring oscillators. Then, a clock distribution buffer is installed that enables an increase or decrease in clock delay via a scan chain so as to maximize the processor performance during manufacture based on the above-mentioned observation result.

In a sequential circuit, a transparent latch is used to reduce the path delay. In using this type of circuit, sufficient consideration should be given to avoid racing (a phenomena associated with incorrect data being output to the next step due to an undesired change in the input data transmitted from the previous step before determining the output data from the latch), because input data are transferred to output continuously while the clock is active. Because of this, a signal with a narrow pulse width (in the order of several tens of picoseconds) is input to the latch as a clock instead of using a signal with the duty at 50%. However, a narrow pulse width might lead to a pulse disappearance because of a deviation in chip specifications in some cases. To avoid this, a function to widen the pulse width is installed to secure a mean to check or avoid disturbances.

As part of an energy conservation countermeasure, a clock distribution buffer integrating an “Enable” function for output signals is used. By using this buffer, any unnecessary supply of the clock to inactivated areas is blocked. In addition, active power is reduced with this countermeasure.

3. Inductance effect of clock wiring

Clock signals are transmitted over a long distance with a higher frequency compared with signals fused or general data transmission. This
is achieved by using buffers with a higher drive performance and upper layer interconnects characterized by a wider width, larger thickness and lower resistance. For these reasons, the influence of inductance cannot be ignored. Figure 2 (a) compares the simulation results of the RLC model, where the impact of inductance is considered, and the RC model, where the impact of inductance is not considered, for the same interconnect. It is observed that the waveforms are steeper under the influence of inductance.

Besides, inconsistent impedance may lead to abnormal waveforms such as overshoot or undershoot [Figure 2 (b)] and bump [Figure 2 (c)]. These inductance influences should be clearly identified in the circuit design phase to avoid possible disturbance in an actual processor. To cope with the inductance influences properly, the delay is calculated based on circuit simulation equivalent to Simulation Program with Integrated Circuit Emphasis (SPICE) using the RLC interconnect model when calculating a clock path delay. At the same time, a waveform check is carried out. If the waveform is judged as abnormal, appropriate correction is made by changing the clock wiring path, buffer installation position and buffer drive performance so that a normal waveform can be obtained.

4. Wiring geometry to control noise

Large inductance in clock wiring will increase the induced cross talk noise in the surrounding wiring. Besides, it is necessary to pay attention to cross talk noise caused by capacity coupling between the clock wiring and its surrounding wiring. To minimize these impacts, we implement a noise reduction countermeasure by arranging a shield in the vicinity of the clock wiring with some dedicated power source wiring (Figure 3).

Multiple pieces of shield power wiring are arranged in parallel around the clock wiring. These pieces of wiring serve as return current channels for the clock signal current. By arranging the shield current wiring as close to the clock wiring as possible, loop inductance can be reduced. Further, the striped geometry of the clock wiring formed by its fine division will increase the number of return current channels, resulting in a further reduction of loop inductance. As a consequence, induced cross talk noise is reduced. ¹)

Besides, the presence of shield power source
wiring helps to reduce capacity coupling between the clock wiring and general data wiring, which also reduces cross talk noise associated with the capacity.

This configuration offers an additional advantage of making it easier to extract wiring inductance and capacity, because clock wiring is firmly surrounded by shield power source wiring.

5. CAD support (interconnect function)

In designing a processor using advanced technologies, automatic interconnects are used as an output network for clock final distribution supplying to the latch. However, the critical clock distribution network located before the network for final distribution output, for which timing-related restrictions (e.g. clock skew) are applicable, has interconnects that are designed based on a manual. This design uses an interactive interconnect editor so that the designer’s intention to be compliant with these restrictions can be easily reflected in the interconnects.\(^2\), \(^3\)

When designing the clock wiring sandwiching a shield, and if wiring across multiple distribution layers is involved, a very complicated geometry is formed where a via is allocated to each piece of the divided wiring and shield wiring as illustrated in Figure 4. If a wiring process takes place in such a state with complicated divisions, initial wiring as well as later modification processes would require an extremely long time. To simplify the process of editing these pieces of wiring, our team developed a “unified wiring” feature and “interlock wiring of lower layer shield wiring” feature.

1) Unified wiring

Unified wiring is an approach based on representing multiple pieces of wiring by just a single piece of virtual wiring. It defines a virtual piece of wiring width that encloses all multiple pieces of wiring within the same layer. Wiring is carried out as a unified wiring state by using an interactive interconnect editor [Figure 5 (a)]. Upon completing the wiring process, real divided wiring is generated from this unified wiring based on the division rules by running a dedicated tool [Figure 5 (b)]. In a CAD database, we set up a system where one network has two types of data (data for real divided wiring and data for virtual unified wiring) so that they can be treated as separate data. Also by employing a width not used for ordinary wiring as the width of the virtual wiring, it is possible to identify the original clock wiring just by looking at information about the wiring width. By handling the defined virtual wiring in the same way as the real wiring, it is possible to operate divided wiring in the same layer as interlinked wiring. This allows an operability equivalent to that of ordinary wiring in editing. As a secondary benefit, unified wiring improves the visibility of drawings because of its simplified illustration.

2) Interlock wiring of lower layer shield wiring
Next, to improve operability and reduce human errors in operation, lower layer shield power source wiring is generated in linkage with the upper layer clock wiring, if clock wiring has lower layer shield power source wiring. Also, a dedicated shield wiring category is defined so that the concerned wiring can be regarded as clock wiring as a signal during the editing, while it can be regarded as power source wiring during the via connection or checking. The purpose of this approach is to facilitate data handling by grouping the data.

With regard to the above-mentioned unified wiring, a problem associated with excessive checking may occur if the wiring width is different from the width assumed, if the same design rule checks (e.g. spacing) as those for ordinary wiring are applied. This problem was solved by obtaining information for real divided wiring in the event of checking. To obtain either one of two types of wiring data (i.e. virtual wiring data or real wiring data) depending on the contents of CAD tool processing, a unified data access function is used. This makes it possible for the CAD program to process data without being aware of divisional operations. If the shield power source wiring included in virtual wiring is connected with the power source, a short circuit seems to be generated between virtual wiring and the power source. So as not to confuse the designer, changes were made to the way of expressing data. Such changes include displaying check results based only on real wiring or displaying wiring information of actual wiring.

6. CAD support (delay calculation)

In conventional clock wiring, no closed loop is generated in wiring patterns within a single network, when there is no divided wiring and the actual wiring is comprised of just a single piece of wiring. However, in the case of divided wiring, a via is arranged to change wiring layers. A closed loop is generated in such a point. Because of this, it is impossible to calculate a wiring delay in simplified approaches including widely used calculations based on the Elmore delay model. Further, to calculate delays with high accuracy by considering the impact of inductance, it is essential to calculate delays based on a waveform analysis typically used in a SPICE simulation.

Meanwhile, when designing a processor, analysis as a whole chip is carried out for more accurate timing analysis because it is necessary to take the following factors into consideration:
dummy metal inserted to achieve leveling, and availability of wiring on the upper layer. To handle the data on the whole chip, a time necessary for processing, i.e. turn-around-time (TAT), is an important issue to consider in the design phase. To be specific, we need to address an issue associated with a greater circuit scale and prolonged TAT when the whole chip is analyzed.

Therefore, to ensure accurate timing analysis as a whole and to reduce TAT, we carry out a delay calculation equivalent to SPICE only for the clock distribution circuits. The analysis is performed by conducting the following four steps:

1) Generation of a SPICE net list from the capacity extraction results
2) Circuit division with the focus on number of steps and connections
3) Simulation of circuit by using a high-speed, high-accuracy SPICE compatibility tool
4) Feedback of analysis results

With regard to the circuit division mentioned in Step 2), a tree structure from PLL as a clock source to chopper to drive a latch in the final step is divided into three groups in the direction of number of steps. Evaluation is conducted one by one from the first group. In the second and the third groups, there are multiple inputs that serve as starting points of the tree structure. Further division is carried out for each of these input terminals and circuit simulation is performed in parallel. Further, by establishing a system to manage the simulation by using a queue, we made it possible to run a host of circuit simulations with excellent efficiency. By using the above-mentioned approaches, it was possible to reduce analysis time by reducing the circuit scale and parallel processing.

Besides, we established a system to automatically detect undesirable waveforms such as overshoot or undershoot and bump from the waveform information upon executing a circuit simulation. Through automating visual inspections on a large amount of clock waveform data, a significant reduction in the number of procedures and human errors could be achieved.

7. Conclusion
This paper explained outlines of circuit technology and measurement technology to minimize clock skew or noise as well as CAD technology to be embodied in real chips. These are technologies used to design a clock in high-performance, high-speed processor application for servers. By promoting the development of sophisticated clock control technologies as a further advancement of technologies to realize sophisticated clocks and also as an effort for energy conservation, we hope we can help to enhance processor performance and generate a driver to make servers more competitive.

References