An integer cosine transform chip design for image compression

G. A. Ruiz, J. A. Michell, A. M. Barón, J. M. Solana, M. A. Manzano, J. Díaz
Department of Electronic and Computers, Facultad de Ciencias, University of Cantabria
Avda.Los Castros s/n. 39005-Santander. SPAIN
michelli@unican.es

ABSTRACT

The Integer Cosine Transform (ICT) has been shown to be an alternative to the Discrete Cosine Transform (DCT) for image processing. This paper presents a parallel-pipelined architecture of an 8x8 ICT (10, 9, 6, 2, 3, 1) processor for image compression. The main characteristics of this architecture are: high throughput, low latency, reduced internal storage and 100% efficiency in all computational elements. The processor has been designed in 0.35-µm CMOS technology with an estimated operational frequency of 300MHz.

Keywords: Integer cosine transform, multiplication free DCT, discrete cosine transform, image compression, parallel pipelined architectures, VLSI.

INTRODUCTION

Since the introduction of the Discrete Cosine Transform (DCT) several contributions have been published on the subject of fast algorithms and architectures for different applications, many of which describe 2D-DCT processors for image compression [1-6] and approximations to the DCT that are multiplication-free in order to reduce implementation complexity [7-8]. Although the DCT is the most widely used transform for image processing, other transforms have appeared in the last two decades which have less compression capacity and lower implementation costs [9-13]. From them, the Integer Cosine Transform (ICT) has been shown to be a promising alternative to the DCT due to its implementation simplicity, similar performance and compatibility with the DCT [9]. The ICT(10, 9, 6, 2, 3, 1) has the advantage of not needing multipliers, as its kernel is a matrix of integers [12-13].

This paper describes the design and implementation of an 8x8 2-D ICT processor for image compression, which meets the numerical characteristic of the IEEE std. 1180-1990. This processor uses a low latency data flow that minimizes the internal memory and a parallel pipelined architecture, based on a numerical strength reduction Integer Cosine Transform (10, 9, 6, 2, 3, 1) algorithm, in order to attain high throughput and continuous data flow. A prototype of the 8x8 ICT processor has been implemented using a standard cell design methodology and a 0.35-µm CMOS CSD 3M/2P 3.3V process [14] on a 10mm² die. Pipeline circuit techniques have been used to attain the frequency of operation of 300MHz, maximum allowed by the technology. The circuit includes 12446 cells, 6757 of which are flip-flops. Two clock signals have been distributed: an external one, Clk1, operating at input data sampling frequency fs, and an internal one, Clk2, operating at fs/2. The high number of flip-flops has forced the use of a strategy to minimize clock-skew, combining large sized buffers on the periphery and using wide metal lines (clock-trunks) to distribute the signals.

2. DECOMPOSITION OF THE INTEGER COSINE TRANSFORM

The ICT was derived from the DCT through the concept of dyadic symmetry. The order-8 ICT kernel is

\[ T = KJ \] (1)

where \( K \) is the normalization matrix, and \( J \) an orthogonal matrix defined as

whose elements are all integers and satisfy
\[
\begin{align*}
a \ b = a + c + b d + c d, & \quad a \geq b \geq c \geq d \quad \text{and} \quad e \geq f
\end{align*}
\]
There are many possible J matrices, and the corresponding ICTs are denoted as ICT(a, b, c, d, e, f); g is always 1.
The 1-D ICT for a real input sequence \(x(n)\) is defined as
\[
X = T x = K J x = K Y
\]
where \(X\) and \(x\) are dimension-8 column matrices. Reordering the input sequence and the transform coefficients according to the rules
\[
\begin{align*}
x'(n) &= x(n), & x'(7 - n) &= x(n+4), & n \in [0,3] \\
X'(m) &= X(Br8[m]), & X'(m+4) &= X(2m+1), & m \in [0,3]
\end{align*}
\]
where \(Br8[m]\) represents bit-reverse operation of length 8, then the 1-D ICT can be expressed as
\[
X = T_R \ x' = K_R \ J_R \ x' = K_R \ Y'
\]
The reordered integer ICT kernel is
\[
J_R = \begin{bmatrix} J_{se} & 0 \\ 0 & J_{so} \end{bmatrix} \begin{bmatrix} I_4 & I_4 \\ I_4 & -I_4 \end{bmatrix}
\]
\(I_4\) being the dimension-4 identity matrix, and
\[
\begin{align*}
J_{se} &= \begin{bmatrix} g & g & g & g \\ g & g & -g & -g \\ e & f & -f & -e \\ f & -e & e & -f \end{bmatrix}, & J_{so} &= \begin{bmatrix} a & b & c & d \\ b & -d & -a & -c \\ c & -a & d & b \\ d & -c & b & -a \end{bmatrix}
\end{align*}
\]
Applying the decomposition rules defined in equations (5) and (6) to the \(J_{se}\) matrix results
\[
J_{se} = \begin{bmatrix} J_{se} & 0 \\ 0 & J_{so} \end{bmatrix} \begin{bmatrix} I_2 & I_2 \\ I_2 & -I_2 \end{bmatrix} R_4
\]
where \(R_4\) is the reordering matrix of length 4, \(I_2\) is the dimension-2 identity matrix, and
\[
\begin{align*}
J_{se} &= \begin{bmatrix} g & g \\ g & -g \end{bmatrix}, & J_{so} &= \begin{bmatrix} f & e \\ f & -e \end{bmatrix}
\end{align*}
\]
\( J_{4e} = \begin{bmatrix}
  1 & 1 & 1 & 1 \\
  1 & -1 & -1 & 1 \\
  3 & 1 & -1 & -3 \\
  1 & -3 & 3 & -1
\end{bmatrix} \), and \( J_{4o} = \begin{bmatrix}
  10 & 9 & 6 & 2 \\
  9 & -2 & -10 & 2 \\
  6 & -10 & 2 & 9 \\
  2 & -6 & 9 & -10
\end{bmatrix} \) \( (12) \)

Figure 1. Signal flow graph of 1-D ICT

As can be seen in Figure 1, the first computing level operates on the input data, reordering them according to rule (5); additions and subtractions of data pairs formed with sequences \( x(n) \) and \( x(n+4) \) \( (n=0, 1, 2, 3) \) are executed. In the second computing level the transformations \( J_{4e} \) and \( J_{4o} \) are obtained, their nuclei being the matrices defined by \( (12) \). The transformation \( J_{4e} \) is applied to the first half of the intermediate data sequence, \( a_0 \) to \( a_3 \), giving as a result the even coefficients \( \langle Y_0, Y_4, Y_6, Y_8 \rangle \) of the ICT, without normalization. Similarly, \( J_{4o} \) is applied to the other half of the middle data sequence, \( a_7 \) to \( a_4 \), giving as a result the odd coefficients \( \langle Y_1, Y_5, Y_7, Y_9 \rangle \) of the ICT, also without normalization. In the third computing level, the coefficients \( Y_i \) are normalized by \( k_i \) and the transform sequence of the coefficients \( X(m) \) appears reordered according to rule (6). Applying the decomposition procedure of \( J_{4e} \) established in \( (10) \) and \( (11) \), we get:

\[
\begin{bmatrix}
  Y_0 \\ Y_4 \\ Y_2 \\ Y_6
\end{bmatrix} =
\begin{bmatrix}
  1 & 1 & 0 & 0 \\ 1 & -1 & 0 & 0 \\ 0 & 0 & 3 & 1 \\ 0 & 0 & 1 & -3
\end{bmatrix}
\begin{bmatrix}
  a_0 \\ a_1 \\ a_2 \\ a_3
\end{bmatrix}
\quad \text{and} \quad
\begin{bmatrix}
  1 & 1 & 0 & 0 \\ 1 & -1 & 0 & 0 \\ 0 & 0 & 3 & 1 \\ 0 & 0 & 1 & -3
\end{bmatrix}
\begin{bmatrix}
  b_0 \\ b_1 \\ b_2 \\ b_3
\end{bmatrix}
\] \( (13) \)

\( b_0, b_1, b_2 \) and \( b_3 \) being the intermediate data of the computation of transformation \( J_{4e} \). Operating on \( (13) \), we get:

\[
\begin{bmatrix}
  Y_0 \\ Y_4 \\ Y_2 \\ Y_6
\end{bmatrix} =
\begin{bmatrix}
  1 & -1 \\ 1 & -1 \\ 0 & 0 \\ 0 & 0
\end{bmatrix}
\begin{bmatrix}
  b_0 \\ b_1 \\ b_2 \\ b_3
\end{bmatrix}
\quad \text{and} \quad
\begin{bmatrix}
  1 & -1 \\ 1 & -1 \\ 0 & 0 \\ 0 & 0
\end{bmatrix}
\begin{bmatrix}
  b_0 \\ b_1 \\ b_2 \\ b_3
\end{bmatrix}
\] \( (14) \)

Figure 2 shows the signal flow graph obtained from \( (13) \) and \( (14) \).
As can be seen in (13), the computation of the even coefficients of the ICT can be performed with additions and subtractions, as multiplication by 3 can be easily implemented by means of add and shift operations. The computation of the odd coefficients of the ICT can also be simplified; decomposition of the \( J_4 \) matrix as the addition of matrices having elements that are powers of 2, gives:

\[
\begin{align*}
Y_0 &= \begin{bmatrix} 2 & 2 & -2 & 2 \end{bmatrix} \begin{bmatrix} a_7 \end{bmatrix} + \begin{bmatrix} 0 & 8 & 8 & 0 \end{bmatrix} \begin{bmatrix} a_7 \end{bmatrix} \\
Y_4 &= \begin{bmatrix} 2 & -2 & -2 & 2 \end{bmatrix} \begin{bmatrix} a_8 \end{bmatrix} + \begin{bmatrix} 8 & 0 & 0 & -8 \end{bmatrix} \begin{bmatrix} a_8 \end{bmatrix} \\
Y_2 &= \begin{bmatrix} -2 & -2 & 2 & 2 \end{bmatrix} \begin{bmatrix} a_9 \end{bmatrix} + \begin{bmatrix} 8 & 0 & 0 & 8 \end{bmatrix} \begin{bmatrix} a_9 \end{bmatrix} \\
Y_6 &= \begin{bmatrix} 2 & 2 & 2 & -2 \end{bmatrix} \begin{bmatrix} a_{10} \end{bmatrix} + \begin{bmatrix} 0 & -8 & 8 & 0 \end{bmatrix} \begin{bmatrix} a_{10} \end{bmatrix}
\end{align*}
\]

(15)

In this form, the odd coefficients of the ICT can be implemented simply in terms of add and shift operations. Figure 3 shows the signal flow graph, obtained from (15) for the transformation \( J_4 \), having three computing levels, \( d_i, f_i, e_i \), and \( g_i \) (i=0, 1, 2, 3) being the intermediate data.

### 3. 1-D ICT PARALLEL PIPELINED ARCHITECTURE

The 1-D ICT processor architecture, whose scheme is shown in Figure 4, has been designed to implement the computing diagram of Figure 1 with the highest degree of efficiency. It has an input processor computing the intermediate data of the first computing level, \( a_0 \) to \( a_7 \), two processors in parallel, computing the transformations \( J_{4e} \) and \( J_{4o} \), and an output mixer generating the coefficients sequence of the ICT, ordered in natural form. The three processors have parallel architecture, allowing the operation frequency to be reduced to \( f_s/2 \), where \( f_s \) is the input data sampling frequency. The output mixer gives the coefficients sequence of the ICT at the frequency \( f_s \).

The input processor (Fig. 4) has a shift register of length 11 which stores the input data sampled at frequency \( f_s \), two multiplexers 4:1, an adder and a subtractor. The adder and the subtractor both have pipeline structure and operate in parallel at frequency \( f_s/2 \), generating the input sequences, \( <a_0, a_1, a_2, a_3> \) and \( <a_4, a_5, a_6> \), of processors \( J_{4e} \) and \( J_{4o} \) from the data stored in the register. In this way an efficiency of 100% is attained for the arithmetic elements.
The processor J_{4e} has been conceived to calculate the even coefficients of the ICT using the procedure established in (13) and in the signal flow graph of Figure 2. As can be seen in Figure 5, this processor has four shift registers, four multiplexers 4:1, three arithmetic units having pipeline structure, and an output mixing and ordering circuit. The adder and the subtracter, which operate in parallel at frequency $f_s/2$, generate first the intermediate data $<b_0, b_1, b_2, b_3>$ from the input sequence $<a_0, a_1, a_2, a_3>$, stored in the register SRA1; $b_0$ and $b_1$ are stored in register SRB1; while $b_2$ and $b_3$ are stored in SRB2. The multiplier by 3, implemented by adding and shifting, generates the data $3b_2$ and $3b_3$, which are stored in register SRB3. After that, the even coefficients of the ICT are generated from the data stored in SRB1, SRB2 and SRB3: $Y_0$ and $Y_2$ in the adder, $Y_4$ and $Y_6$ in the subtracter. The output mixer orders the even coefficient sequence of the ICT $<Y_0, Y_2, Y_4, Y_6>$. In this processor, the adder and the subtracter have an efficiency of 100%.

The processor J_{4o} architecture, shown in Figure 6, has been conceived to implement the computing diagram of Figure 3. This processor has five shift registers, ten multiplexers 4:1, and five arithmetic units with pipeline structure, operating in parallel at frequency $f_s/2$. Multiplications by 2 and by 8 are implemented as wired shift operations. The adder/subtractor AE5 and the subtracter AE6, generate first the intermediate data $<d_0, d_1, d_2, d_3>$ from the input sequence $<a_7, a_6, a_5, a_4>$, stored in the register SRA2; $d_0$ and $d_1$ are stored in the register SRD1, while $d_2$ and $d_3$ are stored...
in SRD2. Next, from the data stored in SRD1 and SRD2, in AE5 the intermediate data, e0 and e3, are generated and in AE6 e1 and e2; e0 and e3 are stored in SRD1, and e1 and e2 in SRD2. The adder/subtractor AE7 and AE8 generate first the intermediate data <f0, f1, f2, f3> from the input sequence <a7, a6, a5, a4>, stored in the register SRA2; f0 and f2 are stored in the register SRF1, while f1 and f3 are stored in SRF2. After that, from the data stored in SRD1, SRD2, SRF1 and SRF2, the intermediate data g0 and g3, are generated in AE7 and in AE8 g1 and g2; g0 and g2 are stored in SRF1, and g1 and g3 in SRF2. The output adder AE9 generates the odd coefficients of the ICT from the intermediate data <e0, e1, e2> and <g0, g1, g2, g3>, stored in the registers SRD1, SRD2, SRF1 and SRF2. These coefficients are produced in natural order: <Y1, Y3, Y5, Y7>. All the arithmetic elements of processor J4o have an efficiency of 100%.
4. 2-D ICT ASIC PROCESSOR

The 2-D ICT for a real input sequence $x$ is defined as

$$X = T x T = K J x J^t K^t = K Y K^t$$

(16)

where $X$ and $x$ are $8 \times 8$ matrices. Reordering the input data and the transform coefficients applying the rules defined in (5) and (6) to both dimensions, the 2-D ICT can be expressed as

$$X = T_r x T_r^t = K_r Y K_r^t$$

(17)

the 2-D $J$ transform being

$$Y' = \begin{bmatrix} J_4 & 0 \\ 0 & J_4 \end{bmatrix} \begin{bmatrix} I_4 & I_4 \\ I_4 & -I_4 \end{bmatrix} Y \begin{bmatrix} I_4 & I_4 \\ I_4 & -I_4 \end{bmatrix} \begin{bmatrix} J_4 & 0 \\ 0 & J_4 \end{bmatrix}$$

(18)

The 2-D ICT architecture shown in Figure 7 is based on a computing scheme derived from eq. (17) and (18), which first obtains the 2-D $J$ transform followed by the normalization. This 2-D $J$ transform is implemented using two 1-D $J$ processors that share a memory to store the intermediate data. This memory, whose scheme shown in Figure 7, is made up of a shift-register file with $8 \times 8$ elements and a control circuit having a 7-bit pipeline counter and some simple additional logic. Reading and writing the shift register file is a two steps process: by rows and by columns. In the first step, the output data of the first processor, corresponding to the eight 1-D ICT of an image block, are stored by rows, following a decreasing order, until the memory is filled up; at the same time, the second processor reads by rows the intermediate data stored in the register file, corresponding to the eight 1-D ICT of the previous image block. In the second step, columns store the intermediate data corresponding to the next image block, while the second processor reads by columns the intermediate data of the previous block.

Using the architecture shown in Figure 7, a 2-D ICT processor chip for transform image compression has been designed, being compatible with standard IEEE 1180-1990. The output coefficients can be selected without normalization (23 bits output) or with normalization (12 bits output according to standard IEEE 1180-1990). The design

![Figure 7. Scheme of 8x8 2-D ICT image processing](image)
has been made with a semi-custom methodology, using the 0.35µm CMOS CSD 3M/2P 3.3V technology of Austria-Microsystems [14]. The circuit, whose microphotograph is shown in Figure 8, has an area of 2900x3200≈9.3 mm² (the core is 2200x2480≈5.5 mm²) with 13 input pads, 26 output pads and 22 power pads. The circuit contains 12446 cells, 6737 of them being flip-flops. The full functionality of prototype has been tested using a Logic Analysis System and has been verified at 200 MHz using a simplified test arrangement. From detailed circuit simulation, including line delays, a maximum operating frequency of about 300 MHz has been established.

For adder/subtractor implementation, the binary look-ahead carry (BCL) adder [15] has been selected. The BCL adder keeps the number of gates along the critical path to \( \log_2 n + 2 \) for an \( n \)-bit addition. Moreover, its structure is highly suitable for pipelining because it presents the same number of gates from each input to each output. All adders/subtractors in the processor are based on the BCL structure with one stage of pipelined registers in the middle to attain a high-speed addition. A highly pipelined output multiplier, having 3006 cells, performs the final normalization. The multiplier and the shift-register file account for nearly 50% of the total surface.

The processor has four main control signals: Clk1, external clock at frequency \( f_s \), Clk2, internal clock at frequency \( f_s/2 \), and the multiplexer selection signals M1 at frequency \( f_s/4 \) and M2 at frequency \( f_s/8 \). Figure 9 shows the fan out of each of these signals and the scheme used to distribute them on the circuit. As this is a core-limited circuit, periphery buffers of \( 2\,mA \) drive strength have been placed in the pad ring. These buffers are short to create a single net and are placed side-by-side with their own power and ground supplies. To avoid clock skew, Clk1 and Clk2 are routed as a main trunk that goes through the middle of the chip, with branches on each side. M1 and M2 are globally routed in the chip. Power pads have been placed on all four sides of the chip to obtain an even current distribution to I/O pads, buffers and core. Internal buses and power feed thru cells are used to get a better current distribution to the standard cell area. The main supply buses of VDD and GND are also shown in Figure 9.

5. CONCLUSIONS

This paper presents an 8x8 2-D ICT(10, 9, 6, 2, 3, 1) processor chip for image compression compatible with standard IEEE 1180-1990. This processor uses a minimal internal memory and a parallel-pipelined architecture to attain the maximum frequency of operation allowed by the technology estimated in 300MHz. As result, the architecture presents a
low latency, high throughput and continuous data flow with a 100% efficiency in all computational elements. A prototype has been implemented in a 0.35-µm CMOS technology with an area of 9.3 mm². To avoid clock skew, clocks are routed using a clock-trunk strategy and the buffers are placed on the periphery. The prototype has been tested satisfactory using a test environment based on a Logic Analysis System. Simple test has been passed at 200 MHz.

ACKNOWLEDGMENT

We wish to acknowledge the Spanish Ministry of Science and Technology for the financial help TIC2000-1289 received to support this work.

REFERENCES