# HIGH THROUGHPUT 2D DCT/IDCT PROCESSOR FOR VIDEO CODING

G. A. Ruiz, J. A. Michell, A. M. Burón

Department of Electronic and Computers. University of Cantabria (SPAIN). ruizrg@unican.es, michellj@unican.es, burona@unican.es

### ABSTRACT

This paper describes the architecture of an 8x8 2-D DCT/IDCT processor with high throughput, reduced hardware, and a parallel-pipeline scheme. This architecture allows the processing elements and arithmetic units to work in parallel at half the frequency of the data input rate. A fully pipelined row-column decomposition method based on two 1-D DCTs and a transpose buffer based on D-type flip-flops are used. The processor has been implemented in a 0.35- $\mu$ m CMOS process with a core area of 3mm<sup>2</sup> and 11.7k gates. It meets the requirements of IEEE Std. 1180-1990. The data input rate frequency is 300MHz with a latency of 172 cycles for 2-D DCT and 178 cycles for 2-D IDCT. The proposed design is compact and suitable for HDTV applications.

## **1. INTRODUCTION**

The Discrete Cosine Transform (DCT) is widely considered to provide the best performance for transform coding and image compression [1]. Thus, the DCT has been applied for most of recent picture international standards as JPEG, MPEG, H.261 and H.263, as well as in high-definition television (HDTV) systems. The computation complexity requirements in many real-time applications often lead to the use of efficient dedicated hardware (ASIC's) operating at high speed with an acceptable cost in area [3]-[7].

This paper describes the architecture of an 8x8 2-D DCT/IDCT processor with a high input data rate and a cost-effective hardware. The 2D DCT/IDCT is calculated using the separability property, so that its architecture is made up of two 1-D processors and a transpose buffer as intermediate memory. The processor has been designed aiming to attain high throughput, reduced hardware, parallel and pipeline architecture, and a maximum efficiency in all arithmetic elements. This architecture allows the processing elements and arithmetic units to work in parallel at half the frequency of the data input rate, except for the normalisation of the transform which is

carried out by a multiplier operating at maximum frequency. Moreover, it has been verified that the precision analysis of the proposed processor meets the demands of IEEE Std. 1180-1990 [2] used in video codecs ITU-T H.261 and ITU-T H.263, among others. The processor has been conceived using a standard cell design methodology and manufactured in a  $0.35\mu m$  CMOS process. Fast computing speed as well as low hardware costs make the proposed design suitable for HDTV applications.

## 2. TWO DIMENSIONAL 8X8 DCT/IDCT

Let  $S_{R8}$  the 8-point DCT matrix with rows reordered according to the sequence (0,4,2,6,1,5,3,7):

$$\mathbf{S}_{\mathbf{R8}} = \frac{1}{2} \begin{bmatrix} C_0 & C_0 \\ C_4 & -C_4 & -C_4 & C_4 & C_4 & -C_4 & -C_4 & C_4 \\ C_2 & C_6 & -C_6 & -C_2 & -C_6 & C_6 & C_2 \\ C_6 & -C_2 & C_2 & -C_6 & -C_6 & C_2 & -C_2 & C_6 \\ C_1 & C_3 & C_5 & C_7 & -C_7 & -C_5 & -C_3 & -C_1 \\ C_5 & -C_1 & C_7 & C_3 & -C_3 & -C_7 & C_1 & -C_5 \\ C_3 & -C_7 & -C_1 & -C_5 & C_5 & C_1 & C_7 & -C_3 \\ C_7 & -C_5 & C_3 & -C_1 & C_1 & -C_3 & C_5 & -C_7 \end{bmatrix} (I)$$

where:

$$C_0 = \frac{1}{\sqrt{2}}; C_i = \cos\frac{i \cdot \pi}{16}, i = 1, 2, ..., 7$$
 (2)

This matrix can be decomposed in the following way:  $S_{R8} = P_{R8} J_{R8}$  (3)

where,

$$\mathbf{P_{R8}} = \frac{1}{2} \cdot \mathbf{Diagonal} (C_0, C_4, C_2, C_2, C_1, C_5, C_5, C_1) \quad (4)$$

and

$$\mathbf{J}_{\mathbf{R8}} = \begin{bmatrix} \mathbf{J}_{\mathbf{SE4}} \mathbf{Q}_{\mathbf{R4}} & \mathbf{0} \\ \mathbf{0} & \mathbf{J}_{\mathbf{04B}} \mathbf{J}_{\mathbf{04C}} \mathbf{J}_{\mathbf{04D}} \end{bmatrix} \mathbf{Q}_{\mathbf{R8}}$$
(5)

These matrices are given by

$$\mathbf{J}_{\mathbf{SE4}} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & -1 & 0 & 0 \\ 0 & 0 & T_2 & 1 \\ 0 & 0 & -1 & T_2 \end{bmatrix}, \ \mathbf{J}_{\mathbf{O4B}} = \begin{bmatrix} T_1 & 0 & 0 & 1 \\ 0 & T_5 & 1 & 0 \\ 0 & -1 & T_5 & 0 \\ -1 & 0 & 0 & T_1 \end{bmatrix}$$
(6)

$$\mathbf{J}_{04C} = \begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & -1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 0 & 0 & 1 & 1 \end{bmatrix}, \quad \mathbf{J}_{04D} = \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & -C_4 & C_4 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}$$
(7)
$$\mathbf{Q}_{R4} = \begin{bmatrix} \mathbf{I}_2 & \mathbf{I}_{D2} \\ \mathbf{I}_{D2} & -\mathbf{I}_2 \end{bmatrix}, \quad \mathbf{Q}_{R8} = \begin{bmatrix} \mathbf{I}_4 & \mathbf{I}_{D4} \\ \mathbf{I}_{D4} & -\mathbf{I}_4 \end{bmatrix}$$
(8)

where  $I_N$  is the dimension-N unit matrix,  $I_{D4}$  the matrix resulting from permuting the  $I_4$  rows according to the ordering sequence (3,2,1,0),  $I_{D2}$  ordering according to (1,0),  $T_1=C_7/C_1$ ,  $T_2=C_6/C_2$  and  $T_5=C_3/C_5$ .

Since  $S_{R8}$ ,  $Q_{R8}$  and  $P_{R8}$  are orthogonal, after Eq. (3) and (5) we get:

$$\mathbf{S}_{R8}^{-1} = \mathbf{J}_{R8}^{t} \mathbf{P}_{R8} = \mathbf{Q}_{R8} \begin{bmatrix} \mathbf{Q}_{R4} \mathbf{J}_{SE4}^{t} & \mathbf{0} \\ \mathbf{0} & \mathbf{J}_{O4D} \mathbf{J}_{O4C} \mathbf{J}_{O4B}^{t} \end{bmatrix} \mathbf{P}_{R8}$$
(9)

as  $Q_{R4}$ ,  $J_{O4D}$  and  $J_{O4C}$  are also orthogonal. The 8x8 2-D DCT can be expressed on the basis of the nuclei of the  $S_{R8}$  matrix transform as:

$$\mathbf{X}_{\mathbf{R}} = \mathbf{S}_{\mathbf{R8}} \cdot \mathbf{x} \cdot \mathbf{S}_{\mathbf{R8}}^{\mathsf{t}} = \mathbf{K}_{\mathbf{8}} \bullet \left( \mathbf{J}_{\mathbf{R8}} \cdot \mathbf{x} \cdot \mathbf{J}_{\mathbf{R8}}^{\mathsf{t}} \right)$$
(10)

where  $(\bullet)$  represent the Hadamard product and  $\mathbf{K}_8$  is the normalization matrix defined as

Similarly, the 8x8 IDCT is obtained as:

$$\mathbf{x} = \mathbf{S}_{\mathbf{R8}}^{\mathsf{t}} \cdot \mathbf{X}_{\mathbf{R}} \cdot \mathbf{S}_{\mathbf{R8}} = \mathbf{J}_{\mathbf{R8}}^{\mathsf{t}} \cdot (\mathbf{K}_{\mathbf{8}} \bullet \mathbf{X}_{\mathbf{R}}) \cdot \mathbf{J}_{\mathbf{R8}}$$
(12)

# **3. ARCHITECTURE OF J\_{R8} / J\_{R8}^t PROCESSOR**

Figure 1 shows the architecture of the  $J_{R8} / J_{R8}^t$  processor designed to implement the matrix decompositions defined in Eq. (5) and (9) respectively. It is made up of a double-input  $Q_{R8}$  processor, and processors  $\{Q_{R4}, J_{SE4} / J_{SE4}^t\}$  and  $\{J_{O4D}, J_{O4C}, J_{O4B} / J_{O4B}^t\}$  operating in parallel. The configuration of forward DCT or IDCT of the architecture is performed by multiplexers and the operation of processors  $J_{SE4} / J_{SE4}^t$  and  $J_{O4B} / J_{O4B}^t$ .

One important characteristic is that the input data  $I_E$ and  $I_O$  are processed in parallel and thus the output data  $O_E$  and  $O_O$  are also generated in parallel. In this way, the operation frequency of the  $J_{R8}/J_{R8}^t$  processor is halved from the input data sampling frequency (f<sub>s</sub>).



**Fig. 1**. Architecture of  $J_{R8} / J_{R8}^{t}$  processor.

Figure 2 shows the architecture of each of the basic processors specified in Figure 1 which are derived from (6)-(8). All of them operate with 4 input data in series and produce 4 output data also in series, these outputs being compatible with the next processor. They are synchronized by the internal clock Clk2 at  $f_s/2$ . They are made up of shift registers (S-R), multiplexers (MUX), carry incrementer adders/substracters and hardwired multipliers. These processors have been designed aiming at an efficiency of 100% in the arithmetic elements in most cases.

The carry incrementer adder scheme has been chosen because it has an asymptotic performance with O(n) area and  $O(\sqrt{n})$  time, and provides a compromise between a ripple-carry adder and a carry look-ahead adder. The concept of hardwired multiplication and binary signed digit representation for fixed coefficients has been adopted to simplify the hardware complexity for realizing multiplication through a carry-saver adder scheme. The multiplication by fixed-coefficients is computed in three types of configurable multipliers which perform the following arithmetic operations:  $P=d_i\{T_1 \text{ or } T_5\}+d_i$ ,  $P=d_i\{1 \text{ or } T_2\}+d_i \text{ and } P=d_iC_4$ , where  $d_i$  and  $d_i$  are input data. These multipliers are made up of a carry save adder tree based on 4:2 and 5:3 compressors and configured according to the type of coefficient, and a final carry incrementer adder with rounding correction. In order to shorten the critical path, one pipeline stage is inserted within the multipliers.

#### 4. 8x8 2-D DCT/IDCT

The 2-D 8x8 DCT/IDCT architecture is implemented by the row-column decomposition technique according to Eqs. (10) and (12). Figure 3 shows the block diagram of the proposed architecture made up by two 1-D processors  $(J_{R8}/J_{R8}^t)$ , a transpose buffer (TB) for storing the intermediate data, one down-sampling (D-S) unit and another up-sampling (U-S) unit, and a multiplier for performing the normalization of the transform according to matrix  $K_8$  defined in (11). These processors operate at  $f_s/2$  and the normalization at  $f_s$ . The normalization is



Fig. 2. Architecture of basic processors: a)  $\mathbf{Q}_{\mathbf{R8}}$ , b)  $\mathbf{J}_{\mathbf{SE4}}$  /  $\mathbf{J}_{\mathbf{SE4}}^{t}$ , c)  $\mathbf{J}_{\mathbf{O4D}}$ , d)  $\mathbf{Q}_{\mathbf{R4}}$  and  $\mathbf{J}_{\mathbf{O4C}}$ , and e)  $\mathbf{J}_{\mathbf{O4B}}$  /  $\mathbf{J}_{\mathbf{O4B}}^{t}$ .



Fig. 3. Architecture of 2D DCT/IDCT processor.

performed by a single  $K_8$  multiplier operating at the output for the forward DCT and at the input for the IDCT.

The 8x8 intermediate data generated by the first 1-D DCT processor has to be stored and transposed in the TB before the second 1-D DCT is performed. This TB allows simultaneous read and write operations between the two processors and performance the equivalent to a matrix transposition. To achieve this, the data are read out of the memory column-wise if the previous intermediate data were written into the memory row-wise, and vice versa. This TB is made up by D-type flip-flops and multiplexers.

#### 4.1. Pipeline scheme

As the arithmetic circuits limit the speed of the processor, the DCT/IDCT processor uses pipelining to shorten the cycle time and perform real-time processing for applications with high pixel rates. The pipeline registers are inserted in the critical path to improve the operation speed with minimal overhead and to provide balanced paths.

Figure 4 shows the data cycle timing for calculating the two 1-D DCT/IDCT and TB transposing operations for the different 8x8 blocks. The input data are fed in rowwise order at 1 pixel/clock and the output data are produced in column-wise order. The total latency for 2-D DCT is 172 cycles and for 2-D IDCT is 178 cycles. This small difference in the number of cycles is forced by the necessary synchronization between the different basic processors shown in Fig. 2.

#### 4.2. Accuracy specifications

The computation accuracy of the processor is affected by quantization error of coefficients and by finite internal wordlength. The IEEE-std. 1180-1990 has established some strident specification to evaluate the errors caused by the computing of the IDCT [2]. The fulfillment of the specifications ensures the compatibility between different implementations of the IDCT. This standard has been used to define the accuracy of the data-path and wordlength of the coefficients of the processor. The IDCT architecture must be exercised with 10,000 8x8 blocks of random numbers in the ranges [-5, 5], [-256, 255] and [-300, 300]. The simulations carried out with MATLAB lead to the following minimum architectural requirements: 20-b for data-path, 12-b for coefficients wordlength and 13-b for normalization coefficients of K8. Table I summarizes the simulation results with these wordlength specifications.



|            | MSE(<0.02) | ME(<0.0015) | PMSE(<0.06) | PME(<0.015) |
|------------|------------|-------------|-------------|-------------|
| [-256,255] | 0.016733   | 0.000008    | 0.0224      | 0.0028      |
| [255,-256] | 0.017011   | 0.000136    | 0.022500    | 0.0029      |
| [-300,300] | 0.014727   | 0,000083    | 0.018300    | 0.0029      |
| [300,-300] | 0.014872   | 0.000075    | 0.018700    | 0.0024      |
| [-5,5]     | 0.011494   | 0.000209    | 0.013600    | 0.0027      |
| [5,-5]     | 0.011422   | 0.000181    | 0.0144      | 0.0025      |

MSE denotes overall mean square error, ME denotes overall mean error, PMSE denotes peak mean square error and PME denotes peak mean error

**Fig. 4**. Data timing of 8x8 2-D DCT/IDCT processor

| <b>Fable I</b> . A | ccuracy | test result | of the | IDCT |
|--------------------|---------|-------------|--------|------|
|--------------------|---------|-------------|--------|------|

| Reference   | [3]          | [4]         | [5]          | [6]         | [7]          | Ours         |
|-------------|--------------|-------------|--------------|-------------|--------------|--------------|
| Year        | 1996         | 1999        | 2002         | 2003        | 2004         | 2004         |
| Function    | DCT/IDCT     | IDCT        | DCT/IDCT     | IDCT        | DCT/IDCT     | DCT/IDCT     |
| Tech. CMOS  | 0.3 µm (FC)  | 0.6 µm (SC) | 0.18 µm (SC) | 0.6 µm (SC) | 0.25 µm (SC) | 0.35 µm (SC) |
| Area (mm2)  | 4            | 12.3 (core) |              | 21 (core)   | 1.5 (core)   | 3 (core)     |
| Gates       |              | 10k gates   | 28k gates    | 7.5 gates + | 52k gates    | 11.7k gates  |
| Transistors | 120K Tr      | 78k Tr      |              | RAM         |              |              |
| Frecuency   | 150MHz@0.9 V | 100MHz@3.3V | 33.6MHz@1.6V | 100MHz      | 150MHz       | 300MHz@3.3V  |
| Latency     | 112 cycles   | 86 cycles   | 227 cycles   |             | 64 cycles    | 178 cycles   |

Table II. Comparison with dedicated hardware designs (FC denotes full-custom and SC semi-custom)

# 5. IMPLEMENTATION AND COMPARISONS

6. CONCLUSIONS

A prototype of an 8x8 2-D DCT processor has been designed using standard cells in a semi-custom methodology. It uses 9-b input data and 12-b output data for DCT and 12-b and 9-b for IDCT. The processor was implemented with a 0.35 $\mu$ m CMOS 3M/2P 3.3V technology. The processor has an area of 2.5x2.5 $\approx$ 6.25 mm<sup>2</sup> (the core is 3mm<sup>2</sup>). It contains a total of 11.7k gates, 5.8k gates of which are flip-flops and 826 gates are full adder/half adder. The hardware costs in terms of number of gates for the different blocks are: **Q**<sub>R8</sub> processor needs 549 gates, **Q**<sub>R4</sub> 318, **Q**<sub>R4</sub>, **J**<sub>04C</sub> 320, **J**<sub>O4B</sub>/**J**<sup>t</sup><sub>O4B</sub> 600, **J**<sub>SE4</sub>/**J**<sup>t</sup><sub>SE4</sub> 372, **J**<sub>O4D</sub> 502, TB 2563, **K**<sub>8</sub> multiplier 1528 and others 2300. The maximum operating frequency of about 300 MHz has been established and the computing time of a block is close to 580ns.

Table II lists features of the proposed processor and other DCT implementations selected from among those which fulfill the specifications of the standard [2]. The parallel-pipeline architecture and arithmetic units operating at half the frequency gives an input data rate of 300MHz, far higher than that of the fastest processor listed in this table [3][7]. This speed does not imply any additional cost in terms of the number of gates since it offers an efficient hardware implementation.

One advantage of the proposed architecture is that the  $K_8$  normalization can be incorporated into the decoding process in an adaptive transform coding system. Therefore, the quantization process at the receiver or at the transmitter can include this normalization. Then, the  $K_8$  multiplier can be completely removed in the 2D DCT processor, reducing area by 13% and latency by 10%.

This paper describes the architecture of an 8x8 2-D DCT/IDCT processor with high throughput, reduced hardware, parallel and pipeline architectyre. This good performance in the computing speed as well as hardware cost, indicate that the proposed design is suitable for HDTV applications.

ACKNOWLEDGMENT: This work was supported by the Spanish Ministry of Education and Science under project TEC 2004-04868-C02-02/MIC.

## 7. REFERENCES

- V. Bhaskaran and K. Konstantinides, *Image and Video Compression Standards: Algoritms and Architectures*. Kluwer Academic Publishers. 2nd Edition, 1997.
- [2] *IEEE standard specifications for the implementations of 8x8 inverse discrete cosine transform*, Institute of Electrical and Electronics Engineers, New York, March 1991.
- [3] T. Kuroda et al., "A 0.9-V, 150MHz, 10-mW, 4 mm2, 2-D Discrete Cosine Transform Core Processor with variable Threshold-Voltage (VT) Scheme,", IEEE Journal on Solid-State Circuits, vol. 31, no. 11, pp. 1770-1779, Nov 1996.
- [4] T.H. Chen, "A cost-effective 8x8 2-D IDCT core processor with folded architecture," IEEE on Consumer Electronics, vol. 45, no. 2, pp. 333-339, May 1999.
- [5] L. Fanucci, S, Saponara, "Data driven VLSI computation for low power DCT-based video coding," 9th IEEE Int. Conf. Electronics Circuits& Systems, vol. 2, pp. 541-544, 2002.
- [6] J.I. Guo, J.C. Yen, "An efficient IDCT processor design for HDTV applications," Journal of VLSI Signal Processing, vol. 33, pp. 147-155, October 2003.
- [7] D. Gong, Y. He, Z. Cao, "New cost-effective VLSI implementation of a 2-D discrete cosine transform and its inverse," IEEE trans. on Circuits and Systems for Video Technology, vol. 14, no. 4, pp. 405-415, April 2004.