

A Peer Reviewed Open Access International Journal

# Architecture for Matching Of Data Encoded With Hard Systematic Error-Correcting Codes Using Verilog HDL



P. Sampath Kumar Associate professor & HOD, Department of ECE, Malla Reddy Institute of Technology, Hyderabad.

## Abstract:

A new architecture for matching the data protected with an error-correcting code (ECC) is presented in this brief to reduce latency and complexity. Based on the fact that the codeword of an ECC is usually represented in a systematic form consisting of the raw data and the parity information generated by encoding, the proposed architecture parallelizes the comparison of the data and that of the parity information. To further reduce the latency and complexity, in addition, a new butterfly-formed weight accumulator (BWA) is proposed for the efficient computation of the Hamming distance. Grounded on the BWA, the proposed architecture examines whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. For a (40, 33) code, the proposed architecture reduces the latency and the hardware complexity by ~32% and 9%, respectively, compared with the most recent implementation.

## **Index Terms:**

Data comparison, error-correcting codes (ECCs), Hamming distance, systematic codes, tag matching.

## **I.INTRODUCTION:**

Data comparison is widely used in computing systems to perform many operations such as the tag matching in a cache memory and the virtual-to-physical address translation in a translation look aside buffer (TLB). Because of such prevalence, it is important to implement the comparison circuit with low hardware complexity.



Bairi Sumanth Kumar M.Tech (VLSI &ES), Department of ECE, Malla Reddy Institute of Technology, Hyderabad.

Besides, the data comparison usually resides in the critical path of the components that are devised to increase the system performance, e.g., caches and TLBs, whose outputs determine the flow of the succeeding operations in a pipeline. The circuit, therefore, must be designed to have as low latency as possible, or the components will be disqualified from serving as accelerators and the overall performance of the whole system would be severely deteriorated. As recent computers employ error-correcting codes (ECCs) to protect data and improve reliability [1]–[5], complicated decoding procedure, which must precede the data comparison, elongates the critical path and exacerbates the complexity overhead. Thus, it becomes much harder to meet the above design constraints. Despite the need for sophisticated designs as described, the works that cope with the problem are not widely known in the literature since it has been usually treated within industries for their products.Recently, however, [6] triggered the attraction of more and more attentions from the academic field. The most recent solution for the matching problem is the direct compare method [6], which encodes the incoming data and then compares it with the retrieved data that has been encoded as well. Therefore, the method eliminates the complex decoding from the critical path. In performing the comparison, the method does not examine whether the retrieved data is exactly the same as the incoming data. Instead, it checks if the retrieved data resides in the error correctable range of the codeword corresponding to the incoming data. As the checking necessitates an additional circuit to compute the Hamming distance, i.e., the number of different bits between the two code words, the saturate adder (SA) was presented in [6] as a basic building block for calculating the Hamming distance.

Volume No: 2 (2015), Issue No: 7 (July) www.ijmetmr.com



A Peer Reviewed Open Access International Journal

However, [6] did not consider an important fact that may improve the effectiveness further, a practical ECC codeword is usually represented in a systematic form in which the data and parity parts are completely separated from each other [7]. In addition, as the SA always forces its output not to be greater than the number of detectable errors by more than one, it contributes to the increase of the entire circuit complexity.In this brief, we renovate the SA-based direct compare architecture to reduce the latency and hardware complexity by resolving the aforementioned drawbacks. More specifically, we consider the characteristics of systematic codes in designing the proposed architecture and propose a low-complexity processing element that computes the Hamming distance faster. Therefore, the latency and the hardware complexity are decreased considerably even compared with the SA based architecture.

### **II.PREVIOUS WORKS:**

This section describes the conventional decode-andcompare architecture and the encode-and-compare architecture based on the direct compare method. For the sake of concreteness, only the tag matching performed in a cache memory is discussed in this brief, but the proposed architecture can be applied to similar applications without loss of generality.





## A. Decode-and-Compare Architecture:

Let us consider a cache memory where a k-bit tag is stored in the form of an n-bit codeword after being encoded by a (n, k) code. In the decode-and-compare architecture depicted in Fig. 1(a), the n-bit retrieved codeword should first be decoded to extract the original k-bit tag. The extracted k-bit tag is then compared with the k-bit tag field of an incoming address to determine whether the tags are matched or not. As the retrieved codeword should go through the decoder before being compared with the incoming tag, the critical path is too long to be employed in a practical cache system designed for high-speed access. Since the decoder is one of the most complicated processing elements, in addition, the complexity overhead is not negligible.

### **B. Encode-and-Compare Architecture:**

Note that decoding is usually more complex and takes more time than encoding as it encompasses a series of error detection or syndrome calculation, and error correction [7]. The implementation results in [8] support the claim. To resolve the drawbacks of the decode-and compare architecture, therefore, the decoding of a retrieved codeword is replaced with the encoding of an incoming tag in the encode-and-compare architecture More precisely, a k-bit incoming tag is first encoded to the corresponding n-bit codeword X and compared with an n-bit retrieved codeword Y as shown in Fig. 1(b).

The comparison is to examine how many bits the two code words differ, not to check if the two code words are exactly equal to each other. For this, we compute the Hamming distance d between the two code words and classify the cases according to the range of d.Let tmax and rmax denote the numbers of maximally correctable and detectable errors, respectively. The cases are summarized as follows.

1) If d = 0, X matches Y exactly.

2) If  $o < d \le tmax$ , X will match Y provided at most tmax errors in Y are corrected.

) If tmax < d  $\leq$  rmax, Y has detectable but uncorrectable errors. In this case, the cache may issue a system fault so as to make the central processing unit take a proper action.

#### 4) If rmax < d, X does not match Y.

Assuming that the incoming address has no errors, we can regard the two tags as matched if d is in either the first or the second ranges.



A Peer Reviewed Open Access International Journal

In this way, while maintaining the error-correcting capability, the architecture can remove the decoder from its critical path at the cost of an encoder being newly introduced. Note that the encoder is, in general, much simpler than the decoder, and thus the encoding cost is significantly less than the decoding cost. Since the above method needs to compute the Hamming distance, [6] presented a circuit dedicated for the computation.



Match/Mismatch/Fault

# Fig.2: SA-based architecture supporting the direct compare method

The circuit shown in Fig. 2 first performs XOR operations for every pair of bits in X and Y so as to generate a vector representing the bitwise difference of the two code words. The following half adders (HAs) are used to count the number of 1's in two adjacent bits in the vector. The numbers of 1's are accumulated by passing through the following SA tree. In the SA tree, the accumulated value z is saturated to rmax + 1 if it exceeds rmax. More precisely, given inputs x and y, z can be expressed as follows:

$$Z = \begin{cases} x + y, & \text{if } x + y \le r_{max} \\ r_{max} + 1, & \text{otherwise.} \end{cases}$$
(1)

The final accumulated value indicates the range of d. As the compulsory saturation necessitates additional logic circuitry, the complexity of a SA is higher than the conventional adder.

#### **III.PROPOSED ARCHITECTURE:**

This section presents a new architecture that can reduce the latency and complexity of the data comparison by using the characteristics of systematic codes. In addition, a new processing element is presented to reduce the latency and complexity further.

## A. Datapath Design for Systematic Codes:

In the SA-based architecture [6], the comparison of two code words is invoked after the incoming tag is encoded. Therefore, the critical path consists of a series of the encoding and the n-bit comparison as shown in Fig. 3(a). However, [6] did not consider the fact that, in practice, the ECC codeword is of a systematic form in which the data and parity parts are completely separated as shown in Fig. 4.



Fig.3: Timing diagram of the tag match in (a) direct compare method [6] and (b) proposed architecture.



#### **Fig.4: Systematic representation of an ECC codeword** As the data part of a systematic codeword is exactly the same as the incoming tag field, it is immediately available for comparison while the parity part becomes available only after the encoding is completed. Grounded on this fact, the comparison of the k-bit tags can be started before the remaining (n–k)-bit comparison of the parity bits.In the proposed architecture, therefore, the encoding process to generate the parity bits from

the incoming tag is performed in parallel with the tag comparison, reducing the overall latency as shown in Fig. 3(b).

# **B.** Architecture for Computing the Hamming Distance:

The proposed architecture grounded on the data path design is shown in Fig. 5.



A Peer Reviewed Open Access International Journal

The proposed architecture contains multiple butterflyformed weight accumulators (BWAs) proposed to improve the latency and complexity of the Hamming distance computation.



Match/Mismatch/Fault

Fig.5: Proposed architecture optimized for systematic code words.

The basic function of the BWA is to count the number of 1's among its input bits. It consists of multiple stages of Half Adders as shown in Fig. 6(a), where each output bit of a HA is associated with a weight.



Fig.6: Proposed BWA. (a) General structure and (b) new structure revised for the matching of ECC-protected data. Note that sum-bit lines are dotted for visibility.

The HAs in a stage are connected in a butterfly form so as to accumulate the carry bits and the sum bits of the upper stage separately. In other words, both inputs of a HA in a stage, except the first stage, are either carry bits or sum bits computed in the upper stage. This connection method leads to a property that if an output bit of a HA is set, the number of 1's among the bits in the paths reaching the HA is equal to the weight of the output bit. In Fig. 6(a), for example, if the carry bit of the gray-colored HA is set, the number of 1's among the associated input bits, i.e., A, B, C, and D, is 2. At the last stage of Fig. 6(a), the number of 1's among the input bits, d, can be calculated as

d = 8I + 4(J + K + M) + 2(L + N + 0) + P. (2)

Since what we need is not the precise Hamming distance but the range it belongs to, it is possible to simplify the circuit. When rmax = 1, for example, two or more than two 1's among the input bits can be regarded as the same case that falls in the fourth range. In that case, we can replace several HAs with a simple OR-gate tree as shown in Fig. 6(b). This is an advantage over the SA that resorts to the compulsory saturation expressed in (1).Note that in Fig. 6, there is no overlap between any pair of two carry-bit lines or any pair of two sum-bit lines. As the overlaps exist only between carry-bit lines and sum-bit lines, it is not hard to resolve overlaps in the contemporary technology that provides multiple routing layers no matter how many bits a BWA takes. We now explain the overall architecture in more detail. Each XOR stage in Fig. 5 generates the bitwise difference vector for either data bits or parity bits, and the following processing elements count the number of 1's in the vector, i.e., the Hamming distance. Each BWA at the first level is in the revised form shown in Fig. 6(b), and generates an output from the OR-gate tree and several weight bits from the HA trees. In the interconnection, such outputs are fed into their associated processing elements at the second level.

The output of the OR-gate tree is connected to the subsequent OR-gate tree at the second level, and the remaining weight bits are connected to the second level BWAs according to their weights. More precisely, the bits of weight w are connected to the BWA responsible for w-weight inputs. Each BWA at the second level is associated with a weight of a power of two that is less than or equal to Pmax, where Pmax is the largest power of two that is not greater than rmax + 1.

Volume No: 2 (2015), Issue No: 7 (July) www.ijmetmr.com



A Peer Reviewed Open Access International Journal

As the weight bits associated with the fourth range are all ORed in the revised BWAs, there is no need to deal with the powers of two that are larger than Pmax. For example, let us consider a simple (8, 4) single-error correction double-error detection code. The corresponding first and second level circuits are shown in Fig. 7. Note that the encoder and XOR banks are not drawn in Fig. 7 for the sake of simplicity.



Fig.7: First and second level circuits for a (8, 4) code.

Since rmax = 2, Pmax = 2 and there are only two BWAs dealing with weights 2 and 1 at the second level. As the bits of weight 4 fall in the fourth range, they are ORed. The remaining bits associated with weight 2 or 1 are connected to their corresponding BWAs. Note that the interconnection induces no hardware complexity, since it can be achieved by a bunch of hard wiring. Taking the outputs of the preceding circuits, the decision unit finally determines if the incoming tag matches the retrieved codeword by considering the four ranges of the Hamming distance. The decision unit is in fact a combinational logic of which functionality is specified by a truth table that takes the outputs of the preceding circuits as inputs. For the (8, 4) code that the corresponding first and second level circuits are shown in Fig. 7, the truth table for the decision unit is described in Table I. Since U and V cannot be set simultaneously, such cases are implicitly included in do not care terms in Table I.

# Table I: Truth Table of the Decision Unit for a (8,4) code

| Q OR R OR S | Т | U | V | Decision |
|-------------|---|---|---|----------|
| 0           | 0 | 0 | х | Match    |
|             | 0 | 1 | х | Fault    |
|             | 1 | 0 | 0 | Fault    |
|             | 1 | 0 | 1 | Mismatch |
|             | 1 | 1 | x | Mismatch |
| 1           | х | х | х | Mismatch |

# C. General Expressions for the Complexity and the Latency:

The complexity as well as the latency of combinational circuits heavily depends on the algorithm employed. In addition, as the complexity and the latency are usually conflicting with each other, it is unfortunately hard to derive an analytical and fully deterministic equation that shows the relationship between the number of gates and the latency for the proposed architecture and also for the conventional SA-based architecture. To circumvent the difficulty in analytical derivation, we present instead an expression that can be used to estimate the complexity and the latency by employing some variables for the nondeterministic parts. The complexity of the proposed architecture, C, can be expressed as

$$C = C_{XOR} + C_{ENC} + C_{BWA}(k) + C_{BWA}(n-k) + C_{2nd} + C_{DU}$$
  
$$\leq n + C_{ENC} + 2C_{BWA}(n) + C_{DU} \quad (3)$$

where CXOR, CENC, C2nd, CDU, and CBWA(n) are the complexities of XOR banks, an encoder, the second level circuits, the decision unit, and a BWA for n inputs, respectively. Using the recurrence relation, CBWA(n) can be calculated as

$$C_{BWA}(n) = C_{BWA}(\lfloor n/2 \rfloor) + C_{BWA}(\lfloor n/2 \rfloor) + 2 \lfloor \frac{n}{2} \rfloor$$
 (4)

where the seed value, CBWA(1), is o. Note that when a + b = c, CBWA(a) + CBWA(b)  $\leq$  CBWA(c) holds for all positive integers a, b, and c. Because of the inequality and the fact that an OR-gate tree for n inputs is always simpler than a BWA for n inputs, both CBWA(k) + CBWA(n -k) and C2nd are bounded by CBWA(n). The latency of the proposed architecture, L, can be expressed as

$$L \leq \max \left[ L_{\text{XOR}} + L_{\text{BWA}}(k), L_{\text{ENC}} + L_{\text{XOR}} + L_{\text{BWA}}(n-k) \right]$$
  
+ 
$$L_{\text{2nd}} + L_{\text{DU}}$$
  
$$\leq \max \left( 1 + \left\lceil \log_2 k \right\rceil, L_{\text{ENC}} + 1 + \left\lceil \log_2(n-k) \right\rceil \right)$$
  
+ 
$$\left\lceil \log_2 n \right\rceil + L_{\text{DU}}$$
(5)

where LXOR, LENC, L2nd, LDU, and LBWA(n) are the latencies of an XOR bank, an encoder, the second level circuits, the decision unit, and a BWA for n inputs, respectively. Note that the latencies of the OR-gate tree and BWAs for  $x \le n$  inputs at the second level are all bounded by [log2 n]. As one of BWAs at the first level finishes earlier than the other, some components at the second level may start earlier. Similarly, some BWAs or the OR-gate tree at the second level may provide their output earlier to the decision unit so that the unit can begin its operation without waiting for all of its inputs.



A Peer Reviewed Open Access International Journal

In such cases, L2nd and LDU can be partially hidden by the critical path of the preceding circuits, and L becomes shorter than the given expression.

## **IV.SIMULATION RESULTS:**

The simulations of the proposed design are carried out by using Verilog HDL language in Xilinx tool. The Top view and simulated waveforms of the proposed design are shown in below figure:



#### Fig.8: simulated waveforms of proposed design



#### Fig.9: Top- view of the proposed design

## **V.CONCLUSION:**

To reduce the latency and hardware complexity, a new architecture has been presented for matching the data protected with an ECC. The proposed architecture examines whether the incoming data matches the stored data if a certain number of erroneous bits are corrected. To reduce the latency, the comparison of the data is parallelized with the encoding process that generates the parity information. The parallel operations are enabled based on the fact that the systematic codeword has separate fields for the data and parity. In addition, an efficient processing architecture has been presented to further minimize the latency and complexity. As the proposed architecture is effective in reducing the latency as well as the complexity considerably, it can beregarded as a promising solution for the comparison of ECC-protected data. Though this brief focuses only on the tag match of a cache memory, the proposed method is applicable to diverse applications that need such comparison.

### **REFERENCES:**

[1] J. Chang, M. Huang, J. Shoemaker, J. Benoit, S.-L. Chen, W. Chen, S. Chiu, R. Ganesan, G. Leong, V. Lukka, S. Rusu, and D. Srivastava, "The 65-nm 16-MB shared on-die L3 cache for the dual-core Intel xeon processor 7100 series," IEEE J. Solid-State Circuits, vol. 42, no. 4, pp. 846–852, Apr. 2007.

[2] J. D. Warnock, Y.-H. Chan, S. M. Carey, H. Wen, P. J. Meaney, G. Gerwig, H. H. Smith, Y. H. Chan, J. Davis, P. Bunce, A. Pelella, D. Rodko, P. Patel, T. Strach, D. Malone, F. Malgioglio, J. Neves, D. L. Rude, and W. V. Huott "Circuit and physical design implementation of the microprocessor chip for the zEnterprise system," IEEE J. Solid-State Circuits, vol. 47, no. 1, pp. 151–163, Jan. 2012.

[3] H. Ando, Y. Yoshida, A. Inoue, I. Sugiyama, T. Asakawa, K. Morita, T. Muta, and T. Motokurumada, S. Okada, H. Yamashita, and Y. Satsukawa, "A 1.3 GHz fifth generation SPARC64 microprocessor," in IEEE ISSCC. Dig. Tech. Papers, Feb. 2003, pp. 246 247.

[4] M. Tremblay and S. Chaudhry, "A third-generation 65nm 16-core 32-thread plus 32-scout-thread CMT SPARC processor," in ISSCC. Dig. Tech. Papers, Feb. 2008, pp. 82–83.

[5] AMD Inc. (2010). Family 10h AMD Opteron Processor Product Data Sheet, Sunnyvale, CA, USA [Online].