浮点运算单元的设计与实现

2005 6 20

32 IEEE 754 32 (A+B C) 4 SRT Verilog HDL 0.18µm CMOS - i -

Abstract Floating point representation is the science notification in computer, and floating point operations are the major part of multimedia calculations, so floating point unit is an important part in the design of all kinds of processors and it determines the performance. This article describes design and implementation of a 32-bit floating point unit and related research results, which is archived as my undergraduate thesis. The unit is fully compliant with the IEEE 754 floating point standard, and supports 32-bit single precision floating point operations including addition, multiplication, division, square root operation, and conversion between integers and floating point numbers. Multiply-add fused scheme is adapted in front end arithmetic design, to calculate addition and multiplication (A+B C). The processing of denormalized numbers is merged into the dataflow. In order to narrow the width of the main adder, normalization shift is performed before addition. Along with the significant bits wide main adder, rounding is performed on-fly. The synthesis result proves that the delay of MAF unit decreased a lot, compared with previous designs. The MAF unit is divided into 3 pipeline stages, so that the throughput increased a lot. Radix-4 SRT algorithm is implemented for calculating division and square root literally. Different rounding schemes are chose for different representation of intermediate results. As a result, the iteration cycles are limited and the arithmetic is efficient. The whole design is described in Verilog HDL and simulated. The design is mapped on SMIC 0.18µm CMOS technology as an automatic standard cell implementation. The post-simulation result proves that it achieved the goal. Key Words: FPU (floating point unit), MAF (multiply-add fused), denormalized number processing, rounding - ii -

... I ABSTRACT...II...III...1 1.1...1 1.2...1 1.3...2...4 2.1...4 2.2 (ROUNDING)...5 2.3...6 2.4 EXCEPTION...8 2.5...9...10 3.1...10 3.2...11 3.3...21 3.4...26...28 4.1 SRT...28 4.2...29 4.3...31...33 5.1...33 5.2...34 - iii -

5.3...37...38 6.1...38 6.2...41...43...46 8.1...46 8.2...46...48...50...50...51...68 - iv -

1.1 0 (Float Point Unit) FPU 3D FPU FPU ALU CPU 486 Intel i80486dx CPU MMX 3DNow! CPU DSP FPU 1.2 1 FPU FLOPS Floating Point Operations Per Second 2005 2 SONY IBM TOSHIBA ISSCC05 90nm SOI Cell SPE (Synergistic Processing Element) 4GHz SPE 32GFLOPS SONY Play Station 3-1 -

2 Intel P4 23 7 AMD Athlon 16 4 P4 Athlon 3 1994 Intel Pentium 5 1.3 SMIC 0.18µm CMOS MAF(Multiply-add fused) A+B C A B C MAF IBM z990 [1] MAF MAF Z990 (millicode) [2] [3] MAF - 2 -

[3] MAF MAF A/B [1] 4 SRT 24 12 1 14 1.2 AMD Athon SQRT(A) z990 FPU 2 SRT 1 20 P4 23 [4] 4 SRT 2 14 IEEE 754 [4] MAF Verilog HDL RTL Synopsys VCS Synopsys Design Complier Synopsys Astro IBM z990 FPU 0.13µm CMOS 1.2GHz 156MHz IEEE 754 FPU FPU - 3 -

IEEE 754 [4] IEEE 754 2.1 ±S B E 32 2.1.1 1 8 23 s e f 2.1.1 s(sign) 0 1 S(significand) 1 1 (mantissa) (fraction) E(exponent) 32 127 8 0~255 1~254-126~127 0-126 B(base) 2.1.1 32 IEEE 754 32-4 -

2 32 2.1.1 0 1 0 1 0/1 0/1 0/1 255 255 255 0 0 0<e<255 0 1 1 1 0 0 0 0 0 f f 0-0 - NaN ±2 e-127 1.f ±2-126 0.f 2.2 (rounding) IEEE 754 4 (round to nearest/even) IEEE + - (interval arithmetic) - 5 -

0 2.2.1 00.0 01.0 10.0 2.2.1 2.3 IEEE 754 2.3.1 < <+ 2.3.2 NaN NaN (signaling)nan NaN (quiet)nan NaN - 6 -

1) NaN 2) ( ) + (+ ) 3)0 4)0/0 / 5) x REM 0 REM y 6) 2.3.3 0 - x x + x = x (-x) x -0-0 2.3.4 IEEE 754 1 1 2.3.1 [2 n 2 n+1 ] 0 0 IEEE 32 2 23 2-126 2 23 [2-126 2-125 ] 0 2-126 - 7 -

0 2-126 2-125 2-124 2-123 2.3.1 (gradual underflow) 2.4 Exception IEEE 754 5 2.4.1 NaN 2.3.2 2.4.2 2.4.3 overflow 1) 2) 3) 4) - 8 -

2.4.4 underflow 2.4.5 / 2.5 32 IEEE 754 [4] 1 8 23 24 1 0 1-9 -

3.1 3.1.1 [1] [6] 1 C B [7] (CSA, Carry-Save Adder) 48 A 3:2CSA A B C A B C B C 74 A 48 B C 3:2 CSA 74 74 24 [3] [3] [3] [6] 1-10 -

1 [1] [2] A B C A B C A B C B C CSA 3 2 CSA 74 3.1.1 MAF 3.2 24 MAF - 11 -

ExpA sub ExpB ExpC A 24 Bit invert bits of A and 0's if add or bits of ~A and 1's if sub B 24 C 24 mul record ExpA LeadA ExpB LeadB ExpC LeadC Calculate the exponent of intermediate results 27-d 74-bits alignment shifter Sticky bit st1 calculation Part of sticky 74 26 MSB 48 LSB sub 13:2 CSA tree 48 48 ~st1 denormalized number processing Adjustment for potential denormalized results 3:2 CSA 49 49 HA Carry word HA Sum word HA inv.inputs Carry word LZA Logic for the LSBs MSBs processing Part of Dual adder 74 49 0 1 mux 75 75 Part of Dual adder 24 bits 49 bits XX GG XX.PP complement Sign detection to the add/round module 75-bits 75-bits normalization normalization shifter shifter Bits shifted-in during normalization Carry word: 0... 0 Sum word: 0... 0 (if complement=0) 1... 1 (if complement=1) 75 75 25 bits 50 bits X. X.... X X......... X X. X.... X X......... X L bit 50 24 24 50 st1 complement Exponent calculationone bit correction Rest of 22-bits Flagged prefix adder Rounding Correction Carry and sticky bits calculation Rounding Mode RN=1 if rounding to nearest RI=1 if rounding to infinity Exponent Resul 3.2.1 MAF 3.2.1 1 0 24 1-12 -

3 1 B C (Booth) C 24 C 0 13 B 0 ±1 ±2 3:2 13 2 [7] 2 A B C A+ B C A A 75 3.2.2(a) 24 A 51 1 51 0 A B C d = exp(a) (exp(b) + exp(c)) 3.2.2(b) d = 27 d 27 A B C B C B C A exp(a) d 27 A 0 1 shift amount = 27 d 75 d 48 24 A 24 24 0 1 0 (sticky bit) 24 0 1-13 -

st1 0 75 24 51 A A A 2 48 75 B C (a) 2 24 A 51 B C (b) 24 1 1 (c) 49 3:2 CSA 49 50 B C A 26 3:2CSA 49 0 1 3:2CSA 50 49 51 (d) 3:2CSA 3.2.2 MAF 3 1 A B C 1 24 lead 0 lead=0-14 -

75 B C 26+leadB+leadC 27+leadB+leadC A leada + shift amount A B C 3.2.2 49 3:2 (Carry Save Adder) A 49 48 B C A 1 1 24 1 st1 1 75 B C 0 3:2 CSA 49 50 49 (d) B C 48 B C 3:2 CSA B C B C [0,4) 4 11.*+01.* 1X.*+1X.* 01.*+11.* B C B C [1,4) 10.*+01.* 1 B C [0,2) 2 (1X.*+XX.* XX.*+1X.*) B C multi-carry 3:2 CSA carry[50] 75 75 26 A 49 3:2 CSA 75 49 3:2 CSA - 15 -

49 multi-carry 0 carry[50] 25 0 75 26 multi-carry 1 carry[50] 1 26 0 75 26 multi-carry 1 carry[50] 0 26 1 75 26 3:2 CSA 75 75 24 1 (Leading Zeros Anticipator) 1 3:2 CSA 3 1 75 51 A 1 24 75 A 51 [8] 1 [9] [8] 0 1 0-16 -

T = A B, G = AB, Z = AB f = TT 0 0 1 ( i+ 1 i+ ) 1( 1 ) 1 i i+ i+ 1 f = T G Z + Z G + T Z Z + GG, i> 0 i i 1 i i i i (3.1) 51 A 1 3.2.1 A 1 27 26 0 2 1 4 1 t[n]^ (~z[n+1]) 1 A 1 26 25 0 t[n]^ (~z[n+1]) 1 A 1 26 25 0 A t[n] ^ (~g[n+1]) 0 A 25 24 0 1 t[n] ^ (~g[n+1]) 51 [3] 32 3.2.3 51-17 -

3.2.3 51 Lz 75 a) A 0 A 2525 A B C 2 24 A B C 3 A B C 3 A ( ) A A A A 25 B C b) 25 0 1( ) A 1 A 1-18 -

c) A 25 A B C B C ExpBC = + ExpC Ebias 51 51 4 47 51 A 25 24 1 51 24 + Lz ExpR 1 ExpR = ExpBC Lz + 3 1 3 4 Lz 4 1 3 Lz ExpBC + 2 Lz > ExpBC+2 + 2 ExpR 1 3.2.1 ARS NLS ExpR 3.2.1 NLS ExpR ARS = 0 0 ExpA ARS <25 ARS ExpA ARS <25 A ARS ExpA 0<ARS<25 A ARS - 1 ExpA + 1 ARS 25 Lz ExpBC + 2 24 + Lz ExpBC Lz -3 ARS 25 Lz > ExpBC+2 26 + ExpBC 1 ARS 25 Lz ARS<25 0 1 1 0 2 IEEE 754 51 A B C B C - 19 -

3 [3] 75 3.2.3 NLS 1 0 1 1 24 51 51 1 [10] MAF A+B A+B+1 [11] flagged prefix adder [11] IEEE 754 IEEE 754-20 -

3.3 [10] MAF 1 75 1 2 1(Eone ) 0 C F, S F, R F C I, S I, R I m0, m1 n0, n1 l g(m) st2 25 50 3.3.1 75 C S R C S R 25 50 25 S I C I R I S F C F R F R = S + C + 1 ( 1 ) R I = S I + C I R F = S F + C F + 1 1 S F,C F [0,1) R F [0,2) g R F st2 st1 49 R F 49 1 st2 r 1 R F - 21 -

r c R F m R F l R I R F n0 n1 0 1 R I m0 m1 p R F l p R I 25 50 2 X 2 = 4 R I 25 R F 50 24 24 R I,R I+2 4 1 l p c lp fix0,fix1 Sns0, Sns1, Sls0, Sls1 50 m0,m1 n0,n1 c,m st1 st st2 NOR 1 Eone 3.3.2 [10] IEEE 754-22 -

3.3.1 p=0 RF (LS) (NS) [0,0.5) RI (n, l, st): (X0X): RI (010): RI (011): RI + 2 (11X): RI + 2 [0.5,1) (l, st): (00): RI (01): RI ( l 1) (n, l, st) (X0X): RI (X1X): RI+2 (1X): RI+2 ( l 0) [1,1,5) (l): (0): RI ( l 1) (n, l, st): (000): RI (1): RI +2 ( l 0) (001): RI + 2 (01X): RI + 2 (1XX): RI + 2 [1.5,2) (l, st): (0X): RI + 2 RI + 2 (10): RI +2 ( l 0) (11): RI + 2 Sns0 Sns1 24 R I R I +2 Sls0 Sls1 24 R I R I +2 23 l NS LS R1 R2 R3 R4 R F Sls0 & ~ ~ Sls0 = LS & (R1 R2 & ~l R3 & ~l) LS LS = ~m0 & (R1 R2) ~m1 & (R3 R4) R1 = ~c & ~m R2 = ~c & m R3 = c & ~m - 23 -

R4 = c & m m0 0 R I m0 = R I [23] m1 1 R I R I +1 R I R I +2 m1 = R I [23] & l (R I +2)[23] & ~l Eone LS = ~Eone & (~m0 & (R1 R2) ~m1 & (R3 R4)) Sls1 Sns1 Sns0 fix0 fix1 R I +3 R I 1 1 p = C F.m S F.m 24 R I +p R I +p+2 3.3.2 RF (LS) (NS) [0,0.5) RI+p (lp): (0): RI + p (1): RI + p + 2 [0.5,1) (l, p): (00): RI + p ( l 1) ( lp, p) (00): RI + p +2 (10): RI + p + 2 ( l 0) (01): RI + p (X1): RI + p (10): RI + p + 2 (11): RI + p + 2 [1,1,5) RI + p (1): RI +2 ( l 0) [1.5,2) (lp): (0): RI + p ( l 0) (1): RI + p + 2 ( l 1) (lp): (0): RI + p (1): RI + p + 2 RI + p + 2 Sls0 Sls1 Sns1 Sns0 fix0 fix1-24 -

MAF 3.3.3 RN p=0 Slns0 = ~m0 & ~c & (~m ~l) ~m1 & c & ~m & ~l Sls1 = (~m0 & ~c & m & l ~m1 & c & (m l)) & ~m1 Sns0 = m0 & ~c & ( ~l ~m & ~n0 & ~st ) m1 & c & ~m & ~n1 & ~l & ~st Srs1 = m0 & l & ~c & (n0 st m) m1 & c & (n1 l st m)) ( ~m0 & ~c & m & l ~m1 & c & (m l)) & m1 fix0 = ~m0 & l & ~c & m ~m1 & l & c & (~m ~st) fix1 = ~m0 & ~c & m & ~l & st ~m1 & c & ~m & ~l RI p = CF.m SF.m Sls0 = ~m0 & ~c & (~m & ~st ~lp p) ~m1 & c & ( ~m & ~st ~lp) Sls1 = ~m0 & ~c & ( m st ) & lp & ~p ~m1 & c & lp Sns0 = m0 & ~c & ~lp & (~m & ~st p) ~m1 & c & ~m & ~st & ~lp Sns1 = m0 & ~c & (lp ~p & (m st)) m1 & c fix0 = ~m0 & ~c & lp & ~p & (m st) ~m1 & c & lp fix1 = ~m0 & ~c & ~lp & ~p & (m st) ~m1 & c & ~lp RZ p = 0 Sls0 = ~m0 & ( ~c ~l) Sls1 = ~m1 & c & l Sns0 = m0 & (~c ~l) Sns1 = m1 & c & l fix0 = ~m1 & c & l fix1 = ~m1 & c & ~l Eone m0 m1 m1 = Eone R I [23] & l (R I +2)[23] & ~l m0 = Eone R I [23] n0 = R I [0] - 25 -

n1 = R I [0] & l (R I +2)[0] & ~l NS LS 3.4 MAF MAF Verilog HDL SMIC 0.18µm Synopsys Design Compiler 7.15ns 1.8 (tdetect) A (talign) 3:2 CSA (tcsa) LZD (tcalculate) (tnorm) (tadd) tmaf = tdetect + talign + tcsa + tcalculate + tnorm + tadd MAF CSA MAF 75 75 MAF LZD 50 24 CSA 75 + 75 + 75 + 75 50 + 75 + 24 75 + 75 2 50 24 8 MAF 8 NOR - 26 -

MAF MAF 350MHz 2-27 -

4.1 SRT FPU 4 SRT IBM eserver z990 P i+1 = r P i q i+1 D (4.1) P q D r q i+1 P i+1 q i+1 q i+1 P i+1 <(q max D)/(r-1) FPU q {-3,-2,-1,0,1,2,3} r 4 P<D P-D 4.1.1 IBM z990 PD PD 3 1 5 3 P in <D - 28 -

4.1.2 PD 4.2 24 Qpos Qneg Pcarry, Psave 5 Psave i+1 + Pcarry i+1 = 4 (Psave i + Pcarry i ) q i+1 D (4.2) - 29 -

q i >0 Qpos i = q i q i <0 Qneg i = q i Q = Qpos+Qneg = qi 4 -i (4.3) / PD / 5 Qpos Qneg 4.2.1 PD 1 0 [1,2) (0.5,2) 1 ExpA, ExpB leada, leadb ExpR = ExpA ExpB + Ebias leada +leadb 1 1 (0.5,2) ExpR 1 24 4-30 -

:P<D (D 1.00 Pin<0.010) 24 13 1 (0.5,1) 0 13 1 14 1 Qpos Qneg 24 15 ExpR<-23-23 ExpR<1 23+ExpR 23+ExpR (27+ExpR)/2 (29+ExpR)/2 23+ExpR (26+ExpR)/2 (28+ExpR)/2 4.3 [11] - 31 -

Qpos 13 Qneg 13 24 26 23/24 4.3.1 14 Qpos Qneg 1 1 Qpos + ~Qneg = Qpos Qneg 1 1 2-32 -

5.1 IBM z990 FPU 2 SRT 1 4 SRT {-2,-1,0,1,2}, X s X X [1,2) s = X 1/2 ε s [1, 2) s m ε <4 -m j S[j] S[0] = 1 j i i i { } (5.1) S[ j] = s 4, s 2, 1,0,1,2 i= 0 s = S[m] ε w[j] = 4 j ( x S[j] 2 ) w[j+1] = 4w[j] 2S[j]s j+1 s 2 j+1 4 -(j+1) (5.2) 4 4 j 4 4 S[ j] + 4 w[ j] S[ j] + 4 3 9 3 9 j (5.3) PD [12] Ŝ[j] j S[j] 4 A 1.A 2 A 3 A 4 A 1 =1 A 1 S 1 (1,1, 0, ), j = 0 ( S1, S2, S3, S4) = (1,1,1,1), A1 = 0 & j 0 (1, A2, A3, A4), j 0 (5.4) - 33 -

m k (i) w[j]<m k+1 (i) s j+1 =k 5.1.1 i Ŝ[j] 0 8/16 1 9/16 2 10/16 3 11/16 4 12/16 5 13/16 6 14/16 7 15/16 m 2 (i) 3/2 7/4 2 2 9/4 5/2 5/2 11/4 m 1 (i) 1/2 1/2 1/2 1/2 3/4 3/4 1 1 m 0 (i) -1/2-5/8-3/4-3/4-3/4-1 -1-1 m -1 (i) -13/8-7/4-2 -17/8-9/4-5/2-11/4-23/8 [11] 5.2 w[j+1] = 4w[j] 2S[j]s j+1 s j+1 2 4 -(j+1) 1) 4w[j] 2) 4w[j] S[j] s j+1 3) w[j+1] 5.2.1-34 -

X S[m] w[j] w[j+1] = 4w[j] + F[j] F[j] = 2S[j]s j+1 s j+1 2 4 -(j+1) S[j] s j+1 5.2.1 [11] S[j] A[j] = S[j] B[j] = S[j] 4 -j ( j+ 1) Aj [ ] + sj+ 14, sj+ 1 0 Aj [ + 1] = ( j+ 1) B[ j] + ( 4 Sj+ 1 ) 4, sj+ 1 < 0 (5.5) ( j+ 1) Aj [ ] + ( sj+ 1 1)4, sj+1 > 0 B[ j+ 1] = ( j+ 1) B[ j] + ( 3 sj+ 1 ) 4, sj+ 1 0 (5.6) - 35 -

s [1, 2) A[0] = 1.00.00 B[0] = 0.00 00 2 w[j+1] F[j]= 2S[j]s j+1 s j+1 4 -(j+1) ( j+ 1) ( ) 2 Aj [ ] + sj+ 14 sj+ 1, sj+ 1 > 0 F[ j] = 0, sj+ 1 = 0 (2 B[ j] + 8 s 4 ) s, s < 0 ( j+ 1) ( ) j+ 1 j+ 1 j+ 1 (5.7) w[j] s[j+1] A[j] B[j] F[j] 5.2.2 s [1, 2) 0 2-149 24 12 13 14-36 -

5.3 MAF S[m] 1 1 [13] A[j] B[j] S[j] C[j] A[j] = S[j] B[j] = S[j] 4 -j C[j] = S[j] + 4 -j C[j] A[j] B[j] 5.3.1 s j+1 A[j+1] B[j+1] C[j+1] 0 (A[j],0) (B[j],3) (A[j],1) 1 (A[j],1) (A[j],0) (A[j],2) -1 (B[j],3) (B[j],2) (A[j],0) 2 (A[j],2) (A[j],1) (A[j],3) -2 (B[j],2) (B[j],1) (B[j],3) A[m] B[m] C[m] - 37 -

2003 International Technology Roadmap for Semi-conductor, ITRS2003 2:1 3:1 1994 Intel 4.75 6.1-38 -

6.1.1 6.1.1 IEEE 754 constrains biasing 100 3 2 32X3 100 10 23 2 2 32X2 100 10 14 1 2 32 100-39 -

0.23 1000 6.1.2 IEEE 754 C CPU CPU CPU verilog verilog 6.1.3-40 -

1 Synopsys VCS Coverage Metrics license (line/statement coverage) (toggle coverage) (path coverage) (condition coverage) if?: 100% 100% case 6.2 (symbolic simulation) (model checking) (Theorem Proving) Intel STE IBM SixSense AMD ACL2 MAF [14] - 41 -

Synopsys Formality MAF MAF - 42 -

Verilog HDL Synopsys VCS Synopsys Design Complier Synopsys Astro SMIC 0.18µm Synopsys Design Compiler 3.07ns 4.04 3.34 0.70 1.8V 91.2mW 51.76 mw 39.43 mw 1.64 uw DC 6.2.1-43 -

4mm 4mm PAD 1mm 1mm 47% 6.68ns MAF PAD FPU 2004 [15] 12 32 32 (RF) 6.2.2 6.2.1 [15] SMIC 0.18 CMOS TSMC 0.18 CMOS 1.8V Core 2.5V PAD 1.8V Core 2.5V PAD 1.00mm 1.00mm 1.06mm 1.06mm 3 5 15 12 156MHz 266MHz [15] - 44 -

[15] IBM z99 MAF 6.2.2 Intel P4 AMD Athon K7 IBM z990 3 5 4 5 3 7 4 5 15 23 16 25 14 23 19 FPU - 45 -

8.1 IEEE 754 32 4 SRT FPU 1mm 2 150MHz 64 MAF SRT VCS DC Astro CAD Solaris UNIX 8.2-46 -

1 AMD ACL2 AMD Athlon K5 20 [16] 2 CPU DSP CPU DSP CPU DSP CPU DSP 2005 ISSCC IBM SONY SCE Toshiba 90nm SOI Cell 2500 Cell IBM 64 Power 8 FPU FPU 3-47 -

[1] G. Gerwig, H. Wetter, E. M. Schwarz, J. Haess, etc, all. The IBM eserver z990 floating-point unit, In: IBM Journal of Research and Development, v 48, n 3-4, May/July, 2004, p 311-322 [2] G. Gerwig, H. Wetter, E. M. Schwarz, and J. Haess, High Performance Floating-Point Unit with 116 Bit Wide Divider, In: Proceedings of the 16th Symposium on Computer Arithmetic, Santiago de Compostela, Spain, June 2003, pp.87 94 [3] L. Tomas and B. Javier D, Floating-point multiply-add fused with reduced latency, In: IEEE Transactions on Computers, v 53, n 8, August, 2004, p 988-1003 [4] ANSI/IEEE Standard 754-1985: IEEE standard for Binary Floating-Point Arithmetic. Poscataway, NJ: IIEEE Press, 1985. [5] B. Neil, the "Flagged prefix adder" for dual additions, In: Proceedings of SPIE - The International Society for Optical Engineering, v 3461, 1998, p 567-575 [6] Chichyang Chen, Liang-An Chen and Jih-Ren Cheng, Architectural Design of a Fast Floating-point Multiplication-Add Fused Unit Using Signed-Digit Addition, In: IEE Proceedings: Computers and Digital Techniques, v 149, n 4, July, 2002, p 113-120 [7] Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolic, Digital Integrated Circuits, a Design Perspective, Second Edition,,, 2004, p 586-594. [8] Schmookler, M.S.; Nowka, K.J., Leading zero anticipation and detection - A comparison of methods, In: Proceedings - Symposium on Computer Arithmetic, 2001, p 7-12. [9] Bruguera, Javier D. and Lang, Tomas, Leading-one prediction with concurrent position correction, In: IEEE Transactions on Computers, v 48, n 10, Oct, 1999, p 1083-1097 [10] Q, Nhon T., T. Naofumi, F. Michael J., Systematic IEEE Rounding Method for High-Speed Floating-Point Multipliers, In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v 12, n 5, May, 2004, p 511-521 - 48 -

[11] Neil Burgess, Prenormalization Rounding in IEEE Floating-Point Operations Using a Flagged Prefix Adder, In: IEEE Transcations on VLSI Systems, Vol. 13. NO 2. Feb, 2005. p 266-277 [12] M. D. Ercegovac and L. Tomas, Radix-4 square root without initial PLA, In: IEEE transactions on Computers. Vol. 39, NO. 8, Aug. 1990. p1016-1024 [13] E., Milos D. and L. Tomas, On-the-fly rounding for division and square root, In: Proceedings - Symposium on Computer Arithmetic, 1989, p 169-173 [14] Jacobi, C. ; Weber, K.; Paruthi, V.; Baumgartner, J. Automatic formal verification of fused-multiply-add FPUs, In: Proceedings. Design, Automation and Test in Europe, 2005, pt. 2, p 1298-303 Vol. 2 [15] K., Taek-Jun; M., Joong-Seok; S., Jeff and D., Jeff, A 0.18µm implementation of a floating-point unit for a processing-in-memory system, In: Proceedings - IEEE International Symposium on Circuits and Systems, v 2, 2004 [16] Russinoff, D.M., A case study in formal verification of register-transfer logic with ACL2: the floating point adder of the AMD Athlon/sup TM/ processor, In: Formal Methods in Computer-Aided Design. Third International Conference, FMCAD 2000. Proceedings (Lecture Notes in Computer Science Vol.1954), 2000, p 3-6 - 49 -

- 50 -

IBM eserver z990 G. Gerwig H. Wetter E. M. Schwarz J. Haess C. A. Krygowski B. M. Fleischer M. Kroener IBM eserver z990 FPU IBM SRT IBM zseries IEEE754 FPU 4 SRT 2 IBM z990 eserver*[1] (FPU) IEEE754 [2] BFP) IBM z/architecture*[3] (HFP) IBM PowerPC * z * HFP BFP z990 FPU 1996 G3 FPU[4] 1997 G4 FPU[5,6] 1998 G5 FPU[7,8] 1990 G6 FPU 2000 z900 FPU[9] z990 FPU BFP z Linux** JAVA** C++ BFP G5 G6 z900 FPU BFP HFP BFP - 51 -

BFP HFP BFP IBM p POWER4* [10] z POWER4 SRT [11,12] z990 1 SRT 2 3 4 SRT 5 6 7 8 9 SRT BFP HFP 1998 IBM z G5 HFP BFP HFP BFP HFP BFP [13] HFP 2 n-1 BFP - 52 -

(2 n-1-1) BFP HFP XBPFi=(-1) Xs (1+Xf) 2 Xe-biasBi bias Bii =2 n-1-1=32767 X HFPi =(-1) Xs Xf 2 Xe-biasHi, bias Hii =2 n-1 =32768 [10] 3 (FPU FPU E1 E2 E3 E4 E5 E (E-1) - 53 -

E0 (FPR) E0 E1 E2 E3 0 E4 E5 E6 E0 A B C (LWRs) 16 FPR 4 5 LWR E0-54 -

4 FPU RX RR RX RR [13]IBM PowerPC RX PowerPC RR z HFP 56 56 4 29-55 -

BFP 1 Y X P X Wj lzcl[14,15] 0 n 1 X x x = + i i= 1 i 2 i 0 n 1 Y = y + y i2 j= 1 j j Y n 1 + 1 2 = W i 4 j= 1 j j j { 2, 1, 0, 1, 2} W + + n 1 + 1 2 P= W ixi 4 j= 1 j j 1 n 1 = + i X x x i= 1 i 2 i X = X x 0 n 1 + 1 2 j P = Wj ix i4 Yi x j= 1 0-56 -

lzcl = Yi x 0 D D+1 D-1 [10] 56 112 176 HFP BFP BFP 176 60 116 1.cccc...cGGG xx.pppp...pggggggggg ^ ^ 1 2-57 -

G c p E c 1 60 E p 2 LZC E n =E p -LZC En<Emin BFP En<Emin LZCmin=Ep-Emin En Emin [10] IEEE754 LSB BFP 113-58 -

5 116 116 SRT 1 FPU 32 64 128 1 SRT SRT P i+1 =r P i -q i+1 D P q D r - 59 -

P i+1 =r P i =q i+1 2Q i -q 2 i+1r -(i+1) P q Q r 6 q i+1 P i+1 q i+1 q i+1 P i+1 <(q max D)/(r-1) P-D q i+1-60 -

4 [16,17] P i+1 =P Ci+1 +P Ci+1 P Si+1 +P Ci+1 =4(P Si +P Ci )-q i+1 D q i+1 {-3,-2,-1,0,0,+1,+2,+3} q i+1 =q i+1,1 + q i+1,2 P Si+1 +P Ci+1 =4(P Si +P Ci )-q i+1,1 1D-q i+1,2 2D P S P C -q i+1,1 q i+1,2 1 2-1 0 +1 2(r=2) P i+1 =P Si+1 +P Ci+1 Q i =Q Pi +Q Ni, P Si+1 +P Ci+1 =2(P Si +P Ci )-q i+1 2Q Pi +q i+1 2Q Ni -q 2 i+1 r -(i+2) q i+1 {-1,0,0,+1} q 2 i+1 r -(i+2) - 61 -

7 BFP 113 PD +/- - 62 -

q i+1,1 q i+1,2 HFP 116 116 28 6 4:2 3:2 (CSAs) CPAs CPA CPA CPA 116 Q pos Q neg q i+1 Q pos Q neg q i+1 FPU BFP FPU IEEE - 63 -

3 IEEE q i+1 n V n D n Q n Q0 =n V -n D V norm <D norm n I =n V -n D+1 V norm D norm n Qe P Start P Stop P Start =64-n Qe - 64 -

P Stop =64 4 FPU 3.76 0.22 FPU 6% IBM 0.13 m CMOS SOI 1.15V 50 1.2GHz A B C 56 116 116 FPR A B C CMOS FPU - 65 -

8 IBM eserver z990-66 -

80 z990 FPU Juergen Foag Andree Marth Hans-Juergen Muenster Lukas Daellenbach, Dave Rude, Peter Cook, Steve Klepner, Fanchieh Yee, Harald Mielich, Rainer Clemen Juergen Vielfort Klaus Keuerleber * (IBM) ** Linus Torvalds Sun Microsystems, Inc. - 67 -

- 68 -

- 69 -

- 70 -

- 71 -

- 72 -

- 73 -

- 74 -

- 75 -

- 76 -

- 77 -

- 78 -

- 79 -