浮点运算单元的设计与实现

Similar documents
NANO COMMUNICATION 23 No.3 90 CMOS 94/188 GHz CMOS 94/188 GHz A 94/188 GHz Dual-Band VCO with Gm- Boosted Push-Push Pair in 90nm CMOS 90 CMOS 94

2/80 2

SuperMap 系列产品介绍

多核心CPU成長日記.doc

附件1:

IP TCP/IP PC OS µclinux MPEG4 Blackfin DSP MPEG4 IP UDP Winsock I/O DirectShow Filter DirectShow MPEG4 µclinux TCP/IP IP COM, DirectShow I

(Pattern Recognition) 1 1. CCD

投影片 1

CPU CPU Intel CPU AMD CPU CPU Socket A/Socket 370 CPU Socket 478 CPU CPU CPU CPU CPU

<4D F736F F F696E74202D20B5DAD2BBD5C228B4F2D3A1B0E6292E BBCE6C8DDC4A3CABD5D>

1 CPU

Intel® Core2™ i7 Processor

Microsoft PowerPoint - lecture4--Signal Processing on DSPs.ppt

UDC 厦门大学博硕士论文摘要库

Achieving One TeraFLOPS with 28-nm FPGAs

02 看 見 躍 動 的 創 新 力 量 04 矽 數 十 年 金 矽 創 意 十 年 有 成 16 築 夢 之 際 你 所 不 知 道 的 金 矽 獎 40 樂 在 其 中

Mechanical Science and Technology for Aerospace Engineering October Vol No. 10 Web SaaS B /S Web2. 0 Web2. 0 TP315 A

OncidiumGower Ramsey ) 2 1(CK1) 2(CK2) 1(T1) 2(T2) ( ) CK1 43 (A 44.2 ) CK2 66 (A 48.5 ) T1 40 (

第一章 前言

Microsoft PowerPoint - STU_EC_Ch08.ppt

祲 肐 끤 㽺 멎 ᡒ 艹 ᰠ 㝵 㽑 쭺 흟 祝 慎 獑 晛 ൎ 鎊 ൎ Ȱ 쭗 ꢚ 問 箘 全 鍨 し 멎 ή љ ൎ 劗 煜 Ȱ ᴠ شمم ᱎ 뭹 _ ƀ 뙛 煜 홎 㙲 뉎 콫 앟 ୷ 葶 㽼 奵 葶 힋 卟 १ 㜀 腜 Ȱ ή 蹎 㠀 㤀 アハ ート 瑞 葶 شمم ᱎ 摫 뭹 _ ᰠ

《计算机应用基础》学习材料(讲义)

2 3. 1,,,.,., CAD,,,. : 1) :, 1,,. ; 2) :,, ; 3) :,; 4) : Fig. 1 Flowchart of generation and application of 3D2digital2building 2 :.. 3 : 1) :,

第二章

小论文草稿2_邓瀚

Ps22Pdf

4 115,,. : p { ( x ( t), y ( t) ) x R m, y R n, t = 1,2,, p} (1),, x ( t), y ( t),,: F : R m R n.,m, n, u.,, Sigmoid. :,f Sigmoid,f ( x) = ^y k ( t) =

cm /s c d 1 /40 1 /4 1 / / / /m /Hz /kn / kn m ~

31 17 www. watergasheat. com km 2 17 km 15 km hm % mm Fig. 1 Technical route of p

Microsoft PowerPoint - chap3.ppt

基 础 实 室 4 计 算 机 网 络 唐 爱 红 专 业 机 房 PROTEL 联 想 同 方 电 脑 180 台 唐 爱 红 MATLAB 计 算 机 网 络 电 工 电 子 技 能 训 练 室 电 子 基 本 技 能 示 波 器 毫 伏 表 雕 刻 机 图 示 仪 电 子 实 训 台 电 工

通 过 厂 变 带 电, 这 种 设 计 减 少 了 机 组 自 带 厂 用 电 负 荷 能 力, 降 低 了 锅 炉 满 足 FCB 时 最 低 稳 燃 工 况, 同 时 造 成 燃 烧 调 整 量 加 大 本 电 厂 在 FCB 试 验 时, 电 泵 不 联 启, 始 终 保 持 汽 泵 运 行

1 VLBI VLBI 2 32 MHz 2 Gbps X J VLBI [3] CDAS IVS [4,5] CDAS MHz, 16 MHz, 8 MHz, 4 MHz, 2 MHz [6] CDAS VLBI CDAS 2 CDAS CDAS 5 2

Microsoft Word - A doc

untitled

Outline Speech Signals Processing Dual-Tone Multifrequency Signal Detection 云南大学滇池学院课程 : 数字信号处理 Applications of Digital Signal Processing 2

A dissertation for Master s degree Metro Indoor Coverage Systems Analysis And Design Author s Name: Sheng Hailiang speciality: Supervisor:Prof.Li Hui,

Microsoft Word - A doc

/3 CAD JPG GIS CAD GIS GIS 1 a CAD CAD CAD GIS GIS ArcGIS 9. x 10 1 b 1112 CAD GIS 1 c R2VArcscan CAD MapGIS CAD 1 d CAD U

TI 3 TI TABLE 4 RANDBIN Research of Modern Basic Education

F4

ebook105-12

國家圖書館典藏電子全文

I 元器件上市公司经济状况分析及年度展望

LaDefense Arch Petronas Towers 2009 CCTV MOMA Newmark Hahn Liu 8 Heredia - Zavoni Barranco 9 Heredia - Zavoni Leyva

Agenda PXI PXI

热设计网

Microsoft Word - netcontr.doc

ebook105-1

Vol. 22 No. 4 JOURNAL OF HARBIN UNIVERSITY OF SCIENCE AND TECHNOLOGY Aug GPS,,, : km, 2. 51, , ; ; ; ; DOI: 10.

mm ~

Microsoft Word - ED-774.docx

第 15 章 程 式 編 写 語 言 15.1 程 式 編 写 語 言 的 角 色 程 式 編 寫 語 言 是 程 式 編 寫 員 與 電 腦 溝 通 的 界 面 語 法 是 一 組 規 則 讓 程 式 編 寫 員 將 字 詞 集 合 起 來 電 腦 是 處 理 位 元 和 字 節 的 機 器, 與

第一章 出口退税制改革的内容

untitled

微软用户

2011年上海市高校精品课程申报表(本科)

Microsoft PowerPoint - Performance Analysis of Video Streaming over LTE using.pptx

上海交通大学(二).DOC

[1-3] (Smile) [4] 808 nm (CW) W 1 50% 1 W 1 W Fig.1 Thermal design of semiconductor laser vertical stack ; Ansys 20 bar ; bar 2 25 Fig

中山大學學位論文典藏.PDF

1 1

Improved Preimage Attacks on AES-like Hash Functions: Applications to Whirlpool and Grøstl

WTO

θ 1 = φ n -n 2 2 n AR n φ i = 0 1 = a t - θ θ m a t-m 3 3 m MA m 1. 2 ρ k = R k /R 0 5 Akaike ρ k 1 AIC = n ln δ 2

論文寫作技巧

Microsoft Word - 12-DQ _56-61_上接55页.doc

y 1 = 槡 P 1 1h T 1 1f 1 s 1 + 槡 P 1 2g T 1 2 interference 2f 2 s y 2 = 槡 P 2 2h T 2 2f 2 s 2 + 槡 P 2 1g T 2 1 interference 1f 1 s + n n

Cube20S small, speedy, safe Eextremely modular Up to 64 modules per bus node Quick reaction time: up to 20 µs Cube20S A new Member of the Cube Family

Microsoft PowerPoint ARIS_Platform_en.ppt

目 录 第 一 部 分 毕 业 设 计 / 毕 业 实 习 工 作 计 划 总 则 成 立 毕 业 设 计 毕 业 实 习 工 作 领 导 小 组 毕 业 设 计 时 间 安 排...4 第 二 部 分 毕 业 设 计 任 务 书 毕

Microsoft Word - A _ doc

Abstract / / B-ISDN ATM Crossbar Batcher banyan N DPA Modelsim Verilog Synopsys Design Analyzer Modelsim FPGA ISE FPGA ATM ii

地勘快讯

JOURNAL OF EARTHQUAKE ENGINEERING AND ENGINEERING VIBRATION Vol. 31 No. 5 Oct /35 TU3521 P315.

Thesis for the Master degree in Engineering Research on Negative Pressure Wave Simulation and Signal Processing of Fluid-Conveying Pipeline Leak Candi

南華大學數位論文

(02) (02) (02) (02) (02


CMOS线性响应测试

标题

Learning Java

untitled

8 : 731 Key words : narrow annular gap ; curvat ure ; critical heat flux ; annular flow,,,,,,,, ( ),, [122 ] kg/ (m 2 s) MPa

VLBI2010 [2] 1 mm EOP VLBI VLBI [3 5] VLBI h [6 11] VLBI VLBI VLBI VLBI VLBI GPS GPS ( ) [12] VLBI 10 m VLBI 65 m [13,14] (referen

元培科技大學 年度「傑出校友」推薦表

T stg -40 to 125 C V cc 3.8V V dc RH 0 to 100 %RH T a -40 to +125 C -0.3 to 3.6V V -0.3 to VDD+0.3 V -10 to +10 ma = 25 = 3V) VDD

ANSYS在高校《材料力学》教学中的应用

2011年南臺灣教育論壇

. 1 4 Web PAD

<4D F736F F D20B1B1BEA9BAA3F6CEBFC6BDF0B8DFBFC6BCBCB9C9B7DDD3D0CFDEB9ABCBBECAD7B4CEB9ABBFAAB7A2D0D0B9C9C6B1D5D0B9C9CBB5C3F7CAE9A3A8C9EAB1A8B8E C4EA35D4C23132C8D5B1A8CBCDA3A92E646F63>

12-1b T Q235B ML15 Ca OH Table 1 Chemical composition of specimens % C Si Mn S P Cr Ni Fe

11 25 stable state. These conclusions were basically consistent with the analysis results of the multi - stage landslide in loess area with the Monte

混訊設計流程_04.PDF

M M. 20

untitled

g 100mv /g 0. 5 ~ 5kHz 1 YSV8116 DASP 1 N 2. 2 [ M] { x } + [ C] { x } + [ K]{ x } = { f t } 1 M C K 3 M C K f t x t 1 [ H( ω )] = - ω 2

1 引言

声 明 本 人 郑 重 声 明 : 此 处 所 提 交 的 硕 士 学 位 论 文 基 于 等 级 工 鉴 定 的 远 程 考 试 系 统 客 户 端 开 发 与 实 现, 是 本 人 在 中 国 科 学 技 术 大 学 攻 读 硕 士 学 位 期 间, 在 导 师 指 导 下 进 行 的 研 究

20

Transcription:

2005 6 20

32 IEEE 754 32 (A+B C) 4 SRT Verilog HDL 0.18µm CMOS - i -

Abstract Floating point representation is the science notification in computer, and floating point operations are the major part of multimedia calculations, so floating point unit is an important part in the design of all kinds of processors and it determines the performance. This article describes design and implementation of a 32-bit floating point unit and related research results, which is archived as my undergraduate thesis. The unit is fully compliant with the IEEE 754 floating point standard, and supports 32-bit single precision floating point operations including addition, multiplication, division, square root operation, and conversion between integers and floating point numbers. Multiply-add fused scheme is adapted in front end arithmetic design, to calculate addition and multiplication (A+B C). The processing of denormalized numbers is merged into the dataflow. In order to narrow the width of the main adder, normalization shift is performed before addition. Along with the significant bits wide main adder, rounding is performed on-fly. The synthesis result proves that the delay of MAF unit decreased a lot, compared with previous designs. The MAF unit is divided into 3 pipeline stages, so that the throughput increased a lot. Radix-4 SRT algorithm is implemented for calculating division and square root literally. Different rounding schemes are chose for different representation of intermediate results. As a result, the iteration cycles are limited and the arithmetic is efficient. The whole design is described in Verilog HDL and simulated. The design is mapped on SMIC 0.18µm CMOS technology as an automatic standard cell implementation. The post-simulation result proves that it achieved the goal. Key Words: FPU (floating point unit), MAF (multiply-add fused), denormalized number processing, rounding - ii -

... I ABSTRACT...II...III...1 1.1...1 1.2...1 1.3...2...4 2.1...4 2.2 (ROUNDING)...5 2.3...6 2.4 EXCEPTION...8 2.5...9...10 3.1...10 3.2...11 3.3...21 3.4...26...28 4.1 SRT...28 4.2...29 4.3...31...33 5.1...33 5.2...34 - iii -

5.3...37...38 6.1...38 6.2...41...43...46 8.1...46 8.2...46...48...50...50...51...68 - iv -

1.1 0 (Float Point Unit) FPU 3D FPU FPU ALU CPU 486 Intel i80486dx CPU MMX 3DNow! CPU DSP FPU 1.2 1 FPU FLOPS Floating Point Operations Per Second 2005 2 SONY IBM TOSHIBA ISSCC05 90nm SOI Cell SPE (Synergistic Processing Element) 4GHz SPE 32GFLOPS SONY Play Station 3-1 -

2 Intel P4 23 7 AMD Athlon 16 4 P4 Athlon 3 1994 Intel Pentium 5 1.3 SMIC 0.18µm CMOS MAF(Multiply-add fused) A+B C A B C MAF IBM z990 [1] MAF MAF Z990 (millicode) [2] [3] MAF - 2 -

[3] MAF MAF A/B [1] 4 SRT 24 12 1 14 1.2 AMD Athon SQRT(A) z990 FPU 2 SRT 1 20 P4 23 [4] 4 SRT 2 14 IEEE 754 [4] MAF Verilog HDL RTL Synopsys VCS Synopsys Design Complier Synopsys Astro IBM z990 FPU 0.13µm CMOS 1.2GHz 156MHz IEEE 754 FPU FPU - 3 -

IEEE 754 [4] IEEE 754 2.1 ±S B E 32 2.1.1 1 8 23 s e f 2.1.1 s(sign) 0 1 S(significand) 1 1 (mantissa) (fraction) E(exponent) 32 127 8 0~255 1~254-126~127 0-126 B(base) 2.1.1 32 IEEE 754 32-4 -

2 32 2.1.1 0 1 0 1 0/1 0/1 0/1 255 255 255 0 0 0<e<255 0 1 1 1 0 0 0 0 0 f f 0-0 - NaN ±2 e-127 1.f ±2-126 0.f 2.2 (rounding) IEEE 754 4 (round to nearest/even) IEEE + - (interval arithmetic) - 5 -

0 2.2.1 00.0 01.0 10.0 2.2.1 2.3 IEEE 754 2.3.1 < <+ 2.3.2 NaN NaN (signaling)nan NaN (quiet)nan NaN - 6 -

1) NaN 2) ( ) + (+ ) 3)0 4)0/0 / 5) x REM 0 REM y 6) 2.3.3 0 - x x + x = x (-x) x -0-0 2.3.4 IEEE 754 1 1 2.3.1 [2 n 2 n+1 ] 0 0 IEEE 32 2 23 2-126 2 23 [2-126 2-125 ] 0 2-126 - 7 -

0 2-126 2-125 2-124 2-123 2.3.1 (gradual underflow) 2.4 Exception IEEE 754 5 2.4.1 NaN 2.3.2 2.4.2 2.4.3 overflow 1) 2) 3) 4) - 8 -

2.4.4 underflow 2.4.5 / 2.5 32 IEEE 754 [4] 1 8 23 24 1 0 1-9 -

3.1 3.1.1 [1] [6] 1 C B [7] (CSA, Carry-Save Adder) 48 A 3:2CSA A B C A B C B C 74 A 48 B C 3:2 CSA 74 74 24 [3] [3] [3] [6] 1-10 -

1 [1] [2] A B C A B C A B C B C CSA 3 2 CSA 74 3.1.1 MAF 3.2 24 MAF - 11 -

ExpA sub ExpB ExpC A 24 Bit invert bits of A and 0's if add or bits of ~A and 1's if sub B 24 C 24 mul record ExpA LeadA ExpB LeadB ExpC LeadC Calculate the exponent of intermediate results 27-d 74-bits alignment shifter Sticky bit st1 calculation Part of sticky 74 26 MSB 48 LSB sub 13:2 CSA tree 48 48 ~st1 denormalized number processing Adjustment for potential denormalized results 3:2 CSA 49 49 HA Carry word HA Sum word HA inv.inputs Carry word LZA Logic for the LSBs MSBs processing Part of Dual adder 74 49 0 1 mux 75 75 Part of Dual adder 24 bits 49 bits XX GG XX.PP complement Sign detection to the add/round module 75-bits 75-bits normalization normalization shifter shifter Bits shifted-in during normalization Carry word: 0... 0 Sum word: 0... 0 (if complement=0) 1... 1 (if complement=1) 75 75 25 bits 50 bits X. X.... X X......... X X. X.... X X......... X L bit 50 24 24 50 st1 complement Exponent calculationone bit correction Rest of 22-bits Flagged prefix adder Rounding Correction Carry and sticky bits calculation Rounding Mode RN=1 if rounding to nearest RI=1 if rounding to infinity Exponent Resul 3.2.1 MAF 3.2.1 1 0 24 1-12 -

3 1 B C (Booth) C 24 C 0 13 B 0 ±1 ±2 3:2 13 2 [7] 2 A B C A+ B C A A 75 3.2.2(a) 24 A 51 1 51 0 A B C d = exp(a) (exp(b) + exp(c)) 3.2.2(b) d = 27 d 27 A B C B C B C A exp(a) d 27 A 0 1 shift amount = 27 d 75 d 48 24 A 24 24 0 1 0 (sticky bit) 24 0 1-13 -

st1 0 75 24 51 A A A 2 48 75 B C (a) 2 24 A 51 B C (b) 24 1 1 (c) 49 3:2 CSA 49 50 B C A 26 3:2CSA 49 0 1 3:2CSA 50 49 51 (d) 3:2CSA 3.2.2 MAF 3 1 A B C 1 24 lead 0 lead=0-14 -

75 B C 26+leadB+leadC 27+leadB+leadC A leada + shift amount A B C 3.2.2 49 3:2 (Carry Save Adder) A 49 48 B C A 1 1 24 1 st1 1 75 B C 0 3:2 CSA 49 50 49 (d) B C 48 B C 3:2 CSA B C B C [0,4) 4 11.*+01.* 1X.*+1X.* 01.*+11.* B C B C [1,4) 10.*+01.* 1 B C [0,2) 2 (1X.*+XX.* XX.*+1X.*) B C multi-carry 3:2 CSA carry[50] 75 75 26 A 49 3:2 CSA 75 49 3:2 CSA - 15 -

49 multi-carry 0 carry[50] 25 0 75 26 multi-carry 1 carry[50] 1 26 0 75 26 multi-carry 1 carry[50] 0 26 1 75 26 3:2 CSA 75 75 24 1 (Leading Zeros Anticipator) 1 3:2 CSA 3 1 75 51 A 1 24 75 A 51 [8] 1 [9] [8] 0 1 0-16 -

T = A B, G = AB, Z = AB f = TT 0 0 1 ( i+ 1 i+ ) 1( 1 ) 1 i i+ i+ 1 f = T G Z + Z G + T Z Z + GG, i> 0 i i 1 i i i i (3.1) 51 A 1 3.2.1 A 1 27 26 0 2 1 4 1 t[n]^ (~z[n+1]) 1 A 1 26 25 0 t[n]^ (~z[n+1]) 1 A 1 26 25 0 A t[n] ^ (~g[n+1]) 0 A 25 24 0 1 t[n] ^ (~g[n+1]) 51 [3] 32 3.2.3 51-17 -

3.2.3 51 Lz 75 a) A 0 A 2525 A B C 2 24 A B C 3 A B C 3 A ( ) A A A A 25 B C b) 25 0 1( ) A 1 A 1-18 -

c) A 25 A B C B C ExpBC = + ExpC Ebias 51 51 4 47 51 A 25 24 1 51 24 + Lz ExpR 1 ExpR = ExpBC Lz + 3 1 3 4 Lz 4 1 3 Lz ExpBC + 2 Lz > ExpBC+2 + 2 ExpR 1 3.2.1 ARS NLS ExpR 3.2.1 NLS ExpR ARS = 0 0 ExpA ARS <25 ARS ExpA ARS <25 A ARS ExpA 0<ARS<25 A ARS - 1 ExpA + 1 ARS 25 Lz ExpBC + 2 24 + Lz ExpBC Lz -3 ARS 25 Lz > ExpBC+2 26 + ExpBC 1 ARS 25 Lz ARS<25 0 1 1 0 2 IEEE 754 51 A B C B C - 19 -

3 [3] 75 3.2.3 NLS 1 0 1 1 24 51 51 1 [10] MAF A+B A+B+1 [11] flagged prefix adder [11] IEEE 754 IEEE 754-20 -

3.3 [10] MAF 1 75 1 2 1(Eone ) 0 C F, S F, R F C I, S I, R I m0, m1 n0, n1 l g(m) st2 25 50 3.3.1 75 C S R C S R 25 50 25 S I C I R I S F C F R F R = S + C + 1 ( 1 ) R I = S I + C I R F = S F + C F + 1 1 S F,C F [0,1) R F [0,2) g R F st2 st1 49 R F 49 1 st2 r 1 R F - 21 -

r c R F m R F l R I R F n0 n1 0 1 R I m0 m1 p R F l p R I 25 50 2 X 2 = 4 R I 25 R F 50 24 24 R I,R I+2 4 1 l p c lp fix0,fix1 Sns0, Sns1, Sls0, Sls1 50 m0,m1 n0,n1 c,m st1 st st2 NOR 1 Eone 3.3.2 [10] IEEE 754-22 -

3.3.1 p=0 RF (LS) (NS) [0,0.5) RI (n, l, st): (X0X): RI (010): RI (011): RI + 2 (11X): RI + 2 [0.5,1) (l, st): (00): RI (01): RI ( l 1) (n, l, st) (X0X): RI (X1X): RI+2 (1X): RI+2 ( l 0) [1,1,5) (l): (0): RI ( l 1) (n, l, st): (000): RI (1): RI +2 ( l 0) (001): RI + 2 (01X): RI + 2 (1XX): RI + 2 [1.5,2) (l, st): (0X): RI + 2 RI + 2 (10): RI +2 ( l 0) (11): RI + 2 Sns0 Sns1 24 R I R I +2 Sls0 Sls1 24 R I R I +2 23 l NS LS R1 R2 R3 R4 R F Sls0 & ~ ~ Sls0 = LS & (R1 R2 & ~l R3 & ~l) LS LS = ~m0 & (R1 R2) ~m1 & (R3 R4) R1 = ~c & ~m R2 = ~c & m R3 = c & ~m - 23 -

R4 = c & m m0 0 R I m0 = R I [23] m1 1 R I R I +1 R I R I +2 m1 = R I [23] & l (R I +2)[23] & ~l Eone LS = ~Eone & (~m0 & (R1 R2) ~m1 & (R3 R4)) Sls1 Sns1 Sns0 fix0 fix1 R I +3 R I 1 1 p = C F.m S F.m 24 R I +p R I +p+2 3.3.2 RF (LS) (NS) [0,0.5) RI+p (lp): (0): RI + p (1): RI + p + 2 [0.5,1) (l, p): (00): RI + p ( l 1) ( lp, p) (00): RI + p +2 (10): RI + p + 2 ( l 0) (01): RI + p (X1): RI + p (10): RI + p + 2 (11): RI + p + 2 [1,1,5) RI + p (1): RI +2 ( l 0) [1.5,2) (lp): (0): RI + p ( l 0) (1): RI + p + 2 ( l 1) (lp): (0): RI + p (1): RI + p + 2 RI + p + 2 Sls0 Sls1 Sns1 Sns0 fix0 fix1-24 -

MAF 3.3.3 RN p=0 Slns0 = ~m0 & ~c & (~m ~l) ~m1 & c & ~m & ~l Sls1 = (~m0 & ~c & m & l ~m1 & c & (m l)) & ~m1 Sns0 = m0 & ~c & ( ~l ~m & ~n0 & ~st ) m1 & c & ~m & ~n1 & ~l & ~st Srs1 = m0 & l & ~c & (n0 st m) m1 & c & (n1 l st m)) ( ~m0 & ~c & m & l ~m1 & c & (m l)) & m1 fix0 = ~m0 & l & ~c & m ~m1 & l & c & (~m ~st) fix1 = ~m0 & ~c & m & ~l & st ~m1 & c & ~m & ~l RI p = CF.m SF.m Sls0 = ~m0 & ~c & (~m & ~st ~lp p) ~m1 & c & ( ~m & ~st ~lp) Sls1 = ~m0 & ~c & ( m st ) & lp & ~p ~m1 & c & lp Sns0 = m0 & ~c & ~lp & (~m & ~st p) ~m1 & c & ~m & ~st & ~lp Sns1 = m0 & ~c & (lp ~p & (m st)) m1 & c fix0 = ~m0 & ~c & lp & ~p & (m st) ~m1 & c & lp fix1 = ~m0 & ~c & ~lp & ~p & (m st) ~m1 & c & ~lp RZ p = 0 Sls0 = ~m0 & ( ~c ~l) Sls1 = ~m1 & c & l Sns0 = m0 & (~c ~l) Sns1 = m1 & c & l fix0 = ~m1 & c & l fix1 = ~m1 & c & ~l Eone m0 m1 m1 = Eone R I [23] & l (R I +2)[23] & ~l m0 = Eone R I [23] n0 = R I [0] - 25 -

n1 = R I [0] & l (R I +2)[0] & ~l NS LS 3.4 MAF MAF Verilog HDL SMIC 0.18µm Synopsys Design Compiler 7.15ns 1.8 (tdetect) A (talign) 3:2 CSA (tcsa) LZD (tcalculate) (tnorm) (tadd) tmaf = tdetect + talign + tcsa + tcalculate + tnorm + tadd MAF CSA MAF 75 75 MAF LZD 50 24 CSA 75 + 75 + 75 + 75 50 + 75 + 24 75 + 75 2 50 24 8 MAF 8 NOR - 26 -

MAF MAF 350MHz 2-27 -

4.1 SRT FPU 4 SRT IBM eserver z990 P i+1 = r P i q i+1 D (4.1) P q D r q i+1 P i+1 q i+1 q i+1 P i+1 <(q max D)/(r-1) FPU q {-3,-2,-1,0,1,2,3} r 4 P<D P-D 4.1.1 IBM z990 PD PD 3 1 5 3 P in <D - 28 -

4.1.2 PD 4.2 24 Qpos Qneg Pcarry, Psave 5 Psave i+1 + Pcarry i+1 = 4 (Psave i + Pcarry i ) q i+1 D (4.2) - 29 -

q i >0 Qpos i = q i q i <0 Qneg i = q i Q = Qpos+Qneg = qi 4 -i (4.3) / PD / 5 Qpos Qneg 4.2.1 PD 1 0 [1,2) (0.5,2) 1 ExpA, ExpB leada, leadb ExpR = ExpA ExpB + Ebias leada +leadb 1 1 (0.5,2) ExpR 1 24 4-30 -

:P<D (D 1.00 Pin<0.010) 24 13 1 (0.5,1) 0 13 1 14 1 Qpos Qneg 24 15 ExpR<-23-23 ExpR<1 23+ExpR 23+ExpR (27+ExpR)/2 (29+ExpR)/2 23+ExpR (26+ExpR)/2 (28+ExpR)/2 4.3 [11] - 31 -

Qpos 13 Qneg 13 24 26 23/24 4.3.1 14 Qpos Qneg 1 1 Qpos + ~Qneg = Qpos Qneg 1 1 2-32 -

5.1 IBM z990 FPU 2 SRT 1 4 SRT {-2,-1,0,1,2}, X s X X [1,2) s = X 1/2 ε s [1, 2) s m ε <4 -m j S[j] S[0] = 1 j i i i { } (5.1) S[ j] = s 4, s 2, 1,0,1,2 i= 0 s = S[m] ε w[j] = 4 j ( x S[j] 2 ) w[j+1] = 4w[j] 2S[j]s j+1 s 2 j+1 4 -(j+1) (5.2) 4 4 j 4 4 S[ j] + 4 w[ j] S[ j] + 4 3 9 3 9 j (5.3) PD [12] Ŝ[j] j S[j] 4 A 1.A 2 A 3 A 4 A 1 =1 A 1 S 1 (1,1, 0, ), j = 0 ( S1, S2, S3, S4) = (1,1,1,1), A1 = 0 & j 0 (1, A2, A3, A4), j 0 (5.4) - 33 -

m k (i) w[j]<m k+1 (i) s j+1 =k 5.1.1 i Ŝ[j] 0 8/16 1 9/16 2 10/16 3 11/16 4 12/16 5 13/16 6 14/16 7 15/16 m 2 (i) 3/2 7/4 2 2 9/4 5/2 5/2 11/4 m 1 (i) 1/2 1/2 1/2 1/2 3/4 3/4 1 1 m 0 (i) -1/2-5/8-3/4-3/4-3/4-1 -1-1 m -1 (i) -13/8-7/4-2 -17/8-9/4-5/2-11/4-23/8 [11] 5.2 w[j+1] = 4w[j] 2S[j]s j+1 s j+1 2 4 -(j+1) 1) 4w[j] 2) 4w[j] S[j] s j+1 3) w[j+1] 5.2.1-34 -

X S[m] w[j] w[j+1] = 4w[j] + F[j] F[j] = 2S[j]s j+1 s j+1 2 4 -(j+1) S[j] s j+1 5.2.1 [11] S[j] A[j] = S[j] B[j] = S[j] 4 -j ( j+ 1) Aj [ ] + sj+ 14, sj+ 1 0 Aj [ + 1] = ( j+ 1) B[ j] + ( 4 Sj+ 1 ) 4, sj+ 1 < 0 (5.5) ( j+ 1) Aj [ ] + ( sj+ 1 1)4, sj+1 > 0 B[ j+ 1] = ( j+ 1) B[ j] + ( 3 sj+ 1 ) 4, sj+ 1 0 (5.6) - 35 -

s [1, 2) A[0] = 1.00.00 B[0] = 0.00 00 2 w[j+1] F[j]= 2S[j]s j+1 s j+1 4 -(j+1) ( j+ 1) ( ) 2 Aj [ ] + sj+ 14 sj+ 1, sj+ 1 > 0 F[ j] = 0, sj+ 1 = 0 (2 B[ j] + 8 s 4 ) s, s < 0 ( j+ 1) ( ) j+ 1 j+ 1 j+ 1 (5.7) w[j] s[j+1] A[j] B[j] F[j] 5.2.2 s [1, 2) 0 2-149 24 12 13 14-36 -

5.3 MAF S[m] 1 1 [13] A[j] B[j] S[j] C[j] A[j] = S[j] B[j] = S[j] 4 -j C[j] = S[j] + 4 -j C[j] A[j] B[j] 5.3.1 s j+1 A[j+1] B[j+1] C[j+1] 0 (A[j],0) (B[j],3) (A[j],1) 1 (A[j],1) (A[j],0) (A[j],2) -1 (B[j],3) (B[j],2) (A[j],0) 2 (A[j],2) (A[j],1) (A[j],3) -2 (B[j],2) (B[j],1) (B[j],3) A[m] B[m] C[m] - 37 -

2003 International Technology Roadmap for Semi-conductor, ITRS2003 2:1 3:1 1994 Intel 4.75 6.1-38 -

6.1.1 6.1.1 IEEE 754 constrains biasing 100 3 2 32X3 100 10 23 2 2 32X2 100 10 14 1 2 32 100-39 -

0.23 1000 6.1.2 IEEE 754 C CPU CPU CPU verilog verilog 6.1.3-40 -

1 Synopsys VCS Coverage Metrics license (line/statement coverage) (toggle coverage) (path coverage) (condition coverage) if?: 100% 100% case 6.2 (symbolic simulation) (model checking) (Theorem Proving) Intel STE IBM SixSense AMD ACL2 MAF [14] - 41 -

Synopsys Formality MAF MAF - 42 -

Verilog HDL Synopsys VCS Synopsys Design Complier Synopsys Astro SMIC 0.18µm Synopsys Design Compiler 3.07ns 4.04 3.34 0.70 1.8V 91.2mW 51.76 mw 39.43 mw 1.64 uw DC 6.2.1-43 -

4mm 4mm PAD 1mm 1mm 47% 6.68ns MAF PAD FPU 2004 [15] 12 32 32 (RF) 6.2.2 6.2.1 [15] SMIC 0.18 CMOS TSMC 0.18 CMOS 1.8V Core 2.5V PAD 1.8V Core 2.5V PAD 1.00mm 1.00mm 1.06mm 1.06mm 3 5 15 12 156MHz 266MHz [15] - 44 -

[15] IBM z99 MAF 6.2.2 Intel P4 AMD Athon K7 IBM z990 3 5 4 5 3 7 4 5 15 23 16 25 14 23 19 FPU - 45 -

8.1 IEEE 754 32 4 SRT FPU 1mm 2 150MHz 64 MAF SRT VCS DC Astro CAD Solaris UNIX 8.2-46 -

1 AMD ACL2 AMD Athlon K5 20 [16] 2 CPU DSP CPU DSP CPU DSP CPU DSP 2005 ISSCC IBM SONY SCE Toshiba 90nm SOI Cell 2500 Cell IBM 64 Power 8 FPU FPU 3-47 -

[1] G. Gerwig, H. Wetter, E. M. Schwarz, J. Haess, etc, all. The IBM eserver z990 floating-point unit, In: IBM Journal of Research and Development, v 48, n 3-4, May/July, 2004, p 311-322 [2] G. Gerwig, H. Wetter, E. M. Schwarz, and J. Haess, High Performance Floating-Point Unit with 116 Bit Wide Divider, In: Proceedings of the 16th Symposium on Computer Arithmetic, Santiago de Compostela, Spain, June 2003, pp.87 94 [3] L. Tomas and B. Javier D, Floating-point multiply-add fused with reduced latency, In: IEEE Transactions on Computers, v 53, n 8, August, 2004, p 988-1003 [4] ANSI/IEEE Standard 754-1985: IEEE standard for Binary Floating-Point Arithmetic. Poscataway, NJ: IIEEE Press, 1985. [5] B. Neil, the "Flagged prefix adder" for dual additions, In: Proceedings of SPIE - The International Society for Optical Engineering, v 3461, 1998, p 567-575 [6] Chichyang Chen, Liang-An Chen and Jih-Ren Cheng, Architectural Design of a Fast Floating-point Multiplication-Add Fused Unit Using Signed-Digit Addition, In: IEE Proceedings: Computers and Digital Techniques, v 149, n 4, July, 2002, p 113-120 [7] Jan M. Rabaey, Anantha Chandrakasan, Borivoje Nikolic, Digital Integrated Circuits, a Design Perspective, Second Edition,,, 2004, p 586-594. [8] Schmookler, M.S.; Nowka, K.J., Leading zero anticipation and detection - A comparison of methods, In: Proceedings - Symposium on Computer Arithmetic, 2001, p 7-12. [9] Bruguera, Javier D. and Lang, Tomas, Leading-one prediction with concurrent position correction, In: IEEE Transactions on Computers, v 48, n 10, Oct, 1999, p 1083-1097 [10] Q, Nhon T., T. Naofumi, F. Michael J., Systematic IEEE Rounding Method for High-Speed Floating-Point Multipliers, In: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v 12, n 5, May, 2004, p 511-521 - 48 -

[11] Neil Burgess, Prenormalization Rounding in IEEE Floating-Point Operations Using a Flagged Prefix Adder, In: IEEE Transcations on VLSI Systems, Vol. 13. NO 2. Feb, 2005. p 266-277 [12] M. D. Ercegovac and L. Tomas, Radix-4 square root without initial PLA, In: IEEE transactions on Computers. Vol. 39, NO. 8, Aug. 1990. p1016-1024 [13] E., Milos D. and L. Tomas, On-the-fly rounding for division and square root, In: Proceedings - Symposium on Computer Arithmetic, 1989, p 169-173 [14] Jacobi, C. ; Weber, K.; Paruthi, V.; Baumgartner, J. Automatic formal verification of fused-multiply-add FPUs, In: Proceedings. Design, Automation and Test in Europe, 2005, pt. 2, p 1298-303 Vol. 2 [15] K., Taek-Jun; M., Joong-Seok; S., Jeff and D., Jeff, A 0.18µm implementation of a floating-point unit for a processing-in-memory system, In: Proceedings - IEEE International Symposium on Circuits and Systems, v 2, 2004 [16] Russinoff, D.M., A case study in formal verification of register-transfer logic with ACL2: the floating point adder of the AMD Athlon/sup TM/ processor, In: Formal Methods in Computer-Aided Design. Third International Conference, FMCAD 2000. Proceedings (Lecture Notes in Computer Science Vol.1954), 2000, p 3-6 - 49 -

- 50 -

IBM eserver z990 G. Gerwig H. Wetter E. M. Schwarz J. Haess C. A. Krygowski B. M. Fleischer M. Kroener IBM eserver z990 FPU IBM SRT IBM zseries IEEE754 FPU 4 SRT 2 IBM z990 eserver*[1] (FPU) IEEE754 [2] BFP) IBM z/architecture*[3] (HFP) IBM PowerPC * z * HFP BFP z990 FPU 1996 G3 FPU[4] 1997 G4 FPU[5,6] 1998 G5 FPU[7,8] 1990 G6 FPU 2000 z900 FPU[9] z990 FPU BFP z Linux** JAVA** C++ BFP G5 G6 z900 FPU BFP HFP BFP - 51 -

BFP HFP BFP IBM p POWER4* [10] z POWER4 SRT [11,12] z990 1 SRT 2 3 4 SRT 5 6 7 8 9 SRT BFP HFP 1998 IBM z G5 HFP BFP HFP BFP HFP BFP [13] HFP 2 n-1 BFP - 52 -

(2 n-1-1) BFP HFP XBPFi=(-1) Xs (1+Xf) 2 Xe-biasBi bias Bii =2 n-1-1=32767 X HFPi =(-1) Xs Xf 2 Xe-biasHi, bias Hii =2 n-1 =32768 [10] 3 (FPU FPU E1 E2 E3 E4 E5 E (E-1) - 53 -

E0 (FPR) E0 E1 E2 E3 0 E4 E5 E6 E0 A B C (LWRs) 16 FPR 4 5 LWR E0-54 -

4 FPU RX RR RX RR [13]IBM PowerPC RX PowerPC RR z HFP 56 56 4 29-55 -

BFP 1 Y X P X Wj lzcl[14,15] 0 n 1 X x x = + i i= 1 i 2 i 0 n 1 Y = y + y i2 j= 1 j j Y n 1 + 1 2 = W i 4 j= 1 j j j { 2, 1, 0, 1, 2} W + + n 1 + 1 2 P= W ixi 4 j= 1 j j 1 n 1 = + i X x x i= 1 i 2 i X = X x 0 n 1 + 1 2 j P = Wj ix i4 Yi x j= 1 0-56 -

lzcl = Yi x 0 D D+1 D-1 [10] 56 112 176 HFP BFP BFP 176 60 116 1.cccc...cGGG xx.pppp...pggggggggg ^ ^ 1 2-57 -

G c p E c 1 60 E p 2 LZC E n =E p -LZC En<Emin BFP En<Emin LZCmin=Ep-Emin En Emin [10] IEEE754 LSB BFP 113-58 -

5 116 116 SRT 1 FPU 32 64 128 1 SRT SRT P i+1 =r P i -q i+1 D P q D r - 59 -

P i+1 =r P i =q i+1 2Q i -q 2 i+1r -(i+1) P q Q r 6 q i+1 P i+1 q i+1 q i+1 P i+1 <(q max D)/(r-1) P-D q i+1-60 -

4 [16,17] P i+1 =P Ci+1 +P Ci+1 P Si+1 +P Ci+1 =4(P Si +P Ci )-q i+1 D q i+1 {-3,-2,-1,0,0,+1,+2,+3} q i+1 =q i+1,1 + q i+1,2 P Si+1 +P Ci+1 =4(P Si +P Ci )-q i+1,1 1D-q i+1,2 2D P S P C -q i+1,1 q i+1,2 1 2-1 0 +1 2(r=2) P i+1 =P Si+1 +P Ci+1 Q i =Q Pi +Q Ni, P Si+1 +P Ci+1 =2(P Si +P Ci )-q i+1 2Q Pi +q i+1 2Q Ni -q 2 i+1 r -(i+2) q i+1 {-1,0,0,+1} q 2 i+1 r -(i+2) - 61 -

7 BFP 113 PD +/- - 62 -

q i+1,1 q i+1,2 HFP 116 116 28 6 4:2 3:2 (CSAs) CPAs CPA CPA CPA 116 Q pos Q neg q i+1 Q pos Q neg q i+1 FPU BFP FPU IEEE - 63 -

3 IEEE q i+1 n V n D n Q n Q0 =n V -n D V norm <D norm n I =n V -n D+1 V norm D norm n Qe P Start P Stop P Start =64-n Qe - 64 -

P Stop =64 4 FPU 3.76 0.22 FPU 6% IBM 0.13 m CMOS SOI 1.15V 50 1.2GHz A B C 56 116 116 FPR A B C CMOS FPU - 65 -

8 IBM eserver z990-66 -

80 z990 FPU Juergen Foag Andree Marth Hans-Juergen Muenster Lukas Daellenbach, Dave Rude, Peter Cook, Steve Klepner, Fanchieh Yee, Harald Mielich, Rainer Clemen Juergen Vielfort Klaus Keuerleber * (IBM) ** Linus Torvalds Sun Microsystems, Inc. - 67 -

- 68 -

- 69 -

- 70 -

- 71 -

- 72 -

- 73 -

- 74 -

- 75 -

- 76 -

- 77 -

- 78 -

- 79 -