数据库原理与技术

Similar documents
数据库原理与技术

目錄

學 科 100% ( 為 單 複 選 題, 每 題 2.5 分, 共 100 分 ) 1. 請 參 閱 附 圖 作 答 : (A) 選 項 A (B) 選 項 B (C) 選 項 C (D) 選 項 D Ans:D 2. 下 列 對 於 資 料 庫 正 規 化 (Normalization) 的 敘

Microsoft PowerPoint - 05-SQL3-advanced.ppt

SQL: Interactive Queries (2)

Microsoft Word - TIP006SCH Uni-edit Writing Tip - Presentperfecttenseandpasttenseinyourintroduction readytopublish

6-1 Table Column Data Type Row Record 1. DBMS 2. DBMS MySQL Microsoft Access SQL Server Oracle 3. ODBC SQL 1. Structured Query Language 2. IBM

Microsoft PowerPoint - NCBA_Cattlemens_College_Darrh_B

1-1 database columnrow record field 不 DBMS Access Paradox SQL Server Linux MySQL Oracle IBM Informix IBM DB2 Sybase 1-2

untitled

4. 每 组 学 生 将 写 有 习 语 和 含 义 的 两 组 卡 片 分 别 洗 牌, 将 顺 序 打 乱, 然 后 将 两 组 卡 片 反 面 朝 上 置 于 课 桌 上 5. 学 生 依 次 从 两 组 卡 片 中 各 抽 取 一 张, 展 示 给 小 组 成 员, 并 大 声 朗 读 卡


穨control.PDF

Windows XP

Oracle 4

PowerPoint Presentation

VASP应用运行优化

2015年4月11日雅思阅读预测机经(新东方版)

Microsoft PowerPoint - ch6 [相容模式]

Microsoft PowerPoint - CH 04 Techniques of Circuit Analysis

untitled

untitled

Preface This guide is intended to standardize the use of the WeChat brand and ensure the brand's integrity and consistency. The guide applies to all d


ebook 165-5

Abstract Today, the structures of domestic bus industry have been changed greatly. Many manufacturers enter into the field because of its lower thresh

WTO

Microsoft Word - Final Exam Review Packet.docx

(baking powder) 1 ( ) ( ) 1 10g g (two level design, D-optimal) 32 1/2 fraction Two Level Fractional Factorial Design D-Optimal D

Microsoft PowerPoint - Aqua-Sim.pptx

ENGG1410-F Tutorial 6

天 主 教 輔 仁 大 學 社 會 學 系 學 士 論 文 小 別 勝 新 婚? 久 別 要 離 婚? 影 響 遠 距 家 庭 婚 姻 感 情 因 素 之 探 討 Separate marital relations are getting better or getting worse? -Exp

BC04 Module_antenna__ doc

2/80 2

ch_code_infoaccess

声 明 本 人 郑 重 声 明 : 此 处 所 提 交 的 硕 士 学 位 论 文 基 于 等 级 工 鉴 定 的 远 程 考 试 系 统 客 户 端 开 发 与 实 现, 是 本 人 在 中 国 科 学 技 术 大 学 攻 读 硕 士 学 位 期 间, 在 导 师 指 导 下 进 行 的 研 究

Microsoft PowerPoint - STU_EC_Ch08.ppt

厦 门 大 学 学 位 论 文 原 创 性 声 明 本 人 呈 交 的 学 位 论 文 是 本 人 在 导 师 指 导 下, 独 立 完 成 的 研 究 成 果 本 人 在 论 文 写 作 中 参 考 其 他 个 人 或 集 体 已 经 发 表 的 研 究 成 果, 均 在 文 中 以 适 当 方

untitled

國立中山大學學位論文典藏

Microsoft Word - template.doc

Fun Time (1) What happens in memory? 1 i n t i ; 2 s h o r t j ; 3 double k ; 4 char c = a ; 5 i = 3; j = 2; 6 k = i j ; H.-T. Lin (NTU CSIE) Referenc


A Community Guide to Environmental Health

Microsoft Word - ChineseSATII .doc

Improved Preimage Attacks on AES-like Hash Functions: Applications to Whirlpool and Grøstl

coverage2.ppt

( ) ( ) ( ) ( )

致 谢 开 始 这 篇 致 谢 的 时 候, 以 为 这 是 最 轻 松 最 愉 快 的 部 分, 而 此 时 心 头 却 充 满 了 沉 甸 甸 的 回 忆 和 感 恩, 一 时 间 竟 无 从 下 笔 虽 然 这 远 不 是 一 篇 完 美 的 论 文, 但 完 成 这 篇 论 文 要 感 谢

Microsoft Word - ORA doc

F4

硕 士 学 位 论 文 论 文 题 目 : 北 岛 诗 歌 创 作 的 双 重 困 境 专 业 名 称 : 中 国 现 当 代 文 学 研 究 方 向 : 中 国 新 诗 研 究 论 文 作 者 : 奚 荣 荣 指 导 老 师 : 姜 玉 琴 2014 年 12 月

Untitled-3

Oracle Database 10g: SQL (OCE) 的第一堂課

Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial

國立中山大學學位論文典藏.PDF

東莞工商總會劉百樂中學

東吳大學

untitled

Microsoft Word doc

Important Notice SUNPLUS TECHNOLOGY CO. reserves the right to change this documentation without prior notice. Information provided by SUNPLUS TECHNOLO


2015 Chinese FL Written examination

% % % % % % ~

國立中山大學學位論文典藏.PDF

RUN_PC連載_12_.doc

入學考試網上報名指南

Microsoft Word 記錄附件


Abstract Since 1980 s, the Coca-Cola came into China and developed rapidly. From 1985 to now, the numbers of bottlers has increased from 3 to 23, and

世新稿件end.doc

66 臺 中 教 育 大 學 學 報 : 人 文 藝 術 類 Abstract This study aimed to analyze the implementing outcomes of ability grouping practice for freshman English at a u

國 史 館 館 刊 第 23 期 Chiang Ching-kuo s Educational Innovation in Southern Jiangxi and Its Effects ( ) Abstract Wen-yuan Chu * Chiang Ching-kuo wa

國家圖書館典藏電子全文

2

Abstract There arouses a fever pursuing the position of being a civil servant in China recently and the phenomenon of thousands of people running to a

Stochastic Processes (XI) Hanjun Zhang School of Mathematics and Computational Science, Xiangtan University 508 YiFu Lou talk 06/

数据分析技术介绍

Microsoft Word - 11月電子報1130.doc

Microsoft Word - 論文封面 修.doc

Microsoft Word - 中級會計學--試題.doc

Microsoft Word - (web)_F.1_Notes_&_Application_Form(Chi)(non-SPCCPS)_16-17.doc

2 response personnel to speed up the rescue operations after various natural or man-made disasters. Keywords: SMS, Database, Disaster

USPTO Academic research Corporate needs Global/International Inventors Libraries News Media/Publication Patent Attorney or Agent USPTO e (ebusiness Ce

L L L-1 L-1 L-1 L-1 L-1 L-2 L-1 L-1 L-2 L-2 L-2 L-2 L-2 L-2 L-2 L-2 L-2 L-2 L-3 L-3 L-3 L-3 L-2 L-2 L-2 L-2 L

<4D F736F F D C4EAC0EDB9A4C0E04142BCB6D4C4B6C1C5D0B6CFC0FDCCE2BEABD1A15F325F2E646F63>

The Development of Color Constancy and Calibration System

Introduction to Hamilton-Jacobi Equations and Periodic Homogenization

快乐蜂(Jollibee)快餐连锁店 的国际扩张历程


DB2 (join) SQL DB2 11 SQL DB2 SQL 9.1 DB2 DB2 ( ) SQL ( ) DB2 SQL DB2 DB2 SQL DB2 DB2 SQL DB2 ( DB2 ) DB2 DB2 DB2 SQL DB2 (1) SQL (2) S

Analysis of Cultural Elements of Meinong s Paper Umbrella Painting Abstract Meinong paper umbrellas are a traditional industrial art for the Hakka peo

基于UML建模的管理管理信息系统项目案例导航——VB篇

Microsoft Word - ??????-2.doc

X UDC A Post-Evaluation Research on SINOPEC Refinery Reconstruction and Expanding Project MBA 厦门大学博硕士论文摘要库

Microsoft PowerPoint - ATF2015.ppt [相容模式]

考試學刊第10期-內文.indd

Microsoft PowerPoint - AWOL - Acrobat Windows Outlook.ppt [Compatibility Mode]

20-25%,

13 A DSS B DSS C DSS D DSS A. B. C. CPU D. 15 A B Cache C Cache D L0 L1 L2 Cache 16 SMP A B. C D 17 A B. C D A B - C - D

Transcription:

第三篇系统篇

DBMS 的系统成分和工作流程

DBMS 的系统成分和工作流程

DBMS 的系统成分和工作流程

DBMS 的系统成分和工作流程

第三篇系统篇 第? 章存储管理及索引第九章查询处理及查询优化第十章数据库恢复技术第十一章并发控制

Storage, File Organization and Indexing 11/30

File Organization(1) 数据库中的数据存储在操作系统管理的文件中 : 一个关系对应到一个文件 一个关系对应到多个文件 一个文件对应到多个关系 操作系统分配给数据库系统一个大的操作系统文件 所有关系都存储在这个文件中, 这个文件的管理由数据库系统进行

File Organization(2) 样例数据库 : Branch-schema = (branch-name,branch-city,assets) Customer-schema = (customer-name,customer-street, customer-city) Account-schema = ( account-number, branch-name, balance) Loan-schema = ( loan-number, branch-name,amount) Depositor-schema = (customer-name,account-number) Borrower-schema = (customer-name,loan-number)

File Organization(3) 记录的物理存储 文件中记录的组织

Record Storage(1) Fixed-Length Records( 定长记录文件 ) 文件中所有的记录都具有相同的长度, 从而一个块中所有的记录都是等长的 Variable-Length Records( 变长记录文件 ) 文件中的记录可以有不同的长度, 从而一个块中的各个记录可以具有不同的长度

Record Storage(2) 定长记录文件 顺序存储法 严重缺点 除非块的大小恰好是记录大小的倍数, 否则一些记录会跨过块的边界, 从而读写这样一条记录需要两次块访问 记录 0 Perryridge A-102 400 记录 1 Rouond Hill A-305 350 记录 2 Mianus A-215 700 记录 3 Downtown A-101 500 记录 4 Redwood A-222 700 记录 5 Perryridge A-201 900 记录 6 Brighton A-217 750 记录 7 Downtown A-110 600 记录 8 Perryridge A-218 700 account 关系

定长记录文件 顺序存储法 严重缺点 删除一条记录十分困难 待删除记录所占据的空间必须由文件中的其它记录来填充, 或者必须用一种方法标记被删除的记录使得它可以被忽略 Record Storage(2) 记录 0 Perryridge A-102 400 记录 1 Rouond Hill A-305 350 记录 2 Mianus A-215 700 记录 3 Downtown A-101 500 记录 4 Redwood A-222 700 记录 5 Perryridge A-201 900 记录 6 Brighton A-217 750 记录 7 Downtown A-110 600 记录 8 Perryridge A-218 700 account 关系

Record Storage(3) 定长记录文件 空闲指针链法 account 关系 文件头 记录 0 Perryridge A-102 400 记录 1 记录 2 Mianus A-215 700 记录 3 Downtown A-101 500 记录 4 记录 5 Perryridge A-201 900 记录 6 记录 7 Downtown A-110 600 记录 8 Perryridge A-218 700

Record Storage(4) 变长记录文件以下情况下需要变长记录文件 : 多种记录类型在一个文件中存储 Account-list ( branch-name, account-info: (account-number, balance)* ) 记录类型允许一个或多个字段是变长的 VARCHAR 记录类型允许可重复的字段

Record Storage(5) 变长记录文件 ( 定长表示法 ) 使用一个或多个定长记录来代表一个变长记录 预留空间 : 使用长度为最大记录长度的定长记录 对较短记录未使用的空间用特殊的空值或记录终结符号来填充 使用指针 : 变长记录用一系列通过指针链接起来的定长记录来表示

Record Storage(6) 预留空间的方法 记录 0 Perryridge A-102 400 A-201 900 A-218 700 记录 1 Round Hill A-305 350 记录 2 Mianus A-215 700 记录 3 Downtown A-101 500 A-110 600 记录 4 Redwood A-222 700 记录 5 Brighton A-217 750 account-info 关系

Record Storage(7) 使用指针的方法 记录 0 Perryridge A-102 400 记录 1 Rouond Hill A-305 350 记录 2 Mianus A-215 700 记录 3 Downtown A-101 500 记录 4 Redwood A-222 700 记录 5 A-201 900 记录 6 Brighton A-217 750 记录 7 A-110 600 记录 8 A-218 700 account-info 关系

Record Storage(8) 使用指针的方法 (anchor block and overflow block) anchor block Perryridge A-102 400 Round Hill A-305 350 Mianus A-215 700 Downtown A-101 500 Redwood A-222 700 Brighton A-217 750 overflow block A-210 900 A-218 700 A-110 600

Record Storage(9) 变长记录文件 ( 变长表示法 ) Byte-String( 字节流 ) 表示 : 把每个记录作为一个连续的字节流存储, 每个记录的末尾附加一个特殊的记录终止符号 0 Perryridge A-102 400 A-201 900 A-218 700 1 Round Hill A-305 350 2 Mianus A-215 700 3 Downtown A-101 500 A-110 600 4 Redwood A-222 700 5 Brighton A-217 750 不易重新使用被删除记录的空间 分配给记录的空间不能随记录的改变而改变

Record Storage(10) 变长记录文件 ( 变长表示法 ) slotted-page( 分槽的页 ) 结构 : 每个块的开始处 有一个块头 块头 记录条目个数 大小 空闲空间尾指针位置 空闲空间

Organization of Records in Files(1) How to organize the records in a file: Heap file( 堆文件组织 ): 记录没有顺序, 一条记录可以放在文件中的任何地方 通常一文件一关系. Sequential file( 顺序文件组织 ): 记录根据检索码的值顺序存储 Hashing file( 散列文件组织 ): 散列函数的计算结果确定记录应存储到文件的哪个块中 Clustering file( 聚集文件组织 ): 几个不同关系的记录可以存储在同一个文件中 不同关系中的相关记录存储在相同的块中

Organization of Records in Files(2) Many relational-database systems store each relation in a separate file to use OS file system However, many large-scale database systems do not relay directly on the underlying OS file. Instead, one large OS file is allocated to database system. All relations are stored in this one file. Database systems manage the file.

Organization of Records in Files(3) To see the advantage of method, consider the following SQL for bank database: Select account-number, customer-name, customer-street, customer-city from depositor, customer where depositor.customer-name = customer.customer-name

Organization of Records in Files(4) 聚集文件组织 Customer-name Account-number Hayes A-102 Hayes A-220 Hayes A-503 Turner A-305 depositor 关系 聚集文件 Hayes Main Brooklyn Hayes A-102 Hayes A-220 Hayes A-503 Turner Putnam Stanford Turner A-305 Customer-name Customer-street Customer-city Hayes Main Brooklyn Turner Putnam Stanford Customer 关系

Organization of Records in Files(5) Clustering enhances the processing of join (depositor, customer), but slowing processing of other types of query. For example: Select * from customer; The query requires more block accesses than it did in one-relation-in-one-file It can be improved by chaining together all the records of the relation using pointers

Organization of Records in Files(6) 聚集文件组织 ( 带指针链 ) Hayes Main Brooklyn Hayes A-102 Hayes A-220 Hayes A-503 Turner Putnam Stanford Turner A-305

Organization of Records in Files(7) The determination of when clustering is to be used depends on the types of query that database designer believes to be most frequent Careful use of clustering can produce significant performance gains in query processing

Large Objects Storage(1) Large objects are called binary large objects (blobs) In relational systems, use long field to store large objects Most relational databases restrict the size of a record to be no larger than the size of a page, to simplify the buffer manager and free-space manager Large objects or long fields are often stored in a special file (or collection of files)

Large Objects Storage(2) Allocation of buffer pages for large objects is a big problem We often modify large objects by updating part of the object. For practical reasons, large objects are manipulated in application programs, rather than in database: Text data Graphical data Audio/video data

Access Methods: Sequential Access/Scan Consider query: select * from Employees where name = Bill Sequential access/scan: (tuples are scanned one at a time) to evaluate the condition on name. This is reasonable if Relation Employees is small, or High percentage of the tuples in the relation satisfy the condition. Too slow for most other cases.

Access Methods: Hash Index (1) Using a hash table: Create a hash table based on the values of A in advance. Suppose h( ) is the hash function used. To process a query with condition A = v, first use h(v) to find the bucket in the hash table and then compare v with the A-values in the bucket to identify qualified tuples.

Access Methods: Hash Index (2) Example: Suppose Employees has 9 tuples. Build a hash table based on Name. Suppose we have 3 buckets and each can hold 3 entries and an overflow pointer. Hash function: h(x) = seq(x) mod 3 seq(x) returns the serial number of the first letter in x in the English alphabet.

Access Methods: Hash Index (3) Hash Table Stored Employees Table 0 1 2 Liz Roy Amy Mike Ann Bill Bob Bill Beth SSN Name Salary 123456789 Bill 35k 234567891 Amy 28k 345678912 Bob 30k 456789123 Mike 27k 567891234 Liz 43k 678912345 Bill 30k 789123456 Ann 31k 891234567 Beth 25k 912345678 Roy 36k

Access Methods: Hash Index (4) Evaluate query using the hash table: select * from Employees where Name = Bill 1. Apply hash function to obtain a bucket#: hash( Bill ) = seq( Bill ) mod 3 = 2 mod 3 = 2 2. Search the found bucket (and related overflow bucket(s)) for Bill and follow proper pointer(s) to obtain the tuple(s).

Access Methods: Hash Index (5) Using hash table can reduce search time substantially. Example: Suppose Employees table has 1000 pages. The hash table on SSN has 50 pages and can be kept in memory. Finding a tuple using sequential scan needs about 500 page I/Os. Finding a tuple using the hash table needs 1 or 2 page I/Os.

Access Methods: Hash Index (6) Hash tables are effective only when op is = but not effective for other comparators. Best for primary keys where order is meaningless (e.g., SSN). Complications of implementing hash tables: hash function selection, collision resolution, bucket overflow, hash table overflow.

Access Methods: B + Tree (1) A B + tree is a balanced search tree. Each node (internal or external) is stored as a page. Format of an internal node: p1 a1 ai-1 pi ai ak-1 pk ak a <= a1... ai-1 < a <= ai... ak-1 < a < ak

Access Methods: B + Tree (2) Format of a leaf node: a1 p1 ai pi ak pk p All leaf nodes form a linked list in nondecreasing A-values. Every non-root node must be at least half full.

Access Methods: B + Tree (3) 25 37 11 14 25 34 37 8 11 12 14 22 25 27 34 36 37

Access Methods: B + Tree (4) Indexes in databases are usually implemented as a B + tree. Definition: If the tuples of a relation R are stored in ascending (or descending) A- values, then the B + tree index on A is a primary index (clustered index). Primary index is not directly related with the primary key.

Access Methods: B + Tree (5) Definition: A secondary index (nonclustered index) is an index on an attribute whose values are not sorted.

Clustered / Non clustered index Clustered index (primary index) A clustered index on attribute X co-locates records whose X values are near to one another. Non-clustered index (secondary index) A non clustered index does not constrain table organization. There might be several non-clustered indexes per table. Records Records

Access Methods: B + Tree (5) Several pointer implementations possible: 1. Each p i in the leaf node points to a tuple. This is suitable for attributes that are candidate keys or have low-degree of repeating values.

Access Methods: B + Tree (6) 26 37 11 14 26 34 37 8 11 12 14 25 26 27 34 36 37

Access Methods: B + Tree (7) Build an index based on R.A: (1) Sort R based on A and store the sorted R. (2) Create a pair (a, p) for each tuple t, where a is t[a] and p is the address of t. Sort all pairs based on a.

Access Methods: B + Tree (8) (3) Build the leaf nodes, starting with pairs with the smallest a s. Each leaf node has a pointer to the next leaf node. (4) Build the upper level nodes, one level at a time. (5) If insertion and update are expected, leave some free space in each page.

Access Methods: B + Tree (9) Stored Employees Table SSN Name Salary e1: 234567891 Amy 28k e2: 789123456 Ann 31k e3: 891234567 Beth 25k e4: 123456789 Bill 35k e5: 678912345 Bob 30k e6: 345678912 Fay 30k e7: 567891234 Liz 43k e8: 456789123 Mike 27k e9: 912345678 Roy 36k Page k Page k+1 Page k+2 Page k+3 Page k+4

Access Methods: B + Tree (10) Fay Roy Ann Bill Fay Mike Roy Amy Ann Beth Bill Bob Fay Liz Mike Roy e1 e2 e3 e4 e5 e6 e7 e8 e9

Access Methods: B + Tree (11) Another pointer implementation: 2. Each p i in the leaf node points to a page that contains pointers to tuple(s) whose index value is a i. This is suitable for attributes that have high-degree of repeating values. For each distinct index value, exactly one pointer exists in the leaf node. An extra indirection is needed to access a tuple.

Access Methods: B + Tree (12) 11 25 36 8 11 14 25 34 36

Dense / Sparse Index Sparse index Pointers are associated to pages Dense index Pointers are associated to records Non clustered indexes are dense P1 P2 Pi record record record

Access Methods: B + Tree (13) Normal height of a B + tree is only 2 or 3. Example: Suppose Employees has 100,000 tuples of length 100 bytes. Each key value is 10 bytes and each pointer is 4 bytes. Each page is 2 KB. Each cell (a, p) has 14 bytes. Each page (2048 bytes) can contain 146 cells. # of cells (a, p): 100000, needs 685 pages at the bottom level. # of cells at next level: 685, needs 5 pages. # of cells at next level: 5, needs 1 page.

Access Methods: B + Tree (14) Some questions: Can we create multiple primary indexes for the same relation? Can we create multiple secondary indexes for the same relation? Which is faster: primary index or secondary index?

Guidelines on Index Creation (1) Information that guides the creation of indexes is the transactional analysis of the requirements document. The following are the main factors on whether or not to create an index on an attribute A: Search frequency: How frequently A is searched, including for evaluating conditions and joins? Update frequency: How frequently A is updated, including insertion and deletion? Candidate key: Must the values of A be unique? Size: How large is the relation containing A?

Guidelines on Index Creation (2) No index on attribute A is needed if: There is low search frequency on A. The size of the relation containing A is very small. An index on attribute A should be created if: The search frequency on A is high. The values of A must be kept unique and there are many insertions and updates involving A. Maintenance cost: Maintaining an index can be costly if frequent changes are needed. Be cautious if A is not a candidate key and has high update frequency.

Guidelines on Index Creation (3) Among all the attributes of a relation that should have an index, only one attribute can have a primary index, and all others will have secondary indexes. Attribute A should be favored to have a primary index if: Searches on A are more likely to returned multiple tuples (due to repeating values and range queries). A is the primary key of the relation (more joins are likely involve primary key).

Dense / Sparse Index Sparse index Pointers are associated to pages Dense index Pointers are associated to records Non clustered indexes are dense P1 P2 Pi record record record

An example (1) Consider table R(A, B, C, D) with the following information: A is the primary key and no other candidate key exists. Approximately, the number of distinct values under B is two times of that under C and this ratio does not change. Out of every 100 operations against R, 10 will be insertions; 10 will be deletions with 5 based on conditions on A and 5 based on conditions on B; 70 will be selection queries with 30 based on conditions on A, and 20 based on conditions on B and 20 based on conditions on C; and 10 will be updates with 5 based on conditions on B and 5 based on conditions on C. All conditions are equality conditions. Discuss which attribute should be used to build the primary index and which attribute(s) should be used to build secondary index(es).

An example (2) From the given information, the following can be derived: Out of every 100 operations: 45 need to search based on A (10 insertions, 5 deletions, 30 selections) 30 need to search based on B (5 deletions, 20 selections, 5 updates) 25 need to search based on C (20 selections, 5 updates) 0 need to search based on D

Types of Queries 1. Point Query SELECT balance FROM accounts WHERE number = 1023; 2. Multipoint Query SELECT balance FROM accounts WHERE branchnum = 100; 3. Range Query SELECT number FROM accounts WHERE balance > 10000; 4. Prefix Match Query SELECT * FROM employees WHERE name = Jensen and firstname = Carl and age < 30;

Types of Queries 5. Extremal Query SELECT * FROM accounts WHERE balance = max(select balance from accounts) 6. Ordering Query 7. Grouping Query SELECT branchnum, avg(balance) FROM accounts GROUP BY branchnum; 8. Join Query SELECT * FROM accounts ORDER BY balance; SELECT distinct branch.adresse FROM accounts, branch WHERE accounts.branchnum = branch.number and accounts.balance > 10000;

Index and Query 1. Use a hash index for point queries only. Use a B-tree if multipoint queries or range queries are used 2. Use clustering if your queries need all or most of the fields of each records returned if multipoint or range queries are asked 3. Use a dense index to cover critical queries 4. Don t use an index if the time lost when inserting and updating overwhelms the time saved when querying

Covering Index Select name from employee where department = marketing Good covering index would be on (department, name) Index on (name, department) less useful. Index on department alone moderately useful. 3 - Index Tuning 64

Constraints and Indexes Primary Key, Unique A non-clustered index is constructed on the attribute(s) that compose the primary key with the constraint that values are unique. Foreign Key By default, no index is created to enforce a foreign key constraint.

Index Implementations in some major DBMS SQL Server B+-Tree data structure Clustered indexes are sparse Indexes maintained as updates/insertions/deletes are performed DB2 B+-Tree data structure, spatial extender for R-tree Clustered indexes are dense Explicit command for index reorganization Oracle B+-tree, hash, bitmap, spatial extender for R-Tree No clustered index Index organized table (unique/clustered) Clusters used when creating tables. MySQL B+-Tree, R-Tree (geometry and pairs of integers) Indexes maintained as updates/insertions/deletes are performed

Multiple-key Access(1) 前面讨论的基本索引结构都是一维的, 即, 它们使用单个的检索码 当然检索码可以是单个属性, 也可以是属性组 检索码由多个属性构成时, 我们把检索码的值取成各属性值的拼接 建立多属性索引时, 构成检索码的多个属性不拼接成一个值处理, 而是看成各自独立的值

Multiple-key Access(2) 对多维索引的需要 : 例关系 account( account-number, branch-name, balance) 在 branch-name 和 balance 属性上分别建有一维索引 select account-number from account where branch-name = "Perryridge and balance = 1000

Multiple-key Access(3) 三种处理策略如果 : 1. 利用满足 branch-name branch-name= Perryridge 上的索引, 找出的记录很多 Perryridge 的分支机构的所有记录 检查每条记录是否满足满足 balance = 1000 的记录很多而满足 balance=1000 branch-name= Perryridge and balance = 1000 的记录很少 2. 利用 balance 上的索引, 找出所有余额等于 1000 美元的则为了得到一个很小的结果集需要扫描大量指针记录 检查每条记录是否满足 branch-name = "Perryridge" 3. 利用 branch-name 上的索引找出 Perryridge 的分支机构的记录的所有指针 同样, 利用 balance 上的索引找出指向余额等于 1000 美元的记录的所有指针 计算这两个指针集合的交 交集中的所有指针指向满足条件的记录

Multiple-key Access(4) 解决的办法 : 在 (branch-name, balance) 建立组合索引 组合索引序 : (a1,a2) < (b1,b2) if a1<b1, or a1=b1 and a2<b2 组合索引是一种多维索引 组合索引的建立可以采用任何建立单属性索引的方法

Multiple-key Access(5) 基于顺序索引的组合索引会有一些问题例如 : select account-number from account where branch-name < "Perryridge and balance = 1000 在检索码值小于等于 Perryridge,1000 的记录中, 找出余额等于 1000 美元的那些记录 然而, 检索码值小于等于 Perryridge,1000 的记录可能很多, 而其中满足 balance=1000 的记录可能很少, 因此大量的磁盘访问实际上都是不必要的 可考虑用 grid file,partitional hashing,r-tree 结构解决这样的问题

多维索引 (1) 多维索引结构 类似于散列表的结构 基于树形的结构 == 网格文件分段散列 == k-d 树四分树 R 树

网格文件 通过对每个维的值进行排序, 将检索码值 散列 到 桶中 结构 : 一个网格数组, 其每个单元包含一个指向桶的指针, 可以有多个单元指向同一个桶 每个维一个线性标量, 对该维的值进行划分 由线性标量确定一个多维检索码值应该落到网格数组的哪一个单元中

5.4.2.1 网格文件 (2) 4 Townsend 3 Perryridge 2 Mianus 1Central 4 3 2 1 Bi Bj Branch-name 的线性标量 balance 的线性标 0 0 1 2 3 4 5 6 1k 2k 5k 10k 50k 100k 桶 网格数组

网格文件 (3) 说明 适合于多码查询, 也可以回答包含一个检索码的查询 支持范围查询 线性标量的选择必须使记录在单元格中的分布是均匀的 如果桶满了还要插入新的记录, 则要分配一个新桶, 并进行指针的调整和记录的重新分布, 或把新桶作为溢出桶 优点 : 大大减少了多码查询的处理时间 缺点 : 增加了空间开销, 和记录插入和删除的开销 此外, 很难选择适当的 保证记录均匀分布的分段范围

Buffer Manager(1) 缓冲区 : 主存储器中用于存储磁盘块的拷贝的部分, 由固定数目的缓冲块构成 目的 : 减少磁盘和主存储器之间传输的块的数目 缓冲区管理器 : 负责缓冲区空间分配的子系统

Buffer Manager(2) Read/Write Requests Buffer Buffer Manager

DBMS 的系统成分和工作流程

Buffer-Replacement Policies(1) 最近最少使用 (LRU): 替换出最长时间没有读或写过的块 先进先出 (FIFO): 替换出被同一个块占用时间最长的缓冲块 时钟 算法 :LRU 的一个常见的 有效的近似 系统控制 : 查询优化器或者其它的 DBMS 部件可以给缓冲区管理器提供建议来避免象 LRU,FIFO, 或者时钟这样的严格的策略可能引起的问题

Buffer-Replacement Policies(2) 系统控制 查询优化器或者其它的 DBMS 部件可以给缓冲区管理器提供建议 例如, 将某些块定义为 固定的 来强迫它们保持在内存中, 如 B- 树的根 数据字典中的块 又如, 对于象一遍散列连接那样的算法, 查询处理器可以 固定 较小的关系的块, 使得确保在全部时间内它都将留在内存中

分段散列 对散列的扩充, 以对多个属性进行散列 散列函数产生 k 个二进制位, 这 k 位在 n 个属性中进行划分, 设为第 i 个属性产生 k i 位散列值, 于是 k 1 + k 2 + + k n =k 更精确地说, 散列函数 h 实际上是一组散列函数 (h 1,h 2,,h n ), 其中每个 h i 运用到第 i 个属性上且产生 k i 位二进制位序列 进行散列时, 在这 n 个属性上值为 (v 1,v 2,,v n ) 的元组所属的桶通过拼接二进制序列 h 1 (v 1 )h 2 (v 2 ) h n (v n ) 计算得到

分段散列 (2) 例检索码 (customer-street,customer-city) 的分段散列函数 search-key value hash value (Main,Harison) 101 111 (North,Rye) 110 101 (Main,Brooklyn) 101 001 (North,Princeton) 110 000 (Park,Palo Alto) 010 010 (Putnam,Stamford) 011 001 (Nassau,Princeton) 011 000 (Spring,Brooklyn) 001 001 (Alma,Palo Alto) 110 010

分段散列 (3) 说明 : 适合于多码查询 也可以方便地回答包含一个检索码的查询 不支持范围查询

k-d 树 k-d 树是二叉检索树的扩展,k-d 树的每一层将空间分成两个 树的顶层结点按一维进行划分, 下一层结点按另一维进行划分, 以此类推, 各个维循环往复 划分要使得在每个结点, 大约一半存储在子树中的点落入一侧, 而另一半落入另一侧 当一个结点中的点数少于给定的最大点数时, 划分结束

k-d 树 (2) 例顾客数据库, 假定相关的属性只有顾客的年龄和工资 示例数据库中有 12 个顾客 (25,60) (45,60) (50,75) (50,100) (50,120) (70,110) (85,140) (30,260) (25,400) (45,350) (50,275) (60,260)

k-d 树 (3) k-d 树工资 150 年龄 60 年龄 47 工资 80 70,110 工资 300 85,140 50,275 60,260 年龄 38 50,100 50,120 30,200 25,400 45,330 25,60 45,60 50,75

k-d 树 (4) k-d 树隐含的划分 500k * * * * * 0 * * * * 0 100 * *

k-d 树 (5) 讨论 : 适合于多码查询, 也可以回答包含一部分检索码的查询 所有维都给定值的情况 : 类似于二叉树查找 给定某些属性值的情况 : 当处于属性值给定的层的结点时, 沿着一个子树走 ; 当处于属性值未给定的层的结点时, 必须考察它的两个子树 可以支持范围查询 : 给定的范围可能引向结点的一棵子树, 但如果给定的范围跨越了结点的划分值, 则必须考察它的两个子树 若允许每个内部节点有多个子节点, 则扩展为 k-d-b 树, 更适合于辅助存储器

四分树 四分树的每个内部节点对应于二维空间中的一个正方形区域, 或是 K 维空间的 K 维立方体, 外部结点对应于存放空间中点的块 以二维的情形为例, 如果一个正方形中的点数不比一个块中能存放的数多, 那么我们就把这个正方形看作树的叶结点, 该结点就表示成存放它的点的块 ; 如果矩形中还有太多的点以至于一个块存放不下, 那么我们把这个正方形看作内部结点, 它的子结点对应于它的四个象限

四分树 (2) 例 400k * * * * * 0 * * * 0 100 * *

四分树 (3) 四分树 50,200 25,60 45,60 50,275 75,100 25,300 60,260 50,75 50,100 85,140 50,120 70,110 30,260 25,400 45,350

R 树 R- 树是平衡树的结构, 非常象 B- 树 它对矩形和其它多边形的索引很有用处 R- 树的结点中不是放取值范围, 而是每个树结点对应一个矩形边界框 多边形只存在叶结点上 叶结点的边界框是包含叶结点中所有对象的最小矩形 类似地内部结点的边界框是包含其子结点的边界框的最小矩形

R 树 (2) 例 0 A C B 1 G 3 0 H 1 2 3 D E F I 2 A B C D E F G H I

B + Tree: Common Operations (1) Let a, b search key values T the pointer to the root of the tree search(a, T): find all tuples whose A- values are a. search(a, b, T): find all tuples whose A- values are between a and b. insert((a,p), T): insert cell (a, p). Delete(a, T): delete all cells whose A- values are a.

B + Tree: Common Operations (2) Search(a, T) /* with index on primary key */ case 1: T is an internal node. Compare a with the A-values in T. (a) If a a 1, call search(a, p 1 ). (b) If a i-1 < a a i, call search(a, p i ). (c) If a > a k-1, call search(a, p k ).

B + Tree: Common Operations (3) case 2: T is a leaf node. Compare a with the A-values in T. (a) If no A-value in T equals a, report not found. (b) If a i in T equals a, follow pointer p i to fetch the tuple.