國立中山大學學位論文典藏.PDF

Similar documents
國家圖書館典藏電子全文

WTO

UDC Empirical Researches on Pricing of Corporate Bonds with Macro Factors 厦门大学博硕士论文摘要库

Microsoft PowerPoint _代工實例-1

穨control.PDF

2/80 2


Shanghai International Studies University THE STUDY AND PRACTICE OF SITUATIONAL LANGUAGE TEACHING OF ADVERB AT BEGINNING AND INTERMEDIATE LEVEL A Thes

Microsoft PowerPoint - STU_EC_Ch08.ppt

國立交通大學客家文化學院

天 主 教 輔 仁 大 學 社 會 學 系 學 士 論 文 小 別 勝 新 婚? 久 別 要 離 婚? 影 響 遠 距 家 庭 婚 姻 感 情 因 素 之 探 討 Separate marital relations are getting better or getting worse? -Exp

南華大學數位論文

Microsoft Word - TIP006SCH Uni-edit Writing Tip - Presentperfecttenseandpasttenseinyourintroduction readytopublish

WTO

University of Science and Technology of China A dissertation for master s degree Research of e-learning style for public servants under the context of

PowerPoint Presentation

國立中山大學學位論文典藏.PDF


苗 栗 三 山 國 王 信 仰 及 其 地 方 社 會 意 涵 The Influences and Implications of Local Societies to Three Mountain Kings Belief, in Taiwan Miaoli 研 究 生 : 林 永 恩 指 導

Public Projects A Thesis Submitted to Department of Construction Engineering National Kaohsiung First University of Science and Technology In Partial

ENGG1410-F Tutorial 6

世新稿件end.doc

Microsoft Word 谢雯雯.doc

東吳大學

國立中山大學學位論文典藏.PDF

untitled

50% SWEET 甜 蜜 五 分 仔 - 橋 頭 糖 廠 紀 念 商 品 開 發 設 計 之 研 究 50% SWEET - The Study on the Development and Design of Souvenirs of Qiao Tou Sugar Plant 研 究 生 : 陳

Abstract Today, the structures of domestic bus industry have been changed greatly. Many manufacturers enter into the field because of its lower thresh

Microsoft Word - template.doc

<4D F736F F D20BCFAA755AAA92DABC8AE61AAE1A5ACB2A3B77EB56FAE69A4A7ACE3A873A147A548B8EAB7BDB0F2C2A6AABAC65BC2492E646F63>

Microsoft PowerPoint - ryz_030708_pwo.ppt

BC04 Module_antenna__ doc

Microsoft PowerPoint - CH 04 Techniques of Circuit Analysis


Chinese oil import policies and reforms 随 着 经 济 的 发 展, 目 前 中 国 石 油 消 费 总 量 已 经 跃 居 世 界 第 二 作 为 一 个 负 责 任 的 大 国, 中 国 正 在 积 极 推 进 能 源 进 口 多 元 化, 鼓 励 替 代

硕 士 学 位 论 文 论 文 题 目 : 北 岛 诗 歌 创 作 的 双 重 困 境 专 业 名 称 : 中 国 现 当 代 文 学 研 究 方 向 : 中 国 新 诗 研 究 论 文 作 者 : 奚 荣 荣 指 导 老 师 : 姜 玉 琴 2014 年 12 月

國立中山大學學位論文典藏


096STUT DOC

國立中山大學學位論文典藏.PDF

Abstract Since 1980 s, the Coca-Cola came into China and developed rapidly. From 1985 to now, the numbers of bottlers has increased from 3 to 23, and

259 I

Time Estimation of Occurrence of Diabetes-Related Cardiovascular Complications by Ching-Yuan Hu A thesis submitted in partial fulfillment of the requi

本 論 文 獲 行 政 院 客 家 委 員 會 99 年 客 家 研 究 優 良 博 碩 士 論 文 獎 助

國立中山大學學位論文典藏.pdf

untitled

A VALIDATION STUDY OF THE ACHIEVEMENT TEST OF TEACHING CHINESE AS THE SECOND LANGUAGE by Chen Wei A Thesis Submitted to the Graduate School and Colleg

1505.indd

Preface This guide is intended to standardize the use of the WeChat brand and ensure the brand's integrity and consistency. The guide applies to all d


<4D F736F F D C4EAC0EDB9A4C0E04142BCB6D4C4B6C1C5D0B6CFC0FDCCE2BEABD1A15F325F2E646F63>

UDC The Policy Risk and Prevention in Chinese Securities Market

03施琅「棄留臺灣議」探索.doc

% % % % % % ~

東莞工商總會劉百樂中學

Microsoft Word - 第四組心得.doc

國立臺南大學數位論文典藏.pdf

國立高雄大學○○○○○○學系(研究所)(標楷體18號字

hks298cover&back


南華大學數位論文

摘 要 張 捷 明 是 台 灣 當 代 重 要 的 客 語 兒 童 文 學 作 家, 他 的 作 品 記 錄 著 客 家 人 的 思 想 文 化 與 觀 念, 也 曾 榮 獲 多 項 文 學 大 獎 的 肯 定, 對 台 灣 這 塊 土 地 上 的 客 家 人 有 著 深 厚 的 情 感 張 氏 於

國立中山大學學位論文典藏.PDF

Cover

南華大學數位論文

厦 门 大 学 学 位 论 文 原 创 性 声 明 本 人 呈 交 的 学 位 论 文 是 本 人 在 导 师 指 导 下, 独 立 完 成 的 研 究 成 果 本 人 在 论 文 写 作 中 参 考 其 他 个 人 或 集 体 已 经 发 表 的 研 究 成 果, 均 在 文 中 以 适 当 方

Shanghai International Studies University A STUDY ON SYNERGY BUYING PRACTICE IN ABC COMPANY A Thesis Submitted to the Graduate School and MBA Center I


Microsoft Word - 11月電子報1130.doc

國家圖書館典藏電子全文

第一章 出口退税制改革的内容

2015年4月11日雅思阅读预测机经(新东方版)

中国人民大学商学院本科学年论文

A Study on JI Xiaolan s ( ) Life, Couplets and Theories of Couplets 紀 曉 嵐 ( ) 生 平 資 料 斠 正 及 對 聯 聯 論 研 究 LI Ha 李 夏 THE UNIVER


The Development of Color Constancy and Calibration System

Microsoft PowerPoint - Aqua-Sim.pptx

TI 3 TI TABLE 4 RANDBIN Research of Modern Basic Education

Microsoft Word doc

水 土 保 持 學 報 47 (3): (2015) Journal of Soil and Water Conservation, 47 (3): (2015) ABSTRACT In this research, it is focused on the

國 立 政 治 大 學 教 育 學 系 2016 新 生 入 學 手 冊 目 錄 表 11 國 立 政 治 大 學 教 育 學 系 博 士 班 資 格 考 試 抵 免 申 請 表 論 文 題 目 申 報 暨 指 導 教 授 表 12 國 立 政 治 大 學 碩 博 士 班 論

中國文化大學政治學研究所

國立中山大學學位論文典藏.PDF

南華大學數位論文

<4D F736F F D203338B4C12D42A448A4E5C3C0B34EC3FE2DAB65ABE1>

9330.doc

:

致 谢 本 人 自 2008 年 6 月 从 上 海 外 国 语 大 学 毕 业 之 后, 于 2010 年 3 月 再 次 进 入 上 外, 非 常 有 幸 成 为 汉 语 国 际 教 育 专 业 的 研 究 生 回 顾 三 年 以 来 的 学 习 和 生 活, 顿 时 感 觉 这 段 时 间 也

C doc

Improved Preimage Attacks on AES-like Hash Functions: Applications to Whirlpool and Grøstl

致 谢 本 论 文 能 得 以 完 成, 首 先 要 感 谢 我 的 导 师 胡 曙 中 教 授 正 是 他 的 悉 心 指 导 和 关 怀 下, 我 才 能 够 最 终 选 定 了 研 究 方 向, 确 定 了 论 文 题 目, 并 逐 步 深 化 了 对 研 究 课 题 的 认 识, 从 而 一

<4D F736F F D20B169B74FC5EF2020A8E2A9A4B0EABB79B1D0ACECAED1A56AA8E5B8D6BA71BFEFBFFDA4A7ACE3A8732E646F63>

我国原奶及乳制品安全生产和质量安全管理研究

南華大學數位論文

Microsoft Word - 刘藤升答辩修改论文.doc

1 * 1 *




Transcription:

國立中山大學資訊工程學系 碩士論文 Department of Computer Science and Engineering National Sun Yat-sen University Master Thesis 一個於資料串流移動視窗中以權重排序晶格來探勘最 大權重頻繁樣式的方法 Weight-Order-Based Lattice lgorithm for Mining Maximal Weighted Frequent Patterns over a Data Stream Sliding Window 研究生 邱宗哲 Tzung-Je Chiou 指導教授 張玉盈 博士 Dr Ye-In Chang 中華民國 4 年 6 月 June 25

謝 辭 人生 是一條漫漫長路 無論崎嶇或坦途 都得靠自己走過 沒人能 替 還好 有一群親朋好友 陪我度過每個階段 塞車了 我有充裕的時 間欣賞沿途的風景 路難行 讓我有機會停步思考 而手上的地圖 讓我 不會繞了遠路 迷失方向 節錄自諮商中心的勵志話語 首先要感謝恩師張玉盈博士的指導 這兩年來由於張老師的細心教 導 使得無論在課業與待人處事上都能有所成長 並得以如期順利結業 其次感謝口試委員陳健輝教授 李建億教授以及郭大維教授 在百忙 之中 辛勞地審閱論文 並對本論文給予精闢的見解及意見 使得本論文 能更臻完善 世界上沒有幾件事比積極的鼓勵更有力量 一個微笑 一句樂觀 充滿希望的話 當事情遇到困難時 說一句 你能辦得到 德伏斯 在此感謝各位實驗室一起努力的夥伴 凱甯 心怡 相伴兩年 不斷地相 互激勵及互相成長學習 感謝實驗室各位學長姐及學弟妹 陳子翔學長 李迦恩學長 蔡孟璇學姐 琪智 修培 以及嘉慧 這一年來的鼓勵及幫 助 在倚山傍海的校園及溫馨的實驗室裡成長學習 是兩年碩士生涯中 最幸福快樂的事 最後感謝我的家人和朋友 如此的辛勞付出及關心 得 以讓我無後顧之憂的完成學業 完成此論文的同時 將這份榮耀 獻給所 有幫助我的人 謝謝你(妳)們 iii

學年度 : 4 學期 : 2 校院 : 國立中山大學 系所 : 資訊工程學系 論文名稱 中 : 一個於資料串流移動視窗中以權重排序晶格來探勘最 大權重頻繁樣式的方法 論文名稱 英 : Weight-Order-Based Lattice lgorithm for Mining 學位類別 : 語文別 : 學號 : Maximal Weighted Frequent Patterns over a Data Stream Sliding Window 碩士 Eng 2348 提要開放使用 : 否 頁數 : 63 研究生 中 姓 : 邱 研究生 中 名 : 宗哲 研究生 英 姓 : Chiou 研究生 中 姓 : Tzung-Je 指導教授 中 姓名 : 張玉盈 指導教授 英 姓名 : Ye-In Chang 關鍵字 中 : 資料串流 關鍵字 中 : 項目集 關鍵字 中 : 移動式視窗模型 關鍵字 中 : 晶格 關鍵字 中 : 最大權重頻繁項目集 關鍵字 英 : Data Stream 關鍵字 英 : Itemset 關鍵字 英 : Sliding Window Model 關鍵字 英 : Lattice 關鍵字 英 : Weighted Maximal Frequent Itemset iv

摘要 對於現實生活中 在資料串流(data stream)中對權重頻繁樣式探勘(weighted frequent pattern mining)是一個非常重要領域 比如說 超市 其中探勘最大權重 頻繁樣式(Weighted maximal frequent pattern)也是一個重要議題 一個最大權重頻 繁樣式是指其不是其他權重頻繁樣式的子集且同時它的權重值是大於或等於門 檻值 但很多之前方法是沒辦法使用在權重頻繁樣式探勘上 因為它們大部份採 用了 anti-monotone property 這個特性是說 即使有一個樣式 Y 的子集 X,此子 集 X 不是 weighted frequent pattern 但是但樣式 Y 卻仍可能是 weighted frequent pattern 此外 由於資料串流具有速度快 連續性 沒有限制性與即時性等特性 因此我們只可掃描資料庫一次 所以 之前在傳統資料庫探勘之演算法是不適用 於資料串流 再則 很多應用都著重於距離現在時間較近的資料串流 而移動式 視窗(sliding window model) 正是處理最近資料串流的模式 為了在移動式視窗 中找出最大權重頻繁樣式 Ryu 學者等人提出了 WMFP-SW 演算法 WMFP-SW 演算法使用 FP-tree 去探勘最大權重頻繁樣式 它還使用了最大權重來減少候補 樣式 但是它在視窗移動的時候花了太多時間 因為當新的交易資料到達的時候 WMFP-SW 演算法必須重建整個 FP-tree 除此之外 WMFP-SW 演算法可能會 有 missing case 為了解決這些問題 在這個論文中 基於移動式視窗為模型的 演算法 我們提出 Weighted-Order-Based Lattice 演算法 我們使用晶格(lattice) 來儲存交易的資料 晶格資料結構會儲存父節點與子節點之間的關係 在每一個 晶格節點 我們會儲存項目集與數量 當新的交易資料到達的時候 我們根據交 易特性分成五種集合關係 : () equivalent (2) subset (3) intersection (4) empty set 與 (5) superset 根據這五種集合關係 我們要新增或更新資料的時候會更有效 率 此外 我們使用 global maximal weight pruning 方式 與 local maximal weight pruning 方式去避免不必要的運算 從效能成果研讀 我們顯示了在真實資料與 在模擬資料之測試 我們的 Weighted-Order-Based Lattice 演算法都比 WMFP-SW 演算法更有效率 (關鍵詞 資料串流, 項目集, 移動式視窗模型, 晶格, 最大權重頻繁項目集) v

BSTRCT Weighted frequent pattern mining in data streams is an important field for the real world, such as the supermarket Moreover, mining the weighted maximal frequent patterns is also an important issue The weighted maximal frequent pattern is the pattern which is not the subset of any other pattern and the weighted support is larger than the threshold However, many previous priori-like algorithms cannot be used in weighted frequent pattern mining The reason is that even through a subset X of a pattern Y is not a weighted frequent pattern, the pattern Y may be a weighted frequent pattern Besides, because data streams are continuous, high speed, unbounded, and real time, we can only scan once for the data streams Therefore, the previous algorithms in the traditional databases are not suitable for the data streams Furthermore, many applications are interested in the recent data streams, and the sliding window is the model which deals with the most recent data streams In order to solve mining weighted maximal frequent patterns based on the sliding window model, Ryu et al propose the W M F P -SW algorithm The W M F P -SW algorithm uses the FP-tree to mine the weighted maximal frequent patterns It also uses maximal weight to prune the patterns But it takes long time in mining the weighted maximal frequent patterns Because when the new transaction comes, the W M F P -SW algorithm always has to reconstruct the FP-tree Moreover, the WMFP-SW algorithm may have a missing case To solve those problems, in this thesis, we propose the Weighted-Order-Based Lattice algorithm based on the sliding window model We use the lattice structure to store the information of the transactions The structure of the lattice stores the relationship between the child node and the father node In each node, we record the itemset and the count When the new transaction comes, we consider five relations: () equivalent, (2) subset, (3) intersection, (4) empty set, (5) superset With those five relations, we can add the new transactions and update the support efficiently Moreover, we use global maximal weight pruning strategy and local maximal weight pruning strategy to avoid generating invalid candidate patterns From the the performance study, including the real data and synthetic data, we show that theweighted-order-based Lattice algorithm provides better performance than the WMFP-SW algorithm both in the case of real data and the case of simulation in both cases (Key Words: Data Stream, Itemset, Sliding Window Model, Lattice, Weighted Maximal Frequent Itemset) vi

WEIGHT-ORDER-BSED LTTICE LGORITHM FOR MINING MXIML WEIGHTED FREQUENT PTTERNS OVER DT STREM SLIDING WINDOW Thesis Submitted to the Faculty of National Sun Yat-sen University by Tzung-Je Chiou In Partial Fulfillment of the Requirements for the Degree of Master of Science June 25

TBLE OF CONTENTS Page Introduction 2 2 4 7 9 2 2 Survey of lgorithms for Mining Weighed Frequent Patterns 3 2 3 4 5 6 2 22 Data Stream Mining Window Models in Data Streams Mining Weighted Maximal Frequent Itemsets pplications Related Work Motivation Organization of the Thesis vii 26 3 Weight-Order-Based Lattice lgorithm Data Structure Our Proposed lgorithm 32 Data Initialization 322 The Pruning Strategy 323 Data Deletion 324 Data Insertion 325 Comparison 3 5 5 9 2 2 22 22 22 3 32 23 24 The MWFIM lgorithm The WMFP-SW lgorithm 22 Data Structures for WMFP-SW 222 Pruning Patterns by MaxW 223 Pruning Patterns in Single-Path The MWS lgorithm The Set-Checking lgorithm 24 Data structure 242 The Set-Checking lgorithm 26 28 3 3 36 38 43

Page 4 Performance 4 42 5 53 53 55 5 Conclusion 59 5 52 The Performance Model Experiment Results 42 Real Data 422 Simulation Data 5 Summary Future Work 59 6 BIBLIOGRPHY 6 viii

LIST OF FIGURES Figure Page Landmark window model 3 2 T ilted-time window model 3 3 Sliding-window model 3 4 Weight of items in Transaction Database TDB 5 5 Transaction TDB 5 6 The relationship of the subset X and superset Y in the priori algorithm 6 7 The relationship of the subset X and superset Y in the weighted frequent pattern mining 6 8 n example EX: (a) frequent pattern; (b) weighted frequent pattern 7 9 W M F P -SW -tree [4] 2 W OBLattice tree 2 2 n example of a prefix tree containing items according to the descending order of weights 4 22 n example of a vertical bitmap representation 5 23 n example of Transaction Database TDB2 in the sliding-window model [4] 5 W M F P -SW -tree for sliding window w : (a) before reconstructing operation; (b) after reconstructing operation [4] 6 24 ix

Figure 25 Page W M F P -SW -tree for sliding window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after reconstructing operation again [4] 7 26 Weight of the items in Transaction Database TDB2 [4] 8 27 W M F P -array: (a) global W M F P -SW -array (b) G s conditional W M F P SW -array [4] 9 28 Single-path of E prefix tree: (a) the path of E; (b) every pattern in E prefix tree [2] 2 Case : Equivalent [9] 23 2 Case 2: Subset [9] 23 2 Case 3: Superset [9] 24 22 Case 4-: The intersected node is not in the lattice [9] 24 23 Case 4-2: The intersected node is in the lattice [9] 25 24 Case 5: Empty set [9] 25 3 Transaction Database TD without sorting 27 32 Sorting the items by weight order: (a) before sorting; (b) after sorting 27 33 The lattice of Window W 28 34 The flowchart of the algorithm 3 35 The Bitmap of TID:2 in Transaction Database TD 3 36 Procedure GM X W pruning 33 37 Global MX weight (GMX-W) pruning strategy 33 39 Procedure LM X W pruning 34 38 Local MX weight (LMX-W) pruning strategy 34 29 3 Procedure CheckW M F P 36 3 The example of WMFP-table 36 x

Figure Page 32 Procedure Delete lattice 37 33 The deletion of : (a) WMFP-table; (b) lattice 38 34 Procedure Decrease M ax 39 36 Procedure Insert T 4 35 Data insertion of : (a) WMFP-table; (b) lattice 4 37 Procedure Increase M ax 4 38 Procedure Delete W M F P 4 39 Function F ind T 4 32 Procedure CheckCase 42 32 The table of Window 2 43 322 Checking the WMFP-table 43 323 n example of Transaction Database TDB2 in Sliding-Window model 45 324 W M F P -SW -tree for Window W : (a) before reconstructing operation; (b) after reconstructing operation 45 325 W M F P -SW -tree for Window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after the reconstructing operation again 46 326 Lattice from (TID to TID 4) 47 327 Lattice from (TID 3 to TID 4) 47 328 Lattice from (TID 3 to TID 6) 48 329 Conditional tree of pattern {D} 49 33 Conditional tree of pattern {D} 49 33 Conditional tree of pattern {CD} 5 4 comparison of the processing time of the Mushroom dataset under the change of the threshold xi 54

Figure 42 43 44 Page comparison of the processing time of the T5MTN2Dk dataset under the change of the threshold 55 comparison of the processing time of the T(5-8)MTN2Dk dataset under the average size of the transactions 57 comparison of the processing time of the TMT2N(2-5)Dk dataset under the change of the number of items 58 xii

LIST OF TBLES Table Page 3 Variables 29 32 n example of the data stream mapped to the bit-pattern 32 4 Parameters used in the experiments 52 42 Details of real datasets used in the experiments 52 43 Details of simulation datasets used in the experiments 52 44 comparison of processing time for the Retail dataset under the change of the threshold 53 The processing time of the Mushroom dataset under the change of the threshold 54 The processing time of the T5MTN2Dk dataset under the change of the threshold 56 The processing time of the T(5-8)MTN2Dk dataset under the average size of the transactions 57 The processing time of the TMT2N(2-5)Dk dataset under the change of the number of items 58 45 46 47 48 xiii

CHPTER I Introduction Data mining is the process of finding hidden and useful knowledge form the large databases Moreover, data mining has been used in many areas, including geographic [5], network [], and traffic data [6] Recently, finding frequent patterns has became one of the most famous ways on data mining The most important process in frequent pattern mining is mining association rules [] By using association rules, the computer can discover new patterns from datasets However, items have different importance in the real world Therefore, we have to consider the importance and the count of the items at the same time Because of this reason, the algorithm of mining association rules cannot be followed in the weight environments and the priori algorithm [2] also cannot be used in weighted frequent pattern mining On the other hand, frequent pattern mining has two types of deformations, maximal frequent pattern mining [6] and closed frequent pattern mining [7, ] Moreover, weighted frequent pattern mining has two types of deformations, weighted maximal frequent pattern mining [8, 9] and weighted closed frequent pattern mining [2] Besides, the concept of a data stream, which may have the infinite transactions, is more appropriate than a data set for many recent applications, for example, sensor network data analysis and web click streams By nature, a stored data set is appropriate when significant portions of the data are queried again and again, and updates are small or relatively infrequent In contrast, a data stream is appropriate when the data is changing constantly, and it is either unnecessary or impractical to operate on large portions of the data many times

Data Stream Mining Data Streams are continuous, unbounded, real time, usually coming with high speed and having a data distribution that often be time-sensitive Hence, it is also called streaming data Besides, the volume of data streams is large and usually with an open end [3, 8, 9,, 4, 5, 6, 9, 2] Mining data streams has many new challenges First, they are unrealistic to keep the entire streams in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of the data is unbounded Second, traditional algorithms of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once Third, mining streams requires fast and real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within the short response time In addition, the combinatorial explosion of itemsets exacerbates mining frequent itemsets over streams in terms of both memory consumption and processing efficiency [3, 8, 9,, 4, 5, 6, 9, 2] Window Models in Data Streams In the research of mining frequent patterns with streaming data, the models of data streams can be divided into tree types: Landmark window, T ilted-time window, and Sliding window [4, 5] In the Landmark window model, as shown in Figure, the time range of mining frequent patterns starts from the beginning time to the present 2 In the T ilted-time window model, as shown in Figure 2, the time range of mining frequent patterns still starts from the beginning time to the present But the T ilted-time window does not treat all the data equally The time period is divided into serveral time slots The time slots in recent time have higher frequency, while those in ancient time have lower frequency 2

3 In the Sliding window model, as shown in Figure 3, the knowledge discovery is performed over a fixed number of recently generated data elements which is the target of data mining t t t2 t3 t4 t5 t6 t7 t8 W W 2 W 3 Figure Landmark window model Figure 2 T ilted-time window model Figure 3 Sliding-window model 3 time

2 Mining Weighted Maximal Frequent Itemsets Weighted maximal frequent pattern mining is one of the most important research in data mining In the traditional frequent pattern mining, the support of a frequent itemset XS = {X, X2,, Xn }, denoted by sup(xs), always satisfies sup(xs) minsup T DB, where minsup is a user-specified minimum support threshold in the range of [, ] and T DB is the size of database T DB If XS is a maximal frequent pattern, there is no frequent pattern which is the superset of XS But in the weighted frequent patterns mining, the minimal support minsup may not be in the range of [, ] Here, we give a weighted frequent pattern P S = {P, P2,, Pn }, where Pi is the ith item in P S Moreover, minsup = 8 and the weight of the pattern P S is defined as W P (P S) So the weight of P S is W P (P S) = (W P (P ) + W P (P2 ) + + W P (Pn )) / n, where Pi P S Pattern P S has to satisfy W Sup(P S) minsup, where W Sup(P S) = count(p S) W P (P S) and minsup can be more than For instance, given a pattern {B, C}, the weight of {B, C} is (8+4)/2 = 6 and the weight of items is shown in Figure 4 If the count of {B, C} is 3, as shown in Figure 5, W Sup({, B, C}) is 8 Because W Sup({B, C}) is larger than minsup, {B, C} is a weighted frequent pattern In Figure 5, pattern P S is a weighted maximal frequent pattern, because there is no weighted frequent pattern which is the superset of pattern P S In the priori algorithm, we combine Lk to find Lk The relationship of the subset and superset in the priori algorithm is shown in Figure 6, and the relationship of the subset and superset in the weighted frequent pattern mining is shown in Figure 7 For instance, in Figure 8-(a), we use L, the frequent -itemsets {}, {B}, and {C}, to find C2, the candidate 2-itemsets {, B}, {, C}, and {B, C} Then, we check each itemset in C2 to decide whether the supports of each itemset is larger than the threshold Those itemsets will become L2, the frequent 2-itemsets {, B}, {, C}, and {B, C} But, in Figure 8-(b), itemset {B, C} which is a weighted frequent pattern has a weighed subset {C} that is not weight frequent, because the item {B} and {C} have different weights When an infrequent itemset is added a frequent 4

weighed itemset, the new itemset may become frequent itemset In this case, the weighted frequent patterns are {}, {B}, {B, C} The weighted maximal frequent pattern is a pattern which does not have any weighted frequent super pattern So, the weighted maximal frequent patterns in this case are {}, {B, C} Item Weight 6 B 8 C 4 Figure 4 Weight of items in Transaction Database TDB TID Transaction, C 2 B, C 3, B, C 4, B 5 B, C Figure 5 Transaction TDB 5

ɺ ɺ Figure 6 The relationship of the subset X and superset Y in the priori algorithm * *? ɺ? Figure 7 The relationship of the subset X and superset Y in the weighted frequent pattern mining 6

{item set}:count {,B,C}: {,B}:2 {}:3 {,C}:2 {B,C}:3 {B}:4 {C}:4 Min_Sup:2 : The support of the pattern is larger than threshold (a) {item set}:wsup {,B,C}:6 {,B}:4 {}:8 {B,C}:8 {B}:32 {,C}: {C}:6 Min_Sup:8 : The weighted support of the pattern is larger than threshold (b) Figure 8 n example EX: (a) frequent pattern; (b) weighted frequent pattern 3 pplications In the real world, item may be not as important as item B For instance, the merchandises, like an iphone and a telephone in the market, have different price and cost They would cause the merchandises with different benefits So, finding the most valuable itemsets is an important task in the market analysis Because of this reason, 7

weighted frequent pattern mining [3, 4] and utility frequent pattern mining [5] are proposed Besides, weighted frequent pattern mining can be used widely not only in the market analysis, but also in web click stream analysis, even biological gene database analysis Recently, database and data mining communities have focused on a new data model, where data arrives in the form of continuous streams It is often referred to data streams or streaming data Many applications generate large amount of data streams in the real time, such as sensor data generated from sensor networks, transaction flows in retail chains, Web record and click streams in Web applications, performance measurement in network monitoring and traffic management, call records in telecommunications, and so on [, 6] In the data stream model, individual data items may be relational tuples, eg, network measurements, call records, web page visits, sensor readings, and so on However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable and unbounded streams appears to yield some fundamentally new research problems [3, 8, 9,, 4, 5, 6, 9, 2] The work reported could be viewed as a step towards enhancing data streams with functionalities to process queries as follows : Sensor network data analysis: In a sensor network environment, stream data comes from multiple remote sources Such an environment imposes excessive communication overhead and wastes computational resources, when data is dynamic In this situation, how to minimize the communication cost, how to combine frequency counts from multiple nodes, and how to mine data streams in parallel and update the associated information incrementally are additional issues which we need to consider 2 Web click streams: In web mining, data can be collected at the server side, client-side, proxy servers, or obtained from the database of an organization which contains business data or consolidated web data Each type of data collections differs not only in terms of the location of the data source, but also the kinds of data available, the segment of population from which the data was collected, and its algorithm of implementation 8

3 Network monitoring and traffic management: nalysis sequential patterns can discover what (time-based) sequence of audit events frequently occur together These frequent event patterns provide guidelines for incorporating temporal statistical measures into intrusion detection models 4 Related Work In [2], grawal et al proposed the priori algorithm for mining association rules In the priori algorithm, it generates all candidate itemsets from the previous result to be continued in each iteration The final results are few in all the generated candidate itemsets However, to generate each candidate itemset, it needs to count its frequency in all transactions Each iteration requires only one scan over the database in the priori algorithm It does not have missing cases But it takes a lot of time to find frequent patterns In [3], Jiawei et al proposed the compact tree structure, called F requentp attern tree, or F P -tree for shortly There have been many mining algorithm using the FP-tree in the mining part Since the FP-tree does not generate candidate itemsets, it does not need to scan over the database for so many times The FP-tree is a prefix-tree structure storing crucial, quantitative information about frequent patterns To ensure the FP-tree is compact, it only stores length -items in the tree, and the way which stores those nodes is dependent on the frequency of each item The FP-tree starts from finding length- pattern, examines only its conditional pattern base Then, it constructs its FP-tree and performs the mining recursively It does not have missing cases But it takes a lot of time to structure the conditional tree In [22], Yun et al proposed the Maximal Weighted Frequent Itemset Mining (M W F IM ) algorithm for mining maximal weighted frequent pattern M W F IM algorithm represents each transaction as a vertical bitmap and uses the N D operation to find the count of itemsets The structure of M W F IM algorithm is the weight descending order prefix tree They use weight descending order prefix tree and depth-first traversal to find the maximal weighted frequent patterns If a node 9

pattern in weight descending order prefix tree is not a weighted frequent pattern, the its child nodes in sub-tree are not weighted frequent patterns Finally, they examine those patterns to find maximal weighted frequent patterns It can find maximal weighted frequent pattern easily But it would take a lot of time, when the number of transactions is large In [4], Lee et al proposed the Weighted Maximal Frequent Pattern mining over data streams based on the Sliding Window model (W M F P -SW ) algorithm for mining maximal weighted frequent pattern in data streams within a transaction sliding window The W M F P -SW algorithm uses the W M F P -F P tree to store the transactions in the present window and reconstruct the W M F P -F P tree by items count descending order While window slides, this algorithm removes the old transactions, insert the new transactions and reconstruct again To decrease the number of candidates, they find the M axw, the largest weight of the item in the itemset, and calculate minsup/m axw If the count of the itemset is smaller than minsup/m axw, it is not a weighted frequent pattern Finally, they traverse the single-pass to find maximal weighted frequent patterns It can find weighted maximal frequent pattern based on the sliding window model easily In [2], Yun et al proposed the maximal frequent pattern mining with weight conditions over data streams (M W S) algorithm for mining maximal weighted frequent pattern in the landmark window The M W S algorithm is duplicated with W M F P -SW algorithm, but it does not need to remove the old transactions Besides, it shows that the nodes of descending count order is less than the nodes of weighted descending order on its tree structure, which is never be compared in [4] 5 Motivation Discovery of mining maximal frequent itemsets is an important problem in the area of data mining weighed maximal frequent pattern is the itemset which is not the subset of any other weighted frequent pattern and the weighted support is larger than or equal to the threshold s a result, the weighted maximal itemset is

more compact than the complete weighted frequent patterns Previous algorithms for mining maximal frequent itemsets in data streams have been based on many different models The W M F P -SW algorithm [4] proposed by Lee et al is an algorithm mining weighted maximal frequent patterns based on the sliding window model In their structure as shown in Figure 9, the W M F P -SW algorithm uses a compact structure to mine the weighted maximal frequent (W M F P ) pattern In the W M F P -SW algorithm, they use W M F P -F P tree and W M F P -SW -array to mine W M F P patterns They use M axw pruning and single-pass pruning to find maximal weighted frequent patterns However, the W M F P -SW algorithm will take long time to reconstruct W M F P -F P tree while window sliding Because of this reason, if the size of the transaction is large, the W M F P -SW algorithm has to take a lot of time to reconstruct W M F P -F P tree Moreover, the WMFP-SW algorithm has missing cases, when they use their pruning patterns by single-path Therefore, in this thesis, we propose the WeightedOrder-Based Lattice (WOB Lattice) algorithm to efficiently find the result without reconstructing the whole data structure In our data structure as shown in Figure, we use a lattice to incrementally mine the weighted frequent patterns We check the relation between the incoming transaction and the old transactions, when the new transaction comes There are five relations which are concerned in our algorithm: () equivalent, (2) subset, (3) intersection, (4) empty set, and (5) superset Because the relations between the set and the subset, there are two advantages First, when the support increases, the W OB structure remains the same Second, we will not lose any result From the simulation result, we show that our proposed W OB Lattice algorithm provides better performance than the W M F P -SW algorithm Because our lattice structure can decrease the time of reconstructing the tree, our W OB Lattice algorithm can decrease the processing time

Item Count B 2 2 C E D F G {} B C D F E Figure 9 W M F P -SW -tree [4] TID Transaction B,C,E 2 D 3,B 4,F root D: Tid(2) B,C,E: Tid(),B: Tid(3) B:2 Tid(, 3),F: Tid(4) :2 Tid(3,4) Figure W OBLattice tree 6 Organization of the Thesis The rest of thesis is organized as follows Chapter 2 gives a survey of some algorithms for mining association rules related problems which include mining weighed frequent patterns Chapter 3 presents the proposed the Weighted-Order-Based Lattice algorithm Chapter 4 presents the performance study of the Weighted-OrderBased Lattice algorithm Finally, Chapter 5 gives the conclusions 2

CHPTER II Survey of lgorithms for Mining Weighed Frequent Patterns Weighted frequent pattern mining is very important problem in the real databases In this chapter, we give a survey of some well-known algorithms to find weighted maximal frequent pattern among objects from databases, including the M W F IM algorithm [22], the W M F P -SW algorithm [4], and the M W S algorithm [2] On the other hand, it is hard to extract all frequent patterns in data streams, because of data streams changed and large new elements added To solve this problem, maximal frequent patterns can be used, including the Set-Checking algorithm [9] In this chapter, first, we introduce the M W F IM algorithm, the W M F P -SW algorithm, and the M W S algorithm Next, we will introduce the Set-Checking algorithm 2 The MWFIM lgorithm The weighted frequent pattern means that the weighted support of the pattern W P = {P, P2,, Pn } must be larger than threshold, where Pi is the ith item in W P Moreover, the weighted support is defined as W Sup = the average weight of the pattern multiplied by the count of the pattern as shown in the following formula: W Sup(W P ) = P + P2 + Pn count(w P ) length(w P ) If a weighted frequent pattern has no weighted frequent superset, it will be called Maximal Weighted Frequent Pattern (M W F P ) However, one of difficulties of mining maximal weighted frequent patterns is some subsets of M W F P might not be weighted frequent 3

The M W F IM algorithm [22] uses a search strategy for M W F P mining according to the weighted descending ordering Figure 2 shows items with weighted descending order and their prefix tree The weight of itemset {B} is equal to or larger than the weight of itemset {B, } and itemset {B, C} While the new item is added to the pattern, the weight of the new item is always equal to or smaller than the weight of the last item in the pattern, since all items have been sorted by the weighted descending ordering Because of this reason, we can be sure that the weight of every parent node would be equal to or larger than the weight of its child nodes in the prefix tree Besides, the count of subset is always equal to or larger than the count of superset So, the W Sup of every parent node would be equal to or larger than the W Sup of its child nodes in the prefix tree If W Sup of the parent node is smaller than M insup, W Sup of any one of its child nodes is also smaller than M insup root {B,} {B} {} {B,C} {,C} {C} Figure 2 n example of a prefix tree containing items according to the descending order of weights The M W F IM algorithm also uses the vertical bitmap representation to find the pattern in transactions Figure 35 shows that itemset {, B} can be generated easily by performing an ND-operation on all of the bits in the bitmaps for itemset {} and itemset {B} Then, we can easily count the number of itemset {, B} Basiclly, the bitmap representation is ideal for both steps of candidate pattern generation and support counting 4

TID B C B 2 3 4 5 Figure 22 n example of a vertical bitmap representation 22 The WMFP-SW lgorithm Ryu et al propose the W M F P -SW algorithm [4] to mine weighted maximal frequent patterns (W M F P s) on the sliding window Their W M F P -SW algorithm performs mining operations based on a tree structure, and accordingly, they need tree structures suitable for finding W M F P s over sliding window-based data streams W W2 TID Transcation B,C,E 2 D 3,B 4,F 5 D,F 6 D,F,G Figure 23 n example of Transaction Database TDB2 in the sliding-window model [4] 22 Data Structures for WMFP-SW Their W M F P -SW algorithm [4] performs mining operations based on the W M F P SW -tree and the W M F P -SW -array This tree structure is similar to the F P -tree [2], but the W M F P -SW -tree has additional weight data and is constructed with 5

only one scan For example, Figure 23 is an sliding window data stream, and sliding window w has been scanned as shown in Figure 24-(a) Then, the tree has to be reconstructed by the descending count order in Figure 24-(b) s the window is changed, a part of the tree is removed as shown in Figure 25-(a), new data items are added into the tree Figure 25-(b), and then performing tree reconstructing operations as shown in Figure 25-(c) Item Count B 2 C E D 2 F G {} B D B F C E (a) Item Count B 2 2 C E D F G {} B C D F E (b) Figure 24 W M F P -SW -tree for sliding window w : (a) before reconstructing operation; (b) after reconstructing operation [4] 6

Item Count (-) B 2 (-) C (-) E (-) D F G {} B F (a) Item {} Count (-) B 2 (-) C (-) E (-) D (+) (+) 2 F (+)(+) 3 G (+) B D F F G (b) Item Count F 3 2 D 2 B G C E {} B F D G (c) Figure 25 W M F P -SW -tree for sliding window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after reconstructing operation again [4] 7

Item Weight 5 B 7 C 8 D E 4 F 9 G 6 Figure 26 Weight of the items in Transaction Database TDB2 [4] The W M F P -SW -array stores a part of node data in the W M F P -SW -tree Figure 27-(a) is the W M F P -SW -array of Figure 25-(c) The number in space [F,D] is 2, which means that count of item D is 2 under item F in Figure 25-(c) Using the conditional tree to find W M F P s has to scan the W M F P -SW -tree many times That is why they use the W M F P -SW -array In data streams, using the conditional WMFP-SW-array to mine patterns would have better performance In Figure 27-(b), we can see that item G is always in the same transaction with item F and item D 8

(5) D () 2 B (7) G (6) C (8) E (4) F (9) (5) D () B (7) G (6) C (8) (a) D () F (9) (b) Figure 27 W M F P -array: W M F P -SW -array [4] 222 (a) global W M F P -SW -array (b) G s conditional Pruning Patterns by MaxW Because weighted candidates do not satisfy the anti-monotone property [7] in general, weighted infrequent pattern could be a subset of a weighted frequent pattern [22] s a result, incorrect pruning operations by the weights can cause weighted pattern losses To solve this problem and perform efficient pruning procedures, they define a pruning condition applied in the sliding window model, named M axw For example, M axw is the largest value among weighted of all items included in the current tree, and the weight of the items is shown in Figure 26 So the current M axw is as shown in Figure 25-(c) ssume that a minimum support threshold, M insup is 2 Item G is weighted infrequent, because its count multiply M axw would be less than M insup On the other hand, in the conditional tree of item B, 9

M axw is smaller than, since item D does not participate in the mining process any longer and the new M axw becomes 7 in the conditional tree 223 Pruning Patterns in Single-Path If any single-path is generated in the process of the mining steps, they would calculate W Sup, the weighted support, for the patterns If the W Sup value is not larger than the threshold, they would find other W M F P s which are the subsets of the single-path For example, Figure 28 shows how to extract W M F P s in a single-path and M insup is 2 In the first step, since W Sup of itemset {, C, D, B, E} is 8 and smaller than M insup So, they start to find the longest subsets of the itemset {, C, D, B, E} In length 4, the W Sup of itemset {, C, D, E} and the W Sup of itemset {, C, B, E} are larger than M insup, and the W Sup of itemset {, C, B, E} and the W Sup of itemset {, D, B, E} are smaller than M insup Thus, itemset {, C, D, E} and itemset {, C, B, E} are W M F P s 2

Prefix E:4,9 {} : 4, 5 C: 4, 5 D: 4, 4 B: 2, 2 D: 4, 4 B: 2, 2 item: count, weight (a) item: count, weight Prefix E:4,9 {} : 4, 5 C: 4, 5 Length Patterns 4 {,C,D,E}:4*75, {,C,B,E}:2*25, {,D,B,E}:2*75, {C,D,B,E}:2*75 3 {,C,E}:4*3, {,D,E}:4*93, {,B,E}:2*87, {C,D,E}:4*94, {C,B,E}:2*87, {D,B,E}:2*5 2 {,E}:4*2, {B,E}:2*55, {C,E}:4*2, {D,E}:4*65 (b) Figure 28 Single-path of E prefix tree: (a) the path of E; (b) every pattern in E prefix tree [2] 23 The MWS lgorithm The M W S algorithm [2] is also proposed by Ryu et al, and it is similar to the W M F P -SW algorithm These two algorithms have the same structures and strategies The difference is that the M W S algorithm is based on the Landmark Window model and the W M F P -SW algorithm is based on the Sliding Window model So, we will not discuss data structure and the strategies of the M W S algorithm in this section However, all of the items are not deleted in the structures of the M W S algorithm, it makes M W S algorithm easier to be done Moreover, it shows the descending count order is better than the weighted descending order on its tree structure lthough that we do not need to compute weight to find M axw by using the descending weight order, it would cause an inefficient and incompact tree be build Because of these reasons, the M W S algorithm and the W M F P -SW algorithm use the descending count order, not the weighted descending order 2

24 The Set-Checking lgorithm The Set-Checking algorithm is proposed by Chang et al It uses a lattice structure to store the transactions [9] 24 Data structure The lattice structure [8, 9, 22] has been widely used in data mining of weights Because the lattice structure presents the relation between the new transaction and the current transactions easily, and can be updated the support of the patterns efficiently Each node in the lattice structure contains two parts: () Bitpattern represents itemset and (2) Sup represents the support of itemset For instance, the bit pattern of itemset {a, c} is denoted as Bit(), since there are five items {a, b, c, d, e} 242 The Set-Checking lgorithm The Set-Checking algorithm has three main steps: Transform the itemset to the bit pattern 2 Check the relation between the new transaction and the old transaction There are five cases in the relations Case : Equivalent as shown in Figure 29 Case 2: Subset as shown in Figure 2 Case 3: Superset as shown in Figure 2 Case 4: Intersection Case 4-: The intersected node is not in the lattice as shown in Figure 22 Case 4-2: The intersected node is in the lattice as shown in Figure 23 Case 5: Empty set as shown in Figure 24 22

3 Examine which transaction is the maximal frequent itemsets root root Insert {abc} abc: Tid() abc:2 Tid(, 2) Before fter Figure 29 Case : Equivalent [9] root root abc: Tid() Insert {bc} abc: Tid() bc:2 Tid(, 2) Before fter Figure 2 Case 2: Subset [9] 23

root root ad: Tid() Insert {abcd} abcd: Tid(2) ad:2 Tid(, 2) Before fter Figure 2 Case 3: Superset [9] root root Insert {acd} abc: abc: Tid() Tid() acd: Tid(2) ac:2 Tid(, 2) Before fter Figure 22 Case 4-: The intersected node is not in the lattice [9] 24

root root Insert {bd} abc: Tid() abc: Tid() bd: Tid(3) b:3 Tid(, 2, 3) b:2 Tid(, 2) fter Before Figure 23 Case 4-2: The intersected node is in the lattice [9] root root Insert {cd} ab: Tid() ab: Tid() cd: Tid(2) fter Before Figure 24 Case 5: Empty set [9] 25

CHPTER III Weight-Order-Based Lattice lgorithm In Chapter, we have mentioned several window models of the algorithm on data streams One of the well-known window models based on the algorithm is the Sliding Window model The Sliding Window model can be applied in many applications, since the sliding window model focuses the importance on the recent data, which is more important and relevant than the old data If some applications that tend to capture the recent data, the Sliding Window model provides more informative and useful information than the two other models In this chapter, we will present our algorithm 3 Data Structure In our algorithm, we propose the weight-order-based lattice structure and the bit-pattern representation of items based on the Sliding Window model We use Transaction Database TD as shown in Figure 3 to illustrate our idea Because we need weight descending order, we have to sort the items, as shown in Figure 32-(a), by using the weighted order, as shown in Figure 32-(b) This lattice structure has two advantages First, using the weight-order-based lattice structure, the relationship between the new transaction and present transactions can be easily understood Second, we can update the support of the transaction efficiently 26

TID Transaction B, D, E 2 C, D 3, B, C, D, E 4 B 5, B, C, D 6 D, E W W2 Figure 3 Transaction Database TD without sorting Item weight Item weight 4 B 8 B 8 E 6 C 3 D 5 D 5 4 E 6 C 3 (a) (b) Figure 32 Sorting the items by weight order: (a) before sorting; (b) after sorting The weight-order-based lattice structure contains the root, nodes, and child-link n example is shown in Figure 33 The root has no information It is a start point When a new transaction is inserted, we sort the transaction by weight descending order and search the weight-order-based lattice structure from the root Each node records some information: Bit-Pattern: It represents an itemset Sup: It represents the support of the itemset 27

The child-link points to the subset node With the child-link, we can check the relationship between nodes and insert the node into the weight-order-based lattice structure efficiently Moreover, we can increase the support of the related nodes easily root B, E, D,, C: Tid(3) B,E,D:2 Tid(, 3) D,C:2 Tid(2, 3) B:3 Tid(, 3, 4) D:3 Tid(, 2, 3) Figure 33 The lattice of Window W 32 Our Proposed lgorithm In this section, we first define some basic definitions and notations and then discuss how to use these techniques to find and represent the weighted maximal frequent itemsets and organize the weighted maximal frequent itemsets in a lattice Our algorithm has 4 main steps Transform the itemset to the bit pattern 2 Check the relation between the new transaction and the old transaction 28

Table 3 Variables Variable Meaning minsup Threshold GM XW Global MX Weight LM XW Local MX Weight Sup Support of the pattern W Sup Weighted Support of the pattern WTD Weighted Transaction Database Ti The i-th Transaction IT S set of items ITi Item i W Ti Weight of item i 3 Use the pruning strategy to prune the patterns 4 Examine which transaction is the weighted maximal frequent itemsets In the first step, we use the bit-pattern representation to store the transaction We will use weighted descending order to transform the transaction to the representation of bit-patterns For each transaction T in the current window, a bit-pattern of transaction T is denoted as Bit(T ), and the length of bit-pattern is the kind of items in total database If item X is in this transaction and item X is the i-th place in the weight descending order, the i-th bit of Bit(T) is set to ; otherwise, it is set to For instance, item D is the third item in Figure 32-(b) Because of this reason, the bit-pattern of the transaction which has item D will be set to in its third place In the second step, we check the relation between the new transaction and the old transaction There are five cases in inserting itemsets into the lattice structure The flowchart is shown in Figure 34 When the current window becomes full, we will delete the two oldest transactions and insert two new transactions 29

NewT CSE 5 Yes No NewT OldT Insert NewT to the lattice No NewT OldT Yes Yes No No NewT OldT NewT = OldT CSE 4 CSE 3 Update the support of NewT and the descendants of NewT No OldT be the chile of NewT Yes WSup(ITS) >= min_sup No Yes The last data stream Insert NewT to lattice structure Insert Tid to OldTTidset 2pply OldT to procedure CheckWMFP CSE 2 NewT becomes the child of OldT Insert OldTTidset to NewTTidset 2pply NewT to procedure CheckWMFP intersect is the intersection of NewT and OldT intersect is in the lattice structure Insert NewTTidset to intersecttidset No dd intersect to the lattice structure pply intersect to procedure CheckWMFP dd NewT to WMFP_table Stop Figure 34 The flowchart of the algorithm 3

32 Data Initialization In mining the maximal frequent itemsets from the data stream, each new transaction must be discovered the related relations (equivalent, superset, subset, intersection, empty) of the present transactions In order to check the relation efficiency, we use the bit-pattern to represent the itemset The maximal length of the bit-length is the number of distinct items n example of the data stream mapped to the bit-pattern is shown in Figure 35 Item B E D C Bit map Figure 35 The Bitmap of TID:2 in Transaction Database TD When the item appears in the transaction, we set the related bit to according to the weight descending order In Figure 3, T id2 is {C, D}, and it would become [D, C] in the weight-order-based lattice structure We set the third and fifth positions to The bit-pattern is denoted as [] So, the bit-pattern of W indow is shown in Table 32 For every new transaction which is inserting, we set the support to be fter checking the new transaction with each present transactions, the support may be increased and new itemsets may be created In the next section, we introduce this case in details 322 The Pruning Strategy In this section, first, we will build an ppearing Table ppeart to store the count of the pattern in lattice fter we build the ppearing Table ppeart, we will use 3

Table 32 n example of the data stream mapped to the bit-pattern Tid Itemset Bit-Pattern B, E, D 2 D, C 3 B, E, D,, C 4 B two pruning strategies: global maximal weight (GM X W ) pruning strategy and local maximal weight (LM X W ) pruning strategy Note that, GM X W means the largest weight of the items among all of transactions and LM X W means the largest weight of the items among items in a certain transaction In the global pruning strategy, we make use of GM X W, the maximum weight of item X among items appearing in all of transactions For any pattern Y, the average weight of pattern Y, W eight(y ) must be smaller than or equal to GM X W Let Count(Y ) denote the count of pattern Y Therefore, if Count(Y )*GM X W is smaller than the threshold, pattern Y will not be the result That is, pattern Y can be pruned In other words, if Count(Y ) < threshold/gm X W, pattern Y can be pruned In our algorithm, Count(Y ) = I, where P (I), the I th entry of ppearing Pattern table, records every pattern Y with count = I Therefore, we prune those patterns stored in P (I), if I is smaller than threshold/gm X W Note that I is an integer, while threshold/gm X W is a real number So procedure of global maximal weight pruning strategy is shown in Figure 36 In order to check whether the subset X of a pattern Y is frequent or not, we must record the count of subset X Here, similar to the concept of the closed set, we only record the subset X which has different count to pattern Y in ppearing Table ppeart Note that the definition of a closed set is that the set whose supersets are not frequent or whose count is larger than the count of its superset Moreover, 32

: Procedure GM X W pruning (ppeart, min support, T ); /* T is a Transaction */ 2: begin δ 3: foreach (the count of Transaction T in the ppeart < GM X 4: begin 5: delete Transaction T from the ppeart ; 6: end; 7: end; W )do Figure 36 Procedure GM X W pruning P (I) records patterns with count = I in the current window Basically, only subset X which has count lager than that of pattern Y will be recorded in ppeart For instance, in Figure 37, the threshold is 3, and we define it as δ The GM X W will be set as 8, since the maximal weight item is B So, we will delete every pattern which has the count that is less than 3/8 In this case, we will delete the pattern [B, E, D,, C], because the count of [B, E, D,, C] is which is less than δ/gm X W P() P(2) P(3) P(4) [B, E, D,, C] [B, E, D] [D, C] [B] [D] ø Figure 37 Global MX weight (GMX-W) pruning strategy In the local pruning strategy, since items recorded in a pattern P X (a list) according to the descending order of weights, the first item X in a pattern P X must be the one with the maximum weight among items in such a pattern P X We use LM X W to record the weight of such an item X Therefore, for any other item Y in the same pattern P X, the weight of item Y must be smaller than or equal to LM X W That is, the weight of pattern P X must be smaller than or equal to LM X W In the 33

: Procedure LM X W pruning (ppeart, min support, T ); /* T is a Transaction */ 2: begin δ 3: foreach (the count of Transaction T in the ppeart < LM X 4: begin 5: delete Transaction T from the ppeart ; 6: end; 7: end; W )do Figure 39 Procedure LM X W pruning same way, if Count(P X) * LM X W is smaller than or equal to the threshold, pattern P X can be pruned Note that Count(P X) = I, where P (I) records pattern P X which appears I times in those transactions within current window In Figure 38, LM X W will be set as the weight of the first item of the pattern, since the pattern has been sorted For instance,the LM X W of [D, C] is set as 5 since the maximal weight item in the pattern [D, C] is D In local maximal weight pruning strategy, [D, C] will be deleted, because we known that we multiply 2 (the count of [D, C]) by 5 (LM X W of [D, C]) which is less than δ Procedure of local maximal weight pruning strategy is shown in Figure 39 P() P(2) P(3) P(4) ø [B, E, D] [D, C] [B] [D] ø Figure 38 Local MX weight (LMX-W) pruning strategy To efficiently find the result, we avoid checking too many subsets X of a pattern Y That is, even though subset X of pattern Y is not frequent, pattern Y may be frequent Note that Count(Y ) is still smaller than or equal to Count(X) However, there is no relationship between W eight(y ) and W eight(x) Therefore, there is no relationship between W Sup(Y ) and W Sup(X) That is, we can not give up checking 34

the subsets of pattern Y, even pattern Y is not frequent However, for checking those subsets of pattern Y, we also can make use of the concept of the ordered set That is, we can make use of the similar concept of the local pruning strategy In this way, we do not have to check every subset X of pattern Y fter the end of the pruning step, we can start to find the W M F P s from those candidate patterns We start from the pattern Y which has the least appearing times among all candidate patterns, since it may have the subset X whose appearing times is larger than the pattern Y In weight maximal frequent mining, an transaction may have more than one W M F P s For instance, the W Sup of the pattern [B, E, D] is less than δ However, pattern [B, E] and pattern [B, D], the subsets of [B, E, D], are W M F P s To make sure we will not lose any WMFP, we have to trace every subset of a candidate pattern, but it would take a lot of time To solve this problem, we propose an way to efficiently trace subsets of a candidate pattern For a candidate pattern P Z of length r = [t, t2,, tr ], if its ordered subset P Y of length h = [t, t2,, th ] (h < r) and the W Sup of ordered subset P Y < threshold, then we prune every other subset of pattern P Z which is length h must satisfied the condition: W Sup < threshold The reason is that pattern P Z has been sorted, so P Y is the subset of length h with the largest weight among all subsets of P Z For instance, if the pattern [B, E, D,, C] has appeared 2 times, we can use it to find W M F P s fter we calculate the W Sup of [B, E, D,, C], we find that [B, E, D,, C] is not an W M F P s the define, the order subsets of [B, E, D,, C] is [B, E, D, ], [B, E, D], [B, E], and [B] Then, we conclude that the W Sup of [B, E, D, ], [B, E, D, ] is also not an W M F P Consequently, we would know every 4-length subset P Y of [B, E, D,, C] is not an W M F P, because [B, E, D, ] has the largest weight in P Y and P Y have the same counts fter we use LM X W to find W M F P s, we can build an W M F P -table to store W M F P s and their weight as shown in Figure 3 35