國立中山大學學位論文典藏.PDF

國立中山大學資訊工程學系碩士論文 Department of Computer Science and Engineering National Sun Yat-sen University Master Thesis 一個於資料串流移動視窗中以權重排序晶格來探勘最大權重頻繁樣式的方法 Weight-Order-Based Lattice lgorithm for Mining Maximal Weighted Frequent Patterns over a Data Stream Sliding Window 研究生邱宗哲 Tzung-Je Chiou 指導教授張玉盈博士 Dr Ye-In Chang 中華民國 4 年 6 月 June 25

謝辭人生是一條漫漫長路無論崎嶇或坦途都得靠自己走過沒人能替還好有一群親朋好友陪我度過每個階段塞車了我有充裕的時間欣賞沿途的風景路難行讓我有機會停步思考而手上的地圖讓我不會繞了遠路迷失方向節錄自諮商中心的勵志話語首先要感謝恩師張玉盈博士的指導這兩年來由於張老師的細心教導使得無論在課業與待人處事上都能有所成長並得以如期順利結業其次感謝口試委員陳健輝教授李建億教授以及郭大維教授在百忙之中辛勞地審閱論文並對本論文給予精闢的見解及意見使得本論文能更臻完善世界上沒有幾件事比積極的鼓勵更有力量一個微笑一句樂觀充滿希望的話當事情遇到困難時說一句你能辦得到德伏斯在此感謝各位實驗室一起努力的夥伴凱甯心怡相伴兩年不斷地相互激勵及互相成長學習感謝實驗室各位學長姐及學弟妹陳子翔學長李迦恩學長蔡孟璇學姐琪智修培以及嘉慧這一年來的鼓勵及幫助在倚山傍海的校園及溫馨的實驗室裡成長學習是兩年碩士生涯中最幸福快樂的事最後感謝我的家人和朋友如此的辛勞付出及關心得以讓我無後顧之憂的完成學業完成此論文的同時將這份榮耀獻給所有幫助我的人謝謝你(妳)們 iii

學年度 : 4 學期 : 2 校院 : 國立中山大學系所 : 資訊工程學系論文名稱中 : 一個於資料串流移動視窗中以權重排序晶格來探勘最大權重頻繁樣式的方法論文名稱英 : Weight-Order-Based Lattice lgorithm for Mining 學位類別 : 語文別 : 學號 : Maximal Weighted Frequent Patterns over a Data Stream Sliding Window 碩士 Eng 2348 提要開放使用 : 否頁數 : 63 研究生中姓 : 邱研究生中名 : 宗哲研究生英姓 : Chiou 研究生中姓 : Tzung-Je 指導教授中姓名 : 張玉盈指導教授英姓名 : Ye-In Chang 關鍵字中 : 資料串流關鍵字中 : 項目集關鍵字中 : 移動式視窗模型關鍵字中 : 晶格關鍵字中 : 最大權重頻繁項目集關鍵字英 : Data Stream 關鍵字英 : Itemset 關鍵字英 : Sliding Window Model 關鍵字英 : Lattice 關鍵字英 : Weighted Maximal Frequent Itemset iv

摘要對於現實生活中在資料串流(data stream)中對權重頻繁樣式探勘(weighted frequent pattern mining)是一個非常重要領域比如說超市其中探勘最大權重頻繁樣式(Weighted maximal frequent pattern)也是一個重要議題一個最大權重頻繁樣式是指其不是其他權重頻繁樣式的子集且同時它的權重值是大於或等於門檻值但很多之前方法是沒辦法使用在權重頻繁樣式探勘上因為它們大部份採用了 anti-monotone property 這個特性是說即使有一個樣式 Y 的子集 X,此子集 X 不是 weighted frequent pattern 但是但樣式 Y 卻仍可能是 weighted frequent pattern 此外由於資料串流具有速度快連續性沒有限制性與即時性等特性因此我們只可掃描資料庫一次所以之前在傳統資料庫探勘之演算法是不適用於資料串流再則很多應用都著重於距離現在時間較近的資料串流而移動式視窗(sliding window model) 正是處理最近資料串流的模式為了在移動式視窗中找出最大權重頻繁樣式 Ryu 學者等人提出了 WMFP-SW 演算法 WMFP-SW 演算法使用 FP-tree 去探勘最大權重頻繁樣式它還使用了最大權重來減少候補樣式但是它在視窗移動的時候花了太多時間因為當新的交易資料到達的時候 WMFP-SW 演算法必須重建整個 FP-tree 除此之外 WMFP-SW 演算法可能會有 missing case 為了解決這些問題在這個論文中基於移動式視窗為模型的演算法我們提出 Weighted-Order-Based Lattice 演算法我們使用晶格(lattice) 來儲存交易的資料晶格資料結構會儲存父節點與子節點之間的關係在每一個晶格節點我們會儲存項目集與數量當新的交易資料到達的時候我們根據交易特性分成五種集合關係 : () equivalent (2) subset (3) intersection (4) empty set 與 (5) superset 根據這五種集合關係我們要新增或更新資料的時候會更有效率此外我們使用 global maximal weight pruning 方式與 local maximal weight pruning 方式去避免不必要的運算從效能成果研讀我們顯示了在真實資料與在模擬資料之測試我們的 Weighted-Order-Based Lattice 演算法都比 WMFP-SW 演算法更有效率 (關鍵詞資料串流, 項目集, 移動式視窗模型, 晶格, 最大權重頻繁項目集) v

BSTRCT Weighted frequent pattern mining in data streams is an important field for the real world, such as the supermarket Moreover, mining the weighted maximal frequent patterns is also an important issue The weighted maximal frequent pattern is the pattern which is not the subset of any other pattern and the weighted support is larger than the threshold However, many previous priori-like algorithms cannot be used in weighted frequent pattern mining The reason is that even through a subset X of a pattern Y is not a weighted frequent pattern, the pattern Y may be a weighted frequent pattern Besides, because data streams are continuous, high speed, unbounded, and real time, we can only scan once for the data streams Therefore, the previous algorithms in the traditional databases are not suitable for the data streams Furthermore, many applications are interested in the recent data streams, and the sliding window is the model which deals with the most recent data streams In order to solve mining weighted maximal frequent patterns based on the sliding window model, Ryu et al propose the W M F P -SW algorithm The W M F P -SW algorithm uses the FP-tree to mine the weighted maximal frequent patterns It also uses maximal weight to prune the patterns But it takes long time in mining the weighted maximal frequent patterns Because when the new transaction comes, the W M F P -SW algorithm always has to reconstruct the FP-tree Moreover, the WMFP-SW algorithm may have a missing case To solve those problems, in this thesis, we propose the Weighted-Order-Based Lattice algorithm based on the sliding window model We use the lattice structure to store the information of the transactions The structure of the lattice stores the relationship between the child node and the father node In each node, we record the itemset and the count When the new transaction comes, we consider five relations: () equivalent, (2) subset, (3) intersection, (4) empty set, (5) superset With those five relations, we can add the new transactions and update the support efficiently Moreover, we use global maximal weight pruning strategy and local maximal weight pruning strategy to avoid generating invalid candidate patterns From the the performance study, including the real data and synthetic data, we show that theweighted-order-based Lattice algorithm provides better performance than the WMFP-SW algorithm both in the case of real data and the case of simulation in both cases (Key Words: Data Stream, Itemset, Sliding Window Model, Lattice, Weighted Maximal Frequent Itemset) vi

WEIGHT-ORDER-BSED LTTICE LGORITHM FOR MINING MXIML WEIGHTED FREQUENT PTTERNS OVER DT STREM SLIDING WINDOW Thesis Submitted to the Faculty of National Sun Yat-sen University by Tzung-Je Chiou In Partial Fulfillment of the Requirements for the Degree of Master of Science June 25

TBLE OF CONTENTS Page Introduction 2 2 4 7 9 2 2 Survey of lgorithms for Mining Weighed Frequent Patterns 3 2 3 4 5 6 2 22 Data Stream Mining Window Models in Data Streams Mining Weighted Maximal Frequent Itemsets pplications Related Work Motivation Organization of the Thesis vii 26 3 Weight-Order-Based Lattice lgorithm Data Structure Our Proposed lgorithm 32 Data Initialization 322 The Pruning Strategy 323 Data Deletion 324 Data Insertion 325 Comparison 3 5 5 9 2 2 22 22 22 3 32 23 24 The MWFIM lgorithm The WMFP-SW lgorithm 22 Data Structures for WMFP-SW 222 Pruning Patterns by MaxW 223 Pruning Patterns in Single-Path The MWS lgorithm The Set-Checking lgorithm 24 Data structure 242 The Set-Checking lgorithm 26 28 3 3 36 38 43

Page 4 Performance 4 42 5 53 53 55 5 Conclusion 59 5 52 The Performance Model Experiment Results 42 Real Data 422 Simulation Data 5 Summary Future Work 59 6 BIBLIOGRPHY 6 viii

LIST OF FIGURES Figure Page Landmark window model 3 2 T ilted-time window model 3 3 Sliding-window model 3 4 Weight of items in Transaction Database TDB 5 5 Transaction TDB 5 6 The relationship of the subset X and superset Y in the priori algorithm 6 7 The relationship of the subset X and superset Y in the weighted frequent pattern mining 6 8 n example EX: (a) frequent pattern; (b) weighted frequent pattern 7 9 W M F P -SW -tree [4] 2 W OBLattice tree 2 2 n example of a prefix tree containing items according to the descending order of weights 4 22 n example of a vertical bitmap representation 5 23 n example of Transaction Database TDB2 in the sliding-window model [4] 5 W M F P -SW -tree for sliding window w : (a) before reconstructing operation; (b) after reconstructing operation [4] 6 24 ix

Figure 25 Page W M F P -SW -tree for sliding window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after reconstructing operation again [4] 7 26 Weight of the items in Transaction Database TDB2 [4] 8 27 W M F P -array: (a) global W M F P -SW -array (b) G s conditional W M F P SW -array [4] 9 28 Single-path of E prefix tree: (a) the path of E; (b) every pattern in E prefix tree [2] 2 Case : Equivalent [9] 23 2 Case 2: Subset [9] 23 2 Case 3: Superset [9] 24 22 Case 4-: The intersected node is not in the lattice [9] 24 23 Case 4-2: The intersected node is in the lattice [9] 25 24 Case 5: Empty set [9] 25 3 Transaction Database TD without sorting 27 32 Sorting the items by weight order: (a) before sorting; (b) after sorting 27 33 The lattice of Window W 28 34 The flowchart of the algorithm 3 35 The Bitmap of TID:2 in Transaction Database TD 3 36 Procedure GM X W pruning 33 37 Global MX weight (GMX-W) pruning strategy 33 39 Procedure LM X W pruning 34 38 Local MX weight (LMX-W) pruning strategy 34 29 3 Procedure CheckW M F P 36 3 The example of WMFP-table 36 x

Figure Page 32 Procedure Delete lattice 37 33 The deletion of : (a) WMFP-table; (b) lattice 38 34 Procedure Decrease M ax 39 36 Procedure Insert T 4 35 Data insertion of : (a) WMFP-table; (b) lattice 4 37 Procedure Increase M ax 4 38 Procedure Delete W M F P 4 39 Function F ind T 4 32 Procedure CheckCase 42 32 The table of Window 2 43 322 Checking the WMFP-table 43 323 n example of Transaction Database TDB2 in Sliding-Window model 45 324 W M F P -SW -tree for Window W : (a) before reconstructing operation; (b) after reconstructing operation 45 325 W M F P -SW -tree for Window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after the reconstructing operation again 46 326 Lattice from (TID to TID 4) 47 327 Lattice from (TID 3 to TID 4) 47 328 Lattice from (TID 3 to TID 6) 48 329 Conditional tree of pattern {D} 49 33 Conditional tree of pattern {D} 49 33 Conditional tree of pattern {CD} 5 4 comparison of the processing time of the Mushroom dataset under the change of the threshold xi 54

Figure 42 43 44 Page comparison of the processing time of the T5MTN2Dk dataset under the change of the threshold 55 comparison of the processing time of the T(5-8)MTN2Dk dataset under the average size of the transactions 57 comparison of the processing time of the TMT2N(2-5)Dk dataset under the change of the number of items 58 xii

LIST OF TBLES Table Page 3 Variables 29 32 n example of the data stream mapped to the bit-pattern 32 4 Parameters used in the experiments 52 42 Details of real datasets used in the experiments 52 43 Details of simulation datasets used in the experiments 52 44 comparison of processing time for the Retail dataset under the change of the threshold 53 The processing time of the Mushroom dataset under the change of the threshold 54 The processing time of the T5MTN2Dk dataset under the change of the threshold 56 The processing time of the T(5-8)MTN2Dk dataset under the average size of the transactions 57 The processing time of the TMT2N(2-5)Dk dataset under the change of the number of items 58 45 46 47 48 xiii

CHPTER I Introduction Data mining is the process of finding hidden and useful knowledge form the large databases Moreover, data mining has been used in many areas, including geographic [5], network [], and traﬃc data [6] Recently, finding frequent patterns has became one of the most famous ways on data mining The most important process in frequent pattern mining is mining association rules [] By using association rules, the computer can discover new patterns from datasets However, items have diﬀerent importance in the real world Therefore, we have to consider the importance and the count of the items at the same time Because of this reason, the algorithm of mining association rules cannot be followed in the weight environments and the priori algorithm [2] also cannot be used in weighted frequent pattern mining On the other hand, frequent pattern mining has two types of deformations, maximal frequent pattern mining [6] and closed frequent pattern mining [7, ] Moreover, weighted frequent pattern mining has two types of deformations, weighted maximal frequent pattern mining [8, 9] and weighted closed frequent pattern mining [2] Besides, the concept of a data stream, which may have the infinite transactions, is more appropriate than a data set for many recent applications, for example, sensor network data analysis and web click streams By nature, a stored data set is appropriate when significant portions of the data are queried again and again, and updates are small or relatively infrequent In contrast, a data stream is appropriate when the data is changing constantly, and it is either unnecessary or impractical to operate on large portions of the data many times

Data Stream Mining Data Streams are continuous, unbounded, real time, usually coming with high speed and having a data distribution that often be time-sensitive Hence, it is also called streaming data Besides, the volume of data streams is large and usually with an open end [3, 8, 9,, 4, 5, 6, 9, 2] Mining data streams has many new challenges First, they are unrealistic to keep the entire streams in the main memory or even in a secondary storage area, since a data stream comes continuously and the amount of the data is unbounded Second, traditional algorithms of mining on stored datasets by multiple scans are infeasible, since the streaming data is passed only once Third, mining streams requires fast and real-time processing in order to keep up with the high data arrival rate and mining results are expected to be available within the short response time In addition, the combinatorial explosion of itemsets exacerbates mining frequent itemsets over streams in terms of both memory consumption and processing eﬃciency [3, 8, 9,, 4, 5, 6, 9, 2] Window Models in Data Streams In the research of mining frequent patterns with streaming data, the models of data streams can be divided into tree types: Landmark window, T ilted-time window, and Sliding window [4, 5] In the Landmark window model, as shown in Figure, the time range of mining frequent patterns starts from the beginning time to the present 2 In the T ilted-time window model, as shown in Figure 2, the time range of mining frequent patterns still starts from the beginning time to the present But the T ilted-time window does not treat all the data equally The time period is divided into serveral time slots The time slots in recent time have higher frequency, while those in ancient time have lower frequency 2

3 In the Sliding window model, as shown in Figure 3, the knowledge discovery is performed over a fixed number of recently generated data elements which is the target of data mining t t t2 t3 t4 t5 t6 t7 t8 W W 2 W 3 Figure Landmark window model Figure 2 T ilted-time window model Figure 3 Sliding-window model 3 time

2 Mining Weighted Maximal Frequent Itemsets Weighted maximal frequent pattern mining is one of the most important research in data mining In the traditional frequent pattern mining, the support of a frequent itemset XS = {X, X2,, Xn }, denoted by sup(xs), always satisfies sup(xs) minsup T DB, where minsup is a user-specified minimum support threshold in the range of [, ] and T DB is the size of database T DB If XS is a maximal frequent pattern, there is no frequent pattern which is the superset of XS But in the weighted frequent patterns mining, the minimal support minsup may not be in the range of [, ] Here, we give a weighted frequent pattern P S = {P, P2,, Pn }, where Pi is the ith item in P S Moreover, minsup = 8 and the weight of the pattern P S is defined as W P (P S) So the weight of P S is W P (P S) = (W P (P ) + W P (P2 ) + + W P (Pn )) / n, where Pi P S Pattern P S has to satisfy W Sup(P S) minsup, where W Sup(P S) = count(p S) W P (P S) and minsup can be more than For instance, given a pattern {B, C}, the weight of {B, C} is (8+4)/2 = 6 and the weight of items is shown in Figure 4 If the count of {B, C} is 3, as shown in Figure 5, W Sup({, B, C}) is 8 Because W Sup({B, C}) is larger than minsup, {B, C} is a weighted frequent pattern In Figure 5, pattern P S is a weighted maximal frequent pattern, because there is no weighted frequent pattern which is the superset of pattern P S In the priori algorithm, we combine Lk to find Lk The relationship of the subset and superset in the priori algorithm is shown in Figure 6, and the relationship of the subset and superset in the weighted frequent pattern mining is shown in Figure 7 For instance, in Figure 8-(a), we use L, the frequent -itemsets {}, {B}, and {C}, to find C2, the candidate 2-itemsets {, B}, {, C}, and {B, C} Then, we check each itemset in C2 to decide whether the supports of each itemset is larger than the threshold Those itemsets will become L2, the frequent 2-itemsets {, B}, {, C}, and {B, C} But, in Figure 8-(b), itemset {B, C} which is a weighted frequent pattern has a weighed subset {C} that is not weight frequent, because the item {B} and {C} have diﬀerent weights When an infrequent itemset is added a frequent 4

weighed itemset, the new itemset may become frequent itemset In this case, the weighted frequent patterns are {}, {B}, {B, C} The weighted maximal frequent pattern is a pattern which does not have any weighted frequent super pattern So, the weighted maximal frequent patterns in this case are {}, {B, C} Item Weight 6 B 8 C 4 Figure 4 Weight of items in Transaction Database TDB TID Transaction, C 2 B, C 3, B, C 4, B 5 B, C Figure 5 Transaction TDB 5

ɺ ɺ Figure 6 The relationship of the subset X and superset Y in the priori algorithm * *? ɺ? Figure 7 The relationship of the subset X and superset Y in the weighted frequent pattern mining 6

{item set}:count {,B,C}: {,B}:2 {}:3 {,C}:2 {B,C}:3 {B}:4 {C}:4 Min_Sup:2 : The support of the pattern is larger than threshold (a) {item set}:wsup {,B,C}:6 {,B}:4 {}:8 {B,C}:8 {B}:32 {,C}: {C}:6 Min_Sup:8 : The weighted support of the pattern is larger than threshold (b) Figure 8 n example EX: (a) frequent pattern; (b) weighted frequent pattern 3 pplications In the real world, item may be not as important as item B For instance, the merchandises, like an iphone and a telephone in the market, have diﬀerent price and cost They would cause the merchandises with diﬀerent benefits So, finding the most valuable itemsets is an important task in the market analysis Because of this reason, 7

weighted frequent pattern mining [3, 4] and utility frequent pattern mining [5] are proposed Besides, weighted frequent pattern mining can be used widely not only in the market analysis, but also in web click stream analysis, even biological gene database analysis Recently, database and data mining communities have focused on a new data model, where data arrives in the form of continuous streams It is often referred to data streams or streaming data Many applications generate large amount of data streams in the real time, such as sensor data generated from sensor networks, transaction flows in retail chains, Web record and click streams in Web applications, performance measurement in network monitoring and traﬃc management, call records in telecommunications, and so on [, 6] In the data stream model, individual data items may be relational tuples, eg, network measurements, call records, web page visits, sensor readings, and so on However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable and unbounded streams appears to yield some fundamentally new research problems [3, 8, 9,, 4, 5, 6, 9, 2] The work reported could be viewed as a step towards enhancing data streams with functionalities to process queries as follows : Sensor network data analysis: In a sensor network environment, stream data comes from multiple remote sources Such an environment imposes excessive communication overhead and wastes computational resources, when data is dynamic In this situation, how to minimize the communication cost, how to combine frequency counts from multiple nodes, and how to mine data streams in parallel and update the associated information incrementally are additional issues which we need to consider 2 Web click streams: In web mining, data can be collected at the server side, client-side, proxy servers, or obtained from the database of an organization which contains business data or consolidated web data Each type of data collections diﬀers not only in terms of the location of the data source, but also the kinds of data available, the segment of population from which the data was collected, and its algorithm of implementation 8

3 Network monitoring and traﬃc management: nalysis sequential patterns can discover what (time-based) sequence of audit events frequently occur together These frequent event patterns provide guidelines for incorporating temporal statistical measures into intrusion detection models 4 Related Work In [2], grawal et al proposed the priori algorithm for mining association rules In the priori algorithm, it generates all candidate itemsets from the previous result to be continued in each iteration The final results are few in all the generated candidate itemsets However, to generate each candidate itemset, it needs to count its frequency in all transactions Each iteration requires only one scan over the database in the priori algorithm It does not have missing cases But it takes a lot of time to find frequent patterns In [3], Jiawei et al proposed the compact tree structure, called F requentp attern tree, or F P -tree for shortly There have been many mining algorithm using the FP-tree in the mining part Since the FP-tree does not generate candidate itemsets, it does not need to scan over the database for so many times The FP-tree is a prefix-tree structure storing crucial, quantitative information about frequent patterns To ensure the FP-tree is compact, it only stores length -items in the tree, and the way which stores those nodes is dependent on the frequency of each item The FP-tree starts from finding length- pattern, examines only its conditional pattern base Then, it constructs its FP-tree and performs the mining recursively It does not have missing cases But it takes a lot of time to structure the conditional tree In [22], Yun et al proposed the Maximal Weighted Frequent Itemset Mining (M W F IM ) algorithm for mining maximal weighted frequent pattern M W F IM algorithm represents each transaction as a vertical bitmap and uses the N D operation to find the count of itemsets The structure of M W F IM algorithm is the weight descending order prefix tree They use weight descending order prefix tree and depth-first traversal to find the maximal weighted frequent patterns If a node 9

pattern in weight descending order prefix tree is not a weighted frequent pattern, the its child nodes in sub-tree are not weighted frequent patterns Finally, they examine those patterns to find maximal weighted frequent patterns It can find maximal weighted frequent pattern easily But it would take a lot of time, when the number of transactions is large In [4], Lee et al proposed the Weighted Maximal Frequent Pattern mining over data streams based on the Sliding Window model (W M F P -SW ) algorithm for mining maximal weighted frequent pattern in data streams within a transaction sliding window The W M F P -SW algorithm uses the W M F P -F P tree to store the transactions in the present window and reconstruct the W M F P -F P tree by items count descending order While window slides, this algorithm removes the old transactions, insert the new transactions and reconstruct again To decrease the number of candidates, they find the M axw, the largest weight of the item in the itemset, and calculate minsup/m axw If the count of the itemset is smaller than minsup/m axw, it is not a weighted frequent pattern Finally, they traverse the single-pass to find maximal weighted frequent patterns It can find weighted maximal frequent pattern based on the sliding window model easily In [2], Yun et al proposed the maximal frequent pattern mining with weight conditions over data streams (M W S) algorithm for mining maximal weighted frequent pattern in the landmark window The M W S algorithm is duplicated with W M F P -SW algorithm, but it does not need to remove the old transactions Besides, it shows that the nodes of descending count order is less than the nodes of weighted descending order on its tree structure, which is never be compared in [4] 5 Motivation Discovery of mining maximal frequent itemsets is an important problem in the area of data mining weighed maximal frequent pattern is the itemset which is not the subset of any other weighted frequent pattern and the weighted support is larger than or equal to the threshold s a result, the weighted maximal itemset is

more compact than the complete weighted frequent patterns Previous algorithms for mining maximal frequent itemsets in data streams have been based on many diﬀerent models The W M F P -SW algorithm [4] proposed by Lee et al is an algorithm mining weighted maximal frequent patterns based on the sliding window model In their structure as shown in Figure 9, the W M F P -SW algorithm uses a compact structure to mine the weighted maximal frequent (W M F P ) pattern In the W M F P -SW algorithm, they use W M F P -F P tree and W M F P -SW -array to mine W M F P patterns They use M axw pruning and single-pass pruning to find maximal weighted frequent patterns However, the W M F P -SW algorithm will take long time to reconstruct W M F P -F P tree while window sliding Because of this reason, if the size of the transaction is large, the W M F P -SW algorithm has to take a lot of time to reconstruct W M F P -F P tree Moreover, the WMFP-SW algorithm has missing cases, when they use their pruning patterns by single-path Therefore, in this thesis, we propose the WeightedOrder-Based Lattice (WOB Lattice) algorithm to eﬃciently find the result without reconstructing the whole data structure In our data structure as shown in Figure, we use a lattice to incrementally mine the weighted frequent patterns We check the relation between the incoming transaction and the old transactions, when the new transaction comes There are five relations which are concerned in our algorithm: () equivalent, (2) subset, (3) intersection, (4) empty set, and (5) superset Because the relations between the set and the subset, there are two advantages First, when the support increases, the W OB structure remains the same Second, we will not lose any result From the simulation result, we show that our proposed W OB Lattice algorithm provides better performance than the W M F P -SW algorithm Because our lattice structure can decrease the time of reconstructing the tree, our W OB Lattice algorithm can decrease the processing time

Item Count B 2 2 C E D F G {} B C D F E Figure 9 W M F P -SW -tree [4] TID Transaction B,C,E 2 D 3,B 4,F root D: Tid(2) B,C,E: Tid(),B: Tid(3) B:2 Tid(, 3),F: Tid(4) :2 Tid(3,4) Figure W OBLattice tree 6 Organization of the Thesis The rest of thesis is organized as follows Chapter 2 gives a survey of some algorithms for mining association rules related problems which include mining weighed frequent patterns Chapter 3 presents the proposed the Weighted-Order-Based Lattice algorithm Chapter 4 presents the performance study of the Weighted-OrderBased Lattice algorithm Finally, Chapter 5 gives the conclusions 2

CHPTER II Survey of lgorithms for Mining Weighed Frequent Patterns Weighted frequent pattern mining is very important problem in the real databases In this chapter, we give a survey of some well-known algorithms to find weighted maximal frequent pattern among objects from databases, including the M W F IM algorithm [22], the W M F P -SW algorithm [4], and the M W S algorithm [2] On the other hand, it is hard to extract all frequent patterns in data streams, because of data streams changed and large new elements added To solve this problem, maximal frequent patterns can be used, including the Set-Checking algorithm [9] In this chapter, first, we introduce the M W F IM algorithm, the W M F P -SW algorithm, and the M W S algorithm Next, we will introduce the Set-Checking algorithm 2 The MWFIM lgorithm The weighted frequent pattern means that the weighted support of the pattern W P = {P, P2,, Pn } must be larger than threshold, where Pi is the ith item in W P Moreover, the weighted support is defined as W Sup = the average weight of the pattern multiplied by the count of the pattern as shown in the following formula: W Sup(W P ) = P + P2 + Pn count(w P ) length(w P ) If a weighted frequent pattern has no weighted frequent superset, it will be called Maximal Weighted Frequent Pattern (M W F P ) However, one of diﬃculties of mining maximal weighted frequent patterns is some subsets of M W F P might not be weighted frequent 3

The M W F IM algorithm [22] uses a search strategy for M W F P mining according to the weighted descending ordering Figure 2 shows items with weighted descending order and their prefix tree The weight of itemset {B} is equal to or larger than the weight of itemset {B, } and itemset {B, C} While the new item is added to the pattern, the weight of the new item is always equal to or smaller than the weight of the last item in the pattern, since all items have been sorted by the weighted descending ordering Because of this reason, we can be sure that the weight of every parent node would be equal to or larger than the weight of its child nodes in the prefix tree Besides, the count of subset is always equal to or larger than the count of superset So, the W Sup of every parent node would be equal to or larger than the W Sup of its child nodes in the prefix tree If W Sup of the parent node is smaller than M insup, W Sup of any one of its child nodes is also smaller than M insup root {B,} {B} {} {B,C} {,C} {C} Figure 2 n example of a prefix tree containing items according to the descending order of weights The M W F IM algorithm also uses the vertical bitmap representation to find the pattern in transactions Figure 35 shows that itemset {, B} can be generated easily by performing an ND-operation on all of the bits in the bitmaps for itemset {} and itemset {B} Then, we can easily count the number of itemset {, B} Basiclly, the bitmap representation is ideal for both steps of candidate pattern generation and support counting 4

TID B C B 2 3 4 5 Figure 22 n example of a vertical bitmap representation 22 The WMFP-SW lgorithm Ryu et al propose the W M F P -SW algorithm [4] to mine weighted maximal frequent patterns (W M F P s) on the sliding window Their W M F P -SW algorithm performs mining operations based on a tree structure, and accordingly, they need tree structures suitable for finding W M F P s over sliding window-based data streams W W2 TID Transcation B,C,E 2 D 3,B 4,F 5 D,F 6 D,F,G Figure 23 n example of Transaction Database TDB2 in the sliding-window model [4] 22 Data Structures for WMFP-SW Their W M F P -SW algorithm [4] performs mining operations based on the W M F P SW -tree and the W M F P -SW -array This tree structure is similar to the F P -tree [2], but the W M F P -SW -tree has additional weight data and is constructed with 5

only one scan For example, Figure 23 is an sliding window data stream, and sliding window w has been scanned as shown in Figure 24-(a) Then, the tree has to be reconstructed by the descending count order in Figure 24-(b) s the window is changed, a part of the tree is removed as shown in Figure 25-(a), new data items are added into the tree Figure 25-(b), and then performing tree reconstructing operations as shown in Figure 25-(c) Item Count B 2 C E D 2 F G {} B D B F C E (a) Item Count B 2 2 C E D F G {} B C D F E (b) Figure 24 W M F P -SW -tree for sliding window w : (a) before reconstructing operation; (b) after reconstructing operation [4] 6

Item Count (-) B 2 (-) C (-) E (-) D F G {} B F (a) Item {} Count (-) B 2 (-) C (-) E (-) D (+) (+) 2 F (+)(+) 3 G (+) B D F F G (b) Item Count F 3 2 D 2 B G C E {} B F D G (c) Figure 25 W M F P -SW -tree for sliding window w2 : (a) Deletion: TID (B,C,E) and TID2 (D); (b) Insertion: TID5 (D,F) and TID6 (D,F,G); (c) after reconstructing operation again [4] 7

Item Weight 5 B 7 C 8 D E 4 F 9 G 6 Figure 26 Weight of the items in Transaction Database TDB2 [4] The W M F P -SW -array stores a part of node data in the W M F P -SW -tree Figure 27-(a) is the W M F P -SW -array of Figure 25-(c) The number in space [F,D] is 2, which means that count of item D is 2 under item F in Figure 25-(c) Using the conditional tree to find W M F P s has to scan the W M F P -SW -tree many times That is why they use the W M F P -SW -array In data streams, using the conditional WMFP-SW-array to mine patterns would have better performance In Figure 27-(b), we can see that item G is always in the same transaction with item F and item D 8

(5) D () 2 B (7) G (6) C (8) E (4) F (9) (5) D () B (7) G (6) C (8) (a) D () F (9) (b) Figure 27 W M F P -array: W M F P -SW -array [4] 222 (a) global W M F P -SW -array (b) G s conditional Pruning Patterns by MaxW Because weighted candidates do not satisfy the anti-monotone property [7] in general, weighted infrequent pattern could be a subset of a weighted frequent pattern [22] s a result, incorrect pruning operations by the weights can cause weighted pattern losses To solve this problem and perform eﬃcient pruning procedures, they define a pruning condition applied in the sliding window model, named M axw For example, M axw is the largest value among weighted of all items included in the current tree, and the weight of the items is shown in Figure 26 So the current M axw is as shown in Figure 25-(c) ssume that a minimum support threshold, M insup is 2 Item G is weighted infrequent, because its count multiply M axw would be less than M insup On the other hand, in the conditional tree of item B, 9

M axw is smaller than, since item D does not participate in the mining process any longer and the new M axw becomes 7 in the conditional tree 223 Pruning Patterns in Single-Path If any single-path is generated in the process of the mining steps, they would calculate W Sup, the weighted support, for the patterns If the W Sup value is not larger than the threshold, they would find other W M F P s which are the subsets of the single-path For example, Figure 28 shows how to extract W M F P s in a single-path and M insup is 2 In the first step, since W Sup of itemset {, C, D, B, E} is 8 and smaller than M insup So, they start to find the longest subsets of the itemset {, C, D, B, E} In length 4, the W Sup of itemset {, C, D, E} and the W Sup of itemset {, C, B, E} are larger than M insup, and the W Sup of itemset {, C, B, E} and the W Sup of itemset {, D, B, E} are smaller than M insup Thus, itemset {, C, D, E} and itemset {, C, B, E} are W M F P s 2

Prefix E:4,9 {} : 4, 5 C: 4, 5 D: 4, 4 B: 2, 2 D: 4, 4 B: 2, 2 item: count, weight (a) item: count, weight Prefix E:4,9 {} : 4, 5 C: 4, 5 Length Patterns 4 {,C,D,E}:4*75, {,C,B,E}:2*25, {,D,B,E}:2*75, {C,D,B,E}:2*75 3 {,C,E}:4*3, {,D,E}:4*93, {,B,E}:2*87, {C,D,E}:4*94, {C,B,E}:2*87, {D,B,E}:2*5 2 {,E}:4*2, {B,E}:2*55, {C,E}:4*2, {D,E}:4*65 (b) Figure 28 Single-path of E prefix tree: (a) the path of E; (b) every pattern in E prefix tree [2] 23 The MWS lgorithm The M W S algorithm [2] is also proposed by Ryu et al, and it is similar to the W M F P -SW algorithm These two algorithms have the same structures and strategies The diﬀerence is that the M W S algorithm is based on the Landmark Window model and the W M F P -SW algorithm is based on the Sliding Window model So, we will not discuss data structure and the strategies of the M W S algorithm in this section However, all of the items are not deleted in the structures of the M W S algorithm, it makes M W S algorithm easier to be done Moreover, it shows the descending count order is better than the weighted descending order on its tree structure lthough that we do not need to compute weight to find M axw by using the descending weight order, it would cause an ineﬃcient and incompact tree be build Because of these reasons, the M W S algorithm and the W M F P -SW algorithm use the descending count order, not the weighted descending order 2

24 The Set-Checking lgorithm The Set-Checking algorithm is proposed by Chang et al It uses a lattice structure to store the transactions [9] 24 Data structure The lattice structure [8, 9, 22] has been widely used in data mining of weights Because the lattice structure presents the relation between the new transaction and the current transactions easily, and can be updated the support of the patterns eﬃciently Each node in the lattice structure contains two parts: () Bitpattern represents itemset and (2) Sup represents the support of itemset For instance, the bit pattern of itemset {a, c} is denoted as Bit(), since there are five items {a, b, c, d, e} 242 The Set-Checking lgorithm The Set-Checking algorithm has three main steps: Transform the itemset to the bit pattern 2 Check the relation between the new transaction and the old transaction There are five cases in the relations Case : Equivalent as shown in Figure 29 Case 2: Subset as shown in Figure 2 Case 3: Superset as shown in Figure 2 Case 4: Intersection Case 4-: The intersected node is not in the lattice as shown in Figure 22 Case 4-2: The intersected node is in the lattice as shown in Figure 23 Case 5: Empty set as shown in Figure 24 22

3 Examine which transaction is the maximal frequent itemsets root root Insert {abc} abc: Tid() abc:2 Tid(, 2) Before fter Figure 29 Case : Equivalent [9] root root abc: Tid() Insert {bc} abc: Tid() bc:2 Tid(, 2) Before fter Figure 2 Case 2: Subset [9] 23

root root ad: Tid() Insert {abcd} abcd: Tid(2) ad:2 Tid(, 2) Before fter Figure 2 Case 3: Superset [9] root root Insert {acd} abc: abc: Tid() Tid() acd: Tid(2) ac:2 Tid(, 2) Before fter Figure 22 Case 4-: The intersected node is not in the lattice [9] 24

root root Insert {bd} abc: Tid() abc: Tid() bd: Tid(3) b:3 Tid(, 2, 3) b:2 Tid(, 2) fter Before Figure 23 Case 4-2: The intersected node is in the lattice [9] root root Insert {cd} ab: Tid() ab: Tid() cd: Tid(2) fter Before Figure 24 Case 5: Empty set [9] 25

CHPTER III Weight-Order-Based Lattice lgorithm In Chapter, we have mentioned several window models of the algorithm on data streams One of the well-known window models based on the algorithm is the Sliding Window model The Sliding Window model can be applied in many applications, since the sliding window model focuses the importance on the recent data, which is more important and relevant than the old data If some applications that tend to capture the recent data, the Sliding Window model provides more informative and useful information than the two other models In this chapter, we will present our algorithm 3 Data Structure In our algorithm, we propose the weight-order-based lattice structure and the bit-pattern representation of items based on the Sliding Window model We use Transaction Database TD as shown in Figure 3 to illustrate our idea Because we need weight descending order, we have to sort the items, as shown in Figure 32-(a), by using the weighted order, as shown in Figure 32-(b) This lattice structure has two advantages First, using the weight-order-based lattice structure, the relationship between the new transaction and present transactions can be easily understood Second, we can update the support of the transaction eﬃciently 26

TID Transaction B, D, E 2 C, D 3, B, C, D, E 4 B 5, B, C, D 6 D, E W W2 Figure 3 Transaction Database TD without sorting Item weight Item weight 4 B 8 B 8 E 6 C 3 D 5 D 5 4 E 6 C 3 (a) (b) Figure 32 Sorting the items by weight order: (a) before sorting; (b) after sorting The weight-order-based lattice structure contains the root, nodes, and child-link n example is shown in Figure 33 The root has no information It is a start point When a new transaction is inserted, we sort the transaction by weight descending order and search the weight-order-based lattice structure from the root Each node records some information: Bit-Pattern: It represents an itemset Sup: It represents the support of the itemset 27

The child-link points to the subset node With the child-link, we can check the relationship between nodes and insert the node into the weight-order-based lattice structure eﬃciently Moreover, we can increase the support of the related nodes easily root B, E, D,, C: Tid(3) B,E,D:2 Tid(, 3) D,C:2 Tid(2, 3) B:3 Tid(, 3, 4) D:3 Tid(, 2, 3) Figure 33 The lattice of Window W 32 Our Proposed lgorithm In this section, we first define some basic definitions and notations and then discuss how to use these techniques to find and represent the weighted maximal frequent itemsets and organize the weighted maximal frequent itemsets in a lattice Our algorithm has 4 main steps Transform the itemset to the bit pattern 2 Check the relation between the new transaction and the old transaction 28

Table 3 Variables Variable Meaning minsup Threshold GM XW Global MX Weight LM XW Local MX Weight Sup Support of the pattern W Sup Weighted Support of the pattern WTD Weighted Transaction Database Ti The i-th Transaction IT S set of items ITi Item i W Ti Weight of item i 3 Use the pruning strategy to prune the patterns 4 Examine which transaction is the weighted maximal frequent itemsets In the first step, we use the bit-pattern representation to store the transaction We will use weighted descending order to transform the transaction to the representation of bit-patterns For each transaction T in the current window, a bit-pattern of transaction T is denoted as Bit(T ), and the length of bit-pattern is the kind of items in total database If item X is in this transaction and item X is the i-th place in the weight descending order, the i-th bit of Bit(T) is set to ; otherwise, it is set to For instance, item D is the third item in Figure 32-(b) Because of this reason, the bit-pattern of the transaction which has item D will be set to in its third place In the second step, we check the relation between the new transaction and the old transaction There are five cases in inserting itemsets into the lattice structure The flowchart is shown in Figure 34 When the current window becomes full, we will delete the two oldest transactions and insert two new transactions 29

NewT CSE 5 Yes No NewT OldT Insert NewT to the lattice No NewT OldT Yes Yes No No NewT OldT NewT = OldT CSE 4 CSE 3 Update the support of NewT and the descendants of NewT No OldT be the chile of NewT Yes WSup(ITS) >= min_sup No Yes The last data stream Insert NewT to lattice structure Insert Tid to OldTTidset 2pply OldT to procedure CheckWMFP CSE 2 NewT becomes the child of OldT Insert OldTTidset to NewTTidset 2pply NewT to procedure CheckWMFP intersect is the intersection of NewT and OldT intersect is in the lattice structure Insert NewTTidset to intersecttidset No dd intersect to the lattice structure pply intersect to procedure CheckWMFP dd NewT to WMFP_table Stop Figure 34 The flowchart of the algorithm 3

32 Data Initialization In mining the maximal frequent itemsets from the data stream, each new transaction must be discovered the related relations (equivalent, superset, subset, intersection, empty) of the present transactions In order to check the relation eﬃciency, we use the bit-pattern to represent the itemset The maximal length of the bit-length is the number of distinct items n example of the data stream mapped to the bit-pattern is shown in Figure 35 Item B E D C Bit map Figure 35 The Bitmap of TID:2 in Transaction Database TD When the item appears in the transaction, we set the related bit to according to the weight descending order In Figure 3, T id2 is {C, D}, and it would become [D, C] in the weight-order-based lattice structure We set the third and fifth positions to The bit-pattern is denoted as [] So, the bit-pattern of W indow is shown in Table 32 For every new transaction which is inserting, we set the support to be fter checking the new transaction with each present transactions, the support may be increased and new itemsets may be created In the next section, we introduce this case in details 322 The Pruning Strategy In this section, first, we will build an ppearing Table ppeart to store the count of the pattern in lattice fter we build the ppearing Table ppeart, we will use 3

Table 32 n example of the data stream mapped to the bit-pattern Tid Itemset Bit-Pattern B, E, D 2 D, C 3 B, E, D,, C 4 B two pruning strategies: global maximal weight (GM X W ) pruning strategy and local maximal weight (LM X W ) pruning strategy Note that, GM X W means the largest weight of the items among all of transactions and LM X W means the largest weight of the items among items in a certain transaction In the global pruning strategy, we make use of GM X W, the maximum weight of item X among items appearing in all of transactions For any pattern Y, the average weight of pattern Y, W eight(y ) must be smaller than or equal to GM X W Let Count(Y ) denote the count of pattern Y Therefore, if Count(Y )*GM X W is smaller than the threshold, pattern Y will not be the result That is, pattern Y can be pruned In other words, if Count(Y ) < threshold/gm X W, pattern Y can be pruned In our algorithm, Count(Y ) = I, where P (I), the I th entry of ppearing Pattern table, records every pattern Y with count = I Therefore, we prune those patterns stored in P (I), if I is smaller than threshold/gm X W Note that I is an integer, while threshold/gm X W is a real number So procedure of global maximal weight pruning strategy is shown in Figure 36 In order to check whether the subset X of a pattern Y is frequent or not, we must record the count of subset X Here, similar to the concept of the closed set, we only record the subset X which has diﬀerent count to pattern Y in ppearing Table ppeart Note that the definition of a closed set is that the set whose supersets are not frequent or whose count is larger than the count of its superset Moreover, 32

: Procedure GM X W pruning (ppeart, min support, T ); /* T is a Transaction */ 2: begin δ 3: foreach (the count of Transaction T in the ppeart < GM X 4: begin 5: delete Transaction T from the ppeart ; 6: end; 7: end; W )do Figure 36 Procedure GM X W pruning P (I) records patterns with count = I in the current window Basically, only subset X which has count lager than that of pattern Y will be recorded in ppeart For instance, in Figure 37, the threshold is 3, and we define it as δ The GM X W will be set as 8, since the maximal weight item is B So, we will delete every pattern which has the count that is less than 3/8 In this case, we will delete the pattern [B, E, D,, C], because the count of [B, E, D,, C] is which is less than δ/gm X W P() P(2) P(3) P(4) [B, E, D,, C] [B, E, D] [D, C] [B] [D] ø Figure 37 Global MX weight (GMX-W) pruning strategy In the local pruning strategy, since items recorded in a pattern P X (a list) according to the descending order of weights, the first item X in a pattern P X must be the one with the maximum weight among items in such a pattern P X We use LM X W to record the weight of such an item X Therefore, for any other item Y in the same pattern P X, the weight of item Y must be smaller than or equal to LM X W That is, the weight of pattern P X must be smaller than or equal to LM X W In the 33

: Procedure LM X W pruning (ppeart, min support, T ); /* T is a Transaction */ 2: begin δ 3: foreach (the count of Transaction T in the ppeart < LM X 4: begin 5: delete Transaction T from the ppeart ; 6: end; 7: end; W )do Figure 39 Procedure LM X W pruning same way, if Count(P X) * LM X W is smaller than or equal to the threshold, pattern P X can be pruned Note that Count(P X) = I, where P (I) records pattern P X which appears I times in those transactions within current window In Figure 38, LM X W will be set as the weight of the first item of the pattern, since the pattern has been sorted For instance,the LM X W of [D, C] is set as 5 since the maximal weight item in the pattern [D, C] is D In local maximal weight pruning strategy, [D, C] will be deleted, because we known that we multiply 2 (the count of [D, C]) by 5 (LM X W of [D, C]) which is less than δ Procedure of local maximal weight pruning strategy is shown in Figure 39 P() P(2) P(3) P(4) ø [B, E, D] [D, C] [B] [D] ø Figure 38 Local MX weight (LMX-W) pruning strategy To eﬃciently find the result, we avoid checking too many subsets X of a pattern Y That is, even though subset X of pattern Y is not frequent, pattern Y may be frequent Note that Count(Y ) is still smaller than or equal to Count(X) However, there is no relationship between W eight(y ) and W eight(x) Therefore, there is no relationship between W Sup(Y ) and W Sup(X) That is, we can not give up checking 34

the subsets of pattern Y, even pattern Y is not frequent However, for checking those subsets of pattern Y, we also can make use of the concept of the ordered set That is, we can make use of the similar concept of the local pruning strategy In this way, we do not have to check every subset X of pattern Y fter the end of the pruning step, we can start to find the W M F P s from those candidate patterns We start from the pattern Y which has the least appearing times among all candidate patterns, since it may have the subset X whose appearing times is larger than the pattern Y In weight maximal frequent mining, an transaction may have more than one W M F P s For instance, the W Sup of the pattern [B, E, D] is less than δ However, pattern [B, E] and pattern [B, D], the subsets of [B, E, D], are W M F P s To make sure we will not lose any WMFP, we have to trace every subset of a candidate pattern, but it would take a lot of time To solve this problem, we propose an way to eﬃciently trace subsets of a candidate pattern For a candidate pattern P Z of length r = [t, t2,, tr ], if its ordered subset P Y of length h = [t, t2,, th ] (h < r) and the W Sup of ordered subset P Y < threshold, then we prune every other subset of pattern P Z which is length h must satisfied the condition: W Sup < threshold The reason is that pattern P Z has been sorted, so P Y is the subset of length h with the largest weight among all subsets of P Z For instance, if the pattern [B, E, D,, C] has appeared 2 times, we can use it to find W M F P s fter we calculate the W Sup of [B, E, D,, C], we find that [B, E, D,, C] is not an W M F P s the define, the order subsets of [B, E, D,, C] is [B, E, D, ], [B, E, D], [B, E], and [B] Then, we conclude that the W Sup of [B, E, D, ], [B, E, D, ] is also not an W M F P Consequently, we would know every 4-length subset P Y of [B, E, D,, C] is not an W M F P, because [B, E, D, ] has the largest weight in P Y and P Y have the same counts fter we use LM X W to find W M F P s, we can build an W M F P -table to store W M F P s and their weight as shown in Figure 3 35