ISSN 000-9825, CODEN RUXUEW E-mail: jos@iscasaccn Journal of Software, Vol8, No6, June 2007, pp429 442 http://wwwjosorgcn DOI: 0360/jos8429 Tel/Fax: +86-0-62562563 2007 by Journal of Software All rights reserved F-Index: Twig,2+,,, (, 00872) 2 (, 066004) F-Index: A Flattened Structural Index for Speeding up Twig Query Processing ZHOU Jun-Feng,2+, MENG Xiao-Feng, JIANG Yu, XIE Min (Information School, Renmin University of China, Beijing 00872, China) 2 (Department of Computer Science and Technology, Yanshan University, Qinhuangdao 066004, China) + Corresponding author: Phn: +86-0-6255575, E-mail: zhoujf@ysueducn, http://wwwruceducn Zhou JF, Meng XF, Jiang Y, Xie M F-Index: A flattened structural index for speeding up twig query processing Journal of Software, 2007,8(6):429 442 http://wwwjosorgcn/000-9825/8/429htm Abstract: How to process twig query quickly and correctly has attracted much attention in research society recently Filtering query irrelevant elements before query execution is an important step for reducing elements scanned at query processing As a flattened structural index, F-Index is proposed to filter out all query irrelevant index nodes, thus query irrelevant elements can be filtered out rapidly and mostly, especially when it is processing deeply nested XML documents with a complex structure After filtering, a new efficient query algorithm based on the remaining elements is proposed to accelerate query processing Experimental results on various datasets indicate that twig query s performance can be improved significantly by using F-Index Key words: : XML; query optimization; twig query; filtering; structural index twig XML,, F-Index,,, XML,, F-Index : XML; ;twig ; ; : TP3 : A XML, XML Supported by the National Natural Science Foundation of China under Grant Nos6057309, 6027308 ( ); the National Basic Research Program of China under Grant No2003CB37000 ( (973)); the Key Project of Ministry of Education of China under Grant No03044 ( ); the Program for New Century Excellent Talents in University ( ) Received 2006-04-20; Accepted 2006-06-30
430 Journal of Software Vol8, No6, June 2007 XML,,Twig ( ) Twig Twig, (PC) (AD),XPath //book [//appendix]//figure book figure, book appendix, XML,, XML XML, XMark, -index 54 ; TreeBank,-index 338 748,, Table Comparison of different datasets features Dataset Size (M) Tags Doc nodes -index nodes Doc nodes/-index nodes XMark 5 74 666 35 54 3 242 TreeBank 84 250 2 437 666 338 748 7 AD PC,,, : () F-Index,,XML,F-Index (2), (3), 2 3 F-Index F-Index 4 F-Index 5 6 * 7 8 [ 5], ; [6 3], ; [4,5],,TwigStack [4], AD, PC, AD Twig, [5] XR-Tree, XML, [6], DataGuide [7],-index [8] F&B [9] [6,20 22] [6,20],, [2,22] TreeBank,,, [22],, / PC, // AD, Tag, A,B,, A,B, a,b
:F-Index: Twig 43 2 :, 2 a A b a 2 a 3 c 4 B A 2 C 2 b 2 c c 2 c 3 d 3 B 2 C D 2 Fig d d 2 e (a) XML-Document (a) XML D E (b) -Index (b) -Index XML document and the corresponding -index structure XML -index : //C[//D]/E, TwigStack c c 4,d d 3,e, (b), -index C,D E, 2 c,c 2,c 3,d,d 2,e, c 4,d 3,, :c,c 3,, 2,, :,,, 3 F-Index 3 F-Index start,end,level -index -index Tag, start 2, F-Index A n Tag B, B i, n 2 B i n A -index, A S={n,n 2,,n k },,n i ( i k) A,, S Tag, S=S S 2 S m (m k) S i S j = (( i m) ( j m) (i j)),,m Tag, S i ( i m), S i, n, A Tag B, 2 nb,, Si, S B, 3 3(a), 3(b) 3(a),AncestorNode A, A Tag C ;ChildNode C 2, A Tag C C 2,, ;MinDescNode A Tag C start, C,nDescCount A Tag C (b), 2 Tag C, C,C 2 A A MinDescNode 2 C, 2 3(b) A Tag C (b), Tag A Tag C, AncestorNode, 3(e), Tag A Tag C
432 Journal of Software Vol8, No6, June 2007 AncestorNode ChildNode MinDescNode ndesccount A C 2 C 2 (a) The link element (b) A link element instance (a) (b) A B C D E A B C D E A 2 B 2 C 2 D 2 ReachableTag AC (c) The link head (c) A C 2 C 2 AC (d) A link head instance (d) A 2 C C (e) A link instance (e) Fig2 Query aided structure 2 Fig3 The link structure 3 3(c), 3(d),ReachableTag -index Tag, 3(d),,AC Tag A -index Tag C, Tag A A i, A i Tag C,, Tag A A m, A m Tag C 3(a) 3(c) 3(e) (b) Tag A Tag C, Tag, 4 F-Index 2 F-Index 3 //A/B, Tag A -index, Tag B 4 //A//B, Tag A -index,, Tag B 5 Twig//A[//B]//C, AB AC, AncestorNode, Tag A,, Tag B C, Tag B C A //A[/B]//C //A[//B]/C Twig Tag A,F-Index, 5 3 4, A, //A//B //A//C 6 Twig, //A[//B]/C[//D]/E, Tag A, Tag C, Tag B,D E, B C, D E A 6 5, 2: 5 3 F-Index Q, C D, C D AD,,, C D,C 2 D 2 Q 2, CD, CE, AncestorNode, ; AncestorNode, 4 Q 2 C D E
:F-Index: Twig 433 AA AB AC AD AE CD CE A A 2 A 2 A B B 2 A C 2 C 2 A - D 2 A - E C D D C E E A 2 B 2 B 2 A 2 C C A 2 - D A 2 - E C 2 D 2 D 2 D B C D Q C Q 2 A C D Q 3 E E Fig4 Structure of F-Index 4 F-Index Fig5 Query examples 5 Q 3, 4 F-Index : () F-Index, (2) 3 :,F-Index,, F-Index,F-Index 32 F-Index Tag,,,F-Index Tag,, n Tag,,F-Index O(n index n Tag ),,n index,n Tag Tag,,, Tag 4 F-Index A, Tag A, A A 3: 5(c) Q 3, A C, A,A A B,C,D,E,,A C PC,A B,D,E AD,A B C 4 FilterData(indexInfo) : curbranchnode=filterinfogetfirst(); 2: while (curbranchnode) 3: ToTagList=curBranchNodeGetToTagList();
434 Journal of Software Vol8, No6, June 2007 4: while (not bfilterend) 5: if (FindMatchedSeq()) 6: RecordMatchedSeq(curBranchNode); 7: AdvanceAllPointer(ToTagList); 8: else 9: break; 0: curbranchnode=filterinfogetnext(); : RecordFilterResult(FilteredResult); 2: return FilteredResult; Procedure FindMatchedSeq() : while (all pointers in ToTagList are not equal) 2: ur=findminpointer(totaglist) 3: AdvancePointer(ur); 4: if (ur=null) 5: return false; 6: return true; Procedure RecordMatchSeq(curNode) : childnode=curnodefirstchild(); 2: while (childnode) 3: Q i =curnodegetchildnodevalue(); 4: P i =childnodegetnodevalue(); 5: while(q i =P i ) 6: NodeList=Node(P i )GetRecordedList(); 7: MergeNodeList(curNode,NodeList); 8: P i =childnodegetnextnodevalue(); 9: childnode=curnodenextchild() filterinfo,filterinfogetfirst(), Q 3 C 2~9 curbranchnode, 3 curbranchnodegettotaglist(), ToTagList FilterEnd(),,FindMatchedSeq(),RecordMatchedSeq(curBranchNode), 0 filterinfogetnext(),,,, 2 FindMatchedSeq(), ToTagList AncestorNode, 2, false ToTagList, true, RecordMatchSeq(curNode), curnodefirstchild() curnode, 3 curnodegetchildnodevalue() Q i, P i, 4 childnodegetnodevalue() P i, Q i =P i, 6 Node(P i )GetRecordedList() 7 MergeNodeList(curNode,NodeList) 8 curnodenextchild(), PC AD :
:F-Index: Twig 435 4: 5(c) Q 3 F-Index Q 3, A C, C, A () C: CD CE, AD PC, 4, C D E (2) A: AB,AD,AE 3, AC,4, AncestorNode, ; A B C 2,A Tag C C 2, C C 2,,4 ; 2 A 2 B 2 C,A 2 Tag C C,, A 2 B 2 C D E 42,,F-Index O(n),,n 5, (start,end,level), start q C q 6,C q q, C q C q C q C q, Advance(C q ) ;, C q start,c q end C q level,, q C q Sorted element set of q Sorted element set of q C q2 C q3 q 2 q 3 Sorted element set of q 2 Sorted element set of q 3 Sorted element set of q 3 Fig6 Cursors during execution 6 2 (solution) q, q, q : () q, C q (2) q, q C start < C start C end C end,q q q q q > q 2 S={q,q 2,,q n } Q,Q q, q,, C q Cq C 2 q n Q :, 2 AD,, PC, PC AD 5 2 FindSolution(), true, 2 RecordMatchedResult() FindSolution() q, q,
436 Journal of Software Vol8, No6, June 2007 (q C i q ) C q, FindSolution q i q i q, C q AD, true, false RecordMatchedResult() (start,end) 2 CHTwigQuery(root) : while (FindSolution(root)) 2: RecordMatchedResult(); Procedure FindSolution(q) : for q i in children(q) do 2: while ( Cq i is not a descendant of Cq) 3: while ( C start<cqstart) 4: Advance(C ); q i 5: if (end(q i )) 6: return false; 7: while ( C start> Cqend) 8: Advance(C q ); 9: if (end(q)) 0: return false; : if (!FindSolution(q i )) 2: return false; 3: return true q i q i 3 F-Index,, 3,, 4 3 FilterQuery(root) : FilteredResult=FilterData(indexInfo); 2: for each Seq in FilteredResult; 3: AppendSeqToQuery(Seq); 4: CHTwigQuery(root); Procedure RecordMatchedResult (q) : for each q i in children(q) do 2: while (!end(q i ) & C end<cqend) 3: if (RecordMatchedResult(q i )) 4: AddResultToParent(q i,q); 5: Advance(q i ) q i 5: //A//C/D, -index A C D,A C 2 D 2 A 2 C D, 7 A C D, 2,3 C A,C C,C D a,c 2 d,, 2, a c 2 d a c 2 d 2 a A C 2 D 2,3 C A,C C,C D, a,c 4 d 3,, a c 4 d 3 A 2 C D,3 C A,C C,C D a 2,c 2 d, 2 2, a 2 c 2 d a 2 c 2 d 2
:F-Index: Twig 437 A a a a 2 a 3 C c c 2 c 3 c 4 c c 2 c 3 d 2 D d d 3 d d 2 Fig7 The elements after filtering 7 52 TwigStack 2, :(), O(mn 2 ),,m,n ;(2), [2],, O(k index /n),,k index,n CHTwigQuery, O(n/m),,n CHTwigQuery,m (start,end) 6 * F-Index, (start,end,level,parentstart),,parentstart start, parentstart 7( ) Q *, Q * Q : () Q=A/*/B Q =A//B, C A level+2=c B level (2) Q=A//*//B Q=A/*//B Q=A//*/B Q =A//B, C A level+2 C B level (3) Q=A/*[/B]/C Q =A[//B]//C, C A level+2=c B level C A level+2=c C level C B parentstart=c C parentstart (4) Q=A//*[/B]/C Q =A[//B]//C, C A level+2 C B level C A level+2 C C level C B parentstart=c C parentstart (5) Q A,B n *, C A level+n+ C B level (6) Q twig * n, (3) (4) 6 * 6, [23] *,, * twig 6: //A[*[/B]/C]/D //A[//B][//C]/D, 7 (C A level+2=c B level) C A level+2=c C level) (C B parentstart=c C parentstart) 7 P4 28GHz,G RAM,80G PC, Windows XP VC++60 TwigStack [4] itwigstack [2] SegSJ [22] XMark [24] TreeBank [24] F-Index -index 72 XMark QX~QX8, Treebank QT~QT4, 2
438 Journal of Software Vol8, No6, June 2007 Table 2 Queries used in the experiment 2 Query No XPath expression Query No XPath expression QX /site/regions/africa/item/description/parlist/listitem/text/keyword QX7 //listitem[//bold]/text//emph QX2 /site/closed_auctions/closed_auction[annotation/ description[parlist/listitem/text[keyword[bold]]]]/price QX8 //listitem[//bold]/text[//emph]/keyword QX3 /site/closed_auctions//emph QT //S//ADJP[//MD] QX4 /site/people/person[//age]//education QT2 //S[/JJ]/NP QX5 //site/people/person/name QT3 //S/VP/PP[/NP/VBN]/IN QX6 //text[/bold]/emph/keyword QT4 //S/VP//PP[//NP/VBN]/IN F-Index, 5 F-Index : F-Index ; F-Index ; F-Index ; F-Index ; F-Index 73 73 F-Index 3 F-Index -index,xmark,f-index -index 22 ;Treebank,F-Index -index 25,, O(n index n Tag ) Table 3 Size of F-Index and -index 3 F-Index -index Dataset (size) XMark (M) XMark (2M) XMark (35M) XMark (57) XMark (93M) XMark (6M) Treebank (84M) Elements 7 32 67 865 50 498 832 9 337 383 666 35 2 437 666 -index nodes 42 502 54 54 54 54 338 748 F-Index nodes 039 8 30 30 30 30 846 228 F-Index nodes/-index nodes 25 22 22 22 22 22 25 732 F-Index () 8(a) 8(b), F-Index, QX, 322 630 2 632,,, Treebank /6~/3 XMark,, QX4,SegSJ, F&B, F-Index itwigstack SegSJ Pre-Filtering 00 00 QX QX2 QX3 QX4 QX5 QX6 QX7 QX8 (a) XMark F-Index 0000 00 00 QT itwigstack SegSJ QT2 QT3 (b) Treebank Pre-Filtering QT4 Fig8 Comparison of elements number 8 (2) 9(a) 9(b),F-Index, QX QX2 itwigstack, QX PC-path,,,QX2 PC-twig
:F-Index: Twig 439, SegSJ F&B, -index, FB, Treebank itwigstackitwigstack Treebank, Treebank -index, 00 00 F-Index itwigstack SegSJ F-Index itwigstack SegSJ 0000 QX QX2 QX3 QX4 QX5 QX6 QX7 QX8 (a) XMark 00 00 QT QT2 (b) Treebank Fig9 Comparison of processed index nodes number (3) QT3 9 0(a) 0(b), XMark,, F-Index, ; Treebank, Treebank, QT4, 465ms, itwigstack,xmark QX2, 4ms,Treebank QT4, 40 968ms SegSJ, F&B, -index, Treebank, F&B, QT4 00 00 F-Index itwigstack SegSJ F-Index itwigs SegSJ 0000 QX QX2 QX3 QX4 QX5 QX6 QX7 QX8 (a) XMark Fig0 00 00 Comparison of filtering time 0 QT QT2 QT3 QT4 (b) Treebank 733 F-Index (a) (b),f-index itwigstack SegSJ, TwigStack, itwigstack TwigStack, 52 00 00 F-Index itwigstack SegSJ F-Index itwigstack SegSJ 0000 00 00 QX QX2 QX3 QX4 QX5 QX6 QX7 QX8 QT QT2 QT3 QT4 (a) XMark Fig (b) Treebank Comparison of running time after filtering 734 F-Index 2(a) 2(b), F-Index, F-Index
440 Journal of Software Vol8, No6, June 2007 3, 00 00 F-Index itwigstack SegSJ TwigStack QX QX2 QX3 QX4 QX5 QX6 QX7 QX8 Fig2 (a) XMark 0000 00 00 F-Index itwigstack SegSJ TwigStack QT QT2 QT3 QT4 (b) Treebank Comparison of the running time of different query algorithms 735 F-Index 2 3 (PC-twig,AD-twig) F-Index, F-Index, 3 F-Index 00 00 itwigstack SegSJ TwigStack M 2M 35M 57M 93M 6M (a) QX6 Fig3 736 F-Index 00 00 itwigstack SegSJ TwigStack M 2M 35M 57M 93M 6M (b) QX7 Comparison of the scalability of different query algorithm 3,,, TwigStack,,,F-Index, itwigstack SegSJ, F-Index, 8, XML, F-Index,,,,F-Index,, F-Index References: [] McHugh J, Widom J Query optimization for XML In: Malcolm PA, Maria EO, Patrick V, Stanley BZ, Michael LB, eds Proc of the 25th Int l Conf on Very Large Data Bases (VLDB) Edinburgh: Morgan Kaufmann Publishers, 999 35 326
:F-Index: Twig 44 [2] Michael J, Franklin MJ Efficient filtering of XML documents for selective dissemination of information In: Amr EA, Michael LB, Sharma C, Umeshwar D, Nabil K, Gunter S, Kyu YW, eds Proc of the 26th Int l Conf on Very Large Data Bases (VLDB) Cairo: Morgan Kaufmann Publishers, 2000 53 64 [3] Gottlob G, Koch C, Pichler R Efficient algorithms for processing XPath queries In: Stéphane B, Akmal BC, Mong LL, Jeffrey XY, Zoé L, eds Proc of the 28th Int l Conf on Very Large Data Bases (VLDB) Hong Kong: Morgan Kaufmann Publishers, 2002 95 06 [4] Halvreson A, Burger J, Galanis L, Kini A, Krishnamurthy R, Rao AN, Tian F, Viglas SD, Wang Y, Naughton JF, DeWitt DJ Mixed mode XML query processing In: Johann CF, Peter CL, Serge A, Michael JC, Patricia GS, Andreas H, eds Proc of the 29th Int l Conf on Very Large Data Bases (VLDB) Berlin: Morgan Kaufmann Publishers, 2003 225 236 [5] Lu SC, Meng XF, Lin C, Wang Y Navigation implementation for XQuery in OrientX Journal of Computer Research and Development, 2004,4(0):85 822 (in Chinese with English abstract) [6] Zhang C, Naughton J, DeWitt D, Luo Q, Lohman G On supporting containment queries in relational database management systems In: Aref WG, ed Proc of the ACM SIGMOD Int l Conf on Management of Data (SIGMOD) ACM, 200 425 436 [7] Li Q, Moon B Indexing and querying XML data for regular path expressions In: Peter MGA, Paolo A, Stefano C, Stefano P, Kotagiri R, Richard TS, eds Proc of the 27th Int l Conf on Very Large Data Bases (VLDB) Rome: Morgan Kaufmann Publishers, 200 36 370 [8] Al-Khalifa S, Jagadish HV, Koudas N, Patel JM, Srivastava D, Wu Y Structural joins: A primitive for efficient XML query pattern matching In: Agrawal R, Dittrich K, Ngu AHH, eds Proc of the 8th Int l Conf on Data Engineering (ICDE) San Jose: IEEE Computer Society, 2002 4 52 [9] Chien SY, Vagena Z, Zhang D, Tsotras VJ, Zaniolo C Efficient structural joins on indexed XML documents In: Stéphane B, Akmal BC, Mong LL, Jeffrey XY, Zoé L, eds Proc of the 28th Int l Conf on Very Large Data Bases (VLDB) Hong Kong: Morgan Kaufmann Publishers, 2002 263 274 [0] Jiang H, Lu H, Wang W, Ooi BC XR-Tree: Indexing XML data for efficient structural joins In: Umeshwar D, Krithi R, Vijayaraman TM, eds Proc of the 9th Int l Conf on Data Engineering (ICDE) Bangalore: IEEE Computer Society, 2003 253 264 [] Lam F, Shui WM, Fisher DK, Wong RK Skipping strategies for efficient structural joins In: Yoonjoon L, Jianzhong L, Kyuyoung W, Doheon L, eds Proc of the Database Systems for Advances Applications (DASFAA) Jeju Island: Springer-Verlag, 2004 96 207 [2] Wang J, Meng XF, Wang S Structural join of XML based on range partitioning Journal of Software, 2004,5(5):720 729 (in Chinese with English abstract) http://wwwjosorgcn/000-9825/5/720htm [3] Wang J, Meng XF, Wang Y, Wang S Target node aimed path expression processing for XML data Journal of Software, 2005,6(5):827 837 (in Chinese with English abstract) http://wwwjosorgcn/000-9825/6/827htm [4] Bruno N, Koudas N, Srivastava D Holistic twig joins: Optimal XML pattern matching In: Michael JF, Bongki M, Anastassia A, eds Proc of the 2002 ACM SIGMOD Int l Conf on Management of Data (SIGMOD) Madison: ACM, 2002 30 32 [5] Jiang HF, Wang W, Lu H, Yu JX Holistic twig joins on indexed XML documents In: Johann CF, Peter CL, Serge A, Michael JC, Patricia GS, Andreas H, eds Proc of the 29th Int l Conf on Very Large Data Bases (VLDB) Berlin: Morgan Kaufmann Publishers, 2003 273 284 [6] Moro MM, Vagena Z, Tsotras VJ Tree-Pattern queries on a lightweight XML processor In: Klemens B, Christian SJ, Laura MH, Martin LK, Per-Ake L, Beng CO, eds Proc of the 3st Int l Conf on Very Large Data Bases (VLDB) Trondheim: ACM, 2005 205 26 [7] Goldman R, Widom J DataGuides: Enabling query formulation and optimization in semistructured databases In: Matthias J, Michael JC, Klaus RD, Frederick HL, Pericles L, Manfred AJ, eds Proc of 23rd Int l Conf on Very Large Data Bases (VLDB) Athens: Morgan Kaufmann Publishers, 997 436 445 [8] Milo T, Suciu D Index structures for path expressions In: Catriel B, Peter B, eds Proc of the Int l Conf on Database Theory (ICDT) Jerusalem: Springer-Verlag, 999 277 295
442 Journal of Software Vol8, No6, June 2007 [9] Kaushik R, Bohannon P, Naughton JF, Korth HF Covering indexes for branching path queries In: Michael JF, Bongki M, Anastassia A, eds Proc of the 2002 ACM SIGMOD Int l Conf on Management of Data Madison: ACM (SIGMOD), 2002 33 44 [20] Barta A, Consenc MP, Mendelzon AO Benefits of path summaries in an XML query optimizer supporting multiple access methods In: Klemens B, Christian SJ, Laura MH, Martin LK, Per-Ake L, Beng CO, eds Proc of the 3st Int l Conf on Very Large Data Bases (VLDB) Trondheim: ACM, 2005 33 44 [2] Chen T, Lu JH, Ling TW On boosting holism in XML twig pattern matching using structural indexing techniques In: Fatma Ö, ed Proc of the ACM SIGMOD Int l Conf on Management of Data (SIGMOD) Baltimore: ACM, 2005 455 466 [22] Wang W, Wang HZ, Lu HJ, Jiang HF, Lin XM, Li JZ Efficient processing of XML path queries using the disk-based F&B index In: Klemens B, Christian SJ, Laura MH, Martin LK, Per-Ake L, Beng CO, eds Proc of the 3st Int l Conf on Very Large Data Bases (VLDB) Trondheim: ACM, 2005 45 56 [23] Hou S, Jacobsen HA Predicate-Based filtering of XPath expressions In: Ling L, Andreas R, Kyuyoung W, Jianjun Z, eds Proc of the 22nd Int l Conf on Data Engineering (ICDE) Atlanta: IEEE Computer Society, 2006 53 64 [24] Miklau G XMLData repository 2002 http://wwwcswashingtonedu/research/xmldatasets : [5],, OrientX XQuery,2004,4(0):85 822 [2], XML,2004,5(5):720 729 http://wwwjosorgcn/000-9825/5/720htm [3],,, XML,2005,6(5):827 837 http://wwwjosorgcn/ 000-9825/6/827htm (977 ),,,,, XML (98 ),,, XML (964 ),,,,,CCF, Web,XML, (984 ),,, XML