12.07.2015 Views

An efficient mechanism for Matching multiple patterns on XML Streams

An efficient mechanism for Matching multiple patterns on XML Streams

An efficient mechanism for Matching multiple patterns on XML Streams

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

AN EFFICIENT MECHANISM FOR MATCHING MULTIPLEPATTERNS WITH STREAMED <strong>XML</strong> DATA<str<strong>on</strong>g>An</str<strong>on</strong>g>dreas Hinnerichs, Edzard HöfigFraunhofer FOKUSKaiserin-Augusta-Allee 31, D-10625 Berlin, Germanyemail: {hinnerichs|hoefig}@fokus.fraunhofer.deABSTRACTFiltering <strong>XML</strong> data streams using <str<strong>on</strong>g>efficient</str<strong>on</strong>g> patternmatching algorithms is a fundamental ability <str<strong>on</strong>g>for</str<strong>on</strong>g> manydata-centric applicati<strong>on</strong>s and main purpose of theTemplate <str<strong>on</strong>g>Matching</str<strong>on</strong>g> sPecificati<strong>on</strong> Language (TMPL).In this paper extensi<strong>on</strong>s to the language are discussedthat help to <str<strong>on</strong>g>for</str<strong>on</strong>g>mulate more powerful query <str<strong>on</strong>g>patterns</str<strong>on</strong>g>:The declarative type system, improved predicates,template references and sequence matching operators.<str<strong>on</strong>g>An</str<strong>on</strong>g> optimised matching runtime based <strong>on</strong> lazy c<strong>on</strong>structedautomata is introduced together with an explanati<strong>on</strong>of the underlying <str<strong>on</strong>g>for</str<strong>on</strong>g>malism. <str<strong>on</strong>g>An</str<strong>on</strong>g> Example,case studies and per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance measurements illustratethe usage and usability of TMPL.KEY WORDSProgramming Tools and Languages, <strong>XML</strong>, StreamProcessing, Template <str<strong>on</strong>g>Matching</str<strong>on</strong>g>1 OverviewThe self descriptive eXtensible Markup Language(<strong>XML</strong>) is at the heart of many data-centric applicati<strong>on</strong>sin use today. For most of these applicati<strong>on</strong>s atimely analysis of their large quantities of data is <strong>on</strong>lypossible using a stream processing approach. As often<strong>on</strong>ly a fracti<strong>on</strong> of the data to be processed is of interest,there is a need <str<strong>on</strong>g>for</str<strong>on</strong>g> the employment of <str<strong>on</strong>g>efficient</str<strong>on</strong>g>filter <str<strong>on</strong>g>mechanism</str<strong>on</strong>g>s. For example, we processed a largelog file to extract statistical data used in per<str<strong>on</strong>g>for</str<strong>on</strong>g>manceassessment of a business system: Out of the 1.2 Gigabytesof log data <strong>on</strong>ly about1130was of interest <str<strong>on</strong>g>for</str<strong>on</strong>g>calculati<strong>on</strong> of the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance values. The best practice<str<strong>on</strong>g>for</str<strong>on</strong>g> sorting out the important pieces of data is theutilisati<strong>on</strong> of an <str<strong>on</strong>g>efficient</str<strong>on</strong>g> <strong>on</strong>e-pass parsing algorithm.Our approach is based <strong>on</strong> the idea that datastreams are always exhibiting a recurrent structurein a particular size and of characteristic <str<strong>on</strong>g>patterns</str<strong>on</strong>g>.The gathering of values is d<strong>on</strong>e by creating such <str<strong>on</strong>g>patterns</str<strong>on</strong>g>— so-called templates — in a <strong>XML</strong>-like syntaxand marking parts of the pattern <str<strong>on</strong>g>for</str<strong>on</strong>g> later processingin a general-purpose programming language 1 .<str<strong>on</strong>g>An</str<strong>on</strong>g> analysing applicati<strong>on</strong> would then use our templatematching engine <str<strong>on</strong>g>for</str<strong>on</strong>g> comparing these templates against1 Currently Javaa possibly infinite <strong>XML</strong> data stream and would be notifiedwhen a match is detected with the name andidentified values of the matched template.In this paper we describe the c<strong>on</strong>cepts that arethe basis <str<strong>on</strong>g>for</str<strong>on</strong>g> an extensi<strong>on</strong> of TMPL and its runtime(see [1] <str<strong>on</strong>g>for</str<strong>on</strong>g> a descripti<strong>on</strong> of the original c<strong>on</strong>cepts)enabling an <str<strong>on</strong>g>efficient</str<strong>on</strong>g> parallel matching of templates,matching of pattern sequences by means oftemplate references, and better integrati<strong>on</strong> with theanalysing applicati<strong>on</strong>s by using a declarative type system.The remaining text is structured as follows: Firstan overview about the related work is given in secti<strong>on</strong>2. In secti<strong>on</strong> 3 extensi<strong>on</strong>s to TMPL are describedand a new approach to the runtime architecture isintroduced together with a discussi<strong>on</strong> about changesin matching semantics. A new matching <str<strong>on</strong>g>mechanism</str<strong>on</strong>g>is introduced and implicati<strong>on</strong>s to runtime behaviourare detailed in secti<strong>on</strong> 4 al<strong>on</strong>g with an example. Thenext secti<strong>on</strong> 5 discusses applicati<strong>on</strong>s by delivering per<str<strong>on</strong>g>for</str<strong>on</strong>g>mancemeasurements and comparing our approachwith alternative technology. Secti<strong>on</strong> 6 c<strong>on</strong>cludes bypresenting a summary, open topics and future steps inthe development of TMPL.2 Related WorkDiverse <str<strong>on</strong>g>mechanism</str<strong>on</strong>g>s <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>efficient</str<strong>on</strong>g> pattern matching <strong>on</strong><strong>XML</strong> data streams have been researched. Usually ac<strong>on</strong>venti<strong>on</strong>al technology used with document based<strong>XML</strong> data (e.g. XPath, XQuery or XSLT) is adoptedto the requirements of stream processing.In [2] Olteanu, Furche, and Bry introduce theSPEX engine which allows to evaluate a subset ofXPath <strong>on</strong> streamed data by translating XPath expressi<strong>on</strong>s,c<strong>on</strong>taining <strong>on</strong>ly <str<strong>on</strong>g>for</str<strong>on</strong>g>ward axis, to a network oftransducers. Becker proposes STX (see [3]), a XSLTlike language that allows <str<strong>on</strong>g>for</str<strong>on</strong>g> the creati<strong>on</strong> of templatesthat can be matched against an <strong>XML</strong> data stream.STX has a Java implementati<strong>on</strong> called joost that wecompare our approach against in secti<strong>on</strong> 5. In [4] Fegarasexplains XStreamQuery, a transducer based implementati<strong>on</strong>of an analysis engine which is using SAX.XQuery expressi<strong>on</strong>s may be translated to XStream-Query expressi<strong>on</strong>s and could then be executed by theXStreamQuery engine.


This paper discusses <strong>XML</strong> stream data processingfrom the perspective of TMPL, a language thatwas specifically designed <str<strong>on</strong>g>for</str<strong>on</strong>g> stream analysis and notadopted from a document based approach. TMPL wasintroduced in [1], further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> related workin this field may be found in this paper, as well.3 Language Extensi<strong>on</strong>sTMPL is a c<strong>on</strong>text-sensitive language describing <str<strong>on</strong>g>patterns</str<strong>on</strong>g>that are to be matched against a data stream.Syntactical structures of TMPL are specified in asyntax quite similar to <strong>XML</strong> itself and may c<strong>on</strong>sistof elements, attributes and data specificati<strong>on</strong>s in the<str<strong>on</strong>g>for</str<strong>on</strong>g>m of so-called predicates. Predicates are denotedusing square brackets and may restrict c<strong>on</strong>tent (e.g.a statement like a=[>10] would match <strong>on</strong>ly if thecorresp<strong>on</strong>ding attribute posses a value greater thanten) or assign a value to a variable (e.g. [=:b]would assign the c<strong>on</strong>tent of the element to avariable of name b). There are additi<strong>on</strong>al symbols <str<strong>on</strong>g>for</str<strong>on</strong>g>marking a single unknown element or an element at anunknown depth 2 in the <strong>XML</strong> data stream. A TMPLfile defines a single module that c<strong>on</strong>tains severaltemplates and it is possible to define c<strong>on</strong>stants at thebeginning of a template or module scope. Variablesor c<strong>on</strong>stants may also be referenced in predicates (e.g.[$c] matches if an unknown element is beencountered that c<strong>on</strong>tained the value stored in thevariable referenced by the name of c). Based <strong>on</strong> theexperience we gained while applying the languageduring the last year, we extended the functi<strong>on</strong>alitywith features we were missing, which either improveexpressiveness by enabling the matching of previouslynot identifiable <str<strong>on</strong>g>patterns</str<strong>on</strong>g> (e.g. sequence matching) orincrease usability of the language by reducing possibletypos or failures <strong>on</strong> the user’s side (e.g. declarativetype system).To accommodate certain new runtime features itbecame necessary to change the TMPL toolchain. Figure1 depicts the current toolchain in white colour; thegrey part shows the old approach that was directlygenerating Java code from the TMPL abstract syntaxtree (AST) as c<strong>on</strong>structed by the parser. Thishas been dropped in favour of a new intermediary <str<strong>on</strong>g>for</str<strong>on</strong>g>matthat codifies the automata <str<strong>on</strong>g>for</str<strong>on</strong>g> the runtime engineusing an extended finite state machine (EFSM) <str<strong>on</strong>g>for</str<strong>on</strong>g>malismwhich will be described in secti<strong>on</strong> 4.1.The basic idea was to be able to use this <str<strong>on</strong>g>for</str<strong>on</strong>g>matin several ways. <str<strong>on</strong>g>An</str<strong>on</strong>g> automat<strong>on</strong> may be savedin, or loaded from a file using the standard Graph eXchangeLanguage (GXL) <str<strong>on</strong>g>for</str<strong>on</strong>g>mat (<str<strong>on</strong>g>for</str<strong>on</strong>g> more in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>please refer to the website [5]). Templates in the intermediary<str<strong>on</strong>g>for</str<strong>on</strong>g>mat are used directly with the TMPL2 Stream depth is defined as the nesting level of elementsgxltmplParserAutomataGenerati<strong>on</strong>InterpreterJavaGenerati<strong>on</strong>javaCompiledRuntimeFigure 1. TMPL tool chainInterpreter, that is the matching runtime engine, tosearch <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>patterns</str<strong>on</strong>g> in the data stream. Some other opti<strong>on</strong>snot shown here would be to generate specialisedmatching code directly from the intermediary <str<strong>on</strong>g>for</str<strong>on</strong>g>mat<str<strong>on</strong>g>for</str<strong>on</strong>g> a certain programming language or to use standardvisualisati<strong>on</strong> software to generate pictures of the statemachines.3.1 Syntactical ChangesThe syntax of TMPL has been slightly changedto be better understandable. Explicit comparis<strong>on</strong>and assignment operators have been introduced (e.g.[=$foo] and [=:bar]) which may be omitted whentheir meaning within a predicate is n<strong>on</strong>-ambiguous.There is a shorthand syntax <str<strong>on</strong>g>for</str<strong>on</strong>g> other syntactical c<strong>on</strong>structsas well, please refer to the website [6] <str<strong>on</strong>g>for</str<strong>on</strong>g> documentati<strong>on</strong>about these.A major change is the introducti<strong>on</strong> of the rangeelement that offers a clearer definiti<strong>on</strong> than the<str<strong>on</strong>g>for</str<strong>on</strong>g>mer placeholder expressi<strong>on</strong>. One can use it to matcha sequence of nested elements. As an example c<strong>on</strong>siderthe following TMPL code:In this case the range element stands <str<strong>on</strong>g>for</str<strong>on</strong>g> a sequencethat has to begin with a child element of and that must end with an element c<strong>on</strong>taining attributeatt and child element b. The range element ismatched successfully when the starting element of thesequence closes.


3.2 Declarative Type SystemThe introducti<strong>on</strong> of a declarative type system ismotivated by several reas<strong>on</strong>s. For starters, <strong>on</strong>e wantsto precisely specify types to be matched as attributevalues or element c<strong>on</strong>tent, e.g. an attribute c<strong>on</strong>taininga numeric value should <strong>on</strong>ly be captured from theinput stream if such a type has been specified andshall be further processed using a suitable type inthe language that the template match is reportedto, waiving the need to cast the data to a new type.We decided to use a static type system due to thenature of TMPL templates, which are also staticand described as a closed structure. Users are <str<strong>on</strong>g>for</str<strong>on</strong>g>cedto explicitly detail all the data that may match <str<strong>on</strong>g>for</str<strong>on</strong>g>a certain template, making it easier to design andmaintain all but the simplest <str<strong>on</strong>g>patterns</str<strong>on</strong>g>. A user canc<strong>on</strong>clude the structure of matched data simply byreferring to the structure of the data definiti<strong>on</strong>s,without combing through the complete templatedefiniti<strong>on</strong>s.All c<strong>on</strong>stants, variables or template referencesneed to declare a name and a type be<str<strong>on</strong>g>for</str<strong>on</strong>g>e beingused; this is d<strong>on</strong>e at the beginning of any template,right after opening the scope and be<str<strong>on</strong>g>for</str<strong>on</strong>g>e the definiti<strong>on</strong>of any structures. Variables may and c<strong>on</strong>stantsneed to have a default value which is assignedas part of the declarati<strong>on</strong>. For example theline c<strong>on</strong>st integer count = 10; declares a c<strong>on</strong>stantwith the name count and the type integer and assignsa default value of 10. The TMPL type systemprovides five basic types:string Is a character string. <str<strong>on</strong>g>An</str<strong>on</strong>g>y such value mustbe given as a sequence of characters enclosed betweendouble quotes. This corresp<strong>on</strong>ds to the native(and <strong>on</strong>ly) type used in <strong>XML</strong>.integer Declares an integer number. This is representedas a simple number. Integers are of arbitrarylength but may be limited by the implementati<strong>on</strong>of the engine.float Declares a floating point type. Floating pointnumbers bear an arbitrary precisi<strong>on</strong> which maybe limited by the implementati<strong>on</strong> language of theengine. A float value is given as two numbers witha . between them.bool Holds a boolean value and may be either trueor false.template Declares a reference to another template.3.3 Template ReferencesDuring applicati<strong>on</strong> of the language we often cameacross situati<strong>on</strong>s where a similar pattern was appearingin a lot of templates (header data <str<strong>on</strong>g>for</str<strong>on</strong>g> example). Toease handling of these kinds of templates, referenceswere introduced. A template can be made abstractwith the result that matches of this template are notreported to the applicati<strong>on</strong>. Templates that are declaredabstract may be referenced by other templatesusing the template type, enabling the matching of sequencesof similar <str<strong>on</strong>g>patterns</str<strong>on</strong>g> and also caring <str<strong>on</strong>g>for</str<strong>on</strong>g> definiti<strong>on</strong>of opti<strong>on</strong>al parts. Technically, when referencingtemplates <strong>on</strong>e is allowed to use a quantifier symbol.If no quantifier is present the template has to occurexactly <strong>on</strong>ce at the given positi<strong>on</strong> (<str<strong>on</strong>g>for</str<strong>on</strong>g> an example seesecti<strong>on</strong> 4). There are five valid quantifiers allowed withthe following meaning (t1 and t2 stand <str<strong>on</strong>g>for</str<strong>on</strong>g> templatereferences):t1? The reference t1 is opti<strong>on</strong>al. It may appear or itmay not.t1+ The reference t1 may match <strong>on</strong>e or more times.t1* The reference t1 may match several times or notat all.t1 | t2 | ... Either <strong>on</strong>e of given the references (t1,t2 or . . . ) matches. If several references matchthe leftmost template is chosen. The or quantifierallows <str<strong>on</strong>g>for</str<strong>on</strong>g> an arbitrary number of operands.t1 & t2 & ... All given references (t1, t2 and . . . )have to match at the same time. The and quantifierallows <str<strong>on</strong>g>for</str<strong>on</strong>g> an arbitrary number of operands.Depending <strong>on</strong> the quantifiers template types have tobe declared either as a scalar type or a sequence. Thesyntax uses square brackets to denote a sequence type.For example template foo[] t1; declares a templatereference with the name t1 which will be used to referencea sequence of abstract foo templates.3.4 Improved PredicatesIn the <str<strong>on</strong>g>for</str<strong>on</strong>g>mer versi<strong>on</strong> of the language <strong>on</strong>ly trivial predicateslike comparis<strong>on</strong> and assignment with the currentlymatched token and <strong>on</strong>e additi<strong>on</strong>al operator werepossible and even these <strong>on</strong>ly <strong>on</strong> values of attributes orc<strong>on</strong>tent in elements. It was so<strong>on</strong> obvious that an improvementof predicates with more sophisticated operati<strong>on</strong>slike regular expressi<strong>on</strong>s, range restricti<strong>on</strong>s or theability to test a variable would be of great value to theapplicability of the language. Predicates have there<str<strong>on</strong>g>for</str<strong>on</strong>g>ebeen significantly extended. They are allowed t<strong>on</strong>ot <strong>on</strong>ly specify attribute and c<strong>on</strong>tent values, but alsoto define the name of elements. When a predicate isenclosed between the symbols [: and :] it is evaluatedas a regular expressi<strong>on</strong> 3 .3 The regular expressi<strong>on</strong> dialect in use is currentlythat of Java (<str<strong>on</strong>g>for</str<strong>on</strong>g> more in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> please refer to thejava.util.regex.Pattern class)


Predicates may also evaluate a boolean expressi<strong>on</strong>given by combinati<strong>on</strong>s of relati<strong>on</strong>al operators (=,!=, , =), the assignment operator (=:), logicaloperators (|, &, !), brackets ((, )), variable references(using $) or the current value from the <strong>XML</strong>data stream (if an operand is left empty). Variable referencesmay be complex as they also need to care <str<strong>on</strong>g>for</str<strong>on</strong>g>template references. For example $foo.bar[3].dingidentifies the value of the field ding of the 3. elementin the sequence bar, where bar is a field of the scalarfoo. The definiti<strong>on</strong> of TMPL has a set of rules thatgovern the exact evaluati<strong>on</strong> of predicates. We will notgo into more detail in this article. Please refer to thedocumentati<strong>on</strong> found at the website [6].3.5 Allowing Mixed C<strong>on</strong>tentThe language allows <str<strong>on</strong>g>for</str<strong>on</strong>g> matching of mixed c<strong>on</strong>tent;values are allowed to appear unrestricted am<strong>on</strong>g the<strong>XML</strong> tags and not <strong>on</strong>ly as a single child of an element.Opposite to other processors, the TMPL engine treatsleading and trailing white space as empty c<strong>on</strong>tent andstrips it from the data.4 Changes in the Evaluati<strong>on</strong> RuntimeImplementati<strong>on</strong> of the stream analysis runtime haschanged c<strong>on</strong>siderably, as shown in figure 1. The evaluati<strong>on</strong>engine is interpreting EFSM template matchingautomata using a proprietary binary <str<strong>on</strong>g>for</str<strong>on</strong>g>mat or GXLenabling a more flexible handling of templates, e.g. itis now possible to add or remove templates during runtimeand to combine automata. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of thisapproach using <strong>on</strong>ly a single template is inferior tothe <str<strong>on</strong>g>for</str<strong>on</strong>g>mer compiled versi<strong>on</strong>, but superior when severaltemplates are matched in parallel.4.1 Automata Generati<strong>on</strong>The <str<strong>on</strong>g>for</str<strong>on</strong>g>malism <str<strong>on</strong>g>for</str<strong>on</strong>g> generated automata had to supportc<strong>on</strong>text (which may be values from the data stream,TMPL declared variables and c<strong>on</strong>stants, or internalvariables used by the evaluati<strong>on</strong> engine) and transiti<strong>on</strong>sthat may depend <strong>on</strong> any of them. These premiseslead to the adopti<strong>on</strong> of an EFSM <str<strong>on</strong>g>for</str<strong>on</strong>g>malism. Morespecifically we choose to extend the <strong>on</strong>e proposed byHenniger and Neumann (see [7]) because of the abovementi<strong>on</strong>ed features, the additi<strong>on</strong>al support <str<strong>on</strong>g>for</str<strong>on</strong>g> internalɛ transiti<strong>on</strong>s and the ability <str<strong>on</strong>g>for</str<strong>on</strong>g> transiti<strong>on</strong>s to changethe complete c<strong>on</strong>text of an automat<strong>on</strong>.<str<strong>on</strong>g>An</str<strong>on</strong>g> EFSM is defined as a tuple(S, C, I, O, T, s 0 , c 0 ), where S is a finite set ofstates, C the c<strong>on</strong>text 4 , I a n<strong>on</strong>-empty finite setof input events including ɛ, and O a n<strong>on</strong>-emptyfinite set of output events. T denotes the transiti<strong>on</strong>4 All possible values of a finite set of variablesrelati<strong>on</strong> from current state, c<strong>on</strong>text and input eventto the next state, modified c<strong>on</strong>text and output event(T ⊆ S × C × I × O × S × C). The two last elementss 0 and c 0 describe the initial state and c<strong>on</strong>text values.Every TMPL comp<strong>on</strong>ent (element, c<strong>on</strong>tent, attribute,template reference, . . . ) is translated to <strong>on</strong>eor more states that are combined to a graph using<str<strong>on</strong>g>for</str<strong>on</strong>g>ward and reverse transiti<strong>on</strong>s. A <str<strong>on</strong>g>for</str<strong>on</strong>g>ward transiti<strong>on</strong>is understood as to be leading towards the endstate, whereas a reverse transiti<strong>on</strong> is used when a pathcannot completed, e.g. when a parent <strong>XML</strong> elementcloses. As an example c<strong>on</strong>sider the following template<str<strong>on</strong>g>for</str<strong>on</strong>g> searching a (hypothetical) music database:template fooMusicList {}c<strong>on</strong>st integer c = 10;float ranking;template tSingle[] single;template tAlbum[] album;$c]>[=:ranking]single*album*The template fooMusicList would be translated toan EFSM similar 5 to the <strong>on</strong>e shown in figure 2. Astraight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward path through the generated automat<strong>on</strong>is marked with bold transiti<strong>on</strong> lines. Several detailsmight be associated with a transiti<strong>on</strong>: c<strong>on</strong>textc<strong>on</strong>diti<strong>on</strong>s which are enclosed in square brackets, acti<strong>on</strong>s<str<strong>on</strong>g>for</str<strong>on</strong>g> changing the c<strong>on</strong>text in curly brackets, an inputevent from the <strong>XML</strong> Stream (e.g. START, END,ATTRIBUTE, . . . ) and an output event or n<strong>on</strong>e inthe case of an ɛ transiti<strong>on</strong>.The first two states are used to initialise c<strong>on</strong>stantsand output variables; the matching processstarts in the START and stops in the END statewhere a matching pattern has been identified whichwill be reported back to the analysing applicati<strong>on</strong>.For every element the generator c<strong>on</strong>structstwo states that match corresp<strong>on</strong>ding <strong>XML</strong> startand end elements (e.g. MATCH AUTHOR andMATCH AUTHOR END); <str<strong>on</strong>g>for</str<strong>on</strong>g> attributes the automat<strong>on</strong>has a state that first registers each attribute’sname and then uses a single state to match all of them,this is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> c<strong>on</strong>tent. Template references aretranslated as single states.5 Some details emitted <str<strong>on</strong>g>for</str<strong>on</strong>g> brevity of examples.


SET_ATTRIBUTESTART[author]{SET(name,Foo Fighters)SET(titles,>c)}MATCH_2_ATTRIBUTESSET_CONTENTLIST_END[false]ATTRIBUTE{CHECK()}LIST_END[true]INIT_CONSTANTINIT_VARIABLESTARTMATCH_AUTHORMATCH_REG_ELEMENTSTART[fan*][=TOP+1]{SET(:=ranking)}START ENDMATCH_CONTENTCONTENT[true]MATCH_REG_ENDEND[=TOP]MATCH_MUSICMATCH_SINGLE_STAR{SET_CONST(c,10)}{SET_VAR(single,LIST)SET_VAR(album,LIST)SET_VAR(ranking,FLOAT)}END[=TOP]END[=TOP]START[music][=TOP+1]MATCHED[tAlbum][=TOP]END[=TOP]MATCHED[tSingle][=TOP]4.2.1 Parallel Executi<strong>on</strong> of TemplatesThe basic idea behind this is to combine several templates,this is a standard procedure in graph theory:by using the cross product a combined automat<strong>on</strong> maybe calculated (please refer to [8]). It is still inevitableto check c<strong>on</strong>diti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> all transiti<strong>on</strong>s; the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mancegain is solely the difference between the time <strong>on</strong>e needsto execute a single state transiti<strong>on</strong> or several <strong>on</strong>es.The downside to this approach is in the exp<strong>on</strong>entialincrease in states and transiti<strong>on</strong>s. For exampletake two EFSM S 1 and S 2 with p i number of states andq i number of transiti<strong>on</strong>s: a resulting EFSM <str<strong>on</strong>g>for</str<strong>on</strong>g> S 1 ×S 2would have p 1·p 2 states and at least p 1·q 2 +q 1·p 2 transiti<strong>on</strong>s.A combinati<strong>on</strong> of ten EFSM with ten stateseach (e.g. templates having three elements and twoc<strong>on</strong>tent predicates) would end up with 10 billi<strong>on</strong> (10 10 )states and far more transiti<strong>on</strong>s. As the authors of [9]point out, a possible soluti<strong>on</strong> to this dilemma is to uselazy c<strong>on</strong>structi<strong>on</strong> principles <str<strong>on</strong>g>for</str<strong>on</strong>g> the automata.4.2.2 Lazy Automata C<strong>on</strong>structi<strong>on</strong>Even if the hypothetical number of states of a combinedEFSM seems to grow to exorbitant numbers <strong>on</strong>lya very small porti<strong>on</strong> of these states would ever be used.If an automat<strong>on</strong> is build in a lazy fashi<strong>on</strong> by c<strong>on</strong>structingnew states at runtime <strong>on</strong>ly when they are needed,this number may be decrease dramatically. For examplewe combined twelve templates with altogether 350states using a lazy c<strong>on</strong>structi<strong>on</strong> principle and endedup with an EFSM of <strong>on</strong>ly 169 states <str<strong>on</strong>g>for</str<strong>on</strong>g> a specific datastream after further optimisati<strong>on</strong>s that reduced redundantor empty states and transiti<strong>on</strong>s.END[=TOP]MATCH_ALBUM_STAREND[=TOP]MATCH_AUTHOR_ENDMATCHED[tAlbum][=TOP]END[=TOP]ENDTEMPLATE_MATCH()Figure 2. Automat<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> template fooMusicList4.2 Optimisati<strong>on</strong>With the interpreting approach in place, several newoptimisati<strong>on</strong>s have become possible. One of the mostimportant <strong>on</strong>es was the combinati<strong>on</strong> of several automatain <strong>on</strong>e larger EFSM.4.2.3 State and Transiti<strong>on</strong> Reducti<strong>on</strong>There are several ways to further reduce the numberof states and transiti<strong>on</strong>s when combining automata:Merging of ɛ transiti<strong>on</strong>s <str<strong>on</strong>g>An</str<strong>on</strong>g>y number of statesthat are c<strong>on</strong>nected solely by ɛ transiti<strong>on</strong>s aremerged to two states with a single c<strong>on</strong>necting ɛtransiti<strong>on</strong>.Removal of initialisati<strong>on</strong> states Initialisati<strong>on</strong>functi<strong>on</strong>ality does not need to be c<strong>on</strong>sidered bythe combinati<strong>on</strong> algorithm but may be executedseparately.Removal of default self-transiti<strong>on</strong>s When no inputevent selects a transiti<strong>on</strong> an EFSM needs tostay in its current state. This can be made defaultbehaviour and then does not need to be made explicitby a self-transiti<strong>on</strong>.Grouping transiti<strong>on</strong>s By grouping transiti<strong>on</strong>s witha comm<strong>on</strong> c<strong>on</strong>diti<strong>on</strong> (e.g. same nesting level or


input event) <strong>on</strong>ly the group needs to be testedand not all transiti<strong>on</strong>s individually.5 Applicati<strong>on</strong> ResultsTMPL is in use <str<strong>on</strong>g>for</str<strong>on</strong>g> a number of projects and the extendedversi<strong>on</strong> of TMPL as presented in this paperhas <strong>on</strong>ly been functi<strong>on</strong>ally tested in some case studies.Results obtained in these case studies were produced<strong>on</strong> a standard desktop PC and are promising.5.1 Open Directory ProjectThe Open Directory Project 6 calls itself the “largest,most comprehensive human-edited directory of theWeb” and makes its c<strong>on</strong>tent freely available am<strong>on</strong>gother things in the <str<strong>on</strong>g>for</str<strong>on</strong>g>m of downloadable <strong>XML</strong> archivefiles. We used seven files ranging from 24 kbyte t<strong>on</strong>early 2 gigabyte size and extracted all directory topics,their assigned links and associated titles usingboth TMPL templates executing with the optimisedinterpreter and STX stylesheets using the joost runtime7 . TMPL exports the filtered values directly toan analysing Java applicati<strong>on</strong>, whereas STX creates atrans<str<strong>on</strong>g>for</str<strong>on</strong>g>med <strong>XML</strong> file with the same in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>. Figure3 shows executi<strong>on</strong> results obtained using built-intime measurement functi<strong>on</strong>s of STX, respective Java.5.2 Further Case StudiesTo test optimisati<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> the TMPL runtime we wereusing a 23 megabyte logging file c<strong>on</strong>taining morethan 800 data entries of protocol messages <str<strong>on</strong>g>for</str<strong>on</strong>g> per<str<strong>on</strong>g>for</str<strong>on</strong>g>mancemeasurement purposes and filtered in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>like timestamps, message types, destinati<strong>on</strong> address,. . . using 1 up to 12 different templates. Figure4 shows a comparis<strong>on</strong> of the speed <str<strong>on</strong>g>for</str<strong>on</strong>g> analysis of thestream data depending <strong>on</strong> the number of used templates<str<strong>on</strong>g>for</str<strong>on</strong>g> an unoptimised interpreter implementati<strong>on</strong>,a combined state versi<strong>on</strong> with lazy c<strong>on</strong>structi<strong>on</strong> and aversi<strong>on</strong> with all optimisati<strong>on</strong>s.Time in sec<strong>on</strong>ds121086420InterpreterCombined InterpreterOptimised Interpreter1 2 3 4 5 6 7 8 9 10 11 12Number of templatesFigure 4. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance comparis<strong>on</strong> of optimisati<strong>on</strong>s600500Time in sec<strong>on</strong>ds4003002001000TMPLSTX0 500 1000 1500 2000Size in megabyteFigure 3. Open Directory Project archive searchIt is to be observed that the STX engine is faster<str<strong>on</strong>g>for</str<strong>on</strong>g> smaller data amounts but after “warm-up” of thelazy c<strong>on</strong>structi<strong>on</strong> <str<strong>on</strong>g>mechanism</str<strong>on</strong>g> the optimised TMPL interpreteris up to 1 3faster than STX. It is unlikelythat still larger amounts of data would change this.<str<strong>on</strong>g>An</str<strong>on</strong>g>other argument in favour of TMPL is that to writeboth versi<strong>on</strong>s of template definiti<strong>on</strong>s in TMPL 15 linesand in STX more than 30 lines are needed. BecauseTMPL syntax is not valid <strong>XML</strong> we argue that it ismore intuitive and easier to read and understand asthe XSLT based syntax that STX uses.6 http://dmoz.org/7 http://joost.source<str<strong>on</strong>g>for</str<strong>on</strong>g>ge.net/6 C<strong>on</strong>clusi<strong>on</strong>The extensi<strong>on</strong>s to TMPL introduced in this paper havebeen found useful in making the language more powerfulwhen <str<strong>on</strong>g>for</str<strong>on</strong>g>mulating <str<strong>on</strong>g>patterns</str<strong>on</strong>g> while preserving theintuitive and clear syntax. The utilisati<strong>on</strong> of lazy c<strong>on</strong>structedautomata <str<strong>on</strong>g>for</str<strong>on</strong>g> matching <str<strong>on</strong>g>multiple</str<strong>on</strong>g> templates inparallel is feasible and offers a good runtime per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance.Experience during the implementati<strong>on</strong> showsa linear increase in processing time depending <strong>on</strong> increaseof the analysed data size. The optimisati<strong>on</strong> approacheswork best <str<strong>on</strong>g>for</str<strong>on</strong>g> templates and data streamsthat bear a lot of similarity.6.1 Open TopicsNested <str<strong>on</strong>g>patterns</str<strong>on</strong>g> The self similarity problem (asmenti<strong>on</strong>ed in [1] page 116) is not solved. A templatethat is already matching will not start tomatch again; this problem also exists with abstracttemplates. There<str<strong>on</strong>g>for</str<strong>on</strong>g>e abstract templatescan <strong>on</strong>ly be used to alleviate the burden of writingthe same template over and over again, but notto specify relati<strong>on</strong>s <strong>on</strong> a more abstract level thanthe natural granularity of the underlying stream.


This is a restricti<strong>on</strong> of the TMPL runtime and resultsin the impossibility to identify nested, selfsimilar<str<strong>on</strong>g>patterns</str<strong>on</strong>g> in a <strong>XML</strong> stream. Of course, ifthe structure of a certain pattern to be matched isknown in sufficient detail in advance, it is alwayspossible to create a template that will match thisexact structure even if parts of it are self-similar.We d<strong>on</strong>’t regard this as a too serious issue as byfar most of the <strong>XML</strong> data in use is not-self similar(VALUE according to the authors of [10, ch.3.6]) and of the remaining share, <strong>on</strong>ly a very smallfracti<strong>on</strong> (VALUE) is deeply nested and there<str<strong>on</strong>g>for</str<strong>on</strong>g>erequiring the creati<strong>on</strong> of complex static templates.Namespaces The namespace <str<strong>on</strong>g>mechanism</str<strong>on</strong>g> of <strong>XML</strong> isstill not supported in TMPL. The <strong>on</strong>ly soluti<strong>on</strong>to this is to explicitly prefix every name to yieldits qualified representati<strong>on</strong> (e.g. using ns:foo).6.2 OutlookUsing <str<strong>on</strong>g>multiple</str<strong>on</strong>g> threads When analysing <strong>XML</strong>data using machines with many CPUs or processorcores, implementing the matching runtimein a multithreaded fashi<strong>on</strong> might provide asignificant better runtime per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance. Thisapproach would probably work quite well as alltemplates are able to run in parallel and share<strong>on</strong>ly the input event from the data stream.Further improvement of predicates Usage ofpredicates could still be improved, <str<strong>on</strong>g>for</str<strong>on</strong>g> example byadding numerical operators (+,-,*,/), or functi<strong>on</strong>s(e.g. to calculate the length of lists).[3] O. Becker, Serielle Trans<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> v<strong>on</strong> <strong>XML</strong> -Probleme, Methoden, Lösungen. PhD thesis,Humbolt-University, Berlin, 2004.[4] L. Fegaras, The Joy of SAX. Proc. 1th Internati<strong>on</strong>alWorkshop <strong>on</strong> XQuery Implementati<strong>on</strong>,Experience and Perspectives (ACM SIG-MOD) pp. 61-66, 2004.[5] R. Holt, A. Schürr, S. E. Sim, and A. Winter,Graph eXchange Language www.gupro.de/GXL/[6] E. Höfig, A. Hinnerichs: TemplateSpecificati<strong>on</strong> Language Websitehttp://source<str<strong>on</strong>g>for</str<strong>on</strong>g>ge.net/projects/tmpl[7] O. Henniger, and P. Neumann, Test case generati<strong>on</strong>based <strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g>mal specificati<strong>on</strong>s in Estelle.In J.-D. Decotignie, ed., Proc. of the1st IEEE Internati<strong>on</strong>al Workshop <strong>on</strong> FactoryCommunicati<strong>on</strong> Systems, 1995.[8] F. Harary, Graph theory(Reading, Mass. :Addis<strong>on</strong>-Wesley Pub. Co., 1969).[9] T.J. Green, G. Miklau, M. Okizuka, andD. Suciu, Processing xml streams with deterministicautomata. Proc. 9th Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Database Theory pp. 173-189,2003.[10] L. Mignet, D. Barbosa, and P. Veltri, The<strong>XML</strong> Web: a First Study. Proc. 12th internati<strong>on</strong>alc<strong>on</strong>ference <strong>on</strong> World Wide Web pp.500-510, 2003.Template parametrisati<strong>on</strong> Templates could beparametrised and supplied with certain values atruntime, enabling matching of a greater varietyof <str<strong>on</strong>g>patterns</str<strong>on</strong>g> using template references.Improvement of template references One couldallow complex sequence definiti<strong>on</strong>s of templates,<str<strong>on</strong>g>for</str<strong>on</strong>g> example by adding predicates to template referencesor by using qualified references in combinati<strong>on</strong>with logical operators.References[1] E. Höfig: Template <str<strong>on</strong>g>Matching</str<strong>on</strong>g> <strong>on</strong> <strong>XML</strong><strong>Streams</strong>. Proc. 24th IASTED Software Engineeringpp. 113-118, 2006.[2] D. Olteanu, T. Furche, and F. Bry: <str<strong>on</strong>g>An</str<strong>on</strong>g> EfficientSingle-Pass Query Evaluator <str<strong>on</strong>g>for</str<strong>on</strong>g> <strong>XML</strong>Data <strong>Streams</strong>. Proc. 19th <str<strong>on</strong>g>An</str<strong>on</strong>g>nual ACM Symposium<strong>on</strong> Applied Computing (SAC) ,2004.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!