Predicates may also evaluate a boolean expressi<strong>on</strong>given by combinati<strong>on</strong>s of relati<strong>on</strong>al operators (=,!=, , =), the assignment operator (=:), logicaloperators (|, &, !), brackets ((, )), variable references(using $) or the current value from the <strong>XML</strong>data stream (if an operand is left empty). Variable referencesmay be complex as they also need to care <str<strong>on</strong>g>for</str<strong>on</strong>g>template references. For example $foo.bar[3].dingidentifies the value of the field ding of the 3. elementin the sequence bar, where bar is a field of the scalarfoo. The definiti<strong>on</strong> of TMPL has a set of rules thatgovern the exact evaluati<strong>on</strong> of predicates. We will notgo into more detail in this article. Please refer to thedocumentati<strong>on</strong> found at the website [6].3.5 Allowing Mixed C<strong>on</strong>tentThe language allows <str<strong>on</strong>g>for</str<strong>on</strong>g> matching of mixed c<strong>on</strong>tent;values are allowed to appear unrestricted am<strong>on</strong>g the<strong>XML</strong> tags and not <strong>on</strong>ly as a single child of an element.Opposite to other processors, the TMPL engine treatsleading and trailing white space as empty c<strong>on</strong>tent andstrips it from the data.4 Changes in the Evaluati<strong>on</strong> RuntimeImplementati<strong>on</strong> of the stream analysis runtime haschanged c<strong>on</strong>siderably, as shown in figure 1. The evaluati<strong>on</strong>engine is interpreting EFSM template matchingautomata using a proprietary binary <str<strong>on</strong>g>for</str<strong>on</strong>g>mat or GXLenabling a more flexible handling of templates, e.g. itis now possible to add or remove templates during runtimeand to combine automata. Per<str<strong>on</strong>g>for</str<strong>on</strong>g>mance of thisapproach using <strong>on</strong>ly a single template is inferior tothe <str<strong>on</strong>g>for</str<strong>on</strong>g>mer compiled versi<strong>on</strong>, but superior when severaltemplates are matched in parallel.4.1 Automata Generati<strong>on</strong>The <str<strong>on</strong>g>for</str<strong>on</strong>g>malism <str<strong>on</strong>g>for</str<strong>on</strong>g> generated automata had to supportc<strong>on</strong>text (which may be values from the data stream,TMPL declared variables and c<strong>on</strong>stants, or internalvariables used by the evaluati<strong>on</strong> engine) and transiti<strong>on</strong>sthat may depend <strong>on</strong> any of them. These premiseslead to the adopti<strong>on</strong> of an EFSM <str<strong>on</strong>g>for</str<strong>on</strong>g>malism. Morespecifically we choose to extend the <strong>on</strong>e proposed byHenniger and Neumann (see [7]) because of the abovementi<strong>on</strong>ed features, the additi<strong>on</strong>al support <str<strong>on</strong>g>for</str<strong>on</strong>g> internalɛ transiti<strong>on</strong>s and the ability <str<strong>on</strong>g>for</str<strong>on</strong>g> transiti<strong>on</strong>s to changethe complete c<strong>on</strong>text of an automat<strong>on</strong>.<str<strong>on</strong>g>An</str<strong>on</strong>g> EFSM is defined as a tuple(S, C, I, O, T, s 0 , c 0 ), where S is a finite set ofstates, C the c<strong>on</strong>text 4 , I a n<strong>on</strong>-empty finite setof input events including ɛ, and O a n<strong>on</strong>-emptyfinite set of output events. T denotes the transiti<strong>on</strong>4 All possible values of a finite set of variablesrelati<strong>on</strong> from current state, c<strong>on</strong>text and input eventto the next state, modified c<strong>on</strong>text and output event(T ⊆ S × C × I × O × S × C). The two last elementss 0 and c 0 describe the initial state and c<strong>on</strong>text values.Every TMPL comp<strong>on</strong>ent (element, c<strong>on</strong>tent, attribute,template reference, . . . ) is translated to <strong>on</strong>eor more states that are combined to a graph using<str<strong>on</strong>g>for</str<strong>on</strong>g>ward and reverse transiti<strong>on</strong>s. A <str<strong>on</strong>g>for</str<strong>on</strong>g>ward transiti<strong>on</strong>is understood as to be leading towards the endstate, whereas a reverse transiti<strong>on</strong> is used when a pathcannot completed, e.g. when a parent <strong>XML</strong> elementcloses. As an example c<strong>on</strong>sider the following template<str<strong>on</strong>g>for</str<strong>on</strong>g> searching a (hypothetical) music database:template fooMusicList {}c<strong>on</strong>st integer c = 10;float ranking;template tSingle[] single;template tAlbum[] album;$c]>[=:ranking]single*album*The template fooMusicList would be translated toan EFSM similar 5 to the <strong>on</strong>e shown in figure 2. Astraight <str<strong>on</strong>g>for</str<strong>on</strong>g>ward path through the generated automat<strong>on</strong>is marked with bold transiti<strong>on</strong> lines. Several detailsmight be associated with a transiti<strong>on</strong>: c<strong>on</strong>textc<strong>on</strong>diti<strong>on</strong>s which are enclosed in square brackets, acti<strong>on</strong>s<str<strong>on</strong>g>for</str<strong>on</strong>g> changing the c<strong>on</strong>text in curly brackets, an inputevent from the <strong>XML</strong> Stream (e.g. START, END,ATTRIBUTE, . . . ) and an output event or n<strong>on</strong>e inthe case of an ɛ transiti<strong>on</strong>.The first two states are used to initialise c<strong>on</strong>stantsand output variables; the matching processstarts in the START and stops in the END statewhere a matching pattern has been identified whichwill be reported back to the analysing applicati<strong>on</strong>.For every element the generator c<strong>on</strong>structstwo states that match corresp<strong>on</strong>ding <strong>XML</strong> startand end elements (e.g. MATCH AUTHOR andMATCH AUTHOR END); <str<strong>on</strong>g>for</str<strong>on</strong>g> attributes the automat<strong>on</strong>has a state that first registers each attribute’sname and then uses a single state to match all of them,this is similar <str<strong>on</strong>g>for</str<strong>on</strong>g> c<strong>on</strong>tent. Template references aretranslated as single states.5 Some details emitted <str<strong>on</strong>g>for</str<strong>on</strong>g> brevity of examples.
SET_ATTRIBUTESTART[author]{SET(name,Foo Fighters)SET(titles,>c)}MATCH_2_ATTRIBUTESSET_CONTENTLIST_END[false]ATTRIBUTE{CHECK()}LIST_END[true]INIT_CONSTANTINIT_VARIABLESTARTMATCH_AUTHORMATCH_REG_ELEMENTSTART[fan*][=TOP+1]{SET(:=ranking)}START ENDMATCH_CONTENTCONTENT[true]MATCH_REG_ENDEND[=TOP]MATCH_MUSICMATCH_SINGLE_STAR{SET_CONST(c,10)}{SET_VAR(single,LIST)SET_VAR(album,LIST)SET_VAR(ranking,FLOAT)}END[=TOP]END[=TOP]START[music][=TOP+1]MATCHED[tAlbum][=TOP]END[=TOP]MATCHED[tSingle][=TOP]4.2.1 Parallel Executi<strong>on</strong> of TemplatesThe basic idea behind this is to combine several templates,this is a standard procedure in graph theory:by using the cross product a combined automat<strong>on</strong> maybe calculated (please refer to [8]). It is still inevitableto check c<strong>on</strong>diti<strong>on</strong>s <str<strong>on</strong>g>for</str<strong>on</strong>g> all transiti<strong>on</strong>s; the per<str<strong>on</strong>g>for</str<strong>on</strong>g>mancegain is solely the difference between the time <strong>on</strong>e needsto execute a single state transiti<strong>on</strong> or several <strong>on</strong>es.The downside to this approach is in the exp<strong>on</strong>entialincrease in states and transiti<strong>on</strong>s. For exampletake two EFSM S 1 and S 2 with p i number of states andq i number of transiti<strong>on</strong>s: a resulting EFSM <str<strong>on</strong>g>for</str<strong>on</strong>g> S 1 ×S 2would have p 1·p 2 states and at least p 1·q 2 +q 1·p 2 transiti<strong>on</strong>s.A combinati<strong>on</strong> of ten EFSM with ten stateseach (e.g. templates having three elements and twoc<strong>on</strong>tent predicates) would end up with 10 billi<strong>on</strong> (10 10 )states and far more transiti<strong>on</strong>s. As the authors of [9]point out, a possible soluti<strong>on</strong> to this dilemma is to uselazy c<strong>on</strong>structi<strong>on</strong> principles <str<strong>on</strong>g>for</str<strong>on</strong>g> the automata.4.2.2 Lazy Automata C<strong>on</strong>structi<strong>on</strong>Even if the hypothetical number of states of a combinedEFSM seems to grow to exorbitant numbers <strong>on</strong>lya very small porti<strong>on</strong> of these states would ever be used.If an automat<strong>on</strong> is build in a lazy fashi<strong>on</strong> by c<strong>on</strong>structingnew states at runtime <strong>on</strong>ly when they are needed,this number may be decrease dramatically. For examplewe combined twelve templates with altogether 350states using a lazy c<strong>on</strong>structi<strong>on</strong> principle and endedup with an EFSM of <strong>on</strong>ly 169 states <str<strong>on</strong>g>for</str<strong>on</strong>g> a specific datastream after further optimisati<strong>on</strong>s that reduced redundantor empty states and transiti<strong>on</strong>s.END[=TOP]MATCH_ALBUM_STAREND[=TOP]MATCH_AUTHOR_ENDMATCHED[tAlbum][=TOP]END[=TOP]ENDTEMPLATE_MATCH()Figure 2. Automat<strong>on</strong> <str<strong>on</strong>g>for</str<strong>on</strong>g> template fooMusicList4.2 Optimisati<strong>on</strong>With the interpreting approach in place, several newoptimisati<strong>on</strong>s have become possible. One of the mostimportant <strong>on</strong>es was the combinati<strong>on</strong> of several automatain <strong>on</strong>e larger EFSM.4.2.3 State and Transiti<strong>on</strong> Reducti<strong>on</strong>There are several ways to further reduce the numberof states and transiti<strong>on</strong>s when combining automata:Merging of ɛ transiti<strong>on</strong>s <str<strong>on</strong>g>An</str<strong>on</strong>g>y number of statesthat are c<strong>on</strong>nected solely by ɛ transiti<strong>on</strong>s aremerged to two states with a single c<strong>on</strong>necting ɛtransiti<strong>on</strong>.Removal of initialisati<strong>on</strong> states Initialisati<strong>on</strong>functi<strong>on</strong>ality does not need to be c<strong>on</strong>sidered bythe combinati<strong>on</strong> algorithm but may be executedseparately.Removal of default self-transiti<strong>on</strong>s When no inputevent selects a transiti<strong>on</strong> an EFSM needs tostay in its current state. This can be made defaultbehaviour and then does not need to be made explicitby a self-transiti<strong>on</strong>.Grouping transiti<strong>on</strong>s By grouping transiti<strong>on</strong>s witha comm<strong>on</strong> c<strong>on</strong>diti<strong>on</strong> (e.g. same nesting level or