12.07.2015 Views

An efficient mechanism for Matching multiple patterns on XML Streams

An efficient mechanism for Matching multiple patterns on XML Streams

An efficient mechanism for Matching multiple patterns on XML Streams

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

This paper discusses <strong>XML</strong> stream data processingfrom the perspective of TMPL, a language thatwas specifically designed <str<strong>on</strong>g>for</str<strong>on</strong>g> stream analysis and notadopted from a document based approach. TMPL wasintroduced in [1], further in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong> <strong>on</strong> related workin this field may be found in this paper, as well.3 Language Extensi<strong>on</strong>sTMPL is a c<strong>on</strong>text-sensitive language describing <str<strong>on</strong>g>patterns</str<strong>on</strong>g>that are to be matched against a data stream.Syntactical structures of TMPL are specified in asyntax quite similar to <strong>XML</strong> itself and may c<strong>on</strong>sistof elements, attributes and data specificati<strong>on</strong>s in the<str<strong>on</strong>g>for</str<strong>on</strong>g>m of so-called predicates. Predicates are denotedusing square brackets and may restrict c<strong>on</strong>tent (e.g.a statement like a=[>10] would match <strong>on</strong>ly if thecorresp<strong>on</strong>ding attribute posses a value greater thanten) or assign a value to a variable (e.g. [=:b]would assign the c<strong>on</strong>tent of the element to avariable of name b). There are additi<strong>on</strong>al symbols <str<strong>on</strong>g>for</str<strong>on</strong>g>marking a single unknown element or an element at anunknown depth 2 in the <strong>XML</strong> data stream. A TMPLfile defines a single module that c<strong>on</strong>tains severaltemplates and it is possible to define c<strong>on</strong>stants at thebeginning of a template or module scope. Variablesor c<strong>on</strong>stants may also be referenced in predicates (e.g.[$c] matches if an unknown element is beencountered that c<strong>on</strong>tained the value stored in thevariable referenced by the name of c). Based <strong>on</strong> theexperience we gained while applying the languageduring the last year, we extended the functi<strong>on</strong>alitywith features we were missing, which either improveexpressiveness by enabling the matching of previouslynot identifiable <str<strong>on</strong>g>patterns</str<strong>on</strong>g> (e.g. sequence matching) orincrease usability of the language by reducing possibletypos or failures <strong>on</strong> the user’s side (e.g. declarativetype system).To accommodate certain new runtime features itbecame necessary to change the TMPL toolchain. Figure1 depicts the current toolchain in white colour; thegrey part shows the old approach that was directlygenerating Java code from the TMPL abstract syntaxtree (AST) as c<strong>on</strong>structed by the parser. Thishas been dropped in favour of a new intermediary <str<strong>on</strong>g>for</str<strong>on</strong>g>matthat codifies the automata <str<strong>on</strong>g>for</str<strong>on</strong>g> the runtime engineusing an extended finite state machine (EFSM) <str<strong>on</strong>g>for</str<strong>on</strong>g>malismwhich will be described in secti<strong>on</strong> 4.1.The basic idea was to be able to use this <str<strong>on</strong>g>for</str<strong>on</strong>g>matin several ways. <str<strong>on</strong>g>An</str<strong>on</strong>g> automat<strong>on</strong> may be savedin, or loaded from a file using the standard Graph eXchangeLanguage (GXL) <str<strong>on</strong>g>for</str<strong>on</strong>g>mat (<str<strong>on</strong>g>for</str<strong>on</strong>g> more in<str<strong>on</strong>g>for</str<strong>on</strong>g>mati<strong>on</strong>please refer to the website [5]). Templates in the intermediary<str<strong>on</strong>g>for</str<strong>on</strong>g>mat are used directly with the TMPL2 Stream depth is defined as the nesting level of elementsgxltmplParserAutomataGenerati<strong>on</strong>InterpreterJavaGenerati<strong>on</strong>javaCompiledRuntimeFigure 1. TMPL tool chainInterpreter, that is the matching runtime engine, tosearch <str<strong>on</strong>g>for</str<strong>on</strong>g> <str<strong>on</strong>g>patterns</str<strong>on</strong>g> in the data stream. Some other opti<strong>on</strong>snot shown here would be to generate specialisedmatching code directly from the intermediary <str<strong>on</strong>g>for</str<strong>on</strong>g>mat<str<strong>on</strong>g>for</str<strong>on</strong>g> a certain programming language or to use standardvisualisati<strong>on</strong> software to generate pictures of the statemachines.3.1 Syntactical ChangesThe syntax of TMPL has been slightly changedto be better understandable. Explicit comparis<strong>on</strong>and assignment operators have been introduced (e.g.[=$foo] and [=:bar]) which may be omitted whentheir meaning within a predicate is n<strong>on</strong>-ambiguous.There is a shorthand syntax <str<strong>on</strong>g>for</str<strong>on</strong>g> other syntactical c<strong>on</strong>structsas well, please refer to the website [6] <str<strong>on</strong>g>for</str<strong>on</strong>g> documentati<strong>on</strong>about these.A major change is the introducti<strong>on</strong> of the rangeelement that offers a clearer definiti<strong>on</strong> than the<str<strong>on</strong>g>for</str<strong>on</strong>g>mer placeholder expressi<strong>on</strong>. One can use it to matcha sequence of nested elements. As an example c<strong>on</strong>siderthe following TMPL code:In this case the range element stands <str<strong>on</strong>g>for</str<strong>on</strong>g> a sequencethat has to begin with a child element of and that must end with an element c<strong>on</strong>taining attributeatt and child element b. The range element ismatched successfully when the starting element of thesequence closes.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!