05.03.2016 Views

Programming in Scala”

fpiscompanion

fpiscompanion

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter notes 24<br />

grammars require little lookahead. Interest<strong>in</strong>gly, the grammar ends up look<strong>in</strong>g almost the same with<br />

the <strong>in</strong>troduction of an explicit commit: the leaf-level token parser is usually wrapped <strong>in</strong> commit, and<br />

higher levels of the grammar that require more lookahead use attempt.<br />

Input handl<strong>in</strong>g: Is the <strong>in</strong>put type fixed to be Str<strong>in</strong>g or can parsers be polymorphic over any sequence<br />

of tokens? The library developed <strong>in</strong> this chapter chose Str<strong>in</strong>g for simplicity and speed. Separate<br />

tokenizers are not as commonly used with parser comb<strong>in</strong>ators–the ma<strong>in</strong> reason to use a separate<br />

tokenizer would be for speed, which is usually addressed by build<strong>in</strong>g a few fast low-level primitive<br />

like regex (which checks whether a regular expression matches a prefix of the <strong>in</strong>put str<strong>in</strong>g).<br />

Hav<strong>in</strong>g the <strong>in</strong>put type be an arbitrary Seq[A] results <strong>in</strong> a library that could <strong>in</strong> pr<strong>in</strong>ciple be used for<br />

other sorts of computations (for <strong>in</strong>stance, we can “parse” a sum from a Seq[Int]). There are usually<br />

better ways of do<strong>in</strong>g these other sorts of tasks (see the discussion of stream process<strong>in</strong>g <strong>in</strong> part 4 of<br />

the book) and we have found that there isn’t much advantage <strong>in</strong> generaliz<strong>in</strong>g the library we have<br />

here to other <strong>in</strong>put types.<br />

We can def<strong>in</strong>e a similar set of comb<strong>in</strong>ators for pars<strong>in</strong>g b<strong>in</strong>ary formats. B<strong>in</strong>ary formats rarely have to<br />

deal with lookahead, ambiguity, and backtrack<strong>in</strong>g, and have simpler error report<strong>in</strong>g needs, so often<br />

a more specialized b<strong>in</strong>ary pars<strong>in</strong>g library is sufficient. Some examples of b<strong>in</strong>ary pars<strong>in</strong>g libraries<br />

are Data.B<strong>in</strong>ary¹⁰⁷ <strong>in</strong> Haskell, which has <strong>in</strong>spired similar libraries <strong>in</strong> other languages, for <strong>in</strong>stance<br />

scodec¹⁰⁸ <strong>in</strong> Scala.<br />

You may be <strong>in</strong>terested to explore generaliz<strong>in</strong>g or adapt<strong>in</strong>g the library we wrote enough to handle<br />

b<strong>in</strong>ary pars<strong>in</strong>g as well (to get good performance, you may want to try us<strong>in</strong>g the specialized<br />

annotation¹⁰⁹ to avoid the overhead of box<strong>in</strong>g and unbox<strong>in</strong>g that would otherwise occur).<br />

Stream<strong>in</strong>g pars<strong>in</strong>g and push vs. pull: It is possible to make a parser comb<strong>in</strong>ator library that can<br />

operate on huge <strong>in</strong>puts, larger than what is possible or desirable to load fully <strong>in</strong>to memory. Related<br />

to this is is the ability to produce results <strong>in</strong> a stream<strong>in</strong>g fashion–for <strong>in</strong>stance, as the raw bytes of<br />

an HTTP request are be<strong>in</strong>g read, the parse result is be<strong>in</strong>g constructed. The two general approaches<br />

here are to allow the parser to pull more <strong>in</strong>put from its <strong>in</strong>put source, and <strong>in</strong>vert<strong>in</strong>g control and<br />

push<strong>in</strong>g chunks of <strong>in</strong>put to our parsers (this is the approach taken by the popular Haskell library<br />

attoparsec¹¹⁰), which may report they are f<strong>in</strong>ished, failed, or requir<strong>in</strong>g more <strong>in</strong>put after each chunk.<br />

We will discuss this more <strong>in</strong> part 4 when discuss<strong>in</strong>g stream process<strong>in</strong>g.<br />

Handl<strong>in</strong>g of ambiguity and left-recursion: The parsers developed <strong>in</strong> this chapter scan <strong>in</strong>put from left<br />

to right and use a left-biased or comb<strong>in</strong>ator which selects the “leftmost” parse if the grammar is<br />

ambiguous–these are called LL parsers¹¹¹. This is a design choice. There are other ways to handle<br />

ambiguity <strong>in</strong> the grammar–<strong>in</strong> GLR pars<strong>in</strong>g¹¹², roughly, the or comb<strong>in</strong>ator explores both branches<br />

simultaneously and the parser can return multiple results if the <strong>in</strong>put is ambiguous. (See Daniel<br />

¹⁰⁷http://code.haskell.org/b<strong>in</strong>ary/<br />

¹⁰⁸https://github.com/scodec/scodec<br />

¹⁰⁹http://www.scala-notes.org/2011/04/specializ<strong>in</strong>g-for-primitive-types/<br />

¹¹⁰http://hackage.haskell.org/packages/archive/attoparsec/0.10.2.0/doc/html/Data-Attoparsec-ByteStr<strong>in</strong>g.html<br />

¹¹¹http://en.wikipedia.org/wiki/LL_parser<br />

¹¹²http://en.wikipedia.org/wiki/GLR_parser

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!