A Library for Processing
A Library for Processing
A Library for Processing
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
A <strong>Library</strong> <strong>for</strong> <strong>Processing</strong><br />
Ad-hoc Data in Haskell<br />
Embedding a Data Description Language<br />
Yan Wang and Veronica Gaspes<br />
Halmstad University, Sweden<br />
IFL2008, Sep 12 2008<br />
Yan Wang, Halmstad University, IFL08
Data Is Everywhere<br />
• Standardized data <strong>for</strong>mats<br />
– HTML<br />
– JPEG&MPEG<br />
– XML<br />
– Databases<br />
– … …<br />
• Tools<br />
– Visualizers<br />
– Languages<br />
– Standard libraries<br />
– Trans<strong>for</strong>mers<br />
– … …<br />
<br />
...<br />
<br />
xml<br />
databases<br />
Yan Wang, Halmstad University, IFL08
Data Is Everywhere<br />
• Ad-hoc data <strong>for</strong>mats (Non-standard data <strong>for</strong>mats)<br />
– In geography<br />
– In chemistry<br />
– In genetics<br />
– In finance<br />
– … …<br />
• Tools not available<br />
– Parsers<br />
– Queriers<br />
– Visualizers<br />
– Trans<strong>for</strong>mers<br />
– … …<br />
Yan Wang, Halmstad University, IFL08
Ad-hoc Data in Business<br />
Train e-ticket<br />
Flight e-ticket<br />
Time<br />
Date<br />
Departure&Arrival<br />
Transport<br />
Yan Wang, Halmstad University, IFL08
Ad-hoc Binary Data in Networks<br />
YMSG Packet -- Yahoo Instance Message<br />
Src 83.178.165.157<br />
Dst 76.13.15.53<br />
Yan Wang, Halmstad University, IFL08
Existing Approaches<br />
• Conventional languages<br />
– C, Java, etc.<br />
– Time consuming & error-prone<br />
• Traditional Parsers<br />
– Yacc, Happy, Parsec<br />
– Heavy-weight<br />
• Data Decription Languages<br />
– PADS, Datascript, Packettype<br />
– Difficult to extend<br />
Yan Wang, Halmstad University, IFL08
Data Description Calculus<br />
• DDC: the calculus of dependent types <strong>for</strong> describing data.<br />
– Base types: atomic pieces of data, e.g., intFW(3), stringUtil(ʻ.ʼ)<br />
– Type constructors: richer structures, e.g., {x:intFW(3)| x
Our Approach<br />
• Embedding a DDL into Haskell<br />
– Data <strong>for</strong>mats are described in dependent types using<br />
• Primitive parsers (Base types)<br />
• Parser combinators (Type constructors)<br />
Type Description<br />
Parser<br />
[t] AdhocParser t a<br />
τ<br />
Representation<br />
a<br />
PD<br />
Parse Descriptor<br />
Yan Wang, Halmstad University, IFL08
• C(e)<br />
•Examples<br />
123456<br />
Base Types<br />
class Basetype t a where ... ...<br />
base :: (Basetype t a) =>AdhocParser t a -- [C()]<br />
baselen :: (Basetype t a) => Int AdhocParser t a -- [C(n)]<br />
baseend :: (Basetype t a) => t AdhocParser t a -- [C(t)]<br />
string(n)<br />
stringlen :: Int AdhocParser Char String<br />
stringlen = baselen<br />
stringlen 5<br />
654,321<br />
int(t)<br />
intlen :: Char AdhocParser Char Int<br />
intlen = baseend<br />
intend ’,’<br />
Yan Wang, Halmstad University, IFL08
• {x :τ| e}<br />
• Examples<br />
Constraint<br />
constrainp :: (a Bool) -- e<br />
AdhocParser t b -- [τ]<br />
AdhocParser t (Either a a) -- [{x :| e}]<br />
123,456<br />
654,321<br />
{x :int() | \x x>0 && x0 && x
Dependent Pairs<br />
• Σ x :τ1 .τ2<br />
sigmap :: AdhocParser t a -- [τ1]<br />
(a AdhocParser t b) -- [τ2 (x)]<br />
AdhocParser t (a, b) -- [Σ x :1 .2]<br />
• Examples<br />
’HELLO’<br />
”HELLO”<br />
5HELLO<br />
6HELLO.<br />
Σ x :char().stringend(x)<br />
s1 = sigmap char stringend<br />
Σ len :int().stringlen(len)<br />
s2 = sigmap int stringlen<br />
Yan Wang, Halmstad University, IFL08
Union<br />
• τ1 +τ2<br />
orp :: AdhocParser t a -- [τ1]<br />
AdhocParser t a -- [τ2]<br />
AdhocParser t (Either a b) -- [τ1 + τ2]<br />
• Examples<br />
123.45 is a float<br />
100 is not a float.<br />
float() + int()<br />
num = orp float int<br />
Yan Wang, Halmstad University, IFL08
Adding Tools<br />
Ad-hoc<br />
Data<br />
[t]<br />
Parser<br />
Type Description<br />
τ<br />
AdhocParser t a<br />
Pretty<br />
Printer<br />
Representation<br />
a<br />
Parse Descriptor<br />
Pretty Document<br />
Doc<br />
Error Report<br />
PD<br />
ErrRep<br />
Error<br />
Reporter<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
YMSG<br />
TCP<br />
IP<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
1<br />
type HexChar = Char<br />
instance Basetype HexChar Int where<br />
... ...<br />
instance Basetype HexChar Char where<br />
... ...<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
2<br />
intlen :: Int -> AdhocParser HChar Int<br />
intlen = baselen<br />
charlen :: Int -> AdhocParser HChar Char<br />
charlen = baselen<br />
… …<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
3.1<br />
ippacket =<br />
do version constrainp (==4) (intlen 1)<br />
ihl intlen 1<br />
... ...<br />
tlen intlen 4<br />
src seqp unit unit (intlen 2)<br />
(\xs -> length xs == 4)<br />
dest … …<br />
options orp<br />
(seqp unit unit (intlen 8)<br />
(\xs length xs == (ihl-5)))<br />
unit<br />
(port, sender, reciever, msg) tcppacket<br />
return (Ymsg src dest sender reciever msg)<br />
YMSG<br />
TCP<br />
IP<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
3.2<br />
tcppacket =<br />
do<br />
... ...<br />
port constrainp (== 5050) (intlen 1)<br />
... ...<br />
(sender, reciever, msg) ymsgpacket<br />
return (port, sender, reciever, msg)<br />
ymsgpacket =<br />
do<br />
... ...<br />
return (sender, reciever, msg)<br />
YMSG<br />
TCP<br />
IP<br />
Yan Wang, Halmstad University, IFL08
Example: YMSG Packet<br />
40 fe 20 00 06 00 00 00<br />
06 00 00 00 08 00 45 00<br />
00 83 c7 f6 40 00 80 06<br />
e8 ec 53 b2 9a 9d 4c 0d<br />
0f 35 0d 25 13 ba a0 d4<br />
11 3f c7 5d 20 be 50 18<br />
f7 29 dd 17 00 00 59 4d<br />
53 47 00 0f 00 00 00 47<br />
00 06 5a 55 aa 56 00 49<br />
c6 af 31 c0 80 77 61 6e<br />
67 6b 69 74 38 36 c0 80<br />
35 c0 80 74 61 72 65 6b<br />
31 32 61 6c 79 c0 80 31<br />
34 c0 80 48 65 6c 6c 6f<br />
c0 80 39 37 c0 80 31 c0<br />
80 36 33 c0 80 3b 30 c0<br />
80 36 34 c0 80 30 c0 80<br />
32 30 36 c0 80 31 c0 80<br />
ippacket<br />
Yahoo msg in IPv4:<br />
from Alice (83.178.165.157)<br />
to Bob (76.13.15.53)<br />
on port 5050<br />
msg is ”Hello”<br />
Yan Wang, Halmstad University, IFL08
Thanks <strong>for</strong> your attention!<br />
Questions & Suggestions?<br />
Yan Wang, Halmstad University, IFL08
Implementation<br />
• newtype AdhocParser t a<br />
= P (([t], PD) -> (Either String a, [t],PD)))<br />
• newtype PD = MkPD Int ErrCode Span Body<br />
newtype ErrCode = Ok | Err | Fail<br />
type Span = (Offset, Offset)<br />
data Body = Unit | Pair PD PD | Or (Either PD PD)<br />
| Constrain PD | Seq Int [PD]<br />
| Scan (Maybe (Int,PD)) | Struct [PD]