08.01.2013 Views

LNCS 2950 - Aspects of Molecular Computing (Frontmatter Pages)

LNCS 2950 - Aspects of Molecular Computing (Frontmatter Pages)

LNCS 2950 - Aspects of Molecular Computing (Frontmatter Pages)

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

DNA-based Cryptography 179<br />

can be compressed without loss <strong>of</strong> information. Random sequences can not be<br />

compressed and therefore have an entropy rate <strong>of</strong> 1.<br />

Lossless data compression with an algorithm such as Lempel-Ziv [51], is theoretically<br />

asymptotically optimal. For sequences whose length n is large, the<br />

compression ratio approaches the entropy rate <strong>of</strong> the source. In particular, it<br />

is <strong>of</strong> the form (1 + ɛ(n))HS, whereɛ(n) → 0forn → ∞. Algorithms such<br />

as Lempel-Ziv build an indexed dictionary <strong>of</strong> all subsequences parsed that can<br />

not be constructed as a catenation <strong>of</strong> current dictionary entries. Compression is<br />

performed by sequentially parsing the input text, finding maximal length subsequences<br />

already present in the dictionary, and outputting their index number<br />

instead. When a subsequence is not found in the dictionary, it is added to it (including<br />

the case <strong>of</strong> single symbols). Algorithms can achieve better compression<br />

by making assumptions about the distribution <strong>of</strong> the data [4]. It is possible to<br />

use a dictionary <strong>of</strong> bounded size, consisting <strong>of</strong> the most frequent subsequences.<br />

Experimentation on a wide variety <strong>of</strong> text sources shows that this method can<br />

be used to achieve compression within a small percentage <strong>of</strong> the ideal [43]. In<br />

particular, the compression ratio is <strong>of</strong> the form (1 + ɛ)HS, forasmallconstant<br />

ɛ>0 typically <strong>of</strong> at most 1/10 if the source length is large.<br />

Lemma 1. The expected length <strong>of</strong> a parsed word is between L<br />

1+ɛ<br />

L = logb n<br />

HS .<br />

and L, where<br />

Pro<strong>of</strong>. Assume the source data has an alphabet <strong>of</strong> size b. An alphabet <strong>of</strong> the<br />

same size can be used for the compressed data. The dictionary can be limited to<br />

a constant size. We can choose an algorithm that achieves a compression ratio<br />

within 1 + ɛ <strong>of</strong> the asymptotic optimal, for a small ɛ>0. Therefore, for large n,<br />

we can expect the compression ratio to approach (1 + ɛ)HS.<br />

Each parsed word is represented by an index into the dictionary, and so its size<br />

would be log b n if the source had no redundancy. By the choice <strong>of</strong> compression<br />

algorithm, the length <strong>of</strong> the compressed data is between HS and HS(1+ɛ) times<br />

the length <strong>of</strong> the original data. From these two facts, it follows that the expected<br />

length <strong>of</strong> a code word will be between log b n<br />

(1+ɛ)HS and log b n<br />

HS .<br />

Lemma 2. A parsed word has length ≤ L<br />

p<br />

with probability ≥ 1 − p.<br />

Pro<strong>of</strong>. The probability <strong>of</strong> a parsed word having length > L<br />

p is <br />

1 − 1<br />

c(1+ɛ) and c′ = c −<br />

c− 1<br />

1+ɛ<br />

p<br />

> 0.<br />

Pro<strong>of</strong>. The maximum length <strong>of</strong> a parsed word has an upper bound in practice.<br />

We assume that this is cL for a small constant c>1. We use ∆ to denote<br />

the difference between the maximum possible and the actual length <strong>of</strong> a parsed<br />

word, and ¯ ∆ to denote the difference’s expected value. Applying Lemma 1,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!