06.12.2012 Views

Analysis of the multiplicity matching parameter in suffix trees

Analysis of the multiplicity matching parameter in suffix trees

Analysis of the multiplicity matching parameter in suffix trees

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

2005 International Conference on <strong>Analysis</strong> <strong>of</strong> Algorithms DMTCS proc. AD, 2005, 307–322<br />

<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong><br />

<strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong><br />

Mark Daniel Ward 1 and Wojciech Szpankowski 2†<br />

1 Department <strong>of</strong> Ma<strong>the</strong>matics, Purdue University, West Lafayette, IN, USA. mward@math.purdue.edu<br />

2 Department <strong>of</strong> Computer Science, Purdue University, West Lafayette, IN, USA. spa@cs.purdue.edu<br />

In a <strong>suffix</strong> tree, <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> (MMP) Mn is <strong>the</strong> number <strong>of</strong> leaves <strong>in</strong> <strong>the</strong> subtree rooted at<br />

<strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st <strong>in</strong>sertion. Equivalently, <strong>the</strong> MMP is <strong>the</strong> number <strong>of</strong> po<strong>in</strong>ters <strong>in</strong>to <strong>the</strong> database<br />

<strong>in</strong> <strong>the</strong> Lempel-Ziv ’77 data compression algorithm. We prove that <strong>the</strong> MMP asymptotically follows <strong>the</strong> logarithmic<br />

series distribution plus some fluctuations. In <strong>the</strong> pro<strong>of</strong> we compare <strong>the</strong> distribution <strong>of</strong> <strong>the</strong> MMP <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> to its<br />

distribution <strong>in</strong> tries built over <strong>in</strong>dependent str<strong>in</strong>gs. Our results are derived by both probabilistic and analytic techniques<br />

<strong>of</strong> <strong>the</strong> analysis <strong>of</strong> algorithms. In particular, we utilize comb<strong>in</strong>atorics on words, bivariate generat<strong>in</strong>g functions, pattern<br />

<strong>match<strong>in</strong>g</strong>, recurrence relations, analytical poissonization and depoissonization, <strong>the</strong> Mell<strong>in</strong> transform, and complex<br />

analysis.<br />

Keywords: <strong>suffix</strong> <strong>trees</strong>, comb<strong>in</strong>atorics on words, pattern <strong>match<strong>in</strong>g</strong>, autocorrelation polynomial, complex asymptotics,<br />

data compression<br />

1 Introduction<br />

When transmitt<strong>in</strong>g data, <strong>the</strong> goal <strong>of</strong> source cod<strong>in</strong>g (data compression) is to represent <strong>the</strong> source with a<br />

m<strong>in</strong>imum <strong>of</strong> symbols. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> goal <strong>of</strong> channel cod<strong>in</strong>g (error correction) is to represent <strong>the</strong><br />

source with a m<strong>in</strong>imum <strong>of</strong> error probability <strong>in</strong> decod<strong>in</strong>g. These goals are obviously <strong>in</strong> conflict. Traditionally,<br />

additional symbols are transmitted when perform<strong>in</strong>g error correction.<br />

In Lonardi and Szpankowski (2003), an algorithm for jo<strong>in</strong>t data compression and error correction is<br />

presented; <strong>the</strong> compression performance is not degraded because <strong>the</strong> algorithm requires no extra symbols<br />

for error correction. In this scheme, a Reed-Solomon error-correct<strong>in</strong>g code is embedded <strong>in</strong>to <strong>the</strong> Lempel-<br />

Ziv ’77 data compression algorithm (see Ziv and Lempel (1977)). Lonardi and Szpankowski utilize <strong>the</strong><br />

fact that <strong>the</strong> LZ’77 adaptive data compression algorithm is unable to remove all redundancy from <strong>the</strong><br />

source. Our goal here is to precisely determ<strong>in</strong>e <strong>the</strong> number <strong>of</strong> redundant bits that are available to be<br />

utilized <strong>in</strong> <strong>the</strong> aforementioned scheme.<br />

We recall <strong>the</strong> basic operation <strong>of</strong> <strong>the</strong> LZ’77 data compression algorithm. When n bits <strong>of</strong> <strong>the</strong> source<br />

have already been compressed, <strong>the</strong> LZ’77 encoder f<strong>in</strong>ds <strong>the</strong> longest prefix <strong>of</strong> <strong>the</strong> uncompressed data that<br />

also appears <strong>in</strong> <strong>the</strong> database (namely, <strong>the</strong> compressed portion <strong>of</strong> <strong>the</strong> data). The encoder performs <strong>the</strong><br />

compression by stor<strong>in</strong>g a po<strong>in</strong>ter <strong>in</strong>to <strong>the</strong> database (and also <strong>the</strong> length <strong>of</strong> this prefix, as well as <strong>the</strong> next<br />

character <strong>of</strong> <strong>the</strong> source). Often, this longest prefix appears more than once <strong>in</strong> <strong>the</strong> database. Each <strong>of</strong> <strong>the</strong><br />

database entries are equally eligible for use by <strong>the</strong> encoder; thus, any <strong>of</strong> <strong>the</strong> analogous po<strong>in</strong>ters <strong>in</strong>to <strong>the</strong><br />

database is suitable. In practice, <strong>the</strong> choice <strong>of</strong> po<strong>in</strong>ter among <strong>the</strong>se candidates has no significance. On <strong>the</strong><br />

o<strong>the</strong>r hand, by judiciously select<strong>in</strong>g <strong>the</strong> po<strong>in</strong>ter, some error correction can be performed. For <strong>in</strong>stance, if<br />

two po<strong>in</strong>ters are available, <strong>the</strong> encoder could easily perform a parity check by choos<strong>in</strong>g <strong>the</strong> first po<strong>in</strong>ter for<br />

“0” and <strong>the</strong> second po<strong>in</strong>ter for “1”. Lonardi and Szpankowski’s scheme for perform<strong>in</strong>g error correction is<br />

very elaborate. We refer <strong>the</strong> reader to <strong>the</strong>ir paper for more details.<br />

We let Mn denote <strong>the</strong> number <strong>of</strong> po<strong>in</strong>ters <strong>in</strong>to <strong>the</strong> database when n bits have already been compressed<br />

(as described above). Throughout this paper, we are primarily <strong>in</strong>terested <strong>in</strong> precisely determ<strong>in</strong><strong>in</strong>g <strong>the</strong><br />

asymptotics <strong>of</strong> Mn. A thorough analysis <strong>of</strong> Mn yields a characterization <strong>of</strong> <strong>the</strong> degree to which error<br />

3074.<br />

† This research was supported by NSF Grant CCR-0208709, NIH grant R01 GM068959-01, and AFOSR Grant FA8655-04-1-<br />

1365–8050 c○ 2005 Discrete Ma<strong>the</strong>matics and Theoretical Computer Science (DMTCS), Nancy, France


308 Mark Daniel Ward and Wojciech Szpankowski<br />

S 1<br />

S 2<br />

S 5<br />

Mn<br />

Fig. 1: F<strong>in</strong>d<strong>in</strong>g M I n <strong>in</strong> a trie.<br />

correction can be performed <strong>in</strong> <strong>the</strong> scheme discussed above. We note that ⌊log 2 Mn⌋ bits are available to<br />

be used for correct<strong>in</strong>g errors.<br />

Tries, especially <strong>suffix</strong> <strong>trees</strong>, provide a natural way to study Mn. We work here with str<strong>in</strong>gs <strong>of</strong> characters<br />

drawn <strong>in</strong>dependently from <strong>the</strong> b<strong>in</strong>ary alphabet A := {0, 1}. We let p denote <strong>the</strong> probability <strong>of</strong> “0” and<br />

q = 1 − p denotes <strong>the</strong> probability <strong>of</strong> “1”; without loss <strong>of</strong> generality, we assume that q ≤ p throughout <strong>the</strong><br />

discussion.<br />

We first recall <strong>the</strong> def<strong>in</strong>ition <strong>of</strong> a b<strong>in</strong>ary trie built over a set Y <strong>of</strong> n str<strong>in</strong>gs. The construction is recursive.<br />

If |Y| = 0, <strong>the</strong>n <strong>the</strong> trie is empty. If |Y| = 1, <strong>the</strong>n trie(Y) is a s<strong>in</strong>gle node. F<strong>in</strong>ally, if |Y| > 1, <strong>the</strong>n Y is<br />

partitioned <strong>in</strong>to two subsets, Y0 and Y1, such that a str<strong>in</strong>g is <strong>in</strong> Y0 if its first symbol is 0, and a str<strong>in</strong>g is <strong>in</strong><br />

Y1 if its first symbol is 1. Then trie(Y0) and trie(Y1) are each constructed <strong>in</strong> <strong>the</strong> same way, except that<br />

<strong>the</strong> splitt<strong>in</strong>g <strong>of</strong> sets at <strong>the</strong> kth step is based on <strong>the</strong> kth symbol <strong>of</strong> <strong>the</strong> str<strong>in</strong>g. This completes <strong>the</strong> def<strong>in</strong>ition<br />

<strong>of</strong> a b<strong>in</strong>ary trie.<br />

Now we briefly recall <strong>the</strong> construction <strong>of</strong> a b<strong>in</strong>ary <strong>suffix</strong> tree built over a str<strong>in</strong>g X = X1X2X3 . . .. The<br />

word X (i) = XiXi+1Xi+2 . . . is <strong>the</strong> ith <strong>suffix</strong> <strong>of</strong> X, which beg<strong>in</strong>s at <strong>the</strong> ith position <strong>of</strong> X. Then a b<strong>in</strong>ary<br />

<strong>suffix</strong> tree is precisely a b<strong>in</strong>ary trie built over <strong>the</strong> first n <strong>suffix</strong>es <strong>of</strong> X, namely X (1) , X (2) , . . . , X (n) .<br />

In a <strong>suffix</strong> tree, Mn is exactly <strong>the</strong> number <strong>of</strong> leaves <strong>in</strong> <strong>the</strong> subtree rooted at <strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong><br />

(n + 1)st <strong>in</strong>sertion (cf. Figure 1). The str<strong>in</strong>gs <strong>in</strong> a <strong>suffix</strong> tree are highly dependent on each o<strong>the</strong>r, which<br />

apparently makes a precise analysis <strong>of</strong> Mn quite difficult; <strong>the</strong>refore, we also consider <strong>the</strong> analogous (but<br />

simpler) situation <strong>in</strong> a trie built over <strong>in</strong>dependent str<strong>in</strong>gs; namely, we study M I n, which is <strong>the</strong> number <strong>of</strong><br />

leaves <strong>in</strong> <strong>the</strong> subtree rooted at <strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st <strong>in</strong>sertion <strong>in</strong> a trie built over n + 1<br />

<strong>in</strong>dependent b<strong>in</strong>ary str<strong>in</strong>gs. For <strong>in</strong>stance, <strong>in</strong> Figure 1, we have n = 4 and M I n = 2 because <strong>the</strong>re are two<br />

leaves (namely, S1 and S2) <strong>in</strong> <strong>the</strong> subtree rooted at <strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong> 5th <strong>in</strong>sertion.<br />

We are primarily concerned with compar<strong>in</strong>g <strong>the</strong> distribution <strong>of</strong> Mn (a <strong>parameter</strong> <strong>of</strong> <strong>suffix</strong> <strong>trees</strong>) to <strong>the</strong><br />

distribution <strong>of</strong> M I n (a <strong>parameter</strong> <strong>of</strong> tries built over <strong>in</strong>dependent str<strong>in</strong>gs). Our approach to <strong>the</strong> pro<strong>of</strong> beg<strong>in</strong>s<br />

with <strong>the</strong> observation that a variety <strong>of</strong> <strong>parameter</strong>s have <strong>the</strong> same asymptotic behavior regardless <strong>of</strong> whe<strong>the</strong>r<br />

<strong>the</strong>y correspond to <strong>suffix</strong> <strong>trees</strong> or to tries built over <strong>in</strong>dependent str<strong>in</strong>gs. This was observed <strong>in</strong> Szpankowski<br />

(1993) and <strong>the</strong>n made precise <strong>in</strong> Jacquet and Szpankowski (1994), where <strong>the</strong> typical depth <strong>in</strong> a <strong>suffix</strong> tree<br />

is proven to be asymptotically <strong>the</strong> same as <strong>the</strong> typical depth <strong>in</strong> a trie built over <strong>in</strong>dependent str<strong>in</strong>gs when<br />

<strong>the</strong> underly<strong>in</strong>g source is i.i.d. An extension <strong>of</strong> such results to an underly<strong>in</strong>g Markovian model is presented<br />

<strong>in</strong> Fayolle and Ward (2005).<br />

The limit<strong>in</strong>g distribution <strong>of</strong> several trie <strong>parameter</strong>s is given <strong>in</strong> Jacquet and Régnier (1986) and Jacquet<br />

and Régnier (1987). More results about trie <strong>parameter</strong>s are found <strong>in</strong> Kirschenh<strong>of</strong>er and Prod<strong>in</strong>ger (1988).<br />

The variance <strong>of</strong> <strong>the</strong> external path <strong>in</strong> a symmetric trie is given <strong>in</strong> Kirschenh<strong>of</strong>er et al. (1989). The depth <strong>of</strong><br />

a digital trie with an underly<strong>in</strong>g Markovian dependency is analyzed <strong>in</strong> Jacquet and Szpankowski (1991).<br />

Many results about a variety <strong>of</strong> tree structures are collected <strong>in</strong> Szpankowski (2001). Average-case studies<br />

<strong>of</strong> several <strong>parameter</strong>s <strong>of</strong> <strong>suffix</strong> <strong>trees</strong> are found <strong>in</strong> Fayolle (2004) and Szpankowski (1993).<br />

We briefly summarize <strong>the</strong> methodology <strong>of</strong> our pro<strong>of</strong>. Our goal is to compare <strong>the</strong> distribution <strong>of</strong> Mn (<strong>the</strong><br />

<strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>of</strong> a <strong>suffix</strong> tree) to <strong>the</strong> distribution <strong>of</strong> M I n (<strong>the</strong> MMP <strong>of</strong> a trie built over<br />

<strong>in</strong>dependent str<strong>in</strong>gs). Our pro<strong>of</strong> that <strong>the</strong>se two <strong>parameter</strong>s have <strong>the</strong> same asymptotic distribution consists<br />

S 3<br />

S 4


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 309<br />

<strong>of</strong> several steps. We first derive bivariate generat<strong>in</strong>g functions for Mn and M I n, denoted as M(z, u) and<br />

M I (z, u), respectively. We noted above that a <strong>suffix</strong> tree is built over <strong>the</strong> <strong>suffix</strong>es X (1) , X (2) , . . . , X (n)<br />

<strong>of</strong> a str<strong>in</strong>g X. These <strong>suffix</strong>es are highly dependent on each o<strong>the</strong>r. Therefore, <strong>in</strong> deriv<strong>in</strong>g <strong>the</strong> bivariate<br />

generat<strong>in</strong>g function M(z, u), an <strong>in</strong>terest<strong>in</strong>g obstacle arises: We need to determ<strong>in</strong>e <strong>the</strong> degree to which<br />

a <strong>suffix</strong> <strong>of</strong> X can overlap with itself. Fortunately, <strong>the</strong> autocorrelation polynomial Sw(z) <strong>of</strong> a word w<br />

measures <strong>the</strong> amount <strong>of</strong> overlap <strong>of</strong> a word w with itself. The autocorrelation polynomial was <strong>in</strong>troduced <strong>in</strong><br />

Guibas and Odlyzko (1981) and was utilized extensively <strong>in</strong> Régnier and Szpankowski (1998) and Lothaire<br />

(2005). The autocorrelation polynomial is def<strong>in</strong>ed as<br />

Sw(z) = �<br />

k∈P(w)<br />

P(w m k+1)z m−k<br />

where m = |w| and where P(w) denotes <strong>the</strong> set <strong>of</strong> positions k <strong>of</strong> w satisfy<strong>in</strong>g w1 . . . wk = wm−k+1 . . . wm,<br />

that is, w’s prefix <strong>of</strong> length k is equal to w’s <strong>suffix</strong> <strong>of</strong> length k. Us<strong>in</strong>g <strong>the</strong> autocorrelation polynomial,<br />

we can overcome <strong>the</strong> difficulties <strong>in</strong>herent <strong>in</strong> <strong>the</strong> fact that <strong>suffix</strong>es <strong>of</strong> a word X overlap with each o<strong>the</strong>r.<br />

By utiliz<strong>in</strong>g Sw(z), we are able to obta<strong>in</strong> a succ<strong>in</strong>ct way <strong>of</strong> describ<strong>in</strong>g <strong>the</strong> bivariate generat<strong>in</strong>g function<br />

M(z, u). Fortunately, <strong>the</strong> autocorrelation polynomial is well-understood. Note that <strong>the</strong> autocorrelation<br />

polynomial Sw(z) has a P(w m k+1 )zm−k term if and only if w has an overlap with itself <strong>of</strong> length k. All<br />

words w overlap with <strong>the</strong>mselves trivially, so all autocorrelation polynomials have a constant term (i.e.,<br />

z m−m = z 0 = 1 term). On <strong>the</strong> o<strong>the</strong>r hand, with high probability, w has very few large nontrivial overlaps<br />

with itself. Therefore, with high probability, all nontrivial overlaps <strong>of</strong> w with itself are small; such<br />

overlaps correspond to high-degree terms <strong>of</strong> Sw(z).<br />

In order to compare M(z, u) and M I (z, u), we utilize complex analysis. Specifically, we take advantage<br />

<strong>of</strong> Cauchy’s <strong>the</strong>orem, which allows us to analyze <strong>the</strong> poles <strong>of</strong> <strong>the</strong> generat<strong>in</strong>g functions M(z, u) and<br />

M I (z, u) <strong>in</strong> order to obta<strong>in</strong> precise <strong>in</strong>formation about <strong>the</strong> distributions <strong>of</strong> Mn and M I n. Dur<strong>in</strong>g this residue<br />

analysis, it is necessary that <strong>the</strong> generat<strong>in</strong>g function for Mn is analytically cont<strong>in</strong>ued from <strong>the</strong> unit disk to<br />

a larger disk.<br />

Our ultimate conclusion is that <strong>the</strong> distribution <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> Mn is asymptotically<br />

<strong>the</strong> same <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> and <strong>in</strong>dependent tries, i.e., Mn and M I n have asymptotically <strong>the</strong> same<br />

distribution.<br />

The asymptotics for <strong>the</strong> distribution and factorial moments <strong>of</strong> M I n were given <strong>in</strong> Ward and Szpankowski<br />

(2004). Specifically, M I n asymptotically follows <strong>the</strong> logarithmic series distribution (plus some fluctuations<br />

when ln p/ ln q is rational). S<strong>in</strong>ce we prove here that Mn and M I n have asymptotically <strong>the</strong> same distribution,<br />

<strong>the</strong>n as a consequence, we see that Mn also asymptotically follows <strong>the</strong> logarithmic series distribution.<br />

One strik<strong>in</strong>g property <strong>of</strong> this distribution is <strong>the</strong> high concentration around <strong>the</strong> mean. We see that E[Mn] is<br />

asymptotically 1<br />

h (where h denotes <strong>the</strong> entropy <strong>of</strong> <strong>the</strong> source) and also Mn is highly concentrated around<br />

this average value; this property <strong>of</strong> Mn is very desirable for <strong>the</strong> error correction scheme described <strong>in</strong><br />

Lonardi and Szpankowski (2003).<br />

This paper is a concise version <strong>of</strong> <strong>the</strong> first author’s Ph.D. <strong>the</strong>sis; see Ward (2005).<br />

2 Ma<strong>in</strong> Results<br />

We consider <strong>the</strong> str<strong>in</strong>g X = X1X2X3 . . ., where <strong>the</strong> Xi’s are i.i.d. random variables on A := {0, 1} with<br />

P(Xi = 0) = p and P(Xi = 1) = q. (Without loss <strong>of</strong> generality, we assume throughout <strong>the</strong> discussion<br />

that q ≤ p.) Let X (i) denote <strong>the</strong> ith <strong>suffix</strong> <strong>of</strong> X. In o<strong>the</strong>r words, X (i) = XiXi+1Xi+2 . . .. Consider <strong>the</strong><br />

longest prefix w <strong>of</strong> X (n+1) such that X (i) also has w as a prefix, for some i with 1 ≤ i ≤ n. Then Mn is<br />

def<strong>in</strong>ed as <strong>the</strong> number <strong>of</strong> X (i) ’s (with 1 ≤ i ≤ n) that have w as a prefix. So<br />

Mn = #{1 ≤ i ≤ n | X (i) = XiXi+1Xi+2 . . . has w as a prefix} . (2)<br />

An alternate def<strong>in</strong>ition <strong>of</strong> Mn is available via <strong>suffix</strong> <strong>trees</strong>. First, consider a <strong>suffix</strong> tree built from <strong>the</strong> first<br />

n + 1 <strong>suffix</strong>es <strong>of</strong> X. Next, consider <strong>the</strong> <strong>in</strong>sertion po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st <strong>suffix</strong>. Then Mn is exactly <strong>the</strong><br />

number <strong>of</strong> leaves <strong>in</strong> <strong>the</strong> subtree rooted at <strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st <strong>in</strong>sertion. For <strong>in</strong>stance,<br />

suppose that <strong>the</strong> (n + 1)st <strong>suffix</strong> starts with wβ for some w ∈ A ∗ and β ∈ A. Then, exam<strong>in</strong><strong>in</strong>g <strong>the</strong> first n<br />

<strong>suffix</strong>es, if <strong>the</strong>re are exactly k <strong>suffix</strong>es that beg<strong>in</strong> with wα (where α = 1 − β, i.e., {α, β} = {0, 1}), and<br />

<strong>the</strong> o<strong>the</strong>r n − k <strong>suffix</strong>es do not beg<strong>in</strong> with w, we conclude that Mn = k.<br />

Unfortunately, <strong>the</strong> str<strong>in</strong>gs <strong>in</strong> a <strong>suffix</strong> tree are highly dependent on each o<strong>the</strong>r; thus, a precise analysis<br />

<strong>of</strong> Mn is quite difficult. On <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> asymptotic behavior <strong>of</strong> M I n, an analogous <strong>parameter</strong> <strong>of</strong><br />

(1)


310 Mark Daniel Ward and Wojciech Szpankowski<br />

tries built over <strong>in</strong>dependent str<strong>in</strong>gs, is well-understood. Specifically, M I n asymptotically follows <strong>the</strong> logarithmic<br />

series distribution (plus some fluctuations when ln p/ ln q is rational). In Ward and Szpankowski<br />

(2004), a precise analysis <strong>of</strong> M I n is given via <strong>the</strong> analysis <strong>of</strong> <strong>in</strong>dependent tries, us<strong>in</strong>g recurrence relations,<br />

analytical poissonization and depoissonization, <strong>the</strong> Mell<strong>in</strong> transform, and complex analysis.<br />

To def<strong>in</strong>e M I n, we consider <strong>the</strong> situation described above, but we build a trie from n + 1 <strong>in</strong>dependent<br />

str<strong>in</strong>gs from A ∗ . So we consider <strong>in</strong>dependent X(i)’s; specifically, we def<strong>in</strong>e X(i) = X1(i)X2(i)X3(i) . . .,<br />

where {Xj(i) | i, j ∈ N} is a collection <strong>of</strong> i.i.d. random variables. We let w denote <strong>the</strong> longest prefix <strong>of</strong><br />

X(n + 1) such that X(i) also has w as a prefix, for some i with 1 ≤ i ≤ n. Then M I n is def<strong>in</strong>ed as <strong>the</strong><br />

number <strong>of</strong> X(i)’s (with 1 ≤ i ≤ n) that have w as a prefix. So<br />

M I n = #{1 ≤ i ≤ n | X(i) = X1(i)X2(i)X3(i) . . . has w as a prefix} . (3)<br />

To def<strong>in</strong>e M I n via tries, first consider a trie built from <strong>the</strong> n + 1 <strong>in</strong>dependent str<strong>in</strong>gs from A ∗ . Next,<br />

consider <strong>the</strong> <strong>in</strong>sertion po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st str<strong>in</strong>g. Then Mn is exactly <strong>the</strong> number <strong>of</strong> leaves <strong>in</strong> <strong>the</strong> subtree<br />

rooted at <strong>the</strong> branch<strong>in</strong>g po<strong>in</strong>t <strong>of</strong> <strong>the</strong> (n + 1)st <strong>in</strong>sertion. As above, suppose that <strong>the</strong> (n + 1)st str<strong>in</strong>g starts<br />

with wβ. Then, exam<strong>in</strong><strong>in</strong>g <strong>the</strong> first n str<strong>in</strong>gs, if <strong>the</strong>re are exactly k str<strong>in</strong>gs that beg<strong>in</strong> with wα (aga<strong>in</strong><br />

α = 1 − β), and <strong>the</strong> o<strong>the</strong>r n − k str<strong>in</strong>gs do not beg<strong>in</strong> with w, we conclude that M I n = k.<br />

S<strong>in</strong>ce we know from Ward and Szpankowski (2004) that M I n follows <strong>the</strong> logarithmic series distribution<br />

plus some fluctuations, <strong>the</strong>n it suffices to prove that Mn has a similar asymptotic distribution. To accomplish<br />

this goal, we compare <strong>the</strong> distribution <strong>of</strong> Mn <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> to <strong>the</strong> distribution <strong>of</strong> M I n <strong>in</strong> <strong>in</strong>dependent<br />

tries.<br />

Briefly, our pro<strong>of</strong> technique is <strong>the</strong> follow<strong>in</strong>g: We let M(z, u) = �<br />

1≤k,n≤∞ P(Mn = k)u k z n and<br />

M I (z, u) = �<br />

1≤k,n≤∞ P(M I n = k)u k z n denote <strong>the</strong> bivariate generat<strong>in</strong>g functions for Mn and M I n,<br />

respectively. To study <strong>the</strong>se generat<strong>in</strong>g functions, we consider <strong>the</strong> w’s def<strong>in</strong>ed above. Specifically, for<br />

M(z, u), we recall from (2) that if w denotes <strong>the</strong> longest prefix <strong>of</strong> X (n+1) = Xn+1Xn+2Xn+3 . . . that<br />

appears as a prefix <strong>of</strong> any X (i) = XiXi+1Xi+2 . . ., <strong>the</strong>n Mn enumerates <strong>the</strong> number <strong>of</strong> such occurrences<br />

<strong>of</strong> w. This approach to M(z, u) allows us to sum over all w ∈ A ∗ <strong>in</strong>stead <strong>of</strong> summ<strong>in</strong>g over k, n ∈ N.<br />

Similarly, for M I (z, u), we utilize (3) to see that if w denotes <strong>the</strong> longest prefix <strong>of</strong> X(n + 1) = X1(n +<br />

1)X2(n + 1)X3(n + 1) . . . that appears as a prefix <strong>of</strong> any X1(i)X2(i)X3(i) . . ., <strong>the</strong>n M I n is precisely <strong>the</strong><br />

number <strong>of</strong> such occurrences <strong>of</strong> w. Therefore, to determ<strong>in</strong>e M I (z, u), we can sum over all w ∈ A ∗ <strong>in</strong>stead<br />

<strong>of</strong> summ<strong>in</strong>g over <strong>the</strong> <strong>in</strong>tegers k and n.<br />

We note that <strong>the</strong> X (i) ’s are highly dependent on each o<strong>the</strong>r. In fact, if i ≥ j, <strong>the</strong>n X (i) = XiXi+1Xi+2 . . .<br />

is a substr<strong>in</strong>g <strong>of</strong> X (j) = XjXj+1Xj+2 . . .. This apparently makes <strong>the</strong> derivation <strong>of</strong> <strong>the</strong> bivariate generat<strong>in</strong>g<br />

function M(z, u) quite difficult. We overcome this hurdle by succ<strong>in</strong>ctly describ<strong>in</strong>g <strong>the</strong> degree to<br />

which a <strong>suffix</strong> <strong>of</strong> X can overlap with itself. We accomplish this by utiliz<strong>in</strong>g <strong>the</strong> autocorrelation polynomial<br />

Sw(z) <strong>of</strong> a word w, which measures <strong>the</strong> amount <strong>of</strong> overlap <strong>of</strong> a word w with itself. As mentioned<br />

above, <strong>the</strong> autocorrelation polynomial is def<strong>in</strong>ed as<br />

Sw(z) = �<br />

k∈P(w)<br />

P(w m k+1)z m−k<br />

where P(w) denotes <strong>the</strong> set <strong>of</strong> positions k <strong>of</strong> w satisfy<strong>in</strong>g w1 . . . wk = wm−k+1 . . . wm, that is, w’s prefix<br />

<strong>of</strong> length k is equal to w’s <strong>suffix</strong> <strong>of</strong> length k. Via <strong>the</strong> autocorrelation polynomial, we are able to surmount<br />

<strong>the</strong> difficulties <strong>in</strong>herent <strong>in</strong> <strong>the</strong> overlapp<strong>in</strong>g <strong>suffix</strong>es. Thus, us<strong>in</strong>g Sw(z), we obta<strong>in</strong> a succ<strong>in</strong>ct description <strong>of</strong><br />

<strong>the</strong> bivariate generat<strong>in</strong>g function M(z, u). The autocorrelation polynomial is well-understood; we utilize<br />

several results about Sw(z) from Régnier and Szpankowski (1998) and Lothaire (2005). In particular,<br />

when compar<strong>in</strong>g M(z, u) and M I (z, u), it is extremely useful to note that <strong>the</strong> autocorrelation polynomial<br />

Sw(z) is close to 1 with high probability (for |w| large).<br />

In order to obta<strong>in</strong> <strong>in</strong>formation about <strong>the</strong> difference <strong>of</strong> <strong>the</strong> two BGFs as Q(z, u) = M(z, u) − M I (z, u),<br />

we utilize residue analysis. We make a comparison <strong>of</strong> <strong>the</strong> poles <strong>of</strong> M(z, u) and M I (z, u) us<strong>in</strong>g Cauchy’s<br />

<strong>the</strong>orem (<strong>in</strong>tegrat<strong>in</strong>g with respect to z). As a result, we prove that Qn(u) := [z n ]Q(z, u) = O(n −ɛ )<br />

uniformly for |u| ≤ p −1/2 as n → ∞. Then we use ano<strong>the</strong>r application <strong>of</strong> Cauchy’s <strong>the</strong>orem (<strong>in</strong>tegrat<strong>in</strong>g<br />

with respect to u). Specifically, we extract <strong>the</strong> coefficient P(Mn = k) − P(M I n = k) = [u k z n ]Q(z, u) <strong>in</strong><br />

order to prove our ma<strong>in</strong> result.<br />

Theorem 2.1 There exist ɛ > 0 and b > 1 such that<br />

for large n.<br />

P(Mn = k) − P(M I n = k) = O(n −ɛ b −k ) (5)<br />

(4)


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 311<br />

Therefore, <strong>the</strong> distributions <strong>of</strong> Mn and M I n are asymptotically <strong>the</strong> same. We conclude that Mn also asymptotically<br />

follows <strong>the</strong> logarithmic series distribution (plus some fluctuations when ln p/ ln q is rational).<br />

Theorem 2.2 There exist ɛ > 0 and ɛj > 0 (for each j ∈ N) depend<strong>in</strong>g on p such that <strong>the</strong> jth factorial<br />

moment <strong>of</strong> Mn is<br />

E[(Mn) j ] = Γ(j) q(p/q)j + p(q/p) j<br />

h<br />

+ γj(log 1/p n) + O(n −ɛj ) (6)<br />

where γj is a periodic function with mean 0 and small modulus if ln p/ ln q is rational, and o<strong>the</strong>rwise<br />

γj(x) → 0 as x → ∞. Also h = −p log p − q log q denotes <strong>the</strong> entropy <strong>of</strong> <strong>the</strong> source. The probability<br />

generat<strong>in</strong>g function <strong>of</strong> Mn is<br />

E[u Mn q ln (1 − pu) + p ln (1 − qu)<br />

] = − + γ(log1/p n, u) + O(n<br />

h<br />

−ɛ ) , (7)<br />

for |u| ≤ p−1/2 where γ(·, u) is a periodic function with mean 0 and small modulus if ln p/ ln q is rational,<br />

and o<strong>the</strong>rwise γj(u, x) → 0 (uniformly for |u| ≤ p−1/2 ) as x → ∞. More precisely,<br />

E[u Mn ⎡<br />

∞�<br />

] = ⎣ pjq + qjp +<br />

jh<br />

�<br />

− e2krπi log1/p n Γ(zk)(pjq + qjp)(zk) j<br />

j!(p−zk+1 ⎤<br />

⎦ u<br />

ln p + q−zk+1 ln q)<br />

j + O(n −ɛ ) (8)<br />

j=1<br />

k∈Z\{0}<br />

when ln p/ ln q = r/t for some r, t ∈ Z, we have zk = 2krπi/ ln p. Therefore, as n → ∞, we conclude<br />

that Mn follows <strong>the</strong> logarithmic series distribution plus some fluctuations if ln p/ ln q = r/t is rational,<br />

i.e.,<br />

P(Mn = j) = pj q + q j p<br />

jh<br />

+ �<br />

k�=0<br />

− e2krπi log 1/p n Γ(zk)(p j q + q j p)(zk) j<br />

j!(p −zk+1 ln p + q −zk+1 ln q)<br />

+ O(n −ɛ ) . (9)<br />

If ln p/ ln q is irrational, <strong>the</strong>n Mn asymptotically follows <strong>the</strong> logarithmic series distribution, without fluctuations.<br />

Note that <strong>the</strong> average value <strong>of</strong> Mn is asymptotically 1<br />

h , and also Mn is highly concentrated around <strong>the</strong><br />

mean; this property <strong>of</strong> Mn is very desirable for <strong>the</strong> error correction scheme described <strong>in</strong> Lonardi and<br />

Szpankowski (2003).<br />

3 Pro<strong>of</strong>s<br />

We first derive <strong>the</strong> bivariate generat<strong>in</strong>g functions for Mn and M I n, denoted as M(z, u) and M I (z, u),<br />

respectively. Then we prove a few useful lemmas concern<strong>in</strong>g <strong>the</strong> autocorrelation polynomial. Next,<br />

we prove that M(z, u) can be analytically cont<strong>in</strong>ued from <strong>the</strong> unit disk to a larger disk. Afterwards, we<br />

determ<strong>in</strong>e <strong>the</strong> poles <strong>of</strong> M(z, u) and M I (z, u). We write Q(z, u) = M(z, u)−M I (z, u); we use Cauchy’s<br />

<strong>the</strong>orem to that Qn(u) := [z n ]Q(z, u) → 0 uniformly for u ≤ p −1/2 as n → ∞. Then we apply Cauchy’s<br />

<strong>the</strong>orem aga<strong>in</strong> to prove that P(Mn = k) − P(M I n = k) = [u k z n ]Q(z, u) = O(n −ɛ b −k ) for some ɛ > 0<br />

and b > 1.<br />

We conclude that <strong>the</strong> distribution <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> Mn is asymptotically <strong>the</strong> same<br />

<strong>in</strong> <strong>suffix</strong> <strong>trees</strong> as <strong>in</strong> tries built over <strong>in</strong>dependent str<strong>in</strong>gs, i.e., Mn and M I n have asymptotically <strong>the</strong> same<br />

distribution. Therefore, Mn also follows <strong>the</strong> logarithmic series distribution plus some fluctuations.<br />

3.1 BGF for <strong>the</strong> Multiplicity Match<strong>in</strong>g Parameter <strong>of</strong> Independent Tries<br />

First we obta<strong>in</strong> <strong>the</strong> bivariate generat<strong>in</strong>g function for M I n, which is <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> for<br />

a trie built over <strong>the</strong> <strong>in</strong>dependent str<strong>in</strong>gs X(1), . . . , X(n + 1), where X(i) = X1(i)X2(i)X3(i) . . . and<br />

{Xj(i) | i, j ∈ N} is a collection <strong>of</strong> i.i.d. random variables with P(Xj(i) = 0) = p and P(Xj(i) = 1) =<br />

q = 1 − p. We let w denote <strong>the</strong> longest prefix <strong>of</strong> both X(n + 1) and at least one o<strong>the</strong>r str<strong>in</strong>g X(i) for some<br />

1 ≤ i ≤ n. We write β to denote <strong>the</strong> (|w| + 1)st character <strong>of</strong> X(n + 1). When M I n = k, we conclude that<br />

exactly k str<strong>in</strong>gs X(i) have wα as a prefix, and <strong>the</strong> o<strong>the</strong>r n − k str<strong>in</strong>gs X(i) do not have w as a prefix at<br />

all. Thus <strong>the</strong> generat<strong>in</strong>g function for M I n is exactly<br />

M I ∞� ∞�<br />

(z, u) := P(M I n = k)u k z n ∞� ∞� �<br />

� �<br />

n<br />

=<br />

P(wβ) (P(wα))<br />

k<br />

k (1 − P(w)) n−k u k z n .<br />

n=1 k=1<br />

n=1 k=1<br />

w∈A ∗<br />

α∈A<br />

(10)


312 Mark Daniel Ward and Wojciech Szpankowski<br />

After simplify<strong>in</strong>g, it follows immediately that<br />

M I (z, u) = �<br />

w∈A ∗<br />

α∈A<br />

uP(β)P(w)<br />

zP(w)P(α)<br />

. (11)<br />

1 − z(1 − P(w)) 1 − z(1 + uP(w)P(α) − P(w))<br />

Our reason<strong>in</strong>g about M I (z, u) can be applied when we derive generat<strong>in</strong>g function M(z, u) for Mn <strong>in</strong> <strong>the</strong><br />

next section, but <strong>the</strong> situation will be more complicated, because <strong>the</strong> occurrences <strong>of</strong> w can overlap.<br />

3.2 BGF for <strong>the</strong> Multiplicity Match<strong>in</strong>g Parameter <strong>of</strong> Suffix Trees<br />

Now we obta<strong>in</strong> <strong>the</strong> bivariate generat<strong>in</strong>g function for Mn, which is <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong><br />

for a <strong>suffix</strong> tree built over <strong>the</strong> first n + 1 <strong>suffix</strong>es X (1) , . . . , X (n+1) <strong>of</strong> a str<strong>in</strong>g X (i.e., X (i) =<br />

XiXi+1Xi+2 . . .). The bivariate generat<strong>in</strong>g function for <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> is much<br />

more difficult to derive <strong>in</strong> <strong>the</strong> dependent (<strong>suffix</strong> tree) case than <strong>in</strong> <strong>the</strong> <strong>in</strong>dependent (trie) case, because <strong>the</strong><br />

<strong>suffix</strong>es <strong>of</strong> X are dependent on each o<strong>the</strong>r. We let w denote <strong>the</strong> longest prefix <strong>of</strong> both X (n+1) and at least<br />

one X (i) for some 1 ≤ i ≤ n. We write β to denote <strong>the</strong> (|w| + 1)st character <strong>of</strong> X (n+1) ; when Mn = k,<br />

we conclude that exactly k <strong>suffix</strong>es X (i) have wα as a prefix, and <strong>the</strong> o<strong>the</strong>r n − k str<strong>in</strong>gs X (i) do not have<br />

w as a prefix at all. Thus, we are <strong>in</strong>terested <strong>in</strong> f<strong>in</strong>d<strong>in</strong>g str<strong>in</strong>gs with exactly k occurrences <strong>of</strong> wα, ended on<br />

<strong>the</strong> right by an occurrence <strong>of</strong> wβ, with no o<strong>the</strong>r occurrences <strong>of</strong> w at all. This set <strong>of</strong> words is exactly <strong>the</strong><br />

language Rwα(T (α)<br />

w α) k−1T (α)<br />

w β, where<br />

Rw = {v | v conta<strong>in</strong>s exactly one occurrence <strong>of</strong> w, located at <strong>the</strong> right end}<br />

T (α)<br />

w = {v | wαv conta<strong>in</strong>s exactly two occurrences <strong>of</strong> w, located at <strong>the</strong> left and right ends}(12)<br />

So, <strong>the</strong> generat<strong>in</strong>g function for Mn is<br />

M(z, u) =<br />

∞�<br />

k=1<br />

�<br />

w∈A ∗<br />

α∈A<br />

�<br />

s∈Rw<br />

P(sα)z |s|+1 �<br />

�<br />

u<br />

After simplify<strong>in</strong>g <strong>the</strong> geometric sum, this yields<br />

M(z, u) = �<br />

w∈A ∗<br />

α∈A<br />

t∈T (α)<br />

w<br />

uP(β) Rw(z)<br />

z |w|<br />

P(tα)z |t|+1 �k−1 �<br />

u<br />

v∈T (α)<br />

w<br />

P(α)zT (α)<br />

w (z)<br />

1 − P(α)zuT (α)<br />

w (z)<br />

P(vβ)z |v|+1−|w|−1 . (13)<br />

. (14)<br />

We note that Rw(z)/z |w| = P(w)/Dw(z) (Régnier and Szpankowski (1998)), where Dw(z) = (1 −<br />

z)Sw(z) + zmP(w) and where Sw(z) denotes <strong>the</strong> autocorrelation polynomial for w. Recall that Sw(z)<br />

measures <strong>the</strong> degree to which a word w overlaps with itself, and specifically<br />

Sw(z) = �<br />

P(w m k+1)z m−k<br />

(15)<br />

k∈P(w)<br />

where P(w) denotes <strong>the</strong> set <strong>of</strong> positions k <strong>of</strong> w satisfy<strong>in</strong>g w1 . . . wk = wm−k+1 . . . wm, that is, w’s prefix<br />

<strong>of</strong> length k is equal to w’s <strong>suffix</strong> <strong>of</strong> length k; also, m = |w|. Return<strong>in</strong>g to (14), it follows that<br />

M(z, u) = �<br />

w∈A ∗<br />

α∈A<br />

uP(β)P(w)<br />

Dw(z)<br />

P(α)zT (α)<br />

w (z)<br />

1 − P(α)zuT (α)<br />

w (z)<br />

In order to derive an explicit form <strong>of</strong> M(z, u), we still need to f<strong>in</strong>d T (α)<br />

w (z). If we def<strong>in</strong>e<br />

. (16)<br />

Mw = {v | wv conta<strong>in</strong>s exactly two occurrences <strong>of</strong> w, located at <strong>the</strong> left and right ends}(17)<br />

<strong>the</strong>n we observe that αT (α)<br />

w<br />

is exactly <strong>the</strong> subset <strong>of</strong> words <strong>of</strong> Mw that beg<strong>in</strong> with α; We use H (α)<br />

w to<br />

denote this subset (i.e., H (α)<br />

w = Mw ∩ (αA∗ )), and thus αT (α)<br />

w<br />

M(z, u) = �<br />

w∈A ∗<br />

α∈A<br />

uP(β)P(w)<br />

Dw(z)<br />

= H (α)<br />

w . So (16) simplifies to<br />

H (α)<br />

w (z)<br />

1 − uH (α)<br />

w (z)<br />

. (18)


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 313<br />

In order to compute H (α)<br />

w (z), we write Mw = H (α)<br />

w + H (β)<br />

w , where H (β)<br />

w is <strong>the</strong> subset <strong>of</strong> words from Mw<br />

that start with β (i.e., H (β)<br />

w = Mw ∩ (βA∗ )). (Note that every word <strong>of</strong> Mw beg<strong>in</strong>s with ei<strong>the</strong>r α or β,<br />

because <strong>the</strong> empty word ε /∈ Mw.) The follow<strong>in</strong>g useful lemma is <strong>the</strong> last necessary <strong>in</strong>gredient to obta<strong>in</strong><br />

an explicit formula for M(z, u) from (18).<br />

Lemma 3.1 Let H (α)<br />

w denote <strong>the</strong> subset <strong>of</strong> words from Mw that start with α. Then<br />

H (α)<br />

w (z) = Dwα(z) − (1 − z)<br />

. (19)<br />

Dw(z)<br />

Pro<strong>of</strong> We use <strong>the</strong> concepts and notation from Régnier and Szpankowski (1998) and Lothaire (2005)<br />

throughout. In particular, we def<strong>in</strong>e<br />

Uw = {v | wv conta<strong>in</strong>s exactly one occurrence <strong>of</strong> w (located at <strong>the</strong> left end)} (20)<br />

and we recall from (12) and (17) above that<br />

Rw = {v | v conta<strong>in</strong>s exactly one occurrence <strong>of</strong> w, located at <strong>the</strong> right end}<br />

Mw = {v | wv conta<strong>in</strong>s exactly two occurrences <strong>of</strong> w, located at <strong>the</strong> left and right ends}(21)<br />

The follow<strong>in</strong>g notation is similar but slightly adapted for our pro<strong>of</strong>.<br />

U (α)<br />

w = {v | v starts with α, and wv has exactly 1 occurrence <strong>of</strong> wα and no occurrences <strong>of</strong> wβ} .<br />

(22)<br />

We note that <strong>the</strong> set <strong>of</strong> words with no occurrences <strong>of</strong> wβ is exactly A ∗ \ Rwβ(Mwβ) ∗ Uwβ, which has<br />

generat<strong>in</strong>g function<br />

1 Rwβ(z)Uwβ(z)<br />

− . (23)<br />

1 − z 1 − Mwβ(z)<br />

Now we describe <strong>the</strong> set <strong>of</strong> words with no occurrences <strong>of</strong> wβ <strong>in</strong> a different way. The set <strong>of</strong> words with no<br />

occurrences <strong>of</strong> wβ and at least one occurrence <strong>of</strong> wα is exactly Rw(H (α)<br />

w ) ∗U (α)<br />

w , which has generat<strong>in</strong>g<br />

function Rw(z)U (α)<br />

w (z)/(1 − H (α)<br />

w (z)). The set <strong>of</strong> words with no occurrences <strong>of</strong> wβ and no occurrences<br />

<strong>of</strong> wα is exactly Rw + (A∗ \ Rw(Mw) ∗U). (Note that <strong>the</strong> set <strong>of</strong> such words that end <strong>in</strong> w is exactly Rw;<br />

on <strong>the</strong> o<strong>the</strong>r hand, <strong>the</strong> set <strong>of</strong> such words that do not end <strong>in</strong> w is exactly A∗ \ Rw(Mw) ∗U.) So <strong>the</strong> set<br />

<strong>of</strong> words with no occurrences <strong>of</strong> wα and no occurrences <strong>of</strong> wβ has generat<strong>in</strong>g function Rw(z) + 1/(1 −<br />

z) − Rw(z)Uw(z)/(1 − Mw(z)). So <strong>the</strong> set <strong>of</strong> words with no occurrences <strong>of</strong> wβ has generat<strong>in</strong>g function<br />

Comb<strong>in</strong><strong>in</strong>g (23) and (24), it follows that<br />

1 Rwβ(z)Uwβ(z)<br />

−<br />

1 − z 1 − Mwβ(z)<br />

Rw(z)U (α)<br />

w (z)<br />

1 − H (α)<br />

w (z) + Rw(z) + 1 Rw(z)Uw(z)<br />

− . (24)<br />

1 − z 1 − Mw(z)<br />

(α)<br />

Rw(z)U w (z)<br />

=<br />

1 − H (α)<br />

w (z) + Rw(z) + 1 Rw(z)Uw(z)<br />

− . (25)<br />

1 − z 1 − Mw(z)<br />

Now we f<strong>in</strong>d <strong>the</strong> generat<strong>in</strong>g function for U (α)<br />

w . For each word v ∈ U (α)<br />

w , ei<strong>the</strong>r wv has exactly one or two<br />

occurrences <strong>of</strong> w. The subset <strong>of</strong> U (α)<br />

w <strong>of</strong> <strong>the</strong> first type is exactly V (α)<br />

w<br />

:= Uw ∩ (αA ∗ ), i.e., <strong>the</strong> subset <strong>of</strong><br />

words from Uw that start with α. The subset <strong>of</strong> U (α)<br />

w <strong>of</strong> <strong>the</strong> second type is exactly H (α)<br />

w . We observe that<br />

V (α)<br />

w · A = (H (α)<br />

w + V (α)<br />

w ) \ {α} (26)<br />

(see Ward (2005)), so V (α)<br />

w (z) = (H (α)<br />

w (z) − P(α)z)/(z − 1). S<strong>in</strong>ce U (α)<br />

w = V (α)<br />

w + H (α)<br />

w , it follows that<br />

Recall<strong>in</strong>g equation (25), we see that<br />

1 Rwβ(z)Uwβ(z)<br />

−<br />

1 − z 1 − Mwβ(z)<br />

U (α)<br />

w (z) = H(α) w (z) − P(α)z<br />

+ H<br />

z − 1<br />

(α)<br />

w (z) = zH(α)<br />

w (z) − P(α)z<br />

. (27)<br />

z − 1<br />

(α)<br />

Rw(z)(zH w (z) − P(α)z)<br />

=<br />

(1 − H (α)<br />

w (z))(z − 1) + Rw(z) + 1 Rw(z)Uw(z)<br />

− . (28)<br />

1 − z 1 − Mw(z)


314 Mark Daniel Ward and Wojciech Szpankowski<br />

Simplify<strong>in</strong>g, and us<strong>in</strong>g Uw(z) = (1−Mw(z))/(1−z) and Uwβ(z) = (1−Mwβ(z))/(1−z) (see Régnier<br />

and Szpankowski (1998)), it follows that<br />

Rwβ(z)<br />

Rw(z) =<br />

zP(β)<br />

1 − H (α)<br />

w (z)<br />

. (29)<br />

Solv<strong>in</strong>g for H (α)<br />

w (z) and <strong>the</strong>n us<strong>in</strong>g Rw(z) = z m P(w)/Dw(z) and Rwβ(z) = z m+1 P(w)P(β)/Dwβ(z)<br />

(see Régnier and Szpankowski (1998)), it follows that<br />

H (α)<br />

w (z) = Dw(z) − Dwβ(z)<br />

. (30)<br />

Dw(z)<br />

Note Dw(z)−Dwβ(z) = (1−z)Sw(z)+z m P(w)−(1−z)Swβ(z)−z m+1 P(w)P(β) = (1−z)(Swα(z)−<br />

1) + z m+1 P(w)P(α) = Dwα(z) − (1 − z). Thus, (30) completes <strong>the</strong> pro<strong>of</strong> <strong>of</strong> <strong>the</strong> lemma.<br />

✷<br />

Us<strong>in</strong>g <strong>the</strong> lemma above, we f<strong>in</strong>ally observe a form <strong>of</strong> M(z, u) that we summarize below.<br />

Theorem 3.1 Let M(z, u) := �∞ �∞ n=1 k=1 P(Mn = k)ukz n denote <strong>the</strong> bivariate generat<strong>in</strong>g function<br />

for Mn, <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>of</strong> a <strong>suffix</strong> tree built over <strong>the</strong> first n+1 <strong>suffix</strong>es X (1) , . . . , X (n+1)<br />

<strong>of</strong> a str<strong>in</strong>g X. Then<br />

M(z, u) = �<br />

w∈A ∗<br />

α∈A<br />

uP(β)P(w)<br />

Dw(z)<br />

Dwα(z) − (1 − z)<br />

Dw(z) − u(Dwα(z) − (1 − z))<br />

for |u| < 1 and |z| < 1. Here Dw(z) = (1 − z)Sw(z) + z m P(w), and Sw(z) denotes <strong>the</strong> autocorrelation<br />

polynomial for w, def<strong>in</strong>ed <strong>in</strong> (1).<br />

3.3 On <strong>the</strong> Autocorrelation Polynomial<br />

Throughout <strong>the</strong> rest <strong>of</strong> our analysis we assume that, without loss <strong>of</strong> generality, p ≥ q. Note that p ≤<br />

√ √ √<br />

p < 1, so <strong>the</strong>re exists ρ > 1 such that ρ p < 1 (and thus ρp < 1 too). F<strong>in</strong>ally, def<strong>in</strong>e δ = p.<br />

We establish a few lemmas about <strong>the</strong> autocorrelation polynomial that will be important for our analysis.<br />

Recall that <strong>the</strong> autocorrelation polynomial is Sw(z) = �<br />

k∈P(w) P(wm k+1 )zm−k , where P(w) denotes <strong>the</strong><br />

set <strong>of</strong> positions k <strong>of</strong> w satisfy<strong>in</strong>g w1 . . . wk = wm−k+1 . . . wm, that is, w’s prefix <strong>of</strong> length k is equal to<br />

w’s <strong>suffix</strong> <strong>of</strong> length k.<br />

The autocorrelation polynomial Sw(z) has a P(wm k+1 )zm−k term if and only if w has an overlap with<br />

itself <strong>of</strong> length k. S<strong>in</strong>ce each word w overlaps with itself trivially, <strong>the</strong>n every autocorrelation polynomial<br />

has a constant term (i.e., zm−m = z0 = 1 term). With high probability, however, w has very few large<br />

nontrivial overlaps with itself. Therefore, with high probability, all nontrivial overlaps <strong>of</strong> w with itself are<br />

small; such overlaps correspond to high-degree terms <strong>of</strong> Sw(z). Therefore, when w is a randomly chosen<br />

long word, <strong>the</strong>n Sw(z) is very close to 1 with very high probability. The first lemma makes this notion<br />

ma<strong>the</strong>matically precise.<br />

Lemma 3.2 If θ = (1 − pρ) −1 > 1, <strong>the</strong>n<br />

�<br />

w∈A k<br />

where [A] = 1 if A holds, and [A] = 0 o<strong>the</strong>rwise.<br />

(31)<br />

[|Sw(ρ) − 1| ≤ (ρδ) k θ ]P(w) ≥ 1 − δ k θ (32)<br />

Pro<strong>of</strong> Our pro<strong>of</strong> is <strong>the</strong> one given <strong>in</strong> Fayolle and Ward (2005). Note that Sw(z) − 1 has a term <strong>of</strong> degree<br />

i ≤ j if and only if m − i ∈ P(w) with 1 ≤ i ≤ j. Therefore, for each such i and each w1 . . . wi, <strong>the</strong>re is<br />

exactly one word wi+1 . . . wk such that Sw(z) − 1 has a term <strong>of</strong> degree i. Therefore, for fixed j and k,<br />

�<br />

[Sw(z) − 1 has a term <strong>of</strong> degree ≤ j ]P(w)<br />

w∈A k<br />

≤ �<br />

�<br />

1≤i≤j w1,...,wi∈Ai ≤ �<br />

�<br />

1≤i≤j w1,...,wi∈Ai P(w1 . . . wi)<br />

�<br />

wi+1,...,wk∈A k−i<br />

P(w1 . . . wi)p k−i = �<br />

1≤i≤j<br />

[Sw(z) − 1 has a term <strong>of</strong> degree i]P(wi+1 . . . wk)<br />

p k−i ≤ pk−j<br />

1 − p<br />

(33)


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 315<br />

We use j = ⌊k/2⌋. Thus �<br />

w∈Ak [all terms <strong>of</strong> Sw(z) − 1 have degree > ⌊k/2⌋]P(w) ≥ 1 − δkθ. Note that, if all terms <strong>of</strong> Sw(z) − 1 have degree > ⌊k/2⌋, <strong>the</strong>n<br />

|Sw(ρ) − 1| ≤ �<br />

i>⌊k/2⌋<br />

(ρp) i = (ρp)⌊k/2⌋+1<br />

1 − ρp<br />

≤ (ρp)k/2<br />

1 − ρp ≤ ρk p k/2<br />

1 − ρp = (ρδ)k θ . (34)<br />

This completes <strong>the</strong> pro<strong>of</strong> <strong>of</strong> <strong>the</strong> lemma. ✷<br />

Us<strong>in</strong>g this lemma, we can quickly obta<strong>in</strong> ano<strong>the</strong>r result that is similar but slightly stronger.<br />

First consider words w such that |Sw(ρ) − 1| ≤ (ρδ) kθ. Write Sw(z) = �k−1 i=0 aizi and Swα(z) =<br />

�k i=0 bizi . Observe that ei<strong>the</strong>r bi = 0 or bi = ai. The follow<strong>in</strong>g lemma follows immediately:<br />

Lemma 3.3 If θ = (1 − pρ) −1 + 1 and α ∈ A, <strong>the</strong>n<br />

�<br />

w∈A k<br />

[max{|Sw(ρ) − 1|, |Swα(ρ) − 1|} ≤ (ρδ) k θ ]P(w) ≥ 1 − δ k θ . (35)<br />

Also, <strong>the</strong> autocorrelation polynomial is never too small. In fact<br />

Lemma 3.4 Def<strong>in</strong>e c = 1 − ρ √ p > 0. Then <strong>the</strong>re exists an <strong>in</strong>teger K ≥ 1 such that, for |w| ≥ K and<br />

|z| ≤ ρ and |u| ≤ δ −1 ,<br />

|Sw(z) − uSwα(z) + u| ≥ c . (36)<br />

Pro<strong>of</strong> The pro<strong>of</strong> consists <strong>of</strong> consider<strong>in</strong>g several cases. The only condition for K is (1 + δ −1 ) (pρ)K/2<br />

1−pρ ≤<br />

c/2. The analysis is not difficult; all details are presented <strong>in</strong> Ward (2005). ✷<br />

3.4 Analytic Cont<strong>in</strong>uation<br />

Our goal <strong>in</strong> this section is to prove <strong>the</strong> follow<strong>in</strong>g:<br />

Theorem 3.2 The generat<strong>in</strong>g function M(z, u) can be analytically cont<strong>in</strong>ued for |u| ≤ δ −1 and |z| < 1.<br />

The pro<strong>of</strong> requires several lemmas and observations. We always assume |u| ≤ δ −1 .<br />

Lemma 3.5 If 0 < r < 1, <strong>the</strong>n <strong>the</strong>re exists C > 0 and an <strong>in</strong>teger K1 (both depend<strong>in</strong>g on r) such that<br />

for |w| ≥ K1 and |z| ≤ r (and, as before, |u| ≤ δ −1 ).<br />

|Dw(z) − u(Dwα(z) − (1 − z))| ≥ C (37)<br />

Pro<strong>of</strong> Consider <strong>the</strong> K and c def<strong>in</strong>ed <strong>in</strong> Lemma 3.4, which tells us that, for all |w| ≥ K, we have<br />

|Sw(z) − uSwα(z) + u| ≥ c (38)<br />

for |z| ≤ ρ. So, for |w| ≥ K, we have |Dw(z) − u(Dwα(z) − (1 − z))| ≥ (1 − r)c − r m p m (1 − δ −1 rp).<br />

Note that r m p m (1 − δ −1 rp) → 0 as m → ∞. Therefore, replac<strong>in</strong>g K by a larger K1 if necessary, we can<br />

without loss <strong>of</strong> generality assume that r m p m (1 − δ −1 rp) ≤ (1 − r)c/2. So we def<strong>in</strong>e C = (1 − r)c/2,<br />

and <strong>the</strong> result follows immediately. ✷<br />

Now we can streng<strong>the</strong>n <strong>the</strong> previous lemma by dropp<strong>in</strong>g <strong>the</strong> “K1”, i.e., by not requir<strong>in</strong>g w to be a long<br />

word:<br />

Lemma 3.6 If 0 < r < 1, <strong>the</strong>n <strong>the</strong>re exists C > 0 (depend<strong>in</strong>g on r) such that<br />

for |z| ≤ r (and, as before, |u| ≤ δ −1 ).<br />

|Dw(z) − u(Dwα(z) − (1 − z))| ≥ C (39)<br />

Pro<strong>of</strong> Consider <strong>the</strong> K1 def<strong>in</strong>ed <strong>in</strong> Lemma 3.5. Let C0 denote <strong>the</strong> “C” from Lemma 3.5. There are<br />

only f<strong>in</strong>itely many w’s with |w| < K1, say w1, . . . , wi. For each such wj (with 1 ≤ j ≤ i), we note<br />

that Dwj (z) − u(Dwjα(z) − (1 − z)) �= 0 for |z| ≤ r and |u| ≤ δ −1 , so <strong>the</strong>re exists Cj > 0 such<br />

that |Dwj (z) − u(Dwjα(z) − (1 − z))| ≥ Cj for all |z| ≤ r and |u| ≤ δ −1 . F<strong>in</strong>ally, we def<strong>in</strong>e C =<br />

m<strong>in</strong>{C0, C1, . . . , Ci}. ✷<br />

F<strong>in</strong>ally, we prove Theorem 3.2.


316 Mark Daniel Ward and Wojciech Szpankowski<br />

Pro<strong>of</strong> Consider |z| ≤ r < 1. We proved <strong>in</strong> Lemma 3.6 <strong>the</strong>re exists C > 0 depend<strong>in</strong>g on r such that, for<br />

all |u| ≤ δ−1 1<br />

, we have<br />

. Sett<strong>in</strong>g u = 0, we also have<br />

. Thus<br />

1<br />

|Dw(z)−u(Dwα(z)−(1−z))|<br />

|M(z, u)| ≤ P(β)δ−1<br />

C 2<br />

�<br />

≤ 1<br />

C<br />

�<br />

α∈A w∈A∗ |Dw(z)|<br />

≤ 1<br />

C<br />

P(w)|Dwα(z) − (1 − z)| . (40)<br />

Now we use Lemma 3.3. Consider w and α with max{|Sw(ρ) − 1|, |Swα(ρ) − 1|} ≤ (ρδ) m θ. It follows<br />

immediately that<br />

|Dwα(z)−(1−z)| = |(1−z)(Swα(z)−1)+z m+1 P(w)P(α)| ≤ (1+r)(ρδ) m θ+r m+1 p m p = O(s m ) ,<br />

(41)<br />

where s = max{ρδ, rp}. Now consider <strong>the</strong> o<strong>the</strong>r w’s and α’s. We have<br />

|Dwα(z)−(1−z)| = |(1−z)(Swα(z)−1)+z m+1 P(w)P(α)| ≤<br />

so we def<strong>in</strong>e C1 = (1+r)pρ<br />

1−pρ<br />

|M(z, u)| ≤ P(β)δ−1<br />

C 2<br />

≤ P(β)δ−1<br />

C 2<br />

(1 + r)pρ<br />

1 − pρ +rm+1 p m p ≤<br />

(1 + r)pρ<br />

+1 ,<br />

1 − pρ<br />

(42)<br />

+ 1 to be a value which only depends on r (recall that r is fixed here). Thus<br />

� �<br />

�<br />

α∈A m≥0 w∈Am � �<br />

α∈A m≥0<br />

|P(w)(Dwα(z) − (1 − z))|<br />

|(1 − δ m θ)O(s m ) + δ m θC1| ≤ P(β)δ−1<br />

C 2<br />

� �<br />

α∈A m≥0<br />

O(s m ) = O(1) (43)<br />

and this completes <strong>the</strong> pro<strong>of</strong> <strong>of</strong> <strong>the</strong> <strong>the</strong>orem. ✷<br />

3.5 S<strong>in</strong>gularity <strong>Analysis</strong><br />

We first determ<strong>in</strong>e (for |u| ≤ δ −1 ) <strong>the</strong> zeroes <strong>of</strong> Dw(z) − u(Dwα(z) − (1 − z)) and (<strong>in</strong> particular) <strong>the</strong><br />

zeroes <strong>of</strong> Dw(z).<br />

Lemma 3.7 There exists an <strong>in</strong>teger K2 ≥ 1 such that, for u fixed (with |u| ≤ δ −1 ) and |w| ≥ K2, <strong>the</strong>re<br />

is exactly one root <strong>of</strong> Dw(z) − u(Dwα(z) − (1 − z)) <strong>in</strong> <strong>the</strong> closed disk {z | |z| ≤ ρ}.<br />

Pro<strong>of</strong> Let K and c be def<strong>in</strong>ed as <strong>in</strong> Lemma 3.4. Without loss <strong>of</strong> generality (replac<strong>in</strong>g K by a larger K2,<br />

if necessary), we can also assume that 2(pρ) K2 < c(ρ − 1) and K2 ≥ K1 (where K1 is def<strong>in</strong>ed <strong>in</strong> Lemma<br />

3.5). Also, we can choose K2 large enough (for use later) such that ∃c2 > 0 with<br />

ρ(1 − p K2 (1 + δ −1 p)) − 1 > c2 and thus ρ(1 − p K2 ) − 1 > c2 . (44)<br />

We recall 0 < pρδ −1 < 1, and thus 0 < 1 − pρδ −1 < 1. S<strong>in</strong>ce |u| < δ −1 and |z| ≤ ρ, <strong>the</strong>n for<br />

|w| ≥ K2 we have |P(w)z m (1 − uzP(α))| ≤ (pρ) m (1 + δ −1 ρp) ≤ 2(pρ) m < c(ρ − 1) ≤ |(Sw(z) −<br />

uSwα(z) + u)(ρ − 1)|. Therefore, for z on <strong>the</strong> circle {z | |z| = ρ}, we have |P(w)z m (1 − uzP(α))| <<br />

|(Sw(z) − uSwα(z) + u)(z − 1)|. Equivalently,<br />

|(Dw(z) − u(Dwα(z) − (1 − z))) − ((Sw(z) − uSwα(z) + u)(z − 1))| < |(Sw(z)−uSwα(z)+u)(z−1)| .<br />

(45)<br />

Therefore, by Rouché’s Theorem, Dw(z) − u(Dwα(z) − (1 − z)) and (Sw(z) − uSwα(z) + u)(z − 1)<br />

have <strong>the</strong> same number <strong>of</strong> zeroes <strong>in</strong>side <strong>the</strong> disk {z | |z| ≤ ρ}. S<strong>in</strong>ce |Sw(z) − uSwα(z) + u| ≥ c <strong>in</strong>side<br />

this disk, we conclude that (Sw(z) − uSwα(z) + u)(z − 1) has exactly one root <strong>in</strong> <strong>the</strong> disk. It follows that<br />

Dw(z) − u(Dwα(z) − (1 − z)) also has exactly one root <strong>in</strong> <strong>the</strong> disk. ✷<br />

When u = 0, this lemma implies (for |w| ≥ K2) that Dw(z) has exactly one root <strong>in</strong> <strong>the</strong> disk {z | |z| ≤ ρ}.<br />

Let Aw denote this root, and let Bw = D ′ w(Aw). Also let Cw(u) denote <strong>the</strong> root <strong>of</strong> Dw(z) − u(Dwα(z) −<br />

(1 − z)) <strong>in</strong> <strong>the</strong> closed disk {z | |z| ≤ ρ}. F<strong>in</strong>ally, we def<strong>in</strong>e<br />

�<br />

d<br />

Ew(u) :=<br />

dz (Dw(z)<br />

��<br />

���z=Cw<br />

− u(Dwα(z) − (1 − z))) = D ′ w(Cw) − u(D ′ wα(Cw) + 1) . (46)<br />

We have precisely determ<strong>in</strong>ed <strong>the</strong> s<strong>in</strong>gularities <strong>of</strong> M(z, u). Next, we make a comparison <strong>of</strong> M(z, u) to<br />

M I (z, u), <strong>in</strong> order to show that Mn and M I n have asymptotically similar behaviors.


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 317<br />

3.6 Compar<strong>in</strong>g Suffix Trees to Tries<br />

Now we def<strong>in</strong>e<br />

Us<strong>in</strong>g <strong>the</strong> notation from (11) and (31), if we write<br />

<strong>the</strong>n we have proven that<br />

Q(z, u) = M(z, u) − M I (z, u) . (47)<br />

M I w,α(z, u) =<br />

uP(β)P(w)<br />

zP(w)P(α)<br />

1 − z(1 − P(w)) 1 − z(1 + uP(w)P(α) − P(w))<br />

Mw,α(z, u) = uP(β)P(w) Dwα(z) − (1 − z)<br />

Dw(z) Dw(z) − u(Dwα(z) − (1 − z))<br />

(48)<br />

Q(z, u) = �<br />

(Mw,α(z, u) − M I w,α(z, u)) . (49)<br />

w∈A ∗<br />

α∈A<br />

We also def<strong>in</strong>e Qn(u) = [zn ]Q(z, u). We denote <strong>the</strong> contribution to Qn(u) from a specific w and α as<br />

Q (w,α)<br />

n (u) = [zn ](Mw,α(z, u) − M I w,α(z, u)). Then we observe that<br />

Q (w,α)<br />

n (u) = 1<br />

�<br />

(Mw,α(z, u) − M<br />

2πi<br />

I w,α(z, u)) dz<br />

zn+1 (50)<br />

where <strong>the</strong> path <strong>of</strong> <strong>in</strong>tegration is a circle about <strong>the</strong> orig<strong>in</strong> with counterclockwise orientation.<br />

We def<strong>in</strong>e<br />

Iw,α(ρ, u) = 1<br />

�<br />

2πi<br />

(Mw,α(z, u) − M I w,α(z, u)) dz<br />

.<br />

zn+1 (51)<br />

|z|=ρ<br />

By Cauchy’s <strong>the</strong>orem, we observe that <strong>the</strong> contribution to Qn(u) from a specific w and α is exactly<br />

Q (w,α)<br />

Mw,α(z, u)<br />

n (u) = Iw,α(ρ, u) − Res<br />

z=Aw<br />

+ Res<br />

z=1/(1−P(w))<br />

To simplify this expression, note that<br />

z n+1 − Res<br />

z=Cw(u)<br />

Mw,α(z, u)<br />

z n+1<br />

M I w,α(z, u)<br />

z n+1 + Res<br />

z=1/(1+uP(w)P(α)−P(w))<br />

M I w,α(z, u)<br />

z n+1 . (52)<br />

Mw,α(z, u)<br />

Res<br />

z=Aw zn+1 = − P(β)P(w) 1<br />

Bw A n+1<br />

Mw,α(z, u)<br />

Res<br />

z=Cw(u) z<br />

w<br />

n+1 = P(β)P(w) 1<br />

Ew(u) Cw(u) n+1<br />

M<br />

Res<br />

z=1/(1−P(w))<br />

I w,α(z, u)<br />

zn+1 = P(β)P(w)(1 − P(w)) n<br />

M<br />

Res<br />

z=1/(1+uP(w)P(α)−P(w))<br />

I w,α(z, u)<br />

zn+1 = −P(β)P(w)(1 + uP(w)P(α) − P(w)) n<br />

It follows from (52) that<br />

Q (w,α)<br />

n (u) = Iw,α(ρ, u) + P(β)P(w) 1<br />

Bw A n+1 −<br />

w<br />

P(β)P(w) 1<br />

Ew(u) Cw(u) n+1<br />

+ P(β)P(w)(1 − P(w)) n − P(β)P(w)(1 + uP(w)P(α) − P(w)) n . (54)<br />

We next determ<strong>in</strong>e <strong>the</strong> contribution <strong>of</strong> <strong>the</strong> z = Aw terms <strong>of</strong> M(z, u) and <strong>the</strong> z = 1/(1 − P(w)) terms <strong>of</strong><br />

M I (z, u) to <strong>the</strong> difference Qn(u) = [z n ](M(z, u) − M I (z, u)).<br />

Lemma 3.8 The “Aw terms” and <strong>the</strong> “1/(1−P(w)) terms” (for |w| ≥ K2) altoge<strong>the</strong>r have only O(n−ɛ )<br />

contribution to Qn(u), i.e.,<br />

�<br />

�<br />

Mw,α(z, u)<br />

− Res<br />

z=Aw zn+1 M<br />

+ Res<br />

z=1/(1−P(w))<br />

I w,α(z, u)<br />

zn+1 �<br />

= O(n −ɛ ) , (55)<br />

for some ɛ > 0.<br />

|w|≥K 2<br />

α∈A<br />

(53)


318 Mark Daniel Ward and Wojciech Szpankowski<br />

Pro<strong>of</strong> We def<strong>in</strong>e<br />

fw(x) =<br />

for x real. So by (53) it suffices to prove that<br />

Note that � |w|≥K 2<br />

α∈A<br />

�<br />

|w|≥K 2<br />

α∈A<br />

1<br />

A x+1 + (1 − P(w))<br />

w Bw<br />

x<br />

(56)<br />

P(β)P(w)fw(x) = O(x −ɛ ) . (57)<br />

P(β)P(w)fw(x) is absolutely convergent for all x. Also ¯ fw(x) = fw(x) − fw(0)e −x<br />

is exponentially decreas<strong>in</strong>g when x → +∞ and is O(x) when x → 0 (notice that we utilize <strong>the</strong> fw(0)e−x term <strong>in</strong> order to make sure that ¯ fw(x) = O(x) when x → 0; this provides a fundamental strip for <strong>the</strong><br />

Mell<strong>in</strong> transform <strong>in</strong> <strong>the</strong> next step). Therefore, its Mell<strong>in</strong> transform ¯ f ∗ w(s) = � ∞ ¯fw(x)x 0<br />

s−1 dx is welldef<strong>in</strong>ed<br />

for ℜ(s) > −1 (see Flajolet et al. (1995) and Szpankowski (2001)). We compute<br />

¯f ∗ �<br />

(log Aw)<br />

w(s) = Γ(s)<br />

−s − 1<br />

+ (− log(1 − P(w)))<br />

AwBw<br />

−s �<br />

− 1<br />

where Γ denotes <strong>the</strong> Euler gamma function, and we note that<br />

Also<br />

We def<strong>in</strong>e g ∗ (s) = � |w|≥K 2<br />

α∈A<br />

(58)<br />

(log Aw) −s � �−s P(w)<br />

=<br />

(1 + O(P(w)))<br />

Sw(1)<br />

(− log(1 − P(w))) −s = P(w) −s (1 + O(P(w))) (59)<br />

Aw = 1 + 1<br />

Sw(1) P(w) + O(P(w)2 )<br />

�<br />

Bw = −Sw(1) + − 2S′ �<br />

w(1)<br />

+ m P(w) + O(P(w)<br />

Sw(1) 2 ) (60)<br />

Therefore<br />

1<br />

AwBw<br />

= − 1<br />

+ O(|w|P(w))<br />

Sw(1)<br />

(61)<br />

So ¯ f ∗ ��<br />

w(s) = Γ(s) − 1<br />

Sw(1) +O(|w|P(w))<br />

��� �−s �<br />

P(w)<br />

Sw(1) (1+O(P(w)))−1 +P(w) −s (1+O(P(w)))−<br />

� �<br />

1 = Γ(s) P(w) −s � −Sw(1) s−1 + 1 + O(|w|P(w)) � + 1<br />

�<br />

Sw(1) − 1 + O(|w|P(w)) .<br />

P(β)P(w) ¯ f ∗ w(s). Then we compute<br />

g ∗ (s) = �<br />

P(β) �<br />

α∈A<br />

|w|≥K2<br />

P(w) ¯ f ∗ w(s) = �<br />

P(β)Γ(s)<br />

α∈A<br />

∞�<br />

m=K2<br />

�<br />

sup{q −ℜ(s) �m , 1}δ O(1) , (62)<br />

where <strong>the</strong> last equality is true because 1 ≥ p −ℜ(s) ≥ q −ℜ(s) when ℜ(s) is negative, and also because<br />

q −ℜ(s) ≥ p −ℜ(s) ≥ 1 when ℜ(s) is positive. We always have δ < 1. Also, <strong>the</strong>re exists c > 0 such<br />

that q −c δ < 1. Therefore, g ∗ (s) is analytic <strong>in</strong> ℜ(s) ∈ (−1, c). Work<strong>in</strong>g <strong>in</strong> this strip, we choose ɛ with<br />

0 < ɛ < c. Then we have<br />

�<br />

|w|≥K 2<br />

α∈A<br />

P(β)P(w)fw(x) = 1<br />

2πi<br />

� ɛ+i∞<br />

g<br />

ɛ−i∞<br />

∗ (s)x −s ds + �<br />

|w|≥K2 α∈A<br />

P(β)P(w)fw(0)e −x . (63)<br />

Majoriz<strong>in</strong>g under <strong>the</strong> <strong>in</strong>tegral, we see that <strong>the</strong> first term is O(x −ɛ ) s<strong>in</strong>ce g ∗ (s) is analytic <strong>in</strong> <strong>the</strong> strip<br />

ℜ(s) ∈ (−1, c) (and −1 < ɛ < c). Also, <strong>the</strong> second term is O(e −x ). This completes <strong>the</strong> pro<strong>of</strong> <strong>of</strong> <strong>the</strong><br />

lemma. ✷<br />

Now we bound <strong>the</strong> contribution to Qn(u) from <strong>the</strong> Cw(u) terms <strong>of</strong> M(z, u) and <strong>the</strong> z = 1/(1 +<br />

uP(w)P(α) − P(w)) terms <strong>of</strong> M I (z, u).


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 319<br />

Lemma 3.9 The “Cw(u) terms” and <strong>the</strong> “1/(1+uP(w)P(α)−P(w)) terms” (for |w| ≥ K2) altoge<strong>the</strong>r<br />

have only O(n−ɛ ) contribution to Qn(u), for some ɛ > 0. More precisely,<br />

�<br />

�<br />

Mw,α(z, u)<br />

− Res<br />

z=Cw(u) zn+1 M<br />

+ Res<br />

z=1/(1+uP(w)P(α)−P(w))<br />

I w,α(z, u)<br />

zn+1 �<br />

= O(n −ɛ ) . (64)<br />

|w|≥K 2<br />

α∈A<br />

Pro<strong>of</strong> The pro<strong>of</strong> technique is <strong>the</strong> same as <strong>the</strong> one for Lemma 3.8 above. ✷<br />

Next we prove that <strong>the</strong> Iw,α(ρ, u) terms <strong>in</strong> (54) have O(n −ɛ ) contribution to Qn(u).<br />

Lemma 3.10 The “Iw,α(ρ, u) terms” (for |w| ≥ K2) altoge<strong>the</strong>r have only O(n−ɛ ) contribution to<br />

Qn(u), for some ɛ > 0. More precisely,<br />

�<br />

Iw,α(ρ, u) = O(n −ɛ ) . (65)<br />

|w|≥K 2<br />

α∈A<br />

Pro<strong>of</strong> Here we only sketch <strong>the</strong> pro<strong>of</strong>. A rigorous pro<strong>of</strong> is given <strong>in</strong> Ward (2005). Recall that<br />

Iw,α(ρ, u) =<br />

�<br />

�<br />

1<br />

1 Dwα(z) − (1 − z)<br />

uP(β)P(w)<br />

2πi |z|=ρ<br />

Dw(z) Dw(z) − u(Dwα(z) − (1 − z))<br />

�<br />

1<br />

zP(w)P(α)<br />

dz<br />

−<br />

.<br />

1 − z(1 − P(w)) 1 − z(1 + uP(w)P(α) − P(w)) zn+1 (66)<br />

By Lemma 3.7, K2 was selected to be sufficiently large such that (ρp) m (1 − δ −1 ρp) ≤ (ρ − 1)c/2. Thus,<br />

writ<strong>in</strong>g C1 = (ρ−1)c/2, we have 1/|Dw(z)−u(Dwα(z)−(1−z))| ≤ 1/C1 and thus 1/|Dw(z)| ≤ 1/C1.<br />

Also 1/|1 − z(1 − P(w))| ≤ 1/c2 and 1/|1 − z(1 + uP(w)P(α) − P(w))| ≤ 1/c2 by (44). So we obta<strong>in</strong><br />

|Iw,α(ρ, u)| = O(ρ −n )P(w)(Swα(ρ) − 1) + O(ρ −n )P(w)O((pρ) m ) . (67)<br />

Thus, by Lemma 3.3, � �<br />

α∈A |w|=m |Iw,α(ρ, u)| = O(ρ−n )O((ρδ) m ). We conclude � |w|≥K2 |Iw,α(ρ, u)| =<br />

α∈A<br />

O(ρ−n ), and <strong>the</strong> lemma follows. ✷<br />

F<strong>in</strong>ally, we consider <strong>the</strong> contribution to Qn(u) from small words |w|. Basically, we prove that |w| has<br />

a normal distribution with mean 1<br />

h log n and variance θ log n, where h = −p log p − q log q denotes <strong>the</strong><br />

entropy <strong>of</strong> <strong>the</strong> source, and θ is a constant. Therefore, |w| ≤ K2 is extremely unlikely, and as a result, <strong>the</strong><br />

contribution to Qn(u) from words w with |w| ≤ K2 is very small.<br />

Lemma 3.11 The terms � |w|


320 Mark Daniel Ward and Wojciech Szpankowski<br />

In Jacquet and Szpankowski (1994), <strong>the</strong> typical depth D T n+1 <strong>in</strong> a trie built over n+1 <strong>in</strong>dependent str<strong>in</strong>gs<br />

was shown to be asymptotically normal with mean 1<br />

h log(n + 1) and variance θ log(n + 1). We observe<br />

that DI n (def<strong>in</strong>ed <strong>in</strong> (69)) and DT n+1 have <strong>the</strong> same distribution; to see this, observe that P(DI �<br />

n < k) =<br />

|w|=k P(w)(1 − P(w))n = P(DT n+1 < k). Therefore, DI n is also asymptotically normal with mean<br />

1<br />

h log n and variance θ log n. In Ward (2005), we rigorously prove that DI n and Dn have asymptotically<br />

<strong>the</strong> same distribution, namely, a normal distribution with mean 1<br />

h log(n + 1) and variance θ log(n + 1).<br />

Therefore, consider<strong>in</strong>g (71) (and not<strong>in</strong>g that K2 is a constant), it follows that<br />

[z n ] �<br />

|Mw,α(z, u) − M I w,α(z, u)| = O(n −ɛ ) . (72)<br />

|w| 1.<br />

For ease <strong>of</strong> notation, we def<strong>in</strong>e b = δ −1 . F<strong>in</strong>ally, we apply Cauchy’s <strong>the</strong>orem aga<strong>in</strong>. We compute<br />

P(Mn = k) − P(M I n = k) = [u k z n ]Q(z, u) = [u k ]Qn(u) = 1<br />

�<br />

2πi |u|=b<br />

S<strong>in</strong>ce Qn(u) = O(n −ɛ ), it follows that<br />

Qn(u)<br />

du . (73)<br />

uk+1 |P(Mn = k) − P(M I n = k)| ≤ 1<br />

|2πi| (2πb)O(n−ɛ )<br />

b k+1 = O(n−ɛ b −k ) . (74)<br />

So Theorem 2.1 holds. It follows that Mn and M I n have asymptotically <strong>the</strong> same distribution, and <strong>the</strong>refore<br />

Mn and M I n asymptotically have <strong>the</strong> same factorial moments. The ma<strong>in</strong> result <strong>of</strong> Ward and Szpankowski<br />

(2004) gives <strong>the</strong> asymptotic distribution and factorial moments <strong>of</strong> M I n. As a result, Theorem 2.2 follows<br />

immediately. Therefore, Mn follows <strong>the</strong> logarithmic series distribution, i.e., P(Mn = j) = pjq+q j p<br />

jh (plus<br />

some small fluctuations if ln p/ ln q is rational).<br />

Acknowledgements<br />

The second author warmly thanks Philippe Flajolet for his hospitality dur<strong>in</strong>g his recent stay at INRIA.<br />

References<br />

J. Fayolle. An average-case analysis <strong>of</strong> basic <strong>parameter</strong>s <strong>of</strong> <strong>the</strong> <strong>suffix</strong> tree. In M. Drmota, P. Flajolet,<br />

D. Gardy, and B. Gittenberger, editors, Ma<strong>the</strong>matics and Computer Science, pages 217–227, Vienna,<br />

Austria, 2004. Birkhäuser. Proceed<strong>in</strong>gs <strong>of</strong> a colloquium organized by TU Wien.<br />

J. Fayolle and M. D. Ward. <strong>Analysis</strong> <strong>of</strong> <strong>the</strong> average depth <strong>in</strong> a <strong>suffix</strong> tree under a markov model. In<br />

International Conference on <strong>the</strong> <strong>Analysis</strong> <strong>of</strong> Algorithms, Barcelona, 2005.<br />

P. Flajolet, X. Gourdon, and P. Dumas. Mell<strong>in</strong> transforms and asymptotics: Harmonic sums. Theoretical<br />

Computer Science, 144:3–58, 1995.<br />

L. Guibas and A. M. Odlyzko. Periods <strong>in</strong> str<strong>in</strong>gs. J. Comb<strong>in</strong>atorial Theory, 30:19–43, 1981.<br />

P. Jacquet and M. Régnier. Limit<strong>in</strong>g distributions for trie <strong>parameter</strong>s. In Lecture Notes <strong>in</strong> Computer<br />

Science, volume 214, pages 196–210. Spr<strong>in</strong>ger, 1986.<br />

P. Jacquet and M. Régnier. Normal limit<strong>in</strong>g distributions <strong>of</strong> <strong>the</strong> size <strong>of</strong> tries. In Proc. Performance ’87,<br />

pages 209–223, Amsterdam, 1987. North Holland.<br />

P. Jacquet and W. Szpankowski. <strong>Analysis</strong> <strong>of</strong> digital tries with markovian dependency. IEEE Transactions<br />

on Information Theory, 37:1470–1475, 1991.


<strong>Analysis</strong> <strong>of</strong> <strong>the</strong> <strong>multiplicity</strong> <strong>match<strong>in</strong>g</strong> <strong>parameter</strong> <strong>in</strong> <strong>suffix</strong> <strong>trees</strong> 321<br />

P. Jacquet and W. Szpankowski. Autocorrelation on words and its applications. analysis <strong>of</strong> <strong>suffix</strong> <strong>trees</strong> by<br />

str<strong>in</strong>g-ruler approach. Journal <strong>of</strong> Comb<strong>in</strong>atorial Theory, A66:237–269, 1994.<br />

P. Kirschenh<strong>of</strong>er and H. Prod<strong>in</strong>ger. Fur<strong>the</strong>r results on digital search tries. Theoretical Computer Science,<br />

58:143–154, 1988.<br />

P. Kirschenh<strong>of</strong>er, H. Prod<strong>in</strong>ger, and W. Szpankowski. On <strong>the</strong> variance <strong>of</strong> <strong>the</strong> external path <strong>in</strong> a symmetric<br />

digital trie. Discrete Applied Ma<strong>the</strong>matics, 25:129–143, 1989.<br />

S. Lonardi and W. Szpankowski. Jo<strong>in</strong>t source–channel LZ’77 cod<strong>in</strong>g. In IEEE Data Compression Conference,<br />

pages 273–282, Snowbird, 2003.<br />

M. Lothaire, editor. Applied Comb<strong>in</strong>atorics on Words, chapter 7. Cambridge, 2005. Analytic Approach<br />

to Pattern Match<strong>in</strong>g by P. Jacquet and W. Szpankowski.<br />

M. Régnier and W. Szpankowski. On pattern frequency occurrences <strong>in</strong> a Markovian sequence. Algorithmica,<br />

22:631–649, 1998.<br />

W. Szpankowski. A generalized <strong>suffix</strong> tree and its (un)expected asymptotic behaviors. SIAM J. Comput<strong>in</strong>g,<br />

22:1176–1198, 1993.<br />

W. Szpankowski. Average Case <strong>Analysis</strong> <strong>of</strong> Algorithms on Sequences. Addison-Wesley, Wiley, 2001.<br />

M. D. Ward. <strong>Analysis</strong> <strong>of</strong> <strong>the</strong> Multiplicity Match<strong>in</strong>g Parameter <strong>in</strong> Suffix Trees. PhD <strong>the</strong>sis, Purdue University,<br />

West Lafayette, IN, May 2005.<br />

M. D. Ward and W. Szpankowski. <strong>Analysis</strong> <strong>of</strong> a randomized selection algorithm motivated by <strong>the</strong> LZ’77<br />

scheme. In 1st Workshop on Analytic Algorithmics and Comb<strong>in</strong>atorics, pages 153–160, New Orleans,<br />

2004.<br />

J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on<br />

Information Theory, 23:337–343, 1977.


322 Mark Daniel Ward and Wojciech Szpankowski

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!