Construction of weak and strong similarity measures for ordered sets ...

ii.pwr.wroc.pl

Construction of weak and strong similarity measures for ordered sets ...

772 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807JaccardÕs index JjA \ BjJðA; BÞ ¼jA [ BjDiceÕs index D2jA \ BjDðA; BÞ ¼jAjþjBjCosine function CosjA \ BjCosðA; BÞ ¼pffiffiffiffiffiffiffiffiffiffiffijAjjBjThe measure NpNðA; BÞ ¼ffiffi jA \ Bj2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijAj 2 þjBj 2ð1Þð2Þð3Þð4ÞOverlap measures O 1 <strong>andstrong> O 2O 1 ðA; BÞ ¼jA \ BjminðjAj; jBjÞjA \ BjO 2 ðA; BÞ ¼maxðjAj; jBjÞRecall R <strong>andstrong> precision PRðA; BÞ ¼jA \ BjjBjjA \ BjPðA; BÞ ¼ ð8ÞjAjOf course here one must add that A ¼ ret, the set <strong>ofstrong> retrieved documents <strong>andstrong> B ¼ rel, the set <strong>ofstrong>relevant documents.Most <strong>ofstrong> these measures are classical <strong>andstrong> well-known. For the readers who wish a longer introductionon these measures we refer to van Rijsbergen (1979), Salton <strong>andstrong> Mc Gill (1987), Boyce,Meadow, <strong>andstrong> Kraft (1995), Tague-Sutcliffe (1995), Grossman <strong>andstrong> Frieder (1998), Losee (1998) orEgghe <strong>andstrong> Michel (2002).All <strong>ofstrong> the above measures are <strong>ofstrong> the formF ðA; BÞ ¼w 1 ðjA \ BjÞw 2 ðjAj; jBj; jA [ BjÞwhere w 1 is a strictly increasing function <strong>andstrong> w 2 is an increasing function <strong>ofstrong> three variables.Let us consider the following general properties for a general measure F: for all A, B X:ðF 1 Þ 0 6 F ðA; BÞ 6 1 ð10ÞðF 2 Þ F ðA; BÞ ¼1 () A ¼ B ð11Þð5Þð6Þð7Þð9Þ


3propósitos muito similares, ainda que em segmentos tecnológicos diferenciados, formamuma base operacional de vanguarda que em muito pouco tempo será considerado pelosespecialistas em TI como diferenciais competitivos de primeira ordem.Tal constatação, ainda que decorra da própria evolução tecnológica do segmentode TI e das novas exigências do mercado acionário em âmbito mundial não é trivial, jáque a infra-estrutura operacional e tecnológica informacional das organizações,especialmente as que negociam suas ações em bolsas de valores, já está há muito tempoafinada com investimentos pesados e recorrentes na área de Tecnologia da Informação, oque pode produzir a sensação de que as atuais configurações tecnológicas das empresasnesta área podem ter chegado perto do limite de eficiência operacional.A motivação deste trabalho de pesquisa, desta forma, reside neste fato da presentee da futura (e) inexorável adoção destas tecnologias pelas empresas que já completaramo primeiro ciclo tecnológico de implementação dos ERP’s e demais sistemas especialistasque hoje constituem o seu arsenal informacional, a partir de um mergulho nas teoriasrelacionadas às novas tecnologias, além de citações atuais de exemplos de aplicabilidadedas mesmas. Espera-se que o presente trabalho de pesquisa tenha a virtude de associarconceitos de novas tecnologias (ainda que fomentadas pela indústria de s<strong>ofstrong>tware) aoconceito de competitividade empresarial, sob a ótica da controladoria.2. Revisão da LiteraturaA contínua evolução dos recursos ligado à área de TI, tanto em termos de s<strong>ofstrong>twarecomo em termos de hardware, faz com que a gr<strong>andstrong>e maioria das empresas tenha suaárea de geração de informações formadas por distintas plataformas tecnológicas,oper<strong>andstrong>o de forma “natural” sem níveis adequados de integração entre seus sistemascorporativos. Por ser este um dos problemas recorrentes na área de governançacorporativa, já que a integridade e a rapidez com que as informações chegam à direçãoda empresa e ao mercado acionário é diversa e heterogênea, o mercado acionário, osdepartamentos de controladoria e mesmo os órgãos reguladores, como por exemplo aSEC (Sarbanes Oxley Act) e o BACEN (Resolução 3380), dentre outros, vêmgradativamente normatiz<strong>andstrong>o e consequentemente ger<strong>andstrong>o dem<strong>andstrong>a por uma novadinâmica operacional – informacional – tecnológica nas organizações.Para atender a novos níveis de serviço de qualidade e eficiência informacional,tanto internos como externos às empresas, as controladorias vem procur<strong>andstrong>o simplificarprocessos e adotar ferramentas integradoras de tecnologias que são capazes de reduzir acomplexidade operacional da TI, que muitas vezes se constituem verdadeiros obstáculosà confiabilidade e eficiência informacional. Paralelamente e na mesma direção, asimplificação e homogeneização dos métodos de publicação contábil e o conseqüentecompartilhamento on-line de informações que podem ser re-trabalhadas de acordo com ointeresse particular de cada usuário da informação, já é uma realidade que, para seremadotadas integralmente por todas as empresas, é apenas uma questão de tempo.Conforme cita Murray (2007, p.01):Enquanto muitos pr<strong>ofstrong>issionais da área de TI são rápidos em exultar o valorestratégico da TI, o mundo dos negócios visualiza este assunto de maneira bem


Then we haveL. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 775(I) All strong similarity measures J, D, Cos, N, O 2 satisfy the above requirements.(II) F is a good strong similarity measure, i.e. satisfies the requirements (F 1 ), (F 2 ), (F 3 ), (F 4 ).Pro<strong>ofstrong>ii(I) As mentioned already, all measures J, D, Cos, N, O 2 are <strong>ofstrong> the type (14). We will investigateif the functions w 1 <strong>andstrong> w 2 (in each case) satisfy the properties announced in this theorem.J Here w 1 ðaÞ ¼a <strong>andstrong> w 2 ðx; y; zÞ ¼z, hence (i) is satisfied. Let nowa6 xor a < xP jAj ¼x. Then, since for J: z ¼jA [ Bj, we have a < z <strong>andstrong> hence< y 6 yP jBj ¼yw 1 ðaÞ < w 2 ðx; y; zÞ proving (ii). The pro<strong>ofstrong> <strong>ofstrong> (iii) <strong>andstrong> (iv) is similar.D Here w 1 ðaÞ ¼a <strong>andstrong> w 2 ðx; y; zÞ ¼ðx þ yÞ=2. Hence (i) is satisfied. Let nowa6 xor a < x . This implies a < ðx þ yÞ=2 hence< y 6 yw 1 ðaÞ < w 2 ðx; y; zÞ, proving (ii). The pro<strong>ofstrong> <strong>ofstrong> (iii) <strong>andstrong> (iv) is similar.pCos Here w 1 ðaÞ ¼a <strong>andstrong> w 2 ðx; y; zÞ ¼ffiffiffiffiffi xy . Hence (i) is satisfied. Let nowa6 xor a < xp. This implies a < ffiffiffiffiffi xy proving (ii).< y 6 yThe pro<strong>ofstrong> <strong>ofstrong> (iii) <strong>andstrong> (iv) is similar.pN Here w 1 ðaÞ ¼a <strong>andstrong> w 2 ðx; y; zÞ ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðx 2 þ y 2 Þ=2. Hence (i) is satisfied. Let nowa6 xor a < xp. This implies 2a< y 6 y2 < x 2 þ y 2 , hence a < ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiðx 2 þ y 2 Þ=2 proving (ii).The pro<strong>ofstrong> <strong>ofstrong> (iii) <strong>andstrong> (iv) is similar.O 2 Herew 1 ðaÞ ¼a <strong>andstrong> w 2 ðx; y; zÞ ¼maxðx; yÞ. Hence (i) is satisfied. Let nowa6 xor a < x . Then a < maxðx; yÞ implying (ii). The pro<strong>ofstrong> <strong>ofstrong> (iii) <strong>andstrong> (iv) is similar.< y 6 yi(II) We now show that F satisfies (F 1 ), (F 2 ), (F 3 ), (F 4 ).6 jAj(F 1 ) (iii) implies, since jA \ Bj that w6 jBj 1 ðjA \ BjÞ 6 w 2 ðjAj; jBj; zÞ8zP jAj . Hence alsoP jBjfor z ¼jA [ Bj. This proves F ðA; BÞ 6 1. Of course, since w 1 <strong>andstrong> w 2 are positive, so is F.(F 2 ) F ðA; BÞ ¼1impliesw 1 ðjA \ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞð15ÞIf jA \ Bj < jAj or jA \ Bj < jBj we see, by (ii), that w 1 ðjA \ BjÞ < w 2 ðjAj; jBj; jA [ BjÞ,P jAjsince jA [ Bj , contradicting (15). Hence jA \ Bj ¼jAj ¼jBj hence A ¼ B.P jBjConversely, if A ¼ B then jA \ Bj ¼jAj ¼jBj, hence, by (iv)w 1 ðjA [ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞsince A ¼ B implies jA [ Bj ¼jAj ¼jBj. Hence F ðA; BÞ ¼1.


776 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807(F 3 ) F ðA; BÞ ¼0 iff w 1 ðjA \ BjÞ ¼ 0 iff jA \ Bj ¼0 (w 1 is injective <strong>andstrong> w 1 ð0Þ ¼0) iffA \ B ¼;.(F 4 ) This is clear from (14) <strong>andstrong> the fact that w 1 strictly increases. Theorem 2.2. Let F be as in (14), with w 1 , w 2 functions R þ ! R þ such thatii(i) w 1 P 0, w 2 P 0, w 1 strictly increasing, w 2 increasing (in its three variables), w 1 ð0Þ ¼0 (i.e. thesame condition as (i) in the previous theorem),i(ii) a < minðx; y; zÞ implies w 1 ðaÞ < w 2 ðx; y; zÞ for all a, x, y, z,(iii) a 6 minðx; y; zÞ implies w 1 ðaÞ 6 w 2 ðx; y; zÞ for all a, x, y, z.Then we have(I) All the <strong>weakstrong> similarity measures O 1 , R <strong>andstrong> P satisfy the above requirements.(II) F is a good <strong>weakstrong> similarity measure, i.e. satisfies the requirements (F 1 ), (F 02 ), (F 3), (F 4 ).Pro<strong>ofstrong>ii(I) Again, the measures O 1 , R, P are <strong>ofstrong> the type (14). We will now check if the functions w 1 , w 2(in each case) satisfy the requirements in this theorem.O 1Here w 1 ðaÞ ¼a, w 2 ðx; y; zÞ ¼minðx; yÞ. Hence (i) is satisfied. Let now a < minðx; y; zÞ.Hence a < minðx; yÞ implying w 1 ðaÞ < w 2 ðx; y; zÞ, hence proving (ii). The pro<strong>ofstrong> <strong>ofstrong> (iii)is similar.R Here w 1 ðaÞ ¼a, w 2 ðx; y; zÞ ¼y, hence (i) is satisfied. Let now a < minðx; y; zÞ. Hencea < y <strong>andstrong> hence (ii) is satisfied. The pro<strong>ofstrong> <strong>ofstrong> (iii) is similar.P This pro<strong>ofstrong> is similar to the one <strong>ofstrong> R.i(II) We will show that F satisfies (F 1 ), (F 02 ), (F 3), (F 4 ).(F 1 ) F P 0 since w 1 <strong>andstrong> w 2 are positive. Further, since jA \ Bj 6 minðjAj; jBj; jA [ BjÞ wehave, by (iii), thatw 1 ðjA \ BjÞ 6 w 2 ðjAj; jBj; jA [ BjÞhence F ðA; BÞ 6 1.(F2 0)Let F ðA; BÞ ¼1. Hence w 1ðjA \ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞ. By (ii) <strong>andstrong> (iii) <strong>andstrong> sincejA \ Bj 6 minðjAj; jBj; jA [ BjÞ, we have thatjA \ Bj ¼minðjAj; jBj; jA [ BjÞ ¼ minðjAj; jBjÞimplying that A B or A B.(F 3 ) F ðA; BÞ ¼0 iff w 1 ðjA \ BjÞ ¼ 0 iff jA \ Bj ¼0 (w 1 is injective <strong>andstrong> w 1 ð0Þ ¼0) iffA \ B ¼;.(F 4 ) This is clear from (14) <strong>andstrong> the fact that w 1 strictly increases.


778 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807for both <strong>weakstrong> <strong>andstrong> strong similarity measures (<strong>andstrong> in fact is applicable to every measure that weencountered).In fact, defining the fuzzy variants <strong>ofstrong> the <strong>weakstrong> <strong>andstrong> strong similarity measures is very easy: theyare simply (14), interpreted in the fuzzy way: let A <strong>andstrong> B be fuzzy sets. We defineF wðA; BÞ ¼ 1 ðjA \ BjÞð20Þw 2 ðjAj; jBj; jA [ BjÞwhere we use definitions (16), (17), (19) <strong>andstrong> where w 1 <strong>andstrong> w 2 are as in the previous section. Hence(20) isPw 1 minðP A ðxÞ; P B ðxÞÞF x2A\BðA; BÞ ¼ Pw 2 A ðxÞ;x2AP P P ð21ÞP B ðxÞ; maxðP A ðxÞ; P B ðxÞÞx2B x2A[BWe say that F is the fuzzy similarity measure, derived from F. We will prove that F is a good<strong>weakstrong> or strong similarity measure under the same conditions as in Theorems 2.1<strong>andstrong> 2.2 forordinary sets. The definitions <strong>ofstrong> good strong <strong>andstrong> <strong>weakstrong> similarity measures for fuzzy sets are thesame as in the case <strong>ofstrong> ordinary sets:Definition 3.1. We say that F is a strong similarity measure for fuzzy sets if, for all A, B fuzzysets:ðF 1 Þ 0 6 F ðA; BÞ 6 1 ð22ÞðF 2 Þ F ðA; BÞ ¼1 () A ¼ B ð23ÞðF 3 Þ F ðA; BÞ ¼0 () A \ B ¼; ð24ÞðF 4 Þ If the denominator <strong>ofstrong> F is constant then F is strictly increasing with jA \ Bj:Definition 3.2. We say that F is a <strong>weakstrong> similarity measure for fuzzy sets if F satisfies (F1 ) above <strong>andstrong>(F 4ðF 02 Þ F ðA; BÞ ¼1 ) A B or B AAll these notations have to be interpreted in the fuzzy way.We have the following results.), (F3 ),Theorem 3.3. Let F be as in (20) <strong>andstrong> let w 1 , w 2 have the same properties as in Theorem 2.1. ThenF is a good strong similarity measure for fuzzy sets.Pro<strong>ofstrong>(F1 )Obviously F 6 jAjP 0. Since jA \ Bj , we have that w6 jBj1 ðjA \ BjÞ 6 w 2 ðjAj; jBj; jA [ BjÞP jAjsince jA [ Bj . We use here the definitions (16)–(19). Hence FP jBj 6 1.(F2 )Let F ðA; BÞ ¼1. Hence w 1 ðjA \ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞ


HencejA \ Bj ¼jAj ¼jBjby (ii) <strong>andstrong> (iii) in Theorem 2.1. (25) meansXminðP A ðxÞ; P B ðxÞÞ ¼ X P A ðxÞ ¼ X P B ðxÞ:x2A\B x2Ax2B6 PSince 8x 2 A \ B : minðP A ðxÞ; P B ðxÞÞ A ðxÞ6 P B ðxÞ8x 2 A nðA \ BÞ : P A ðxÞ P 08x 2 B nðA \ BÞ : P B ðxÞ P 0we have that8x 2 A \ B : minðP A ðxÞ; P B ðxÞÞ ¼ P A ðxÞ ¼P B ðxÞð26Þ<strong>andstrong>8x 2 A nðA \ BÞ : P A ðxÞ ¼0ð27Þ8x 2 B nðA \ BÞ : P B ðxÞ ¼0ð28ÞEqs. (27) <strong>andstrong> (28) imply that A \ B ¼ A ¼ B as ordinary sets <strong>andstrong> then (26) implies thatA ¼ B as fuzzy sets.Conversely, if A ¼ B as fuzzy sets then jA \ Bj ¼jAj ¼jBj in the fuzzy sense. Hence (iv) inTheorem 2.1yieldsw 1 ðjA \ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞ;hence F ðA; BÞ ¼1(F3 )F ðA; BÞ ¼0iffw 1 ðjA \ BjÞ ¼ 0iffjA \ Bj ¼0(w 1 injective <strong>andstrong> w 1 ð0Þ ¼0) iffXminðP A ðxÞ; P B ðxÞÞ ¼ 0:x2A\BL. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 779This is equivalent with 8x 2 A \ B:minðP A ðxÞ; P B ðxÞÞ ¼ 0() A \ B ¼;(F4 ) is trivial. Theorem 3.4. Let F be as in (20) <strong>andstrong> let w 1 , w 2 have the same properties as in Theorem 2.2. ThenF is a good <strong>weakstrong> similarity measure for fuzzy sets.Pro<strong>ofstrong>(F1 )Obviously F P 0. Since jA \ Bj 6 minðjAj; jBj; jA [ BjÞ, (iii) in Theorem 2.2 impliesw 1 ðjA \ BjÞ 6 w 2 ðjAj; jBj; jA [ BjÞhence F 6 1.(F2 0)F ðA; BÞ ¼1implies w 1 ðjA \ BjÞ ¼ w 2 ðjAj; jBj; jA [ BjÞ. Properties (ii) <strong>andstrong> (iii) in Theorem2.2 imply<strong>andstrong>ð25Þ


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 783Hence, for all x 2 Cj0 1C i¼1i as in (iv) we haveP UC [U C 0ðxÞ ¼ 1ð46Þ2 j 1Since in (42) all unions are disjoint, all these values add up, leading tojU C [ U C 0j¼ X1 X 1jC i \ C 0 j j 12 þ X1 X i 1jCi¼1 j¼ii 1 i \ C 0 j j 12 þ X11j 1 C i C 0 1j2i¼1 j¼1i¼1 j¼1i 1þ X11 1 C0 jC i ð47Þ2j¼1 i¼1j 1This concludes the pro<strong>ofstrong> <strong>ofstrong> the lemma. We can now present concrete (but intricate) formulae <strong>ofstrong> good <strong>weakstrong> or strong similaritymeasures for ordered sets (considered as fuzzy sets), based on (35) <strong>andstrong> the lemma.I. Weak measuresI.1. RecallWe haveRðC; C 0 Þ¼ jU C \ U C 0jjU C 0jRðC; C 0 Þ¼I.2. PrecisionP 1 P 1i¼1j¼1P 1jC i \ C 0 j j 1j¼1jC 0 jjPðC; C 0 Þ¼ jU C \ U C 0jjU C jPðC; C 0 Þ¼P 1 P 1i¼1j¼1P 1jC i \ C 0 j j 1i¼1jC i j2 maxði;jÞ 1ð48Þ12 maxði;jÞ 1ð49Þ1I.3. Overlap measure O 1O 1 ðC; C 0 Þ¼jU C \ U C 0jminðjU C j; jU C 0jÞO 1 ðC; C 0 Þ¼minP 1 P 1i¼1j¼1P1i¼1jC i \ C 0 j j 1jC i j12 i 1 ; P12 maxði;jÞ 1j¼1jC 0 jj12 j 1 !ð50Þ


784 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807II. Strong measuresII.1. Jaccard JJðC; C 0 Þ¼ jU C \ U C 0jjU C [ U C 0jJðC; C 0 Þ¼wherea ¼ X1X 1P 1 P 1i¼1j¼1i¼1 j¼iþ X11 C0 jj¼1II.2. Dice DjC i \ C 0 j j 1a2 maxði;jÞ 1jC i \ C 0 j j 12 þ X1i 1 1C i2i¼1j 1DðC; C 0 Þ¼ 2jU C \ U C 0jjU C jþjU C 0jDðC; C 0 Þ¼2 P1 P 1i¼1j¼1P 1i¼1i¼1jC i \ C 0 j j 11jC i j2 þ X1i 1II.3. Cosine CosCosðC; C 0 Þ¼ jU C \ U C 0jpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijU C jjU C 0jP 1 P 1j¼1X i 1j¼12 maxði;jÞ 1jC i \ Cj 0j 1CosðC; C 0 i¼1j¼12 maxði;jÞ 1Þ¼vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !u P 1 1 P 1t1jC i j jC 02jji 1 2 j 1i¼1jC i \ C 0 j j 12 j 1 þ X1i¼1 C i 1C 0 1j2j¼1i 1ð51ÞjC 0 j j 12 j 1 ð52Þj¼1ð53ÞII.4. NpffiffiffiNðC; C 0 2 jUC \ U C 0jÞ¼qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffijU C j 2 þjU C 0j 2pffiffiP 21P 1i¼1j¼1jC i \ C 0 j j 12 maxði;jÞ 1NðC; C 0 Þ¼vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi !u P 12 2t 1jC i j þ P1 1jC 02jji 12 j 1i¼1j¼1ð54Þ


786 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 2. A type <strong>ofstrong> curve <strong>ofstrong> similarity results.We draw curves <strong>ofstrong> similarity results <strong>ofstrong> the type <strong>ofstrong> Fig. 2.Number <strong>ofstrong> queries are read on the X-axis <strong>andstrong> corresponding similarity results are read on theY-axis. The similarity is calculated according to the following measures.5.2. Measures tested5.2.1. Fuzzy approach <strong>andstrong> classical measuresWe test the five strong OS measures considered in the fuzzy context <strong>andstrong> presented before inEqs. (51)–(55). Measures derived from Jaccard, Dice, Cosine, N, Overlap O 2 are respectivelydenoted as J F , D F , Cos F , N F <strong>andstrong> O 2F .We test also the three <strong>weakstrong> OS measures considered in the fuzzy context <strong>andstrong> presented in Eqs.(48)–(50). Recall, Precision <strong>andstrong> Overlap O 1 are respectively denoted as R F , P F <strong>andstrong> O 1F .Each strong fuzzy OS measure is compared with two other measures:• its classical corresponding indicator: Jaccard, Dice, Cosine, N, Overlap O 2 , respectively denotedas J, D, Cos, N, O 2 <strong>andstrong> have been presented before in Eqs. (1)–(4), (6).• one strong ‘‘simple’’ OS measure: the one derived from Jaccard with an exponential weight(Egghe & Michel (2002)).Each <strong>weakstrong> fuzzy OS measure is compared with:• its classical corresponding indicator: Recall, Precision <strong>andstrong> Overlap O 1 , respectively denoted asR, P, O 1 <strong>andstrong> have been presented before in Eqs. (5), (7) <strong>andstrong> (8).• one strong ‘‘simple’’ OS measure: the one derived from Jaccard with an exponential weight.Why did we only choose this ‘‘simple’’ OS measure in all the comparisons?5.2.2. The choice <strong>ofstrong> one strong ‘‘simple’’ OS measureWe have seen in Egghe <strong>andstrong> Michel (2002) how to construct 10 strong ‘‘simple’’ OS measures,derived from the five classical ones, by using an exponential or a linear weight. For example, thestrong ‘‘simple’’ OS measure derived from Jaccard with an exponential weight is:


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 787J P S ðC; C0 Þ¼ Xmi¼1X m0j¼1JðC i ; C 0 j Þu Pði; jÞð56Þwithu P : N N ! R þ ; u P ði; jÞ ¼ 4m 03ð57Þ4 m 0 1 2 i 2 j 2 ji jjwhere m ¼jCj, m 0 ¼jC 0 j <strong>andstrong> m 0 ¼ maxðm; m 0 Þ.Strong ‘‘simple’’ OS measures derived from Jaccard, N measure, Dice, Cosine <strong>andstrong> O 2 are calledJ0 P, N 0 P, DP 0 , CosP 0 , OP 2 0respectively when they have an exponential weight. Weak ‘‘simple’’ OSmeasure derived from Recall with an exponential weight is called R P 0. Strong ‘‘simple’’ OS measuresderived from Jaccard, N measure, Dice, Cosine <strong>andstrong> O 2 are called J0 ‘, N 0 ‘, D‘ 0 , Cos‘ 0 , O‘ 2 0<strong>andstrong> R ‘ 0respectively when they have a linear weight.We briefly repeat some results <strong>ofstrong> Egghe <strong>andstrong> Michel (2002).We proved thatJ P 0 ¼ DP 0<strong>andstrong> J ‘ 0 ¼ D‘ 0After experimentation we have seen that:J P 0 CosP 0 RP 0<strong>andstrong> J ‘ 0 Cos‘ 0 R‘ 0By considering that J 6¼ Cos 6¼ R 6¼ D, one <strong>ofstrong> the conclusions <strong>ofstrong> the experimentation was that theweight function suppresses the particularity <strong>ofstrong> classical indicators. So all exponential (resp. linear)‘‘simple’’ OS measures are sensibly similar to J P 0 (resp. J ‘ 0 ). Because the exponential weight u P (Eq.(57)) is <strong>ofstrong> the same nature than the concrete membership function u chosen in this paper (Eq. (34)),we choose J P 0 . To simplify notations, we call it J 0 in the sequel. Similarly we note Cos 0 , N 0 , Dice 0 ,O 20 , R 0 , P 0 <strong>andstrong> O 10 for the ‘‘simple’’ OS measures J P 0 , CosP 0 , N P 0 , DiceP 0 , OP 2 0, R P 0 , P P 0 <strong>andstrong> OP 1 0.Sowecompare all strong <strong>andstrong> <strong>weakstrong> measures considered in a classical <strong>andstrong> in a ‘‘fuzzy’’ way with J 0 .For each <strong>ofstrong> the three pr<strong>ofstrong>iles <strong>andstrong> for each <strong>ofstrong> the 600 queries, we made similarity computationsas explained in Fig. 2. We draw curves <strong>andstrong> group them in several ways in order to make comparisons.Firstly, we group the measure by construction in order to make appear a possiblegeneral tendency, that is, for each pr<strong>ofstrong>ile, we compare in the same figure all the classical strongmeasures (i.e. N, D, J, O 2 , Cos) <strong>andstrong> in another one all the strong fuzzy OS measures (i.e. N F , D F ,J F , O 2F , Cos F ). Secondly, we group measures according the native indicator considered: JaccardÕsone, N measureÕs one, DiceÕs one ... in order to see if the type <strong>ofstrong> construction has an influence.5.3. Results5.3.1. Comparison <strong>ofstrong> strong measures by constructionClassical measures N, D, J, O 2 , Cos are presented together in Fig. 3 for pr<strong>ofstrong>ile T1, in Fig. 5 forpr<strong>ofstrong>ile T2 <strong>andstrong> in Fig. 7 for pr<strong>ofstrong>ile T3. In order to make the curves readable, queries on the X-axisare ranked by increasing J s . Similarly, measures N F , D F , J F , O 2F , Cos F are presented together inFig. 4 for pr<strong>ofstrong>ile T1, in Fig. 6 for pr<strong>ofstrong>ile T2 <strong>andstrong> in Fig. 8 for pr<strong>ofstrong>ile T3. Queries on the X-axis are


788 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 3.Fig. 4.ranked by increasing J F s. Queries’ rank in figures make only comparisons possible between curvesdrawed on the same figure <strong>andstrong> not those coming from dierent figures.We can notice that measures have specific values due to the different normalization functions w 2(see Eq. (14)) but share the same shape regarding the pr<strong>ofstrong>ile. Indeed, we can see a J-shape curvefor pr<strong>ofstrong>ile T1(Figs. 3, 4), an inverse J-shape curve for pr<strong>ofstrong>ile T2 (Figs. 5, 6) <strong>andstrong> an S-shape curvefor pr<strong>ofstrong>ile T3 (Figs. 7, 8).Moreover, we can see for most values that measures are ranked as J 6 O 2 6 D 6 N 6 Cos in theclassical case <strong>andstrong> J F 6 O 2F 6 D F 6 N F 6 Cos F in the fuzzy case. Let us remember that, in the strongcase (Egghe & Michel (2002)) we had J 0 ¼ D 0 Cos 0 . So we can say that fuzzy OS measures are


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 789Fig. 5.Fig. 6.more sensitive than the ‘‘simple’’ OS one because they respect the natural properties <strong>ofstrong> classicalmeasures i.e. the normalization function w 2 .Let us compare now each strong ‘‘fuzzy’’ OS measure with its classical correspondent <strong>andstrong> withthe strong ‘‘simple’’ OS measure J 0 .5.3.2. Comparison <strong>ofstrong> strong measures grouped by native indicatorIn the five following figures, curves <strong>ofstrong> measures computed in the T1pr<strong>ofstrong>ile case are groupedaccording to there native indicator. Indeed, curves <strong>ofstrong> J, J F <strong>andstrong> J 0 are presented together in Fig. 9,curves <strong>ofstrong> N, N g <strong>andstrong> J 0 (regarding experimentation results <strong>ofstrong> Egghe & Michel (2002) we supposethat N 0 J 0 ) are presented in Fig. 10, curves <strong>ofstrong> D, D F <strong>andstrong> J 0 (indeed D 0 ¼ J 0 ) are presented in Fig11, curves <strong>ofstrong> O 2 , O 2F <strong>andstrong> J 0 (as before, see suppose that O 20 J 0 ) are presented in Fig. 12 <strong>andstrong>curves <strong>ofstrong> Cos, Cos F <strong>andstrong> J 0 (indeed Cos 0 J 0 ) are presented in Fig. 13. In all figures, queries


790 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 7.Fig. 8.numbers are ranked by increasing J F . So Figs. 3, 4, 9–13 have the same rank for queries <strong>andstrong> canbe compared together.In annexes, the same curves, computed for pr<strong>ofstrong>iles T2 (Figs. 33–37) <strong>andstrong> T3 (Figs. 38–42), arepresented.The shape <strong>ofstrong> the curve <strong>ofstrong> Fig. 4 is naturally noticed on each figure in the ‘‘fuzzy’’ case (i.e. forJ F , N F , D F , O 2F <strong>andstrong> Cos F ). On the contrary, J, N, D, O 2 , Cos <strong>andstrong> J 0 do not have a regular shape.Indeed, we see that the classical curves <strong>andstrong> ‘‘simple’’ OS curves oscillate strongly around the fuzzyOS one. The same observations can be made by looking on figures relating to pr<strong>ofstrong>ile T2 (Figs. 33–37) <strong>andstrong> T3 (Figs. 38–42). So, we can say that ‘‘fuzzy’’ OS, ‘‘simple’’ OS <strong>andstrong> classical strongmeasures are really different.Let us observe now the fundamental characteristics, i.e. characteristics noticed for the threepr<strong>ofstrong>iles. We begin comparisons with the three measures J, J F <strong>andstrong> J 0 computed for the pr<strong>ofstrong>ile T1,


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 791Fig. 9.Fig. 10.Fig. 11.


792 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 12.Fig. 13.Fig. 14.


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 793Fig. 15.T2 <strong>andstrong> T3. They are presented in Figs. 9, 14 <strong>andstrong> 15 respectively. In the sequel we will see whetherconclusions are extensible to other measures than the one derived from J.The J F curve is always very regular; indeed queries are ranked according to increasing J F s. Theamplitude <strong>ofstrong> the oscillations <strong>ofstrong> J 0 <strong>andstrong> J curves are really different depending on the pr<strong>ofstrong>ile. Indeedthey are most <strong>ofstrong> time very small in T1<strong>andstrong> T2 <strong>andstrong> larger in T3. Nevertheless, we do not think thatthere is a characteristic <strong>ofstrong> the pr<strong>ofstrong>ile but rather a characteristic depending on the J F Õs distributionfunction. Indeed, let us define four zones:• Zone 1: J F tends towards 0• Zone 2: J F takes variable values from 0 to 1• Zone 3: J F tends towards 1• Zone 4: J F is equal to 1Vertical lines in Fig. 15 make zone 2, zone 3 <strong>andstrong> zone 4 visible; there is no zone 1 in this case.We let the reader do the same in Fig. 9 (zone 1<strong>andstrong> 2 are noticable) <strong>andstrong> Fig. 14 (zone 2, 3 <strong>andstrong> 4 arenoticable).In zone 1<strong>andstrong> 3, amplitude <strong>ofstrong> oscillations <strong>ofstrong> J 0 <strong>andstrong> J are really small <strong>andstrong> no oscillation isobserved in zone 4. J F values usually are between J 0 <strong>andstrong> J. In a more precise way in zone 1weobserve most <strong>ofstrong> time that the three type <strong>ofstrong> values are ranked as J > J F > J 0 <strong>andstrong> on the contrary inzone 3 they are ranked as J < J F < J 0 , the permutation being effective in zone 2 (see Fig. 15). Thisobservation is really interesting if we take into account the fact that the number <strong>ofstrong> commondocuments is the first similarity criterion in the classical case <strong>andstrong> the rank <strong>ofstrong> presentation <strong>ofstrong>common documents is the one in the strong ‘‘simple’’ OS case. So this observation makes usbelieve that J F takes into account the criteria <strong>ofstrong> rank <strong>andstrong> number <strong>ofstrong> common documents in amore precise <strong>andstrong> representative way than do J <strong>andstrong> J 0 .


794 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807This observation is noticeable also in the figures presented in annexes. So we can say in a generalway that fuzzy OS measures are more precise <strong>andstrong> representative than classical <strong>andstrong> simple OS oneson criteria <strong>ofstrong> rank <strong>andstrong> number <strong>ofstrong> common documents.Let us observe now curves <strong>ofstrong> <strong>weakstrong> measures: Recall, Precision <strong>andstrong> O 1 measures.5.3.3. General comparison <strong>ofstrong> <strong>weakstrong> measures by constructionIn the six following figures we can see curves <strong>ofstrong> R, P <strong>andstrong> O 1 computed for pr<strong>ofstrong>ile T1in Fig. 16,for pr<strong>ofstrong>ile T2 in Fig. 17 <strong>andstrong> for pr<strong>ofstrong>ile T3 in Fig. 18. Curves <strong>ofstrong> R F , P F <strong>andstrong> O 1F are presented forpr<strong>ofstrong>ile T1 in Fig. 19, for pr<strong>ofstrong>ile T2 in Fig. 20 <strong>andstrong> for pr<strong>ofstrong>ile T3 in Fig. 21.We cannot notice any regularity linked with pr<strong>ofstrong>ile or measures. P <strong>andstrong> O 1 oscillate wildlyaround R. P F <strong>andstrong> O 1F do the same around R F .Previous studies have shown that, <strong>ofstrong>ten, recall <strong>andstrong> precision vary in an inverse proportionalway (see Fig. 22 or 23). It is not true in our case. Indeed we give curves (recall, precision) for eachpr<strong>ofstrong>ile in annexes (Figs. 43–48) <strong>andstrong> none <strong>ofstrong> them is like Fig. 22 or 23.Fig. 16.Fig. 17.


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 795Fig. 18.Fig. 19.Fig. 20.


796 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 21.Fig. 22.Fig. 23.This phenomenon is explained by the context <strong>ofstrong> the experimentation. Indeed, recall <strong>andstrong> precisionare usually computed in a collection test context. It is not the case in our experimentation


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 797where precision is calculated by taking the answers without filtering (i.e. the answer given bypr<strong>ofstrong>ile T0) as set <strong>ofstrong> ‘‘retrieved documents’’ (corresponding to A in Eq. (8) <strong>andstrong> to C in Eq. (49))<strong>andstrong> recall is calculated by taking the answers with filtering as set <strong>ofstrong> ‘‘relevant documents’’(corresponding to B in Eq. (7) <strong>andstrong> to C 0 in Eq. (48)). The fact that C 0 <strong>andstrong> B are not real sets <strong>ofstrong>relevant documents in a semantic sense (regarding for example expertsÕ judgement) is a source <strong>ofstrong>biases <strong>andstrong> explain partially why we do not refind the shape <strong>ofstrong> curves shown in Fig. 22 or 23. Wehave no conclusion by comparing R with P <strong>andstrong> R F with P F .Moreover we can notice that P is very similar to O 1 <strong>andstrong> P F is very similar to O 1F . This particularitycannot be seen as a characteristic <strong>ofstrong> P <strong>andstrong> O 1 because <strong>ofstrong> the biases induced by data <strong>ofstrong>experimentation. Indeed, if we place ourselves in the fuzzy case, C corresponds to answers linkedwith the neutral pr<strong>ofstrong>ile T0 <strong>andstrong> C 0 to answers obtained by the filtering process <strong>ofstrong> pr<strong>ofstrong>iles T1, T2 orT3. So in most <strong>ofstrong> the cases we have jCj > jC 0 j, which means that jC 0 j¼minðjCj; jC 0 jÞ <strong>andstrong> soP F ¼ O 1F . Because <strong>ofstrong> the experimentation biases, we are not able to give any conclusions about<strong>weakstrong> fuzzy OS measures construction.Nevertheless, comparisons <strong>ofstrong> R with R F <strong>andstrong> R 0 ; P with P F <strong>andstrong> R 0 , <strong>andstrong> finally O 1 with O 1F <strong>andstrong>R 0 are possible by observing natural characteristics <strong>ofstrong> measures. (The choice R 0 in the P or O 1 caseis justified by conclusion <strong>ofstrong> Egghe & Michel (2002)). Indeed, we suppose that P 0 R 0 <strong>andstrong>O 10 R 0 .5.3.4. Comparison <strong>ofstrong> <strong>weakstrong> measures grouped by native indicatorCurves <strong>ofstrong> measures R <strong>andstrong> R F <strong>andstrong> R 0 are presented in Figs. 24–26 (for the pr<strong>ofstrong>iles T1, T2 <strong>andstrong>T3 respectively). Queries are presented by increasing R F s. Curves <strong>ofstrong> measures P <strong>andstrong> P F arepresented in Figs. 27–29 (for the pr<strong>ofstrong>iles T1, T2 <strong>andstrong> T3 respectively). Queries are presented byincreasing P F s. Curves <strong>ofstrong> measures O 1 <strong>andstrong> O 1g are presented in Figs. 30–32 (for the pr<strong>ofstrong>iles T1, T2<strong>andstrong> T3 respectively). Queries are presented by increasing O 1F s.In general, curves <strong>ofstrong> R, R 0 <strong>andstrong> R F have the same characteristics as the strong measures presentedbefore. Zones 1–4, noticed in Figs. 9, 14, 15 are present in Figs. 24–26.Fig. 24.


798 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 25.Fig. 26.Fig. 27.


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 799Fig. 28.Fig. 29.Fig. 30.


800 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 31.Fig. 32.On the contrary, measures P <strong>andstrong> O 1 have not the characteristics <strong>ofstrong> strong measures. Indeed,oscillations <strong>ofstrong> P <strong>andstrong> R 0 around P F are larger <strong>andstrong> more irregular than the ones <strong>ofstrong> R <strong>andstrong> R 0 aroundR F . Observation is the same with measure O 1 .5.4. ConclusionExperimentation analyzes the characteristics <strong>ofstrong> fuzzy OS measures by comparing them toclassical measures <strong>andstrong> ‘‘simple’’ OS one. Comparisons between all the classical <strong>andstrong> all the fuzzyOS measures (i.e. comparisons by construction) are made to see general tendencies. We observethat, in the strong measures case, general shape <strong>ofstrong> curves are conserved varying the pr<strong>ofstrong>ile,


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 801Fig. 33.Fig. 34.Fig. 35.


802 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 36.Fig. 37.Fig. 38.


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 803Fig. 39.Fig. 40.Fig. 41.


804 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 42.Fig. 43.so the fuzzy process is stable <strong>andstrong> representative <strong>ofstrong> the pr<strong>ofstrong>ile or <strong>ofstrong> the IR-system used. Moreprecisely, if we look on the values, we notice that the natural properties <strong>ofstrong> classical strongmeasures, defined by the normalization function W 2 , are lost in the ‘‘simple’’ OS case (conclusion<strong>ofstrong> Egghe & Michel (2002)) but conserved in the fuzzy case. Indeed, J 6 O 2 6 D 6 N 6 Cosin the classical case, J F 6 O 2F 6 D F 6 N F 6 Cos F in the fuzzy case, but J 0 ¼ D 0 Cos 0 in the‘‘simple’’ OS case. So the strong ‘‘fuzzy’’ OS measures are more sensitive than the strong ‘‘simple’’OS one.Comparisons <strong>ofstrong> classical <strong>andstrong> <strong>weakstrong> fuzzy OS measures are not valid for two different reasons.Firstly, in the analysis <strong>ofstrong> recall <strong>andstrong> precision, we attempted to find a usual regularity but we did


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 805Fig. 44.Fig. 45.not observe any for R <strong>andstrong> P or for R F <strong>andstrong> P F . We suppose that this ‘‘anomaly’’ is a bias <strong>ofstrong> theexperimentation. Indeed, recall <strong>andstrong> precision regularities are observed for collection test evaluations,which is not our context. Secondly, in the analysis <strong>ofstrong> precision <strong>andstrong> O 1 , we have showedthat, for most values, P F ¼ O 1F <strong>andstrong> P ¼ O 1 , which is not normal <strong>andstrong> is due to the experimentationcontext. So, to give pertinent conclusions on the general tendency <strong>ofstrong> <strong>weakstrong> fuzzy OS measures weneed to work on a real collection test evaluation context.In the second step, we group measures according the native indicator considered, in order to seeif the type <strong>ofstrong> construction has an influence on the measure or not. These comparisons show that,for the five strong measures <strong>andstrong> the three <strong>weakstrong> ones, fuzzy OS values are most <strong>ofstrong> the time between‘‘simple’’ OS values <strong>andstrong> classical values. Considering that ‘‘simple’’ OS measures are


806 L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807Fig. 46.Fig. 47.principally depending on the rank <strong>ofstrong> common documents <strong>andstrong> classical measures are only dependingon the number <strong>ofstrong> common documents, we can believe that, strong (resp. <strong>weakstrong>) fuzzy OSmeasures are more precise <strong>andstrong> representative on the two criteria <strong>ofstrong> rank <strong>andstrong> number <strong>ofstrong> commondocuments than classical <strong>andstrong> ‘‘simple’’ OS strong (resp. <strong>weakstrong>) measures.Finally, we can say that strong fuzzy OS measures seem to be better than classical <strong>andstrong> ‘‘simple’’OS strong ones: they are stable, more precise <strong>andstrong> more representative. In the <strong>weakstrong> case, we mustbe more prudent. We have the conviction that the preceding conclusion can be extended but wecannot confirm this before having made complementary tests in a real collection test evaluationcontext.


L. Egghe, C. Michel / Information Processing <strong>andstrong> Management 39 (2003) 771–807 807Fig. 48.Appendix ASee Figs. 33–48.ReferencesBoyce, B. R., Meadow, C. T., & Kraft, D. H. (1995). Measurement in information science. New York: Academic Press.Buell, D. A., & Kraft, D. H. (1981a). Evaluation <strong>ofstrong> fuzzy retrieval systems. In Proceedings <strong>ofstrong> the American society forinformation science (pp. 298–300).Buell, D. A., & Kraft, D. H. (1981b). Performance measurement in a fuzzy retrieval environment. ACM SIGIR Forum16(1). In Proceedings <strong>ofstrong> the fourth international conference on information storage <strong>andstrong> retrieval, Oakl<strong>andstrong>, California,May 31–June 2, 1981 (pp. 56–61).Egghe, L. (1994). A theory <strong>ofstrong> continuous rates <strong>andstrong> applications to the theory <strong>ofstrong> growth <strong>andstrong> obsolescence rates.Information Processing <strong>andstrong> Management, 30(2), 279–292.Egghe, L., & Michel, C. (2002). Strong similarity measures for ordered sets <strong>ofstrong> documents in information retrieval.Information Processing <strong>andstrong> Management, 38(6), 823–848.Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Quantitative methods in library, documentation <strong>andstrong>information science. Amsterdam: Elsevier.Fluhr, C. (1997). SPIRIT.W3: A distributed Cross Lingual Indexing <strong>andstrong> Search Engine. In: Proceedings <strong>ofstrong> the INET97: The seventh annual conference <strong>ofstrong> the internet society, Kuala Lumpur, Malaysia, June 24–27, 1997.Grossman, D. A., & Frieder, O. (1998). Information retrieval. Algorithms <strong>andstrong> heuristics. Boston: Kluwer AcademicPublishers.Laine-Cruzel, S., Lafouge, T., Lardy, J. P., & Abdallah, N. (1996). Improving information retrieval by combining userpr<strong>ofstrong>ile <strong>andstrong> document segmentation. Information Processing <strong>andstrong> Management, 32(3), 305–315.Losee, R. M. (1998). Text retrieval <strong>andstrong> filtering. Analytic models <strong>ofstrong> performance. Boston: Kluwer Academic Publishers.Michel, C. (2000). Ordered similarity measures taking into account the rank <strong>ofstrong> documents. Information Processing <strong>andstrong>Management, 37(4), 603–622.Salton, G., & Mc Gill, M. J. (1987). Introduction to modern information retrieval. New York: Mc Graw-Hill.Tague-Sutcliffe, J. (1995). Measuring information. An information services perspective. New York: Academic Press.van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). London: Butterworths.Zadeh, L. (1975). Fuzzy sets <strong>andstrong> their applications to cognitive <strong>andstrong> decision processes. New York: Academic Press.

More magazines by this user
Similar magazines