13.03.2015 Views

Jornada de Seguimiento de Proyectos en Tecnologías del Software

Jornada de Seguimiento de Proyectos en Tecnologías del Software

Jornada de Seguimiento de Proyectos en Tecnologías del Software

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Jornada</strong>s <strong>de</strong> <strong>Seguimi<strong>en</strong>to</strong> <strong>de</strong> <strong>Proyectos</strong>, 2007<br />

Programa Nacional <strong>de</strong> Tecnologías Informáticas<br />

jspTIN@CEDI2007: <strong>Jornada</strong> <strong>de</strong><br />

<strong>Seguimi<strong>en</strong>to</strong> <strong>de</strong> <strong>Proyectos</strong> <strong>de</strong> Tecnologías<br />

Informáticas<br />

TIN2005-08832-C03<br />

Rafael Morales-Bu<strong>en</strong>o *<br />

Coordinador e IP C03-01<br />

Ricard Gavaldá ***<br />

IP C03-03<br />

Elvira Mayordomo **<br />

IP C03-02<br />

Abstract<br />

This project will <strong>de</strong>velop, implem<strong>en</strong>t, and experim<strong>en</strong>tally validate formal concepts<br />

and algorithms that can be applied to the analysis, mo<strong>de</strong>lling, and prediction or learning from<br />

very large data sets repres<strong>en</strong>ted as sequ<strong>en</strong>ces.<br />

Some of the mo<strong>de</strong>ls that will be consi<strong>de</strong>red are:<br />

1) The Data Stream mo<strong>de</strong>l, as far as it can be applied to Data Mining and Business Intellig<strong>en</strong>ce<br />

tasks,<br />

2) Mo<strong>de</strong>ls for lossless data compression and on-line prediction,<br />

3) Ev<strong>en</strong>t sequ<strong>en</strong>ces and Temporal Series,<br />

4) Proximity-based mo<strong>de</strong>ls (e.g., Nearest-Neighbours-like),<br />

5) Mo<strong>de</strong>ls based on <strong>de</strong>cision and regression trees, and<br />

6) Grammatical or g<strong>en</strong>erative mo<strong>de</strong>ls.<br />

In particular, we are interested in mo<strong>de</strong>ls that can learn drifting concepts and in the<br />

pres<strong>en</strong>ce of noise.<br />

For most of these mo<strong>de</strong>ls there will be a first phase for theoretical <strong>de</strong>velopm<strong>en</strong>t, a<br />

second phase for implem<strong>en</strong>tation and experim<strong>en</strong>tal testing of the algorithms, and a third phase<br />

of application to real data.<br />

Keywords:<br />

Data mining, knowledge discovery, computational learning, symbolic sequ<strong>en</strong>ces,<br />

algorithms, complexity, Hausdorff dim<strong>en</strong>sion.<br />

* Email: rmorales@uma.es<br />

** Email: elvira@unizar.es<br />

***<br />

Email: gavalda@lsi.upc.edu


TIN2005-08832-C03<br />

1 Objetivos <strong>de</strong>l proyecto<br />

The objectives of this Project is a continuation of the previous and linked Projects: FRESCO<br />

(2000-2002) and MOISES (2003-2005):<br />

1. Design of algorithms to handle Data Streams that vary over time (Data Mining on Time<br />

Varying Data Streams), using ext<strong>en</strong>sions of sequ<strong>en</strong>tial sampling methods. Theoretical analysis<br />

of the effici<strong>en</strong>cy and accuracy of these algorithms, and evaluation over empirical and real data.<br />

We will give higher priority to learning techniques in objectives 5 and 8.<br />

2. Continuation of work on complex pattern handling in ev<strong>en</strong>t sequ<strong>en</strong>ces and relational<br />

<strong>de</strong>p<strong>en</strong><strong>de</strong>ncies. Developm<strong>en</strong>t of new theoretical characterizations that help in the <strong>de</strong>sign of<br />

algorithms that work for patterns that are more complex than those that are feasible today<br />

and/or that have better computational effici<strong>en</strong>cy.<br />

3. Proposal, implem<strong>en</strong>tation and validation of new lossless compression methods that are less<br />

s<strong>en</strong>sitive to small input variations than known algorithms (Lempel-Ziv and other).<br />

4. Validation of effici<strong>en</strong>cy and performance of the four exist<strong>en</strong>t compression algorithms for<br />

biological sequ<strong>en</strong>ces. Application to DNA sequ<strong>en</strong>ce classification.<br />

5. Design of new <strong>de</strong>cision tree and regression learning algorithms, with the possibility of including<br />

evolutionary computation and the following characteristics, adaptability to concept change,<br />

increm<strong>en</strong>tality and noise robustness.<br />

6. Improvem<strong>en</strong>t of ev<strong>en</strong>t prediction algorithms with g<strong>en</strong>eration of attributes computed from the<br />

set of differ<strong>en</strong>tial equations that <strong>de</strong>scribe parts of the ph<strong>en</strong>om<strong>en</strong>on, increm<strong>en</strong>tality, adaptability<br />

to time pattern change, and noise robustness.<br />

7. Definition, analysis and inclusion of techniques of significant variable selection in time series<br />

using Fourier mo<strong>de</strong>ls, sliding window mo<strong>de</strong>ls, and spline adjustm<strong>en</strong>t.<br />

8. Developm<strong>en</strong>t of learning mo<strong>de</strong>ls based on sample grouping and classification by the closest<br />

neighbour method, study of methods and group repres<strong>en</strong>tation, and distance functions.<br />

9. Continuation of work on i<strong>de</strong>ntification of dim<strong>en</strong>sion, prediction, and compression.<br />

Improvem<strong>en</strong>t of dim<strong>en</strong>sion computing techniques.<br />

10. Study of differ<strong>en</strong>t performance evaluation measures of sequ<strong>en</strong>ce compression and prediction<br />

algorithms.<br />

11. Study of grammatical mo<strong>de</strong>ls. Ext<strong>en</strong>sion of pres<strong>en</strong>t mo<strong>de</strong>ls based on proof nets (and in<br />

particular non commutative logic nets) to the study of language syntax. Ext<strong>en</strong>sion of the<br />

pres<strong>en</strong>t theories on parsing of mildly context s<strong>en</strong>sitive grammars to more g<strong>en</strong>eral cases.<br />

12. Definition and study of Data Streams as a computation mo<strong>de</strong>l. Investigation of the complexity<br />

of computing functions in this mo<strong>de</strong>l, using methods from Complexity Theory and<br />

connections with compression data mo<strong>de</strong>ls and succinct structure repres<strong>en</strong>tation.<br />

13. Study of mo<strong>de</strong>ls for the construction of time and space reasoning of common s<strong>en</strong>se based on<br />

s<strong>en</strong>sory and motor experi<strong>en</strong>ce of intellig<strong>en</strong>t ag<strong>en</strong>ts, for instance ev<strong>en</strong> sequ<strong>en</strong>ce collection.<br />

Connected with objective 5.<br />

14. Application of the <strong>de</strong>veloped algorithms to real data on labor disability, g<strong>en</strong><strong>de</strong>r viol<strong>en</strong>ce,<br />

stu<strong>de</strong>nts, DNA, and data from private <strong>en</strong>terprises that are “EPO” (promoting and observing<br />

organization).<br />

The resources to achieve these goals are people, computers and data sets.<br />

Initially there was 23 people participating in this project (6 from Zaragoza and Valladolid, 7 from<br />

Barcelona and 10 from Málaga). A participant from Málaga (Manuel Ba<strong>en</strong>a) obtained a FPI grant


TIN2005-08832-C03<br />

from MEC by starting on May 2007. Rec<strong>en</strong>tly (June-2007) another participant from Malaga (José<br />

<strong>de</strong>l Campo), that has finished his FPI grant from the previous coordinated project 2003-2005, has<br />

obtained his Ph. D. Degree and we think that, during the project, 2 or 3 stu<strong>de</strong>nts will obtain it too.<br />

Two new people are working in the subproject of Málaga since March 2007: Amparo Ruiz and<br />

Francisco Cantalejo.<br />

Since February 2006 Pilar Albert is involved as a new member of the Zaragoza<br />

subproject, by means of a grant from Aragón Governm<strong>en</strong>t, to work in topics related to the project.<br />

Since January 2007, Philippe Moser is involved as a new member of the Zaragoza subproject by<br />

means of a contract in the “Juan <strong>de</strong> la Cierva” program.<br />

We emphasize the participation in this project of Professor Jack Lutz, from Iowa State<br />

University, in USA. This researcher is worldwi<strong>de</strong> known for his contributions to Computational<br />

Complexity and Information Theory, especially in the topics related to Effective Hausdorff<br />

Dim<strong>en</strong>sion, that he started in 2000. His collaboration is very important for the attainm<strong>en</strong>t of the<br />

results of this project. From 1991 we have collaborated with him, fun<strong>de</strong>d by differ<strong>en</strong>t sources,<br />

including National Sci<strong>en</strong>ce Foundation from the USA governm<strong>en</strong>t and the Spanish Ministry of<br />

Education.<br />

Since January 2007, Albert Bifet is involved as a new member of the Barcelona subproject<br />

by means of a direct contract. In April 2007, Albert Bifet obtained a FPI grant from G<strong>en</strong>eralitat <strong>de</strong><br />

Catalunya (regional governm<strong>en</strong>t) in or<strong>de</strong>r to work in topics related to the project.<br />

Related to computers resources, every subproject has suffici<strong>en</strong>t equipm<strong>en</strong>t in or<strong>de</strong>r to process the<br />

data sets.<br />

We are working with data sets from public repositories, data sets automatically g<strong>en</strong>erated and real<br />

data sets, from poll to tourists in Andalusia, and m<strong>en</strong>tal health data.<br />

The chronogram proposed in the project was the following:<br />

Activities/Tasks<br />

First year Second year Third year<br />

ACTIVITY 1 X| | | | | |X| | | | X| | | | |X| | | | | X| | | | |X| | | |X|<br />

WEB X|X|X|X|X|X| | | | |X| | | | | | | | | |X| | | | | | | | |<br />

ST-T X|X|X|X|X|X|X|X|X| | | | | | | | | | | | | | | | | | | | |<br />

PR-ST-T | | | | | | | | | | X|X|X|X|X|X|X|X|X| | | | | | | | | | |<br />

AP-ST-T | | | | | | | | | | | | | | | | | | | | X|X|X|X|X|X|X|X|X|<br />

VC-T X|X|X|X|X|X|X|X|X| | | | | | | | | | | | | | | | | | | | |<br />

PR-VC-T | | | | | | | | | | X|X|X|X|X|X|X|X|X| | | | | | | | | | |<br />

AP-VC-T | | | | | | | | | | | | | | | | | | | | X|X|X|X|X|X|X|X|X|<br />

AD-T X|X|X|X|X|X|X|X|X| | | | | | | | | | | | | | | | | | | | |<br />

PR-AD-T | | | | | | | | | | X|X|X|X|X|X|X|X|X| | | | | | | | | | |<br />

AP-AD-T | | | | | | | | | | | | | | | | | | | | X|X|X|X|X|X|X|X|X|<br />

PC-T X|X|X|X|X|X|X|X|X| | | | | | | | | | | | | | | | | | | | |<br />

PR-PC-T | | | | | | | | | | X|X|X|X|X|X|X|X|X| | | | | | | | | | |<br />

AP-PC-T | | | | | | | | | | | | | | | | | | | | X|X|X|X|X|X|X|X|X|<br />

DPC<br />

X|X|X|X|X|X|X|X|X| X|X|X|X|X|X|X|X|X| X|X|X|X|X|X|X|X|X<br />

TCD<br />

X|X|X|X|X|X|X|X|X| X|X|X|X|X|X|X|X|X X|X|X|X|X|X|X|X|X|<br />

RCP<br />

X|X|X|X|X|X|X|X|X| X|X|X|X|X|X|X|X|X| X|X|X|X|X|X|X|X|X|


TIN2005-08832-C03<br />

ACTIVITY 1: To coordinate and docum<strong>en</strong>t (participants: Elvira Mayordomo-EM, Ricard<br />

Gavaldà-RG, Rafael Morales-RM). This will be done during the whole project. At the beginning of<br />

every activity block there will be a basic check betwe<strong>en</strong> the above m<strong>en</strong>tioned co-ordinators in or<strong>de</strong>r<br />

to synchronize the differ<strong>en</strong>t activities and to produce the docum<strong>en</strong>tation stating the results<br />

achieved so far. The responsible persons are the IP of each subproject.<br />

ACTIVITY WEB: To <strong>de</strong>velop the web-site to be used by the project members and by the<br />

sci<strong>en</strong>tific community (participants: Rafael Morales-RM, Elvira Mayordomo-EM, Ricard Gavaldà-<br />

RG, José <strong>de</strong>l Campo-JC, Raul Fidalgo-RF, Manuel Ba<strong>en</strong>a-MB). The technician required for the<br />

subproject MOISES-MA will do this task and will later participate in the activities related to<br />

programming and applications.<br />

- DEVELOPMENT OF FORMALISMS AND ALGORITHMS<br />

ACTIVITY ST-T: To study new mo<strong>de</strong>ls for time series (participants: Llanos Mora-LM,<br />

Inmaculada Fortes-IF, Francisco Triguero-FT, José Luis Triviño-JLT, Marlon Núñez-MN).<br />

Definition, analysis and incorporation of methods for selecting meaningful variables, so that the<br />

mo<strong>de</strong>l will be able to learn from the meaningful information and omit useless data. We propose to<br />

study Fourier methods, splines adjustm<strong>en</strong>ts and sliding windows. We will use the <strong>de</strong>veloped<br />

mo<strong>de</strong>ls in the analysis of stationary and non stationary time series.<br />

ACTIVITY VC-T: To <strong>de</strong>fine new mo<strong>de</strong>ls based on the nearest neighbour classification method<br />

(participants: José Luis Triviño-JLT, Rafael Morales-RM, Gonzalo Ramos-GR, Francisco Triguero-<br />

FT). Definition of a new learning mo<strong>de</strong>l based on grouping of samples and the nearest neighbour<br />

classification method. To specify the features of the new mo<strong>de</strong>l. In this activity the grant hol<strong>de</strong>r<br />

(FPI) required for subproject MOISES-MA will participate.<br />

ACTIVITY AD-T: To <strong>de</strong>velop new algorithms to obtain <strong>de</strong>cision trees (participants: Gonzalo<br />

Ramos-GR, José <strong>de</strong>l Campo-JC, Manuel Ba<strong>en</strong>a-MB, Inmaculada Fortes-IF, Alejandra Cabaña-AC).<br />

We propose to try to inclu<strong>de</strong> evolutionary computing concepts for voting algorithms; to build sets<br />

of trees and to inclu<strong>de</strong> the adaptive sample methos in or<strong>de</strong>r to discover changes of concepts and<br />

increm<strong>en</strong>tality. The company “AXPE Consulting” has shown its interest in this activity (see letter<br />

adjoined).<br />

ACTIVITY PC-T: To study mo<strong>de</strong>ls of pattern learning that change over time (participants:<br />

Marlon Núñez-MN, Raúl Fidalgo-RF, Llanos Mora-LM, José <strong>de</strong>l Campo-JC, Alejandra Cabaña-<br />

AC, FPI required for MOISES-MA). To <strong>de</strong>fine new learning methods to build regression and<br />

<strong>de</strong>cision increm<strong>en</strong>tal trees without adjustm<strong>en</strong>t of parameters. To amplify the BPL method by<br />

including incem<strong>en</strong>tality and adaptability to changing patterns. The company “AXPE Consulting”<br />

has shown its interest in this activity (see letter adjoined).<br />

ACTIVITY DPC: To study the i<strong>de</strong>ntification betwe<strong>en</strong> dim<strong>en</strong>sion, prediction and compression in<br />

polynomial time (participants: Elvira Mayordomo-EM, María López-ML, Argimiro Arratia-AA).<br />

To <strong>de</strong>fine a compressor that “doesn’t work from zero”. To adjust the parameters that guar<strong>en</strong>tee<br />

the equival<strong>en</strong>ce betwe<strong>en</strong> compressors and predictors “on-line”. To study the case of compressors


TIN2005-08832-C03<br />

that parse the input. To study other parameters used in computational learning, like the Vapnik-<br />

Chervon<strong>en</strong>kis (VC) dim<strong>en</strong>sion.<br />

ACTIVITY TCD: To improve methods for computing dim<strong>en</strong>sion (participants: Jack Lutz-JL,<br />

Elvira Mayordomo-EM, María López-ML, Alejandra Cabaña-AC). To build an effective version of<br />

the Jarník’s theorem related to diofantic approximations and applications. To apply the equival<strong>en</strong>ce<br />

betwe<strong>en</strong> compression and dim<strong>en</strong>sion to compute the dim<strong>en</strong>sion of the Takagi’s function. To study<br />

the influ<strong>en</strong>ce of the changes of the alphabet and the base in the effective dim<strong>en</strong>sion and<br />

compression.<br />

ACTIVITY RCP: To study several performance measures in the algorithms of compression and<br />

sequ<strong>en</strong>ces’ prediction (participants: Jack Lutz-JL, Elvira Mayordomo-EM, María López-ML). To<br />

study non-linear compression measures (that is, for finer distinction in the reachable dim<strong>en</strong>sion).<br />

To study loss functions non logarithmic for prediction. To study scaled dim<strong>en</strong>sion.<br />

ACTIVITY LZ: To study the difficulties occurring within the Lempel-Ziv’s algorithms, the onebit<br />

catastrophe (participants: Elvira Mayordomo-EM, Jack Lutz-JL, María López-ML, Alejandra<br />

Cabaña-AC). To study the effect of the compression of little variations in the input. To study the<br />

problem statistically. To study the same problem for other variants of the algorithm. To look for a<br />

formal and satisfactory explanation. In this activity the grant hol<strong>de</strong>r (FPI) required for the<br />

subproject MOISES-ZVA will participate.<br />

ACTIVITY FD: To <strong>de</strong>sign algorithms for mining Data Stream that change over time (participants:<br />

Ricard Gavaldà-RG, Computer Sci<strong>en</strong>ce <strong>en</strong>gineering and FPI required for MOISES-BAR). To<br />

review known techniques. To select concrete algorithms (by including at least, naive Bayes, the<br />

nearest neighbour classification method and <strong>de</strong>cision trees). Theoretical <strong>de</strong>sign of the algorithms<br />

and analysis of their performance. In the activity José <strong>de</strong>l Campo-JC will collaborate.<br />

ACTIVITY SE: To <strong>de</strong>sign algorithms for mining temporal patterns in sequ<strong>en</strong>ces of ev<strong>en</strong>ts, and to<br />

study complex <strong>de</strong>p<strong>en</strong><strong>de</strong>nces in the relational mo<strong>de</strong>l. (participants: Jaume Baixeries-JG, G. Casas-<br />

GC and Computer Sci<strong>en</strong>ce <strong>en</strong>gineering required for MOISES-BAR). In this activity tools coming<br />

from the Logic, Data Base theory and Formal Concept Analysis will be used.<br />

ACTIVITY MG: To study grammatical mo<strong>de</strong>ls and applications for the syntax of symbolic<br />

sequ<strong>en</strong>ces (participants: Glyn Morrill-GM, Mario Fadda-MF, María Luisa González-MLG). To<br />

study the applications of proof nets in the g<strong>en</strong>eralized non commutative logic . To study mildly<br />

context-s<strong>en</strong>sitive grammars.<br />

ACTIVITY RT: To <strong>de</strong>fine methods to build concepts of temporal and spatial reasoning of<br />

common s<strong>en</strong>se based upon s<strong>en</strong>sorial and motion experi<strong>en</strong>ce of intellig<strong>en</strong>t ag<strong>en</strong>ts. (participants:<br />

Josefina Sierra-JS). Raúl Fidalgo-RF collaborates in this activity.<br />

ACTIVITY MC-FD: To study Data Stream as a computing mo<strong>de</strong>l (participants: Ricard Gavaldà-<br />

RG, Elvira Mayordomo-EM, Antoni Lozano-AL, Argimiro Arratia-AA, Mª Luisa González-MLG).<br />

To research the complexity of computing functions in this mo<strong>de</strong>l, by using tools from the


TIN2005-08832-C03<br />

Complexity Theory and to study relations with the mo<strong>de</strong>l of data compression and succinct<br />

repres<strong>en</strong>tation of structures.<br />

- PROGRAMMING AND EMPIRICAL EVALUATION OF ALGORITHMS<br />

For each of the following activities: ST-T, VC-T, AD-T, PC-T, FD, SE, MG, RT, there will be an<br />

activity with the same name prece<strong>de</strong>d by “PR-“. These new activities are <strong>de</strong>voted to the<br />

programming of the algorithms and to its evaluation.<br />

ACTIVITY EVAL-DNA: To validate the performance of the four known algorithms for<br />

compression of biological sequ<strong>en</strong>ces. Application to the algorithms for classifying DNA sequ<strong>en</strong>ces.<br />

(participants: Elvira Mayordomo-EM, Computer Sci<strong>en</strong>ce <strong>en</strong>gineering required in MOISES-ZVA).<br />

In this activity the grant hol<strong>de</strong>r (FPI) required for the subproject MOISES-ZVA will participate,<br />

and Manuel Ba<strong>en</strong>a-MB will collaborate also. The company “DECODE g<strong>en</strong>etics” has shown its<br />

interest in this activity (see letter adjoined).<br />

ACTIVITY-MLZ: To <strong>de</strong>fine, program and validate new methods based upon the basic Lempel-<br />

Ziv. (participants: Elvira Mayordomo-EM, Sci<strong>en</strong>ce <strong>en</strong>gineering required in MOISES-ZVA) In this<br />

activity the grant hol<strong>de</strong>r (FPI) required for the subproject MOISES-ZVA will participate. Two<br />

companies related to “<strong>Software</strong> <strong>de</strong>velopm<strong>en</strong>t” have shown their interest in this activity.<br />

ACTIVITY REC: To collect data sets (participants: all members of the project). To collect large<br />

real data sets from companies (public and private).<br />

- APPLICATIONS<br />

For each single activity that <strong>de</strong>velops new programs we propose an activity with the same name<br />

prece<strong>de</strong>d by “AP-“, that is <strong>de</strong>voted to the application of these programs to real data.<br />

- COORDINATION<br />

There will be two meetings every year to coordinate the activities: One will be in person and the<br />

other by vi<strong>de</strong>oconfer<strong>en</strong>ce. The objectives are twofold:<br />

1) To exchange results, implem<strong>en</strong>tations and data<br />

2) To adjust, synchronize and refine the subsequ<strong>en</strong>t tasks.<br />

Before collecting the data sets, we will have a session in or<strong>de</strong>r to unify the repres<strong>en</strong>tation of data<br />

(XML, gzip, etc), allowing to share the data sets betwe<strong>en</strong> all the members of the project.<br />

2 Nivel <strong>de</strong> éxito alcanzado <strong>en</strong> el proyecto<br />

It can be consi<strong>de</strong>red that the objectives have be<strong>en</strong> completed around 60%. More precisely, the<br />

objectives 1-7 and 9-12 are in a good level of <strong>de</strong>velopm<strong>en</strong>t (more that 50%) and the objectives 8,<br />

13 y 14 are in a low level (however, there is 50% time left until project completion). Mo<strong>de</strong>rate


TIN2005-08832-C03<br />

difficulties have aris<strong>en</strong> wh<strong>en</strong> the experim<strong>en</strong>tal results do not match the theoretical quality of the<br />

methods; in these cases we have reconsi<strong>de</strong>red our estimations and approximations and modified<br />

some aspects of the algorithms. The relevance of the results can be verified by the impact factor of<br />

the journals or by the acceptation rate of the international confer<strong>en</strong>ces (for example, last IDA<br />

confer<strong>en</strong>ce has a rate of 34%).<br />

Related specially to objective 14, we are recollecting data from labor disability, g<strong>en</strong><strong>de</strong>r viol<strong>en</strong>ce,<br />

m<strong>en</strong>tal health and tourist poll. From September to the project completion we will apply the<br />

algorithms <strong>de</strong>veloped to these data sets and we will analyze the results with experts in every<br />

sci<strong>en</strong>tific/economic field.<br />

3 Indicadores <strong>de</strong> resultados<br />

3.1 ZARAGOZA Subproject (with Valladolid and Iowa)<br />

Objective 3:<br />

María López [7] has studied the s<strong>en</strong>sitiv<strong>en</strong>ess of known algorithms to finite variations on<br />

input, by pointing out that they can appear only in some cases. In the future we will obtain a family<br />

of gamblers (compression algorithms) without this s<strong>en</strong>sitiv<strong>en</strong>ess.<br />

Pilar Albert, Elvira Mayordomo and Philippe Moser [1] have rec<strong>en</strong>tly studied compression<br />

algorithms with implem<strong>en</strong>tation using pushdown automatas, very used to compress xml archives.<br />

A comparison of compression ratio with respect to Lempel-Ziv algorithms family is in progress.<br />

Objective 4.<br />

Pablo Urcola [11] has <strong>de</strong>veloped his Computer Engineering PFC, supervised by Elvira<br />

Mayordomo. This work compares, experim<strong>en</strong>tally, exist<strong>en</strong>t compression algorithms for biological<br />

sequ<strong>en</strong>ces and proposes a new one. Moreover, the influ<strong>en</strong>ce of compression algorithm used to<br />

build filog<strong>en</strong>etic trees is studied and experim<strong>en</strong>ts with mitochondrial DNA are ma<strong>de</strong>.<br />

Objective 9.<br />

Jack Lutz and Elvira Mayordomo have started [8] a new research line about computing of<br />

fractal dim<strong>en</strong>sion in situations that do not have be<strong>en</strong> previously consi<strong>de</strong>red..<br />

Gu, Lutz and Mayordomo [3] have studied points that appear in computable curves with<br />

finite l<strong>en</strong>gth (that are the points that can be i<strong>de</strong>ntified by a nanorobot) from dim<strong>en</strong>sion and<br />

Computational Geometry techniques. Lutz and Weihrauch [9] have studied connectivity<br />

properties of set of points with a specific dim<strong>en</strong>sion.<br />

Lathrop, Lutz and Summers [5] have studied relations betwe<strong>en</strong> fractal dim<strong>en</strong>sion and selfassembling<br />

structures (very used in nanotechnology).<br />

Objective 10.<br />

In contrast to the best case of algorithmic compression rate, the worse case compression<br />

rate is studied in [2], and their relation with prediction on-line algorithms, fractal dim<strong>en</strong>sion and<br />

"gambling".<br />

A wi<strong>de</strong> variety of performance measurem<strong>en</strong>ts in compression and prediction (differ<strong>en</strong>t to<br />

error log-loss) have be<strong>en</strong> studied in [4] and [6]. The relation betwe<strong>en</strong> compression and prediction<br />

has be<strong>en</strong> completely characterized for constant-memory algorithms; the case of polynomial time<br />

algorithms is still op<strong>en</strong>.


TIN2005-08832-C03<br />

The state of art is summarized by Elvira Mayordomo in the chapter of the book [10] by<br />

including a g<strong>en</strong>eralisation that allows the study of virtually every performance measure (in<br />

gambling, compression and prediction).<br />

3.2 BARCELONA Subproject<br />

Objective 1.<br />

Albert Bifet and Ricard Gavaldà have <strong>de</strong>veloped the algorithm ADWIN (Adaptive<br />

Windowing) that packs the statistics and data structures necessary to discover drifts in the data<br />

stream and to memorize an updated data sample. This method g<strong>en</strong>erates variants of known<br />

learning algorithms that work with data streams, like Naïve Bayes and k-means [14,15] and <strong>de</strong>cision<br />

trees [12].<br />

Another algorithm to <strong>de</strong>tect concept drift has be<strong>en</strong> <strong>de</strong>veloped as joint work betwe<strong>en</strong><br />

Barcelona and Málaga teams [13].<br />

In both cases, the methods have be<strong>en</strong> evaluated from a theoretical point of view and by<br />

using artificial and/or natural data sets.<br />

This work continues in two directions: the first one is the evaluation of them using real<br />

data in or<strong>de</strong>r to refine and to obtain a practical perspective; the second one is the ext<strong>en</strong>sion of their<br />

use to new learning algorithms.<br />

Objective 2.<br />

A <strong>de</strong>ep work about extracting patterns from trees (or<strong>de</strong>red and not or<strong>de</strong>red) has be<strong>en</strong><br />

done by Balcázar, Bifet and Lozano [20,19,18,17,16]. In the past, only very simple cases had be<strong>en</strong><br />

solved. The theoretical <strong>de</strong>finition of the problem has be<strong>en</strong> completed and the implem<strong>en</strong>tation of<br />

the algorithms is being <strong>de</strong>veloped.<br />

Objective 11.<br />

The "survey" paper [22] pres<strong>en</strong>ts a perspective of the state of art and op<strong>en</strong> problems in<br />

this area. [23,21] pres<strong>en</strong>t other results on this area of the project.<br />

Objective 12.<br />

This work is in an initial phase. Members of Zaragoza-Valladolid and Barcelona are<br />

studying the relevant bibliography in or<strong>de</strong>r to select relevant topics. As first step, they are working<br />

to classify the computational complexity of regular languages in the data stream mo<strong>de</strong>l, and the<br />

first results are curr<strong>en</strong>tly being verified.<br />

3.3 MÁLAGA Subproject<br />

Objectives 5, 6 and 7.<br />

We are working to improve the CIDIM algorithm by means of <strong>en</strong>semble of classifiers [24]<br />

and by using correction filters [29]. On the other hand, the initial version of an algorithm without<br />

examples memory was pres<strong>en</strong>ted in [25], and it has be<strong>en</strong> improved in [26] and [27], this last<br />

publication is a collaboration with the University of Porto. The method to discover concept drift<br />

[26] arise from a collaboration betwe<strong>en</strong> Barcelona and Málaga.<br />

Related to mining of data stream, the Online Tree method has be<strong>en</strong> improved to be<br />

increm<strong>en</strong>tal, reducing the processing time nee<strong>de</strong>d for every example, but obtaining the same<br />

accuracy. All the improvem<strong>en</strong>ts have g<strong>en</strong>erated a new version OnlineTree 2; and we have s<strong>en</strong>t a


TIN2005-08832-C03<br />

paper to JMLR [31]. A program to predict solar flares has be<strong>en</strong> <strong>de</strong>veloped and evaluated with real<br />

data sets from ESA 2002-2005. The results have be<strong>en</strong> published in [32], [33] and [34].<br />

Two PFC has be<strong>en</strong> <strong>de</strong>veloped with real data sets: [35] is the project of Miguel Ángel Zafra<br />

Repiso, whose advisors have be<strong>en</strong> Manuel Ba<strong>en</strong>a García and Rafael Morales Bu<strong>en</strong>o, and [36] is the<br />

project of Carlos Rafael Morales Becerra, whose advisors have be<strong>en</strong> José <strong>de</strong>l Campo Ávila and<br />

Gonzalo Ramos Jiménez.<br />

Simultaneously to this activity, three doctoral stu<strong>de</strong>nts of Málaga group have done<br />

research visits to external universities:<br />

- University of Porto (Portugal): José <strong>de</strong>l Campo (3 months in 2006)<br />

- University of Aveiro (Portugal): Manuel Ba<strong>en</strong>a (3 months in 2007)<br />

- University of Ljubljana (Slov<strong>en</strong>ia): Raúl Fidalgo is now there (3 months in 2007)<br />

José <strong>de</strong>l Campo has obtained his Ph. D. <strong>de</strong>gree with European M<strong>en</strong>tion [29]<br />

Manuel Ba<strong>en</strong>a and Raúl Fidalgo are working in the same direction with<br />

objective date in 2008 (before the completion of this project).<br />

We have applied for a complem<strong>en</strong>tary international action of MEC to continue<br />

the research with members of the University of Ljubljana.<br />

4 Refer<strong>en</strong>ces<br />

[1] P. Albert, E. Mayordomo, and P. Moser, Boun<strong>de</strong>d pushdown dim<strong>en</strong>sion vs Lempel Ziv<br />

information <strong>de</strong>nsity. ECCC: Electronic Coloquium on Computational Complexity ISSN 1433-<br />

8092, TR07-051 (2007)<br />

[2] K. B. Athreya, J. M. Hitchcock, J. H. Lutz, and E. Mayordomo, Effective strong dim<strong>en</strong>sion in<br />

algorithmic information and computational complexity, SIAM Journal on Computing, 37, 671-705<br />

(2007)<br />

[3] X. Gu, J. H. Lutz, E. Mayordomo, Points on computable curves. Proceedings of the Forty-<br />

Sev<strong>en</strong>th Annual IEEE Symposium on Foundations of Computer Sci<strong>en</strong>ce (Berkeley, CA, October<br />

22-24, 2006), IEEE Computer Society Press, pp. 469-474 (2006)<br />

[4] J.M. Hitchcock and M. López-Valdés and E. Mayordomo, Scaled Dim<strong>en</strong>sion and the<br />

Kolmogorov Complexity of Turing-Hard Sets, Theory of Computing Systems, to appear (2007)<br />

[5] J. I. Lathrop, J. H. Lutz, and S. M. Summers, Strict self-assembly of discrete Sierpinski triangles,<br />

Computation and Logic in the Real World: Proceedings of the Third Confer<strong>en</strong>ce on Computability<br />

in Europe (Si<strong>en</strong>a, Italy, June 18-23, 2007), Springer-Verlag Lecture Notes in Computer Sci<strong>en</strong>ce<br />

4497, pp. 455-464 (2007)<br />

[6] M. López-Valdés, Scaled dim<strong>en</strong>sion of invidual strings. En Logical Approaches to<br />

Computational Barriers, Local Proceedings of the Second Confer<strong>en</strong>ce on Computability in Europe<br />

(CiE 2006). Report # CSR 7-2006, University of Wales Swansea, pp. 206-215 (2006)


TIN2005-08832-C03<br />

[7] M. López-Valdés, Lempel-Ziv Dim<strong>en</strong>sion for Lempel-Ziv compression. Proceedings of the<br />

Thirtieth International Symposium on Mathematical Foundations of Computer Sci<strong>en</strong>ce (Bratislava,<br />

Slovakia, August 28 - September 1, 2006), Springer-Verlag, pp. 471-479 (2006)<br />

[8] J. H. Lutz and E. Mayordomo, Dim<strong>en</strong>sions of points in self-similar fractals submitted (2007)<br />

[9] J. H. Lutz and K. Weihrauch, Connectivity properties of dim<strong>en</strong>sion level sets, Proceedings of<br />

the Fourth International Confer<strong>en</strong>ce on Computability and Complexity in Analysis (Si<strong>en</strong>a, Italy,<br />

June 16-18, 2007)<br />

[10] E. Mayordomo, Effective fractal dim<strong>en</strong>sion in algorithmic information theory. En "New<br />

Computational Paradigms: Changing Conceptions of What is Computable", Springer-Verlag, to<br />

appear (2007)<br />

[11] P. Urcola, Algoritmos <strong>de</strong> compresión para secu<strong>en</strong>cias biológicas y su aplicación <strong>en</strong> árboles<br />

filogénicos construidos a partir <strong>de</strong> ADN mitocondrial, proyecto fin <strong>de</strong> carrera dirigido por Elvira<br />

Mayordomo, Universidad <strong>de</strong> Zaragoza, 2006.<br />

[12] A. Bifet and R. Gavalda: Learning <strong>de</strong>cision trees adaptively from data streams with time drift.<br />

Confer<strong>en</strong>ce submission, june 2007.<br />

[13] Manuel Ba<strong>en</strong>a-García, José <strong>de</strong>l Campo-Ávila, Raúl Fidalgo, Albert Bifet, Ricard Gavaldà and<br />

Rafael Morales-Bu<strong>en</strong>o: "Early Drift Detection Method". ECML-PKDD Workshop on Knowledge<br />

Discovery from Data Streams 2006.<br />

[14] Albert Bifet and Ricard Gavaldà: "Kalman Filters and Adaptive Windows for Learning in Data<br />

Streams ". Proc. 9th International Confer<strong>en</strong>ce on Discovery Sic<strong>en</strong>ce (DS 2006). Springer-Verlag<br />

Lecture Notes in Artificial Intellig<strong>en</strong>ce 4265, 29-40.<br />

[15] Albert Bifet and Ricard Gavaldà: " Learning from Time-Changing Data with Adaptive<br />

Windowing". In 2007 SIAM International Confer<strong>en</strong>ce on Data Mining (SDM'07), Minneapolis,<br />

Minnesota.<br />

[16] José Luis Balcázar, Albert Bifet and Antoni Lozano: "Closed and maximal tree mining using<br />

natural repres<strong>en</strong>tations". Submitted.<br />

[17] José Luis Balcázar, Albert Bifet and Antoni Lozano: "Mining Frequ<strong>en</strong>t Closed Rooted Trees".<br />

Submitted.<br />

[18] José Luis Balcázar, Albert Bifet and Antoni Lozano: "Subtree Testing and Closed Tree Mining<br />

Through Natural Repres<strong>en</strong>tations ". "Advances in Conceptual Knowledge Engineering" 2007,<br />

Reg<strong>en</strong>sburg, Germany.<br />

[19] José Luis Balcázar, Albert Bifet and Antoni Lozano: "Mining Frequ<strong>en</strong>t Closed Unor<strong>de</strong>red<br />

Trees Through Natural Repres<strong>en</strong>tations ". In 2007 International Confer<strong>en</strong>ce on Conceptual<br />

Structures, Sheffield UK.


TIN2005-08832-C03<br />

[20] José Luis Balcázar, Albert Bifet and Antoni Lozano: "Intersection Algorithms and a Closure<br />

Operator on Unor<strong>de</strong>red Trees ". Workshop Mining and Learning with Graphs MLG 2006.<br />

[21] G. Morrill, M. Fadda and O. Val<strong>en</strong>tín 2007: "Non<strong>de</strong>terministic Discontinuous Lambek<br />

Calculus', in Proceedings of the Sev<strong>en</strong>th International Workshop on Computational Semantics,<br />

IWCS7, Tilburg.<br />

[22] G. Morrill: "Categorial Grammars: Deductive Approaches", Keith Brown (ed.) Encyclopedia<br />

of Language and Linguistics, 2nd Edition, Elsevier, Oxford, Volume 2, 242--248.<br />

[23] Glyn Morrill and Mario Fadda: "Proof Nets for Basic Discontinuous Lambek Calculus",<br />

accepted for Logic and Computation, special issue proceedings of Lambda Calculus, Type Theory<br />

and Natural Language, LCTTNL05, King's College, London, September 2005.<br />

[24] FE-CIDIM: Fast Ensemble of CIDIM Classifiers. G. Ramos, J. <strong>de</strong>l Campo-Ávila, R.Morales.<br />

International Journal of Systems Sci<strong>en</strong>ce, 37, 13 (2006) 939-947.<br />

[25] Increm<strong>en</strong>tal Algorithm Driv<strong>en</strong> by Error Margins. G. Ramos, J. <strong>de</strong>l Campo-Ávila, R.Morales.<br />

Lecture Notes in Artificial Intellig<strong>en</strong>ce, 4265, (2006) 358-362.<br />

[26] Early Drift Detection Method. M. Ba<strong>en</strong>a, José <strong>de</strong>l Campo-Ávila, Albert Bifet, R. Fidalgo,<br />

Ricard Gavaldà, R.Morales. Fourth International Workshop on Knowledge Discovery from Data<br />

Streams, Berlín (Alemania) (2006) 77-86.<br />

[27] Improving prediction accuracy of an increm<strong>en</strong>tal algorithm driv<strong>en</strong> by error margins<br />

J. <strong>de</strong>l Campo-Ávila, G.Ramos, R.Morales. Fourth Workshop on Knowledge Discovery from Data<br />

Streams, Berlín (Alemania) (2006) 57-66.<br />

[28] Improving the performance of an increm<strong>en</strong>tal algorithm driv<strong>en</strong> by error margins. J. <strong>de</strong>l<br />

Campo-Ávila, G. Ramos-Jiménez, J. Gama y R. Morales-Bu<strong>en</strong>o. Intellig<strong>en</strong>t Data Analysis. 2007. To<br />

appear.<br />

[29] Increm<strong>en</strong>tal learning with multiple classifier systems using correction filters for classification. J.<br />

<strong>de</strong>l Campo-Ávila, G. Ramos-Jiménez y R. Morales-Bu<strong>en</strong>o. Lecture Notes in Computer Sci<strong>en</strong>ce <strong>de</strong>l<br />

Intellig<strong>en</strong>t Data Analysis Confer<strong>en</strong>ce 2007, to appear.<br />

[30] Nuevos Enfoques <strong>en</strong> Apr<strong>en</strong>dizaje Increm<strong>en</strong>tal. PhD These by J. <strong>de</strong>l Campo-Ávila and<br />

supervised by R. Morales-Bu<strong>en</strong>o and G. Ramos-Jiménez<br />

[31] Learning in Environm<strong>en</strong>ts with Unknown Dynamics: Towards more Robust Concept<br />

Learners. Marlon Nuñez, Raul Fidalgo, Rafael Morales. JMLR (s<strong>en</strong>t).<br />

[32] M. Nuñez y R. Morales, On forecasting the onset of SEP ev<strong>en</strong>ts, Proceedings of the<br />

International Astronomical Union - Symposium 233 (CU Press, 2006).<br />

[33] M. Nuñez y R. Morales, Early Warning of Solar Proton Ev<strong>en</strong>ts, in: Proceedings of the Third<br />

European Space Weather Week, European Space Ag<strong>en</strong>cy Press, Bruselas (2006).


TIN2005-08832-C03<br />

[34] M. Núñez y R. Morales, Extreme Value Dep<strong>en</strong><strong>de</strong>nce in Problems with a Changing Causation<br />

Structure, Lecture Notes in Computer Sci<strong>en</strong>ce (ADMA 2006), 4093 Springer 2006.<br />

[35] Herrami<strong>en</strong>ta para clasificación <strong>de</strong> correo electrónico basada <strong>en</strong> Re<strong>de</strong>s Bayesianas. PFC by<br />

Miguel Ángel Zafra Repiso and directed by M. Ba<strong>en</strong>a-García and R. Morales-Bu<strong>en</strong>os. January 2007<br />

[36] Minería <strong>de</strong> datos aplicada a las salud m<strong>en</strong>tal, PFC by Carlos Rafael Morales Becerra and<br />

directed by J. <strong>de</strong>l Campo Ávila and G. Ramos Jiménez. July 2007.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!