20.09.2015 Views

Multiresolution Motif Discovery in Time Series

Multiresolution Motif Discovery in Time Series: MrMotif - ALFA

Multiresolution Motif Discovery in Time Series: MrMotif - ALFA

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Tenth SIAM International Conference on Data M<strong>in</strong><strong>in</strong>g<br />

Columbus, Ohio, USA<br />

<strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong><br />

<strong>in</strong> <strong>Time</strong> <strong>Series</strong><br />

NUNO CASTRO<br />

PAULO AZEVEDO<br />

Department of Informatics<br />

University of M<strong>in</strong>ho<br />

Portugal<br />

April 30th, 2010


Roadmap<br />

I. <strong>Motif</strong> def<strong>in</strong>ition<br />

II.<br />

III.<br />

IV.<br />

Motivation<br />

Related work limitations<br />

Our algorithm<br />

V. Experimental Analysis<br />

VI.<br />

Future work<br />

VII. Conclusion<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


I – <strong>Motif</strong> Def<strong>in</strong>ition<br />

• <strong>Motif</strong>s, also known as “recurrent patterns”, “frequent<br />

patterns”, “repeated subsequences”, or typical shapes” are<br />

previously unknown patterns <strong>in</strong> time series<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


II – Motivation<br />

• F<strong>in</strong>d<strong>in</strong>g motifs is an important task:<br />

• Describe the time series at hand<br />

• Help summarize/represent the database<br />

• Provide useful <strong>in</strong>sight to the doma<strong>in</strong> expert<br />

• Examples of motifs:<br />

• Patterns that typically precede a seizure <strong>in</strong> EEG<br />

• DNA subsequence preserved through evolution<br />

• Bursts <strong>in</strong> telecommunication traffic<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


III – Related work limitations<br />

• Computational complexity<br />

• Quadratic algorithms are clearly not the solution<br />

• Disk <strong>in</strong>nefficient (use expensive random disk accesses)<br />

• Memory <strong>in</strong>nefficient (assume data can fit <strong>in</strong>to ma<strong>in</strong><br />

memory)<br />

• Assume all data are available<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


III – Related work limitations (cont.)<br />

• Consider motifs at a s<strong>in</strong>gle resolution<br />

• Are not suited to <strong>in</strong>teractivity<br />

• Large number of un<strong>in</strong>tuitive parameters to set:<br />

• <strong>Motif</strong> length<br />

• Range (distance threshold)<br />

• Number of columns <strong>in</strong> the subsequence matrix<br />

• Limited to f<strong>in</strong>d<strong>in</strong>g motifs <strong>in</strong> univariate time series<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

• We propose an algorithm:<br />

• <strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>: Mr<strong>Motif</strong><br />

• <strong>Time</strong> efficient:<br />

• One s<strong>in</strong>gle sequential disk scan<br />

• Clever representation technique (iSAX)<br />

• Use of constant access time structures<br />

• Memory efficient:<br />

• Comb<strong>in</strong>e our approach with the Space-Sav<strong>in</strong>g algorithm<br />

• Adjustable amount of memory to use<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Problem def<strong>in</strong>ition<br />

• We follow a Top-K frequent pattern approach:<br />

• i.e. f<strong>in</strong>d<strong>in</strong>g the Top-K motifs<br />

• A time series can be counted as a repetition of another if<br />

they have the same symbolic representation<br />

• We use the Symbolic Aggregate Approximation (iSAX*)<br />

* Shieh, J. and Keogh, E., iSAX: <strong>in</strong>dex<strong>in</strong>g and m<strong>in</strong><strong>in</strong>g terabyte sized time series,<br />

<strong>in</strong> Proceed<strong>in</strong>gs of the 14th ACM SIGKDD <strong>in</strong>ternational Conference on Knowledge <strong>Discovery</strong> and Data M<strong>in</strong><strong>in</strong>g (2008), pp. 623-631.<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Problem def<strong>in</strong>ition<br />

iSAX<br />

• State of the art time series representation technique<br />

• Widely used <strong>in</strong> time series data m<strong>in</strong><strong>in</strong>g<br />

• Converts a time series to a sequence of symbols (word)<br />

• Given a resolution (alphabet size) and word size<br />

Image generated by MATLAB and code provided by iSAX authors<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Problem def<strong>in</strong>ition<br />

iSAX (cont.)<br />

• Ability to easily move between different resolutions<br />

Resolution<br />

Decimal word B<strong>in</strong>ary word<br />

2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0}<br />

4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01}<br />

8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011}<br />

16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110}<br />

Resolution = 4 Resolution = 16<br />

Image generated<br />

by MATLAB and<br />

code provided by<br />

iSAX authors<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Problem def<strong>in</strong>ition (cont.)<br />

• Example of 3 time series that form a motif<br />

• Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Mr<strong>Motif</strong><br />

• Perform one traversal of the time series database<br />

• For each resolution<br />

• Convert each time series to an iSAX word<br />

• Ma<strong>in</strong>ta<strong>in</strong> and update a counter of the current Top-K motifs,<br />

<strong>in</strong>dexed by iSAX word<br />

• e.g. resolution 2<br />

<strong>Motif</strong><br />

Count<br />

{2,5,7,5,3,0,3,3} 54<br />

{4,7,0,0,0,1,5,5} 32<br />

{0,0,0,4,5,2,0,0} 25<br />

...<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Properties<br />

• <strong>Multiresolution</strong><br />

• Interactivity<br />

• Space-Sav<strong>in</strong>g<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

<strong>Multiresolution</strong><br />

• Our <strong>in</strong>tuition is that at the larger resolutions, it is harder<br />

for two different time series to match<br />

• Each <strong>in</strong>terval narrows considerably each time we<br />

duplicate the resolution<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

<strong>Multiresolution</strong> (cont.)<br />

• At the largest resolutions, we are work<strong>in</strong>g closer to the<br />

level of raw data<br />

• This assumption prevents us from perform<strong>in</strong>g expensive<br />

distance calculations<br />

• The multiresolution capability allows to develop<br />

<strong>in</strong>teractive visual tools<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Interactivity<br />

• Feed a tree-like structure with our motifs at different<br />

resolutions<br />

• This allows to navigate <strong>in</strong> the motif hierarchy structure<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Space-Sav<strong>in</strong>g (SS)<br />

• Proposed* to efficiently compute frequent elements <strong>in</strong><br />

data streams<br />

• Monitor only m words<br />

• For each new word e<br />

• If e is already monitored, <strong>in</strong>crement its count<br />

• If not, replace the least frequent monitored element by e,<br />

and <strong>in</strong>crement it<br />

• Experimentally shown to guarantee very small errors,<br />

with known upper-bounds on the over-estimation errors<br />

Reference***<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Space-Sav<strong>in</strong>g (cont.)<br />

• We start Mr<strong>Motif</strong> with Space-Sav<strong>in</strong>g disabled, <strong>in</strong> order to<br />

make m large enough to further reduce errors<br />

• Activate Space-Sav<strong>in</strong>g when memory threshold is<br />

reached (e.g. 128MB guarantees m =10000 elements)<br />

• or memory is about to run out<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

• Scalability experiments (synthetic data)<br />

• Execution time<br />

• Memory<br />

• Experiments with noise<br />

• Real applications<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Scalability Experiments<br />

• Dataset:<br />

• Reproduced from Mueen et al., 2009*.<br />

• 10 different sets of random walk time series<br />

• Each set with 10000 up to 100000 series of length 1024<br />

• About 8GB of time series data<br />

• We compare Mr<strong>Motif</strong> to Random Projection (Chiu et al.,<br />

2003)<br />

• Due to its popularity<br />

• Is the basis of many current motif discovery approaches<br />

• We also compare Space-Sav<strong>in</strong>g (SS) and Full Memory<br />

(FM) versions of Mr<strong>Motif</strong><br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010<br />

**Ref


V – Experimental Analysis – Scalability Experiments<br />

Execution time<br />

• Algorithms are executed 10 times for each of the ten<br />

<strong>in</strong>creas<strong>in</strong>gly larger datasets<br />

• Execution times for each dataset are averaged<br />

• Top-10 motifs are recorded<br />

• Maximum amount of memory set to 128MB<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis – Scalability Experiments<br />

Execution time (results)<br />

DB size<br />

Mr<strong>Motif</strong><br />

(SS)<br />

Mr<strong>Motif</strong><br />

(FM)<br />

Random<br />

Projection<br />

10000 16,43 13,91 53,54<br />

20000 32,68 26,85 193,88<br />

30000 49,60 40,34 404,41<br />

40000 62,92 51,87 705,02<br />

50000 79,26 66,13 1221,13<br />

60000 98,15 78,44 1613,53<br />

70000 114,35 89,33 2139,20<br />

80000 127,27 106,40 2708,53<br />

90000 149,40 116,08 3468,50<br />

100000 158,76 133,11 4357,39<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis – Scalability Experiments<br />

Memory<br />

• We compare memory usage of the FM and SS versions<br />

of Mr<strong>Motif</strong> <strong>in</strong> the 100000 sized dataset<br />

• Observe the impact of SS (memory limit set to 128MB)<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Experiments with noise<br />

• We apply Mr<strong>Motif</strong> to the 10000 sized dataset and record<br />

the Top-10 patterns for resolution 4<br />

• Mr<strong>Motif</strong> is executed <strong>in</strong> each<br />

variation of the series<br />

• Precision/recall with respect to<br />

the orig<strong>in</strong>al series are calculated<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Experiments with noise (cont.)<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Real applications<br />

• We have applied Mr<strong>Motif</strong> to real data from:<br />

• Prote<strong>in</strong> unfold<strong>in</strong>g<br />

• Sensor networks<br />

monitor<strong>in</strong>g<br />

• Telecommunication<br />

network operator<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


VI – Conclusions<br />

• We have <strong>in</strong>troduced Mr<strong>Motif</strong> to f<strong>in</strong>d motifs <strong>in</strong> time series:<br />

• Fast<br />

• Space-efficient<br />

• Intuitive<br />

• Robust to noise<br />

• Easy to use<br />

• Straightforward<br />

• Reproducible<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


VII – Future work<br />

• <strong>Motif</strong> evaluation and significance measures:<br />

• <strong>Motif</strong>s are typically evaluated <strong>in</strong> a subjective way by<br />

humans<br />

• Objective evaluation measures that rank motifs <strong>in</strong> terms of<br />

significance are necessary<br />

• <strong>Motif</strong>s as build<strong>in</strong>g blocks:<br />

• As motifs can be used to describe the time series, they can<br />

be used as “build<strong>in</strong>g blocks” for other data m<strong>in</strong><strong>in</strong>g tasks:<br />

• Classification<br />

• Abnormality detection<br />

• Forecast<strong>in</strong>g<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


Thank you for your attention!<br />

• Contact: castro@di.um<strong>in</strong>ho.pt<br />

• Mr<strong>Motif</strong> Web site (executable, source code and datasets):<br />

www.di.um<strong>in</strong>ho.pt/~castro/mrmotif<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


On similarity and multiresolution<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


On similarity<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!