20.09.2015 Views

Multiresolution Motif Discovery in Time Series

Multiresolution Motif Discovery in Time Series: MrMotif - ALFA

Multiresolution Motif Discovery in Time Series: MrMotif - ALFA

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Tenth SIAM International Conference on Data M<strong>in</strong><strong>in</strong>g<br />

Columbus, Ohio, USA<br />

<strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong><br />

<strong>in</strong> <strong>Time</strong> <strong>Series</strong><br />

NUNO CASTRO<br />

PAULO AZEVEDO<br />

Department of Informatics<br />

University of M<strong>in</strong>ho<br />

Portugal<br />

April 30th, 2010


Roadmap<br />

I. <strong>Motif</strong> def<strong>in</strong>ition<br />

II.<br />

III.<br />

IV.<br />

Motivation<br />

Related work limitations<br />

Our algorithm<br />

V. Experimental Analysis<br />

VI.<br />

Future work<br />

VII. Conclusion<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


I – <strong>Motif</strong> Def<strong>in</strong>ition<br />

• <strong>Motif</strong>s, also known as “recurrent patterns”, “frequent<br />

patterns”, “repeated subsequences”, or typical shapes” are<br />

previously unknown patterns <strong>in</strong> time series<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


II – Motivation<br />

• F<strong>in</strong>d<strong>in</strong>g motifs is an important task:<br />

• Describe the time series at hand<br />

• Help summarize/represent the database<br />

• Provide useful <strong>in</strong>sight to the doma<strong>in</strong> expert<br />

• Examples of motifs:<br />

• Patterns that typically precede a seizure <strong>in</strong> EEG<br />

• DNA subsequence preserved through evolution<br />

• Bursts <strong>in</strong> telecommunication traffic<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


III – Related work limitations<br />

• Computational complexity<br />

• Quadratic algorithms are clearly not the solution<br />

• Disk <strong>in</strong>nefficient (use expensive random disk accesses)<br />

• Memory <strong>in</strong>nefficient (assume data can fit <strong>in</strong>to ma<strong>in</strong><br />

memory)<br />

• Assume all data are available<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


III – Related work limitations (cont.)<br />

• Consider motifs at a s<strong>in</strong>gle resolution<br />

• Are not suited to <strong>in</strong>teractivity<br />

• Large number of un<strong>in</strong>tuitive parameters to set:<br />

• <strong>Motif</strong> length<br />

• Range (distance threshold)<br />

• Number of columns <strong>in</strong> the subsequence matrix<br />

• Limited to f<strong>in</strong>d<strong>in</strong>g motifs <strong>in</strong> univariate time series<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

• We propose an algorithm:<br />

• <strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>: Mr<strong>Motif</strong><br />

• <strong>Time</strong> efficient:<br />

• One s<strong>in</strong>gle sequential disk scan<br />

• Clever representation technique (iSAX)<br />

• Use of constant access time structures<br />

• Memory efficient:<br />

• Comb<strong>in</strong>e our approach with the Space-Sav<strong>in</strong>g algorithm<br />

• Adjustable amount of memory to use<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Problem def<strong>in</strong>ition<br />

• We follow a Top-K frequent pattern approach:<br />

• i.e. f<strong>in</strong>d<strong>in</strong>g the Top-K motifs<br />

• A time series can be counted as a repetition of another if<br />

they have the same symbolic representation<br />

• We use the Symbolic Aggregate Approximation (iSAX*)<br />

* Shieh, J. and Keogh, E., iSAX: <strong>in</strong>dex<strong>in</strong>g and m<strong>in</strong><strong>in</strong>g terabyte sized time series,<br />

<strong>in</strong> Proceed<strong>in</strong>gs of the 14th ACM SIGKDD <strong>in</strong>ternational Conference on Knowledge <strong>Discovery</strong> and Data M<strong>in</strong><strong>in</strong>g (2008), pp. 623-631.<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Problem def<strong>in</strong>ition<br />

iSAX<br />

• State of the art time series representation technique<br />

• Widely used <strong>in</strong> time series data m<strong>in</strong><strong>in</strong>g<br />

• Converts a time series to a sequence of symbols (word)<br />

• Given a resolution (alphabet size) and word size<br />

Image generated by MATLAB and code provided by iSAX authors<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Problem def<strong>in</strong>ition<br />

iSAX (cont.)<br />

• Ability to easily move between different resolutions<br />

Resolution<br />

Decimal word B<strong>in</strong>ary word<br />

2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0}<br />

4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01}<br />

8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011}<br />

16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110}<br />

Resolution = 4 Resolution = 16<br />

Image generated<br />

by MATLAB and<br />

code provided by<br />

iSAX authors<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Problem def<strong>in</strong>ition (cont.)<br />

• Example of 3 time series that form a motif<br />

• Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Mr<strong>Motif</strong><br />

• Perform one traversal of the time series database<br />

• For each resolution<br />

• Convert each time series to an iSAX word<br />

• Ma<strong>in</strong>ta<strong>in</strong> and update a counter of the current Top-K motifs,<br />

<strong>in</strong>dexed by iSAX word<br />

• e.g. resolution 2<br />

<strong>Motif</strong><br />

Count<br />

{2,5,7,5,3,0,3,3} 54<br />

{4,7,0,0,0,1,5,5} 32<br />

{0,0,0,4,5,2,0,0} 25<br />

...<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm<br />

Properties<br />

• <strong>Multiresolution</strong><br />

• Interactivity<br />

• Space-Sav<strong>in</strong>g<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

<strong>Multiresolution</strong><br />

• Our <strong>in</strong>tuition is that at the larger resolutions, it is harder<br />

for two different time series to match<br />

• Each <strong>in</strong>terval narrows considerably each time we<br />

duplicate the resolution<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

<strong>Multiresolution</strong> (cont.)<br />

• At the largest resolutions, we are work<strong>in</strong>g closer to the<br />

level of raw data<br />

• This assumption prevents us from perform<strong>in</strong>g expensive<br />

distance calculations<br />

• The multiresolution capability allows to develop<br />

<strong>in</strong>teractive visual tools<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Interactivity<br />

• Feed a tree-like structure with our motifs at different<br />

resolutions<br />

• This allows to navigate <strong>in</strong> the motif hierarchy structure<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Space-Sav<strong>in</strong>g (SS)<br />

• Proposed* to efficiently compute frequent elements <strong>in</strong><br />

data streams<br />

• Monitor only m words<br />

• For each new word e<br />

• If e is already monitored, <strong>in</strong>crement its count<br />

• If not, replace the least frequent monitored element by e,<br />

and <strong>in</strong>crement it<br />

• Experimentally shown to guarantee very small errors,<br />

with known upper-bounds on the over-estimation errors<br />

Reference***<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


IV – Our algorithm – Properties<br />

Space-Sav<strong>in</strong>g (cont.)<br />

• We start Mr<strong>Motif</strong> with Space-Sav<strong>in</strong>g disabled, <strong>in</strong> order to<br />

make m large enough to further reduce errors<br />

• Activate Space-Sav<strong>in</strong>g when memory threshold is<br />

reached (e.g. 128MB guarantees m =10000 elements)<br />

• or memory is about to run out<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

• Scalability experiments (synthetic data)<br />

• Execution time<br />

• Memory<br />

• Experiments with noise<br />

• Real applications<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Scalability Experiments<br />

• Dataset:<br />

• Reproduced from Mueen et al., 2009*.<br />

• 10 different sets of random walk time series<br />

• Each set with 10000 up to 100000 series of length 1024<br />

• About 8GB of time series data<br />

• We compare Mr<strong>Motif</strong> to Random Projection (Chiu et al.,<br />

2003)<br />

• Due to its popularity<br />

• Is the basis of many current motif discovery approaches<br />

• We also compare Space-Sav<strong>in</strong>g (SS) and Full Memory<br />

(FM) versions of Mr<strong>Motif</strong><br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010<br />

**Ref


V – Experimental Analysis – Scalability Experiments<br />

Execution time<br />

• Algorithms are executed 10 times for each of the ten<br />

<strong>in</strong>creas<strong>in</strong>gly larger datasets<br />

• Execution times for each dataset are averaged<br />

• Top-10 motifs are recorded<br />

• Maximum amount of memory set to 128MB<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis – Scalability Experiments<br />

Execution time (results)<br />

DB size<br />

Mr<strong>Motif</strong><br />

(SS)<br />

Mr<strong>Motif</strong><br />

(FM)<br />

Random<br />

Projection<br />

10000 16,43 13,91 53,54<br />

20000 32,68 26,85 193,88<br />

30000 49,60 40,34 404,41<br />

40000 62,92 51,87 705,02<br />

50000 79,26 66,13 1221,13<br />

60000 98,15 78,44 1613,53<br />

70000 114,35 89,33 2139,20<br />

80000 127,27 106,40 2708,53<br />

90000 149,40 116,08 3468,50<br />

100000 158,76 133,11 4357,39<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis – Scalability Experiments<br />

Memory<br />

• We compare memory usage of the FM and SS versions<br />

of Mr<strong>Motif</strong> <strong>in</strong> the 100000 sized dataset<br />

• Observe the impact of SS (memory limit set to 128MB)<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Experiments with noise<br />

• We apply Mr<strong>Motif</strong> to the 10000 sized dataset and record<br />

the Top-10 patterns for resolution 4<br />

• Mr<strong>Motif</strong> is executed <strong>in</strong> each<br />

variation of the series<br />

• Precision/recall with respect to<br />

the orig<strong>in</strong>al series are calculated<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Experiments with noise (cont.)<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


V – Experimental Analysis<br />

Real applications<br />

• We have applied Mr<strong>Motif</strong> to real data from:<br />

• Prote<strong>in</strong> unfold<strong>in</strong>g<br />

• Sensor networks<br />

monitor<strong>in</strong>g<br />

• Telecommunication<br />

network operator<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


VI – Conclusions<br />

• We have <strong>in</strong>troduced Mr<strong>Motif</strong> to f<strong>in</strong>d motifs <strong>in</strong> time series:<br />

• Fast<br />

• Space-efficient<br />

• Intuitive<br />

• Robust to noise<br />

• Easy to use<br />

• Straightforward<br />

• Reproducible<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


VII – Future work<br />

• <strong>Motif</strong> evaluation and significance measures:<br />

• <strong>Motif</strong>s are typically evaluated <strong>in</strong> a subjective way by<br />

humans<br />

• Objective evaluation measures that rank motifs <strong>in</strong> terms of<br />

significance are necessary<br />

• <strong>Motif</strong>s as build<strong>in</strong>g blocks:<br />

• As motifs can be used to describe the time series, they can<br />

be used as “build<strong>in</strong>g blocks” for other data m<strong>in</strong><strong>in</strong>g tasks:<br />

• Classification<br />

• Abnormality detection<br />

• Forecast<strong>in</strong>g<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


Thank you for your attention!<br />

• Contact: castro@di.um<strong>in</strong>ho.pt<br />

• Mr<strong>Motif</strong> Web site (executable, source code and datasets):<br />

www.di.um<strong>in</strong>ho.pt/~castro/mrmotif<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


On similarity and multiresolution<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010


On similarity<br />

Nuno Castro and Paulo Azevedo<br />

04/30/2010

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!