Multiresolution Motif Discovery in Time Series
Multiresolution Motif Discovery in Time Series: MrMotif - ALFA
Multiresolution Motif Discovery in Time Series: MrMotif - ALFA
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Tenth SIAM International Conference on Data M<strong>in</strong><strong>in</strong>g<br />
Columbus, Ohio, USA<br />
<strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong><br />
<strong>in</strong> <strong>Time</strong> <strong>Series</strong><br />
NUNO CASTRO<br />
PAULO AZEVEDO<br />
Department of Informatics<br />
University of M<strong>in</strong>ho<br />
Portugal<br />
April 30th, 2010
Roadmap<br />
I. <strong>Motif</strong> def<strong>in</strong>ition<br />
II.<br />
III.<br />
IV.<br />
Motivation<br />
Related work limitations<br />
Our algorithm<br />
V. Experimental Analysis<br />
VI.<br />
Future work<br />
VII. Conclusion<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
I – <strong>Motif</strong> Def<strong>in</strong>ition<br />
• <strong>Motif</strong>s, also known as “recurrent patterns”, “frequent<br />
patterns”, “repeated subsequences”, or typical shapes” are<br />
previously unknown patterns <strong>in</strong> time series<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
II – Motivation<br />
• F<strong>in</strong>d<strong>in</strong>g motifs is an important task:<br />
• Describe the time series at hand<br />
• Help summarize/represent the database<br />
• Provide useful <strong>in</strong>sight to the doma<strong>in</strong> expert<br />
• Examples of motifs:<br />
• Patterns that typically precede a seizure <strong>in</strong> EEG<br />
• DNA subsequence preserved through evolution<br />
• Bursts <strong>in</strong> telecommunication traffic<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
III – Related work limitations<br />
• Computational complexity<br />
• Quadratic algorithms are clearly not the solution<br />
• Disk <strong>in</strong>nefficient (use expensive random disk accesses)<br />
• Memory <strong>in</strong>nefficient (assume data can fit <strong>in</strong>to ma<strong>in</strong><br />
memory)<br />
• Assume all data are available<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
III – Related work limitations (cont.)<br />
• Consider motifs at a s<strong>in</strong>gle resolution<br />
• Are not suited to <strong>in</strong>teractivity<br />
• Large number of un<strong>in</strong>tuitive parameters to set:<br />
• <strong>Motif</strong> length<br />
• Range (distance threshold)<br />
• Number of columns <strong>in</strong> the subsequence matrix<br />
• Limited to f<strong>in</strong>d<strong>in</strong>g motifs <strong>in</strong> univariate time series<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm<br />
• We propose an algorithm:<br />
• <strong>Multiresolution</strong> <strong>Motif</strong> <strong>Discovery</strong> <strong>in</strong> <strong>Time</strong> <strong>Series</strong>: Mr<strong>Motif</strong><br />
• <strong>Time</strong> efficient:<br />
• One s<strong>in</strong>gle sequential disk scan<br />
• Clever representation technique (iSAX)<br />
• Use of constant access time structures<br />
• Memory efficient:<br />
• Comb<strong>in</strong>e our approach with the Space-Sav<strong>in</strong>g algorithm<br />
• Adjustable amount of memory to use<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm<br />
Problem def<strong>in</strong>ition<br />
• We follow a Top-K frequent pattern approach:<br />
• i.e. f<strong>in</strong>d<strong>in</strong>g the Top-K motifs<br />
• A time series can be counted as a repetition of another if<br />
they have the same symbolic representation<br />
• We use the Symbolic Aggregate Approximation (iSAX*)<br />
* Shieh, J. and Keogh, E., iSAX: <strong>in</strong>dex<strong>in</strong>g and m<strong>in</strong><strong>in</strong>g terabyte sized time series,<br />
<strong>in</strong> Proceed<strong>in</strong>gs of the 14th ACM SIGKDD <strong>in</strong>ternational Conference on Knowledge <strong>Discovery</strong> and Data M<strong>in</strong><strong>in</strong>g (2008), pp. 623-631.<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Problem def<strong>in</strong>ition<br />
iSAX<br />
• State of the art time series representation technique<br />
• Widely used <strong>in</strong> time series data m<strong>in</strong><strong>in</strong>g<br />
• Converts a time series to a sequence of symbols (word)<br />
• Given a resolution (alphabet size) and word size<br />
Image generated by MATLAB and code provided by iSAX authors<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Problem def<strong>in</strong>ition<br />
iSAX (cont.)<br />
• Ability to easily move between different resolutions<br />
Resolution<br />
Decimal word B<strong>in</strong>ary word<br />
2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0}<br />
4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01}<br />
8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011}<br />
16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110}<br />
Resolution = 4 Resolution = 16<br />
Image generated<br />
by MATLAB and<br />
code provided by<br />
iSAX authors<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm<br />
Problem def<strong>in</strong>ition (cont.)<br />
• Example of 3 time series that form a motif<br />
• Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 }<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm<br />
Mr<strong>Motif</strong><br />
• Perform one traversal of the time series database<br />
• For each resolution<br />
• Convert each time series to an iSAX word<br />
• Ma<strong>in</strong>ta<strong>in</strong> and update a counter of the current Top-K motifs,<br />
<strong>in</strong>dexed by iSAX word<br />
• e.g. resolution 2<br />
<strong>Motif</strong><br />
Count<br />
{2,5,7,5,3,0,3,3} 54<br />
{4,7,0,0,0,1,5,5} 32<br />
{0,0,0,4,5,2,0,0} 25<br />
...<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm<br />
Properties<br />
• <strong>Multiresolution</strong><br />
• Interactivity<br />
• Space-Sav<strong>in</strong>g<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Properties<br />
<strong>Multiresolution</strong><br />
• Our <strong>in</strong>tuition is that at the larger resolutions, it is harder<br />
for two different time series to match<br />
• Each <strong>in</strong>terval narrows considerably each time we<br />
duplicate the resolution<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Properties<br />
<strong>Multiresolution</strong> (cont.)<br />
• At the largest resolutions, we are work<strong>in</strong>g closer to the<br />
level of raw data<br />
• This assumption prevents us from perform<strong>in</strong>g expensive<br />
distance calculations<br />
• The multiresolution capability allows to develop<br />
<strong>in</strong>teractive visual tools<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Properties<br />
Interactivity<br />
• Feed a tree-like structure with our motifs at different<br />
resolutions<br />
• This allows to navigate <strong>in</strong> the motif hierarchy structure<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Properties<br />
Space-Sav<strong>in</strong>g (SS)<br />
• Proposed* to efficiently compute frequent elements <strong>in</strong><br />
data streams<br />
• Monitor only m words<br />
• For each new word e<br />
• If e is already monitored, <strong>in</strong>crement its count<br />
• If not, replace the least frequent monitored element by e,<br />
and <strong>in</strong>crement it<br />
• Experimentally shown to guarantee very small errors,<br />
with known upper-bounds on the over-estimation errors<br />
Reference***<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
IV – Our algorithm – Properties<br />
Space-Sav<strong>in</strong>g (cont.)<br />
• We start Mr<strong>Motif</strong> with Space-Sav<strong>in</strong>g disabled, <strong>in</strong> order to<br />
make m large enough to further reduce errors<br />
• Activate Space-Sav<strong>in</strong>g when memory threshold is<br />
reached (e.g. 128MB guarantees m =10000 elements)<br />
• or memory is about to run out<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis<br />
• Scalability experiments (synthetic data)<br />
• Execution time<br />
• Memory<br />
• Experiments with noise<br />
• Real applications<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis<br />
Scalability Experiments<br />
• Dataset:<br />
• Reproduced from Mueen et al., 2009*.<br />
• 10 different sets of random walk time series<br />
• Each set with 10000 up to 100000 series of length 1024<br />
• About 8GB of time series data<br />
• We compare Mr<strong>Motif</strong> to Random Projection (Chiu et al.,<br />
2003)<br />
• Due to its popularity<br />
• Is the basis of many current motif discovery approaches<br />
• We also compare Space-Sav<strong>in</strong>g (SS) and Full Memory<br />
(FM) versions of Mr<strong>Motif</strong><br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010<br />
**Ref
V – Experimental Analysis – Scalability Experiments<br />
Execution time<br />
• Algorithms are executed 10 times for each of the ten<br />
<strong>in</strong>creas<strong>in</strong>gly larger datasets<br />
• Execution times for each dataset are averaged<br />
• Top-10 motifs are recorded<br />
• Maximum amount of memory set to 128MB<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis – Scalability Experiments<br />
Execution time (results)<br />
DB size<br />
Mr<strong>Motif</strong><br />
(SS)<br />
Mr<strong>Motif</strong><br />
(FM)<br />
Random<br />
Projection<br />
10000 16,43 13,91 53,54<br />
20000 32,68 26,85 193,88<br />
30000 49,60 40,34 404,41<br />
40000 62,92 51,87 705,02<br />
50000 79,26 66,13 1221,13<br />
60000 98,15 78,44 1613,53<br />
70000 114,35 89,33 2139,20<br />
80000 127,27 106,40 2708,53<br />
90000 149,40 116,08 3468,50<br />
100000 158,76 133,11 4357,39<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis – Scalability Experiments<br />
Memory<br />
• We compare memory usage of the FM and SS versions<br />
of Mr<strong>Motif</strong> <strong>in</strong> the 100000 sized dataset<br />
• Observe the impact of SS (memory limit set to 128MB)<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis<br />
Experiments with noise<br />
• We apply Mr<strong>Motif</strong> to the 10000 sized dataset and record<br />
the Top-10 patterns for resolution 4<br />
• Mr<strong>Motif</strong> is executed <strong>in</strong> each<br />
variation of the series<br />
• Precision/recall with respect to<br />
the orig<strong>in</strong>al series are calculated<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis<br />
Experiments with noise (cont.)<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
V – Experimental Analysis<br />
Real applications<br />
• We have applied Mr<strong>Motif</strong> to real data from:<br />
• Prote<strong>in</strong> unfold<strong>in</strong>g<br />
• Sensor networks<br />
monitor<strong>in</strong>g<br />
• Telecommunication<br />
network operator<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
VI – Conclusions<br />
• We have <strong>in</strong>troduced Mr<strong>Motif</strong> to f<strong>in</strong>d motifs <strong>in</strong> time series:<br />
• Fast<br />
• Space-efficient<br />
• Intuitive<br />
• Robust to noise<br />
• Easy to use<br />
• Straightforward<br />
• Reproducible<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
VII – Future work<br />
• <strong>Motif</strong> evaluation and significance measures:<br />
• <strong>Motif</strong>s are typically evaluated <strong>in</strong> a subjective way by<br />
humans<br />
• Objective evaluation measures that rank motifs <strong>in</strong> terms of<br />
significance are necessary<br />
• <strong>Motif</strong>s as build<strong>in</strong>g blocks:<br />
• As motifs can be used to describe the time series, they can<br />
be used as “build<strong>in</strong>g blocks” for other data m<strong>in</strong><strong>in</strong>g tasks:<br />
• Classification<br />
• Abnormality detection<br />
• Forecast<strong>in</strong>g<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
Thank you for your attention!<br />
• Contact: castro@di.um<strong>in</strong>ho.pt<br />
• Mr<strong>Motif</strong> Web site (executable, source code and datasets):<br />
www.di.um<strong>in</strong>ho.pt/~castro/mrmotif<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
On similarity and multiresolution<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010
On similarity<br />
Nuno Castro and Paulo Azevedo<br />
04/30/2010