Multiresolution Motif Discovery in Time Series

Tenth SIAM International Conference on Data Mining 

Columbus, Ohio, USA 

Multiresolution Motif Discovery 

in Time Series 

NUNO CASTRO 

PAULO AZEVEDO 

Department of Informatics 

University of Minho 

Portugal 

April 30th, 2010

Roadmap 

I. Motif definition 

II. 

III. 

IV. 

Motivation 

Related work limitations 

Our algorithm 

V. Experimental Analysis 

VI. 

Future work 

VII. Conclusion 

Nuno Castro and Paulo Azevedo 

04/30/2010

I – Motif Definition 

• Motifs, also known as “recurrent patterns”, “frequent 

patterns”, “repeated subsequences”, or typical shapes” are 

previously unknown patterns in time series 


04/30/2010

II – Motivation 

• Finding motifs is an important task: 

• Describe the time series at hand 

• Help summarize/represent the database 

• Provide useful insight to the domain expert 

• Examples of motifs: 

• Patterns that typically precede a seizure in EEG 

• DNA subsequence preserved through evolution 

• Bursts in telecommunication traffic 


04/30/2010

III – Related work limitations 

• Computational complexity 

• Quadratic algorithms are clearly not the solution 

• Disk innefficient (use expensive random disk accesses) 

• Memory innefficient (assume data can fit into main 

memory) 

• Assume all data are available 


04/30/2010

III – Related work limitations (cont.) 

• Consider motifs at a single resolution 

• Are not suited to interactivity 

• Large number of unintuitive parameters to set: 

• Motif length 

• Range (distance threshold) 

• Number of columns in the subsequence matrix 

• Limited to finding motifs in univariate time series 


04/30/2010

IV – Our algorithm 

• We propose an algorithm: 

• Multiresolution Motif Discovery in Time Series: MrMotif 

• Time efficient: 

• One single sequential disk scan 

• Clever representation technique (iSAX) 

• Use of constant access time structures 

• Memory efficient: 

• Combine our approach with the Space-Saving algorithm 

• Adjustable amount of memory to use 


04/30/2010


Problem definition 

• We follow a Top-K frequent pattern approach: 

• i.e. finding the Top-K motifs 

• A time series can be counted as a repetition of another if 

they have the same symbolic representation 

• We use the Symbolic Aggregate Approximation (iSAX*) 

* Shieh, J. and Keogh, E., iSAX: indexing and mining terabyte sized time series, 

in Proceedings of the 14th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (2008), pp. 623-631. 


04/30/2010

IV – Our algorithm – Problem definition 

iSAX 

• State of the art time series representation technique 

• Widely used in time series data mining 

• Converts a time series to a sequence of symbols (word) 

• Given a resolution (alphabet size) and word size 

Image generated by MATLAB and code provided by iSAX authors 


04/30/2010

IV – Our algorithm – Problem definition 

iSAX (cont.) 

• Ability to easily move between different resolutions 

Resolution 

Decimal word Binary word 

2 { 0, 1, 1, 1, 0, 0, 0, 0} {0,1,1,1,0,0,0,0} 

4 {1, 2, 3, 2, 1, 0, 1, 1} {01,10,11,10,01,00,01,01} 

8 {2, 5, 7, 5, 3, 0, 3, 3} {010,101,111,101,011,000,011,011} 

16 {5, 11, 15, 11, 6, 1, 6, 6} {0101,1011,1111,1011,0110,0001,0110,0110} 

Resolution = 4 Resolution = 16 

Image generated 

by MATLAB and 

code provided by 

iSAX authors 


04/30/2010


Problem definition (cont.) 

• Example of 3 time series that form a motif 

• Our motif is the word: { 1, 1, 3, 8, 11, 12, 13, 13 } 


04/30/2010


MrMotif 

• Perform one traversal of the time series database 

• For each resolution 

• Convert each time series to an iSAX word 

• Maintain and update a counter of the current Top-K motifs, 

indexed by iSAX word 

• e.g. resolution 2 

Motif 

Count 

{2,5,7,5,3,0,3,3} 54 

{4,7,0,0,0,1,5,5} 32 

{0,0,0,4,5,2,0,0} 25 

... 


04/30/2010


Properties 

• Multiresolution 

• Interactivity 

• Space-Saving 


04/30/2010

IV – Our algorithm – Properties 

Multiresolution 

• Our intuition is that at the larger resolutions, it is harder 

for two different time series to match 

• Each interval narrows considerably each time we 

duplicate the resolution 


04/30/2010


Multiresolution (cont.) 

• At the largest resolutions, we are working closer to the 

level of raw data 

• This assumption prevents us from performing expensive 

distance calculations 

• The multiresolution capability allows to develop 

interactive visual tools 


04/30/2010


Interactivity 

• Feed a tree-like structure with our motifs at different 

resolutions 

• This allows to navigate in the motif hierarchy structure 


04/30/2010


Space-Saving (SS) 

• Proposed* to efficiently compute frequent elements in 

data streams 

• Monitor only m words 

• For each new word e 

• If e is already monitored, increment its count 

• If not, replace the least frequent monitored element by e, 

and increment it 

• Experimentally shown to guarantee very small errors, 

with known upper-bounds on the over-estimation errors 

Reference*** 


04/30/2010


Space-Saving (cont.) 

• We start MrMotif with Space-Saving disabled, in order to 

make m large enough to further reduce errors 

• Activate Space-Saving when memory threshold is 

reached (e.g. 128MB guarantees m =10000 elements) 

• or memory is about to run out 


04/30/2010

V – Experimental Analysis 

• Scalability experiments (synthetic data) 

• Execution time 

• Memory 

• Experiments with noise 

• Real applications 


04/30/2010


Scalability Experiments 

• Dataset: 

• Reproduced from Mueen et al., 2009*. 

• 10 different sets of random walk time series 

• Each set with 10000 up to 100000 series of length 1024 

• About 8GB of time series data 

• We compare MrMotif to Random Projection (Chiu et al., 

2003) 

• Due to its popularity 

• Is the basis of many current motif discovery approaches 

• We also compare Space-Saving (SS) and Full Memory 

(FM) versions of MrMotif 


04/30/2010 

**Ref

V – Experimental Analysis – Scalability Experiments 

Execution time 

• Algorithms are executed 10 times for each of the ten 

increasingly larger datasets 

• Execution times for each dataset are averaged 

• Top-10 motifs are recorded 

• Maximum amount of memory set to 128MB 


04/30/2010


Execution time (results) 

DB size 


(SS) 


(FM) 

Random 

Projection 

10000 16,43 13,91 53,54 

20000 32,68 26,85 193,88 

30000 49,60 40,34 404,41 

40000 62,92 51,87 705,02 

50000 79,26 66,13 1221,13 

60000 98,15 78,44 1613,53 

70000 114,35 89,33 2139,20 

80000 127,27 106,40 2708,53 

90000 149,40 116,08 3468,50 

100000 158,76 133,11 4357,39 


04/30/2010


Memory 

• We compare memory usage of the FM and SS versions 

of MrMotif in the 100000 sized dataset 

• Observe the impact of SS (memory limit set to 128MB) 


04/30/2010


Experiments with noise 

• We apply MrMotif to the 10000 sized dataset and record 

the Top-10 patterns for resolution 4 

• MrMotif is executed in each 

variation of the series 

• Precision/recall with respect to 

the original series are calculated 


04/30/2010


Experiments with noise (cont.) 


04/30/2010


Real applications 

• We have applied MrMotif to real data from: 

• Protein unfolding 

• Sensor networks 

monitoring 

• Telecommunication 

network operator 


04/30/2010

VI – Conclusions 

• We have introduced MrMotif to find motifs in time series: 

• Fast 

• Space-efficient 

• Intuitive 

• Robust to noise 

• Easy to use 

• Straightforward 

• Reproducible 


04/30/2010

VII – Future work 

• Motif evaluation and significance measures: 

• Motifs are typically evaluated in a subjective way by 

humans 

• Objective evaluation measures that rank motifs in terms of 

significance are necessary 

• Motifs as building blocks: 

• As motifs can be used to describe the time series, they can 

be used as “building blocks” for other data mining tasks: 

• Classification 

• Abnormality detection 

• Forecasting 


04/30/2010

Thank you for your attention! 

• Contact: castro@di.uminho.pt 

• MrMotif Web site (executable, source code and datasets): 

www.di.uminho.pt/~castro/mrmotif 


04/30/2010

On similarity and multiresolution 


04/30/2010

On similarity 


04/30/2010

Multiresolution Motif Discovery in Time Series

Create successful ePaper yourself

Delete template?

Save as template?