13.07.2015 Views

accelerating the mixing phase in studio recording productions by ...

accelerating the mixing phase in studio recording productions by ...

accelerating the mixing phase in studio recording productions by ...

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

ACCELERATING THE MIXING PHASE IN STUDIO RECORDINGPRODUCTIONS BY AUTOMATIC AUDIO ALIGNMENTNicola MontecchioUniversity of PadovaDepartment of Information Eng<strong>in</strong>eer<strong>in</strong>gnicola.montecchio@dei.unipd.itArshia ContInstitut de Recherche et Coord<strong>in</strong>ationAcoustique/Musique (IRCAM)arshia.cont@ircam.frABSTRACTWe propose a system for <strong>accelerat<strong>in</strong>g</strong> <strong>the</strong> <strong>mix<strong>in</strong>g</strong> <strong>phase</strong> <strong>in</strong>a record<strong>in</strong>g production, <strong>by</strong> mak<strong>in</strong>g use of audio alignmenttechniques to automatically align multiple takes of excerptsof a music piece aga<strong>in</strong>st a performance of <strong>the</strong> whole work.We extend <strong>the</strong> approach of our previous work, based on sequentialMontecarlo <strong>in</strong>ference techniques, that was targetedat real-time alignment for score/audio follow<strong>in</strong>g. The proposedapproach is capable of produc<strong>in</strong>g partial alignmentsas well as identify<strong>in</strong>g relevant regions <strong>in</strong> <strong>the</strong> partial resultswith regards to <strong>the</strong> reference, for better <strong>in</strong>tegration with<strong>in</strong>a <strong>studio</strong> mix workflow. The approach is evaluated us<strong>in</strong>gdata obta<strong>in</strong>ed from two record<strong>in</strong>g sessions of classical musicpieces, and we discuss its effectiveness for reduc<strong>in</strong>g manualwork <strong>in</strong> a production cha<strong>in</strong>.1. INTRODUCTIONThe common practice <strong>in</strong> <strong>productions</strong> of <strong>studio</strong> record<strong>in</strong>gsconsists of several <strong>phase</strong>s. At first <strong>the</strong> raw audio materialis captured and stored on a support. This material is subsequentlycomb<strong>in</strong>ed and edited <strong>in</strong> order to produce a mix,which is f<strong>in</strong>alized <strong>in</strong> <strong>the</strong> master<strong>in</strong>g <strong>phase</strong> for commercialrelease. Nowadays, <strong>the</strong> whole process revolves around acomputer Digital Audio Workstation (DAW).In <strong>the</strong> case of <strong>in</strong>strumental record<strong>in</strong>g, <strong>the</strong> <strong>in</strong>itial task <strong>in</strong>volvescaptur<strong>in</strong>g a complete reference run-through of <strong>the</strong> entirepiece, after which additional takes of specific sectionsare recorded to allow <strong>the</strong> <strong>mix<strong>in</strong>g</strong> eng<strong>in</strong>eer to mask performancemistakes or reduce eventual environmental noises.The role of a <strong>mix<strong>in</strong>g</strong> eng<strong>in</strong>eer is to <strong>in</strong>tegrate <strong>the</strong>se takeswith<strong>in</strong> <strong>the</strong> global reference <strong>in</strong> order to achieve a seamlessf<strong>in</strong>al mix [2]. The first step <strong>in</strong> prepar<strong>in</strong>g a mix session consists<strong>in</strong> arrang<strong>in</strong>g <strong>the</strong> takes with regards to <strong>the</strong> global ref-Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and <strong>the</strong> full citation on <strong>the</strong> first page.c○ 2011 International Society for Music Information Retrieval.erence. Figure 1 shows a typical DAW session preparedout of a reference run-through (<strong>the</strong> top track) and additionaltakes aligned appropriately. Those takes usually require fur<strong>the</strong>rcleanup as <strong>the</strong>y commonly <strong>in</strong>clude noise or conversationthat are not useful for <strong>the</strong> f<strong>in</strong>al mix. This means that,<strong>in</strong> addition to alignment, <strong>the</strong> <strong>mix<strong>in</strong>g</strong> eng<strong>in</strong>eer identifies cutpo<strong>in</strong>tsfor each take that correspond to relevant regions <strong>in</strong><strong>the</strong> reference. The additional takes are f<strong>in</strong>ally blended with<strong>the</strong> reference <strong>by</strong> crossfad<strong>in</strong>g short overlapp<strong>in</strong>g audio regionsto avoid perceptual discont<strong>in</strong>uities.Figure 1. A typical DAW <strong>mix<strong>in</strong>g</strong> session.The purpose of this work is to facilitate <strong>the</strong> process of<strong>mix<strong>in</strong>g</strong> <strong>by</strong> <strong>in</strong>tegrat<strong>in</strong>g automatic (audio to audio) alignmenttechniques <strong>in</strong>to <strong>the</strong> production cha<strong>in</strong>. Special care is taken toconsider exist<strong>in</strong>g practices with<strong>in</strong> <strong>the</strong> workflow, such as automaticidentification of <strong>in</strong>terest po<strong>in</strong>ts. In contrast to mostliterature on audio alignment, we are concerned with twoessential aspects: <strong>the</strong> ability to identify a partial alignmentwith an unknown start<strong>in</strong>g position and <strong>the</strong> detection of regionsof <strong>in</strong>terest <strong>in</strong>side <strong>the</strong> alignment. Moreover our approachpermits to achieve different degrees of accuracy depend<strong>in</strong>gon efficiency requirements.Us<strong>in</strong>g audio material collected from two real-life record<strong>in</strong>gsessions, we show that it is possible to optimize <strong>the</strong> operationsof sound eng<strong>in</strong>eers <strong>by</strong> automat<strong>in</strong>g time-consum<strong>in</strong>gtasks. We fur<strong>the</strong>r discuss how such framework can be <strong>in</strong>tegratedpragmatically with<strong>in</strong> common DAW software.


2. RELATED WORKAt <strong>the</strong> application level, alignment techniques were already<strong>in</strong>troduced <strong>in</strong> <strong>the</strong> literature <strong>in</strong> [3]. Alignment of audio to<strong>the</strong> symbolic representation of a piece was <strong>in</strong>tegrated <strong>in</strong>to<strong>the</strong> workflow, permitt<strong>in</strong>g <strong>the</strong> automation of <strong>the</strong> edit<strong>in</strong>g processthrough operations such as pitch and tim<strong>in</strong>g corrections.The application of <strong>the</strong>se approaches is precluded <strong>in</strong><strong>the</strong> present context <strong>by</strong> <strong>the</strong> requirement of access<strong>in</strong>g a symbolicrepresentation of <strong>the</strong> music. None<strong>the</strong>less, despite thislimitation, <strong>the</strong> work provides important <strong>in</strong>sights <strong>in</strong> <strong>the</strong> <strong>in</strong>tegrationwith<strong>in</strong> a DAW setup.At <strong>the</strong> technological level, audio alignment has often been<strong>the</strong> subject of extensive research; an overview of classicalapproaches <strong>in</strong> literature can be found <strong>in</strong> [6]. In contrast totraditional methods, an important aspect of this work is <strong>the</strong>consideration of partial results and detection of <strong>in</strong>terest regions.An audio alignment method with similar aims was<strong>in</strong>troduced <strong>in</strong> [7], that explicitly deals with <strong>the</strong> synchronizationof record<strong>in</strong>gs that have different structural forms.3. GENERAL ARCHITECTUREThe proposed methodology was devised assum<strong>in</strong>g that ageneric algorithm is available that is capable of align<strong>in</strong>gaudio sequences without a known start<strong>in</strong>g position. Eventhough methods such as HMM or DTW [4] could have beenused for this aim, we chose to exploit our previous work [6]on sequential Montecarlo <strong>in</strong>ference because of its straightforwardapplicability to <strong>the</strong> present context, its flexibilityregard<strong>in</strong>g <strong>the</strong> degree of accuracy given <strong>by</strong> <strong>the</strong> availability ofsmooth<strong>in</strong>g algorithms and <strong>the</strong> possibility to trade accuracyfor computational efficiency <strong>in</strong> an direct way.In <strong>the</strong> first <strong>phase</strong> a rough alignment is produced as <strong>in</strong> Figure2(a); <strong>the</strong> <strong>in</strong>itial uncerta<strong>in</strong>ty <strong>in</strong> <strong>the</strong> alignment is due to <strong>the</strong>fact that <strong>the</strong> <strong>in</strong>itial position is not known a priori. In a second<strong>phase</strong> we identify a sufficiently long region of <strong>the</strong> alignmentthat can be reasonably approximated <strong>by</strong> a straight l<strong>in</strong>e, as <strong>in</strong>Figure 2(b); this region <strong>in</strong>tuitively corresponds to <strong>the</strong> “correct”section of <strong>the</strong> alignment. These two <strong>phase</strong>s solve <strong>the</strong>task of plac<strong>in</strong>g <strong>the</strong> takes along <strong>the</strong> reference (Figure 1).The rema<strong>in</strong><strong>in</strong>g steps address <strong>the</strong> tasks <strong>in</strong> which a moreaccurate alignment is required. In <strong>the</strong> third <strong>phase</strong>, <strong>the</strong> <strong>in</strong>itialportion of <strong>the</strong> alignment is corrected, start<strong>in</strong>g from aposition <strong>in</strong>side <strong>the</strong> region found <strong>in</strong> <strong>the</strong> previous <strong>phase</strong> andus<strong>in</strong>g a reversed variant of <strong>the</strong> alignment algorithm (Figure2(c)). F<strong>in</strong>ally, a ref<strong>in</strong>ed alignment is produced <strong>by</strong> exploit<strong>in</strong>ga smooth<strong>in</strong>g algorithm for sequential Montecarlo<strong>in</strong>ference, as shown <strong>in</strong> Figure 2(d).4. METHODOLOGYThe four <strong>phase</strong>s described <strong>in</strong> <strong>the</strong> previous section are highlighted<strong>in</strong> Figure 2 and described below <strong>in</strong> detail.reference time [s]reference time [s]reference time [s]reference time [s]1900 1920 1940 19601910 1930 19501900 1920 1940 19601900 1920 1940 1960p(x 0) = U(0, L)0 10 20 30 40take time [s](a) Initial alignment, us<strong>in</strong>g sequential Montecarlo <strong>in</strong>ference.convergence region0 10 20 30 40take time [s]<strong>in</strong>terest region(b) Identification of <strong>the</strong> <strong>in</strong>terest region of <strong>the</strong> alignment.p(x (b)B ) = diag(1, −1) p(x B|z 0 . . . z B)0 10 20 30 40take time [s](c) Correction of <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g of <strong>the</strong> alignment.p(x (fb)0 ) = diag(1, −1) p(x (b)0 |z B . . . z 0)0 10 20 30 40take time [s](d) F<strong>in</strong>al alignment obta<strong>in</strong>ed us<strong>in</strong>g smoo<strong>the</strong>d <strong>in</strong>ference.Figure 2. Alignment methodology.


4.1 Initial AlignmentThe alignment problem is formulated as <strong>the</strong> track<strong>in</strong>g of an<strong>in</strong>put data stream along a reference, us<strong>in</strong>g motion equations.4.1.1 System State RepresentationThe system state is modeled as a two-dimensional randomvariable x = (s, t), represent<strong>in</strong>g <strong>the</strong> current position <strong>in</strong> <strong>the</strong>reference audio and tempo respectively; s is measured <strong>in</strong>seconds and t is <strong>the</strong> speed ratio of <strong>the</strong> performances. The<strong>in</strong>com<strong>in</strong>g signal process<strong>in</strong>g frontend is based on spectralfeatures extracted from <strong>the</strong> FFT analysis of an overlapp<strong>in</strong>g,w<strong>in</strong>dowed signal representation, with hop size ∆T . In orderto use sequential Montecarlo methods to estimate <strong>the</strong> hiddenvariable x k = (s k , t k ) us<strong>in</strong>g observation z k at time frame k,we assume that <strong>the</strong> state evolution is Markovian.4.1.2 Observation Model<strong>in</strong>gLet p(z k |x k ) denote <strong>the</strong> likelihood of observ<strong>in</strong>g an audioframe z k of <strong>the</strong> take given <strong>the</strong> current position along <strong>the</strong> referenceperformance s k . We consider a simple spectral similaritymeasure, def<strong>in</strong>ed as <strong>the</strong> Kullback-Leibler divergencebetween <strong>the</strong> power spectra at frame k of <strong>the</strong> take and at times k <strong>in</strong> <strong>the</strong> reference.4.1.3 System State Transition Model<strong>in</strong>gLet p(x k |x k−1 ) denote <strong>the</strong> pdf for <strong>the</strong> state transition; wemake use of tempo estimation <strong>in</strong> <strong>the</strong> previous frame, assum<strong>in</strong>gthat it does not change too quickly:[ ]skp(x k |x k−1 ) = N ( | µt k , Σ)k[ ]sk−1 + ∆T tµ k =k−1t k−1[ ]σ2Σ = s ∆T 00 σt 2 ∆TIntuitively, this corresponds to a performance where tempois ra<strong>the</strong>r steady but can fluctuate; <strong>the</strong> parameters σ 2 t and σ 2 scontrol respectively <strong>the</strong> variability of tempo and <strong>the</strong> possibilityof local mismatches that do not affect <strong>the</strong> overalltempo estimate.4.1.4 Inference AlgorithmSequential Montecarlo <strong>in</strong>ference methods work <strong>by</strong> recursivelyapproximat<strong>in</strong>g <strong>the</strong> current distribution of <strong>the</strong> systemstate us<strong>in</strong>g <strong>the</strong> technique of Sequential Importance Sampl<strong>in</strong>g:a random measure {x i k , wi k }Ns i=1is used to characterize<strong>the</strong> posterior pdf with a set of N s particles over <strong>the</strong>state doma<strong>in</strong> and associated weights, and is updated at eachtime step as <strong>in</strong> Algorithm 1. In particular, q(x k |x k−1 , z k ) is<strong>the</strong> particle sampl<strong>in</strong>g function. In our implementation thiscorresponds to <strong>the</strong> transition probability density function; <strong>in</strong>this case <strong>the</strong> algorithm is known as condensation algorithm.An optional resampl<strong>in</strong>g step is used to address <strong>the</strong> degeneracyproblem, common to particle filter<strong>in</strong>g approaches;this is discussed <strong>in</strong> detail <strong>in</strong> [1,5] and <strong>in</strong> <strong>the</strong> next paragraph.The decod<strong>in</strong>g of position and tempo is carried out <strong>by</strong>comput<strong>in</strong>g <strong>the</strong> expected value of <strong>the</strong> result<strong>in</strong>g random measure(which is efficiently computed as E[x k ] = ∑ N si=1 xi k wi k ).Algorithm 1: SIS Particle Filter - Update stepfor i = 1 . . . N s dosample x i k accord<strong>in</strong>g to q(xi k |xi k−1 , z k)w i k ←ŵk i ← wi p(z k |x i k )p(xi k |xi k−1 )k−1 q(x i k |xi k−1 ,z k)ŵi k ∑j ŵj k∀i = 1 . . . N sN eff ← ( ∑ N si=1 (wi k )2 ) −1if N eff < resampl<strong>in</strong>g threshold <strong>the</strong>nresample x 1 k . . . xNs kaccord<strong>in</strong>g to ddf wk 1 . . . wNs kwk i ← N s−1 ∀i = 1 . . . N s4.1.5 InitializationInitialization plays a central role <strong>in</strong> <strong>the</strong> performance of <strong>the</strong>algorithm; <strong>in</strong> a probabilistic context this corresponds to anappropriate choice of <strong>the</strong> prior distribution p(x 0 ).In a real-time setup <strong>the</strong> player is expected to start <strong>the</strong> performanceat a well known po<strong>in</strong>t of <strong>the</strong> reference; this fact isexploited <strong>in</strong> <strong>the</strong> design of <strong>the</strong> algorithm <strong>by</strong> sett<strong>in</strong>g an appropriatelyshaped prior distribution, typically a low-varianceone around <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g.In <strong>the</strong> proposed situation however <strong>the</strong> <strong>in</strong>itial po<strong>in</strong>t is notknown (it represents <strong>in</strong>deed <strong>the</strong> aim of our <strong>in</strong>terest). To copewith this, <strong>the</strong> prior distribution p(x 0 ) is set to be uniformover <strong>the</strong> whole duration L of <strong>the</strong> reference performance; <strong>the</strong>algorithm is expected to “converge” to <strong>the</strong> correct positionafter a few iterations. Figure 3 shows <strong>the</strong> evolution of <strong>the</strong>probability distribution for <strong>the</strong> position of <strong>the</strong> <strong>in</strong>put at differentmoments of <strong>the</strong> alignment.4.1.6 Degeneracy Issues w.r.t. Realtime AlignmentA relevant parameter of Algorithm 1 is <strong>the</strong> resampl<strong>in</strong>g threshold.The variable N eff , commonly known as effective samplesize, is used to estimate <strong>the</strong> degree of degeneracy whichaffects <strong>the</strong> random measure; degeneracy is related to <strong>the</strong>variance of <strong>the</strong> weights {wk i }Ns 1 , and it is proven to be always<strong>in</strong>creas<strong>in</strong>g <strong>in</strong> absence of resampl<strong>in</strong>g. In a degeneratesituation most particles have close-to-zero weight, result<strong>in</strong>g<strong>in</strong> most of <strong>the</strong> computation be<strong>in</strong>g spent <strong>in</strong> updat<strong>in</strong>g particleswhich are subject to numerical approximation errors.Resampl<strong>in</strong>g is <strong>in</strong>troduced to obviate this issue. Intuitively,resampl<strong>in</strong>g replaces a random measure of <strong>the</strong> true distributionwith an equivalent one (<strong>in</strong> <strong>the</strong> limit of N s → ∞) that isbetter suited for <strong>the</strong> <strong>in</strong>ference algorithm. S<strong>in</strong>ce resampl<strong>in</strong>g


0 100 200 300 400 500 600ref. time [s](a) prior distribution (k = 0)0 100 200 300 400 500 600ref. time [s](b) k = 1reference time[s]1920 1940 1960 19800 10 20 30 40take time[s]0 100 200 300 400 500 600ref. time [s](c) k = 10Figure 4. Identification of <strong>the</strong> <strong>in</strong>terest region.125 130 135 140ref. time [s](d) k = 50Figure 3. Evolution of p(s k |z 1 . . . z k ).<strong>in</strong>troduces o<strong>the</strong>r problems (<strong>in</strong> particular, sample impoverishment,i.e., a small number of particles is selected multipletimes) its usage should be limited, thus produc<strong>in</strong>g <strong>the</strong>necessity for a threshold on <strong>the</strong> effective sample size.In <strong>the</strong> real-time score follow<strong>in</strong>g case [6] <strong>the</strong> mass of <strong>the</strong>distribution is always concentrated around a small regionof <strong>the</strong> doma<strong>in</strong> thus allow<strong>in</strong>g <strong>the</strong> resampl<strong>in</strong>g threshold to berelatively low. In contrast, <strong>in</strong> a situation such as <strong>the</strong> onedepicted <strong>in</strong> Figure 3, <strong>the</strong> sparsity of <strong>the</strong> distribution <strong>in</strong> <strong>the</strong><strong>in</strong>itial <strong>phase</strong>s of <strong>the</strong> alignment imposes a much higher resampl<strong>in</strong>gthreshold, o<strong>the</strong>rwise many relevant hypo<strong>the</strong>ses aresoon lost <strong>in</strong> <strong>the</strong> resampl<strong>in</strong>g <strong>phase</strong> and cannot be recovered.4.2 Identification of <strong>the</strong> Interest RegionThis <strong>phase</strong> aims at identify<strong>in</strong>g a region of <strong>the</strong> alignment obta<strong>in</strong>edpreviously where it is certa<strong>in</strong> that <strong>the</strong> alignment is<strong>in</strong>deed “correct”. As depicted <strong>in</strong> Figure 2(b), a typical alignmentcan be subdivided <strong>in</strong>to two regions, <strong>the</strong> first one be<strong>in</strong>gcharacterized <strong>by</strong> irregular oscillations (because not enoughdata has been observed yet <strong>in</strong> order to select <strong>the</strong> most probablehypo<strong>the</strong>sis with enough confidence) and <strong>the</strong> second oneresembl<strong>in</strong>g a straight l<strong>in</strong>e; we will refer to <strong>the</strong> former asconvergence region and to <strong>the</strong> latter as <strong>in</strong>terest region.As can be <strong>in</strong>ferred <strong>by</strong> observ<strong>in</strong>g <strong>the</strong> plot <strong>in</strong> Figure 2(b),<strong>the</strong> most important characteristic of <strong>the</strong> <strong>in</strong>terest region is itsslope. From a technical po<strong>in</strong>t of view, <strong>the</strong> slope should beas constant as possible for <strong>the</strong> alignment region to be significant.From a musical perspective it should be roughlyunitary, imply<strong>in</strong>g that <strong>the</strong> performance tempos of <strong>the</strong> s<strong>in</strong>gletake and <strong>the</strong> reference are approximately <strong>the</strong> same. In additionto that, <strong>the</strong> duration of <strong>the</strong> <strong>in</strong>terest region should belong enough to discard noisy sections of <strong>the</strong> alignment.The <strong>in</strong>terest region is identified <strong>in</strong> <strong>the</strong> follow<strong>in</strong>g man-ner: each of many <strong>in</strong>itial candidate regions w 1 . . . w W isiteratively expanded as long as it meets <strong>the</strong> criteria exposedabove; <strong>the</strong> longest of <strong>the</strong> result<strong>in</strong>g <strong>in</strong>tervals is elected as <strong>the</strong><strong>in</strong>terest region, unless none of <strong>the</strong>m matches <strong>the</strong> requirements,<strong>in</strong> which case <strong>the</strong> alignment is identified as <strong>in</strong>correct.The process described above is depicted <strong>in</strong> Figure 4(dashed horizontal l<strong>in</strong>es represent <strong>the</strong> regions progressivelyexam<strong>in</strong>ed <strong>by</strong> <strong>the</strong> algorithm) and formalized <strong>in</strong> Algorithm 2.Algorithm 2: Identification of <strong>in</strong>terest regionw 1 , . . . , w W ← regularly spaced <strong>in</strong>tervals <strong>in</strong> [0, L]candidates ← ∅ for i = 1 . . . W dowhile |w i | < L dow i ← max(0, w starti−∆T), m<strong>in</strong>(L, w endi+∆T)a i ← slope of LS-fit l<strong>in</strong>e for po<strong>in</strong>ts <strong>in</strong> w ie i ← mean difference with LS-fit l<strong>in</strong>e <strong>in</strong> w iif a i ∈ [1 − ∆A, 1 + ∆A] ∩ e i < ∆E <strong>the</strong>ncandidates ← candidates ∪ ielsebreakif |candidates| > 0 <strong>the</strong>n<strong>in</strong>terest region ← max w ii∈candidateselsealignment is <strong>in</strong>correct4.3 Correction of <strong>the</strong> Convergence RegionIn order to fix <strong>the</strong> convergence region of <strong>the</strong> alignment, weexploit aga<strong>in</strong> <strong>the</strong> sequential Montecarlo <strong>in</strong>ference methodologyof 4.1, with some adaptations. The general idea isto run <strong>the</strong> algorithm “backwards”, i.e., to align <strong>the</strong> timereversedaudio streams, start<strong>in</strong>g from a po<strong>in</strong>t <strong>in</strong> <strong>the</strong> previousalignment that is known to be correct.The start<strong>in</strong>g po<strong>in</strong>t B is chosen <strong>in</strong>side <strong>the</strong> region of <strong>in</strong>terest.The prior distribution for <strong>the</strong> backward alignmentis equal to that of <strong>the</strong> forward alignment at B, howeverwith <strong>the</strong> value of <strong>the</strong> velocity for each particle <strong>in</strong>verted:p(x (b)B ) = diag(1, −1)p(x B|z 0 . . . z B ). The audio stream of


<strong>the</strong> take is <strong>the</strong>n reversed and processed <strong>by</strong> Algorithm 1, as <strong>in</strong>Figure 2(c). Experimentation shows that a narrow uniformor gaussian prior centered <strong>in</strong> (B, −1) T are for practical purposesequivalent to <strong>the</strong> form of p(x (b)B) mentioned above.4.4 Smooth<strong>in</strong>g InferenceSequential Montecarlo <strong>in</strong>ference algorithms are typically foronl<strong>in</strong>e estimation; this implies that at each <strong>in</strong>stant only <strong>the</strong><strong>in</strong>formation about <strong>the</strong> past is exploited, <strong>in</strong>stead of <strong>the</strong> wholeobservation sequence. In <strong>the</strong> context of an offl<strong>in</strong>e applicationhowever <strong>the</strong>se real-time constra<strong>in</strong>ts can be dropped.Both <strong>the</strong> Forward/Backward and Viterbi <strong>in</strong>ference algorithmscan be deduced, respectively estimat<strong>in</strong>g <strong>the</strong> probability distributionat each <strong>in</strong>stant given <strong>the</strong> full observation sequenceand <strong>the</strong> Maximum A Posteriori alignment. The runn<strong>in</strong>g timeof both algorithms is quadratic <strong>in</strong> <strong>the</strong> number of particles,however this issue can be mitigated <strong>by</strong> an appropriate choiceof <strong>the</strong> prior distribution p(x (fb)0 ) such as a resampl<strong>in</strong>g ofdiag(1, −1) p(x (b)0 ) with a smaller numer of samples.5. EVALUATIONAn ideal evaluation of <strong>the</strong> efficacy of <strong>the</strong> proposed methodology<strong>in</strong> <strong>the</strong> context discussed <strong>in</strong> Section 1 should aim atmeasur<strong>in</strong>g <strong>the</strong> amount of work saved <strong>in</strong> production with respectto <strong>the</strong> current workflow. A discussion of our currentwork <strong>in</strong> this area is presented <strong>in</strong> Section 6.Below we evaluate <strong>the</strong> efficacy of <strong>the</strong> proposed approachregard<strong>in</strong>g <strong>the</strong> <strong>in</strong>itial <strong>phase</strong> of lay<strong>in</strong>g out <strong>the</strong> takes as <strong>in</strong> Figure1. The accuracy of <strong>the</strong> alignment <strong>in</strong> terms of latency andaverage error was evaluated <strong>in</strong> our previous work [6]; a similaranalysis could not be performed <strong>in</strong> this case, due to <strong>the</strong>lack of a (manually annotated) reference l<strong>in</strong>k<strong>in</strong>g <strong>the</strong> tim<strong>in</strong>gsof each musical event for all takes to <strong>the</strong> reference record<strong>in</strong>g.Moreover, <strong>in</strong> this situation <strong>the</strong> aim is ra<strong>the</strong>r to positioncorrectly <strong>the</strong> highest number of takes aga<strong>in</strong>st <strong>the</strong> reference,ra<strong>the</strong>r than to align <strong>the</strong>m with <strong>the</strong> highest possible precision.5.1 Dataset descriptionWe collected <strong>the</strong> record<strong>in</strong>gs produced <strong>in</strong> two real-life sessions<strong>by</strong> different groups of sound eng<strong>in</strong>eers, consist<strong>in</strong>g ofapproximately 3 hours of audio data. The first one is arecord<strong>in</strong>g session of <strong>the</strong> second movement of J. Brahms’sextet op. 18; <strong>the</strong> second one was produced shortly after <strong>the</strong>premiere of P. Manoury’s “Tensio”, for str<strong>in</strong>g quartet andlive electronics, <strong>in</strong> December 2010. Table 1 summarizes<strong>the</strong>ir characteristics.5.2 Experimental ResultsWe performed <strong>the</strong> alignment of each take <strong>in</strong> <strong>the</strong> two databasesaccord<strong>in</strong>g to <strong>the</strong> procedure <strong>in</strong>troduced <strong>in</strong> Section 4. We select<strong>the</strong> center po<strong>in</strong>t of <strong>the</strong> <strong>in</strong>terest region identified <strong>in</strong> <strong>the</strong>dataset n. of rec. duration [s]ref. takes (avg,std) totalBrahms 20 + ref. 588.8 112.8, 92.0 2844.0Manoury 49 + ref. 2339.4 113.5, 94.0 7900.4Table 1. Datasets used for evaluation.second <strong>phase</strong> as <strong>the</strong> alignment reference for <strong>the</strong> whole take(we do not performs <strong>the</strong> optional two last steps).In all <strong>the</strong> test we executed, we set <strong>the</strong> number of particlesN s to be proportional to <strong>the</strong> duration of <strong>the</strong> reference (60particles per second). Our implementation aligns a m<strong>in</strong>uteof audio <strong>in</strong> 2.29s for N s = 10 5 on a laptop computer with a2.4 Ghz Intel i5 processor (a s<strong>in</strong>gle core is used).5.2.1 Brahms DatasetFor this dataset, a manual placement of all <strong>the</strong> takes withrespect to <strong>the</strong> reference record<strong>in</strong>g was performed us<strong>in</strong>g amusical score, <strong>in</strong> order to evaluate <strong>the</strong> correctness of <strong>the</strong> automaticprocedure. Aural <strong>in</strong>spection of <strong>the</strong> data showed thatnone of <strong>the</strong> record<strong>in</strong>gs but one presented undesired noises.All <strong>the</strong> takes but one were correctly aligned. In <strong>the</strong> unsuccessfulcase, <strong>the</strong> length of <strong>the</strong> record<strong>in</strong>g itself was onesecond shorter than <strong>the</strong> m<strong>in</strong>imum length for an <strong>in</strong>terest region(15s); us<strong>in</strong>g last alignment po<strong>in</strong>t as a reference, <strong>the</strong>placement of this take also results to be correct.5.2.2 Manoury DatasetThe dataset conta<strong>in</strong>s a complete run-through and 49 separatetakes. The particularity of this dataset is <strong>the</strong> presenceof undesired material for <strong>the</strong> f<strong>in</strong>al mix <strong>in</strong> many of <strong>the</strong> <strong>in</strong>dividualtakes (such as speech, practice sessions, volume andcalibration tests). Out of 49 takes, 14 conta<strong>in</strong> exclusivelynoise and 21 partially. In <strong>the</strong> former case we consider <strong>the</strong>alignment correct if <strong>the</strong> file is discarded, <strong>in</strong> <strong>the</strong> latter we aimat align<strong>in</strong>g correctly <strong>the</strong> <strong>in</strong>terest<strong>in</strong>g portion of <strong>the</strong> take. Thisis <strong>in</strong> sharp contrast with <strong>the</strong> “cleanness” of <strong>the</strong> Brahms setand presents difficulties that were not foreseen when formulat<strong>in</strong>g<strong>the</strong> alignment procedure.Contrarily to <strong>the</strong> Brahms dataset, <strong>the</strong> evaluation of <strong>the</strong>alignment precision was done a posteriori: <strong>in</strong>stead of perform<strong>in</strong>ga manual alignment <strong>in</strong> advance, <strong>the</strong> results of <strong>the</strong>automatic alignment were checked. The reason for this liesbeh<strong>in</strong>d <strong>the</strong> length (approximately 40 m<strong>in</strong>utes) and complexityof <strong>the</strong> music: even with <strong>the</strong> score at our disposal, it wasimmediately evident that a manual alignment would havetaken a very long time. It is precisely this difficulty thatsound eng<strong>in</strong>eers had to face.Our first experiment align<strong>in</strong>g this dataset yielded ra<strong>the</strong>rpoor results on <strong>the</strong> 21 files conta<strong>in</strong><strong>in</strong>g noise regions of significantlength (<strong>in</strong> some cases up to more than one m<strong>in</strong>ute);s<strong>in</strong>ce <strong>in</strong> almost all cases <strong>the</strong> noisy portion was at <strong>the</strong> beg<strong>in</strong>n<strong>in</strong>g,we decided to directly align <strong>the</strong> reversed audio streams


<strong>in</strong> <strong>the</strong> first <strong>phase</strong>. With this simple adaptation <strong>the</strong> results areas follows: of <strong>the</strong> 35 files conta<strong>in</strong><strong>in</strong>g <strong>in</strong>terest<strong>in</strong>g regions, 26were correctly aligned; all of <strong>the</strong> 14 takes that conta<strong>in</strong>ed exclusivelynoises were correctly discarded <strong>by</strong> <strong>the</strong> algorithm.The absence of false positives (no noise-only takes weremistakenly aligned) and <strong>the</strong> correct position<strong>in</strong>g of all <strong>the</strong>aligned files suggest that <strong>the</strong> simple algorithm for identificationof <strong>the</strong> <strong>in</strong>terest region is robust enough to be appliedto ra<strong>the</strong>r short audio segments, yield<strong>in</strong>g <strong>the</strong> possibility of repeat<strong>in</strong>g<strong>the</strong> alignment algorithm multiple times on differentsubregions of <strong>the</strong> audio <strong>in</strong> order to avoid noisy sections.6. WORKFLOW ADAPTATIONThe audio <strong>in</strong>dustry has established over <strong>the</strong> years commonstandards for <strong>mix<strong>in</strong>g</strong> that are adopted <strong>in</strong> most professional<strong>studio</strong> records worldwide. Integration of new technologieswith<strong>in</strong> exist<strong>in</strong>g workflow <strong>the</strong>refore requires special attentionto exist<strong>in</strong>g practices with<strong>in</strong> <strong>the</strong> community. To this end, weconducted several <strong>in</strong>terviews with sound eng<strong>in</strong>eers.From an R&D standpo<strong>in</strong>t, an ideal <strong>in</strong>tegration would bea direct implementation of this technology <strong>in</strong>to <strong>the</strong> graphicaluser <strong>in</strong>terface of common DAW softwares to maximizeusability. Such <strong>in</strong>tegration would allow novel possibilities,such as l<strong>in</strong>k<strong>in</strong>g two tracks <strong>by</strong> means of <strong>the</strong>ir alignment anddef<strong>in</strong><strong>in</strong>g <strong>the</strong> placement of transition po<strong>in</strong>ts between <strong>the</strong>mfor crossfad<strong>in</strong>g, avoid<strong>in</strong>g any destructive edit<strong>in</strong>g regard<strong>in</strong>g<strong>the</strong> discarded audio regions. Such <strong>in</strong>tegration requires directcontact with software houses which are mostly close topublic doma<strong>in</strong> development.An alternative solution is represented <strong>by</strong> standalone alignmenttools, whose outputs should be directly importable <strong>in</strong>toa commercial DAW. Virtually all <strong>the</strong> major DAWs and videopost production systems support <strong>the</strong> Open Media Framework(OMF) and <strong>the</strong> Advanced Author<strong>in</strong>g Format (AAF),respectively owned <strong>by</strong> Avid Technology, Inc. and <strong>by</strong> <strong>the</strong>Advanced Media Workflow Association (AMWA) 1 . Theseare employed as <strong>in</strong>terchange formats to allow <strong>in</strong>teroperabilitybetween different software. An alignment software, thatwe are currently develop<strong>in</strong>g, could automatically constructan <strong>in</strong>itial session us<strong>in</strong>g an <strong>in</strong>terchange format that audio eng<strong>in</strong>eerscan use <strong>in</strong> <strong>the</strong>ir DAW to start <strong>the</strong> <strong>mix<strong>in</strong>g</strong> process.7. CONCLUSIONS AND FUTURE WORKIn this paper, we attempted to address two issues: Introduc<strong>in</strong>gnovels tools generaliz<strong>in</strong>g audio match<strong>in</strong>g algorithmsto partial alignment with relevant region detection, and <strong>the</strong>ir<strong>in</strong>tegration with<strong>in</strong> realistic <strong>studio</strong> <strong>mix<strong>in</strong>g</strong> procedure to accelerate<strong>mix<strong>in</strong>g</strong> session preparation for audio eng<strong>in</strong>eers. Thefirst task <strong>in</strong>volves adapt<strong>in</strong>g audio alignment techniques tosituations where <strong>the</strong>re is no specific prior knowledge on <strong>the</strong>start<strong>in</strong>g po<strong>in</strong>t of <strong>the</strong> alignment. Such considerations wouldallow audio eng<strong>in</strong>eers to automatically obta<strong>in</strong> a global viewof many different <strong>in</strong>dividual takes with regards to a referencerun-through record<strong>in</strong>g <strong>in</strong> a typical record<strong>in</strong>g session, aswell as provid<strong>in</strong>g access to relevant parts with<strong>in</strong> each take;this is a time-consum<strong>in</strong>g task if done manually. We fur<strong>the</strong>rdiscussed how this procedure can realistically be <strong>in</strong>tegrated<strong>in</strong>to common <strong>mix<strong>in</strong>g</strong> workflows.Applications of <strong>the</strong> proposed technology are not limitedto <strong>the</strong> preparation of <strong>the</strong> <strong>in</strong>itial <strong>mix<strong>in</strong>g</strong> session: mid-level <strong>in</strong>formationobta<strong>in</strong>ed dur<strong>in</strong>g <strong>the</strong> alignment task can <strong>in</strong> fact befur<strong>the</strong>r <strong>in</strong>tegrated <strong>in</strong> a <strong>studio</strong> <strong>mix<strong>in</strong>g</strong> workflow. For example,our audio alignment provides useful <strong>in</strong>formation about<strong>the</strong> tempo of a performance with regards to <strong>the</strong> referencethat can be employed as an important factor for <strong>the</strong> <strong>mix<strong>in</strong>g</strong>eng<strong>in</strong>eer. Such <strong>in</strong>tegration requires fur<strong>the</strong>r collaborationwith audio eng<strong>in</strong>eers to determ<strong>in</strong>e an optimal exploitation of<strong>the</strong>se <strong>in</strong>formations <strong>in</strong> <strong>the</strong> context of exist<strong>in</strong>g practices.8. REFERENCES[1] M. S. Arulampalam, S. Maskell, and N. Gordon. ATutorial on Particle Filters for Onl<strong>in</strong>e Nonl<strong>in</strong>ear/Non-Gaussian Bayesian Track<strong>in</strong>g. IEEE Transactions on SignalProcess<strong>in</strong>g, 50:174–188, 2002.[2] B. Bartlett and J. Bartlett. Practical Record<strong>in</strong>g Techniques:The step-<strong>by</strong>-step approach to professional audiorecord<strong>in</strong>g. Focal Press, 2008.[3] R. Dannenberg and N. Hu. Polyphonic Audio Match<strong>in</strong>gfor Score Follow<strong>in</strong>g and Intelligent Audio Editors.Proc. of <strong>the</strong> International Computer Music Conference(ICMC), 2003.[4] S. Dixon and G. Widmer. Match: a Music AlignmentTool Chest. Proc. of <strong>the</strong> 6th International Conferenceon Music Information Retrieval (ISMIR), 2005.[5] R. Douc, O. Cappe, and E. Moul<strong>in</strong>es. Comparison ofResampl<strong>in</strong>g Schemes for Particle Filter<strong>in</strong>g. In Proc. of<strong>the</strong> 4th International Symposium on Image and SignalProcess<strong>in</strong>g and Analysis (ISPA), 2005.[6] N. Montecchio and A. Cont. A Unified Approach to RealTime Audio-to-Score and Audio-to-Audio AlignmentUs<strong>in</strong>g Sequential Montecarlo Inference Techniques. InProc. of <strong>the</strong> IEEE International Conference on Acoustics,Speech, and Signal Process<strong>in</strong>g (ICASSP), 2011.[7] M. Müller and D. Appelt. Path-Constra<strong>in</strong>ed Partial MusicSynchronization. In Proc. of <strong>the</strong> 34th InternationalConference on Acoustics, Speech and Signal Process<strong>in</strong>g(ICASSP), 2008.1 http://www.avid.com, http://www.amwa.tv

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!