11.07.2015 Views

the perception of muslim consumers towards corporate social ...

the perception of muslim consumers towards corporate social ...

the perception of muslim consumers towards corporate social ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

While <strong>the</strong> probability parameters (a ij), π 1 and π 2 are initializedconventionally using uniform distributions [14], <strong>the</strong> state means areinitialized as µ 1 = max(E n) and µ 2 = min(E n). For <strong>the</strong> varianceparameters, <strong>the</strong> initialization σ 2 1 = σ 2 2 = R · var(E n) with R = 0.1has been found to yield satisfactory results. From <strong>the</strong>se initial values,<strong>the</strong> HMM parameters are estimated in a typical fashion usingan implementation <strong>of</strong> <strong>the</strong> EM re-estimation principle [14].During each iteration, <strong>the</strong> EM algorithm estimates <strong>the</strong> probabilitydistribution (γ n(1), γ n(2)) for being in state 1 or state 2 attime instant n. To convert <strong>the</strong>se values to a segmentation in terms<strong>of</strong> <strong>the</strong> two states, we simply take X n = 1 if γ n(1) ≥ γ n(2) andX n = 0 o<strong>the</strong>rwise, for all n. The EM re-estimation is deemed tohave converged and <strong>the</strong> iteration is stopped once this segmentationX n does not change between two successive EM iterations. Sincestate 1 was initialized with <strong>the</strong> maximal value and state 2 with <strong>the</strong>minimal value, we can safely assume that state 1 will be <strong>the</strong> “highstate” for <strong>the</strong> frame energy time series. The additional benefit <strong>of</strong> usingan HMM for this purpose instead <strong>of</strong> simple unsupervised thresholddetermination is that <strong>the</strong> HMM smoo<strong>the</strong>s <strong>the</strong> segmentation intime. All successive processing for <strong>the</strong> block is done using only<strong>the</strong> high-state frames for which X n = 1. Modeling and recognizingonly <strong>the</strong>se high-energy frames is justifiable because <strong>the</strong>se framespresumably have <strong>the</strong> best local SNR. Fig. 3 shows <strong>the</strong> evolution <strong>of</strong><strong>the</strong> state segmentation X n with EM iterations for a noisy (SNR 0 dBfactory noise) two-second speech segment.E nX niteration 1X niteration 2X niteration 3Fig. 3. Evolution <strong>of</strong> HMM segmentation with EM iterations for <strong>the</strong>log energy sequence <strong>of</strong> a noisy speech segment.2.4. GMM classificationThe classifier uses a specialized GMM to model each <strong>of</strong> <strong>the</strong> threeprimary classes: <strong>the</strong> noise environment, normal speech (both maleand female) and shouting (both male and female). These GMMsλ environment, λ speech and λ shout have 8 mixture components witha diagonal covariance structure [15]. For initializing <strong>the</strong> componentmean vectors <strong>of</strong> <strong>the</strong> GMMs, <strong>the</strong> simple heuristic approach proposedin [16] is used. The variances <strong>of</strong> each variable in each componentare initialized by 0.1 times <strong>the</strong> variable’s global variance over <strong>the</strong>training data. The mixture weights are initialized with uniform distributions.Each GMM is trained by running four EM iterations [15].For <strong>the</strong> number <strong>of</strong> mixture components in each model, both 8 and16 were evaluated and <strong>the</strong> former was selected because it providedslightly better performance.When <strong>the</strong> GMMs are used in detection, <strong>the</strong> audio signal is processedin blocks <strong>of</strong> two seconds, with a block shift <strong>of</strong> one second.After <strong>the</strong> high-energy frames have been selected using <strong>the</strong> HMMframe dropping detailed in Section 2.3, <strong>the</strong> averaged log likelihoods<strong>of</strong> <strong>the</strong>m having been produced by each <strong>of</strong> <strong>the</strong> three GMMs are computedand denoted as L shout , L speech and L environment. The shoutingdetection is considered as a binary classification problem andtreated according to <strong>the</strong> Bayes rule. The logarithmic likelihood ratiodecision statistic is defined as L = L shout − L nonshout . The shoutscore is <strong>the</strong> shout GMM likelihood and <strong>the</strong> non-shout score is obtainedas <strong>the</strong> maximum <strong>of</strong> <strong>the</strong> speech and noise GMM likelihoodsas L nonshout = max(L speech , L environment). For each detectionblock, <strong>the</strong> statistic L can be recorded and used in evaluating <strong>the</strong> systemperformance with variable detection threshold.3. EXPERIMENTAL EVALUATION3.1. Test material and setupTo represent different types <strong>of</strong> acoustic environments, two types <strong>of</strong>noise from <strong>the</strong> NOISEX-92 database were used. The factory1 noisecontains machine noise with frequent transient impulsive sounds.The babble noise contains many people talking simultaneously ina cafeteria-like environment.The speech and shouting was recorded with high quality equipmentin an anechoic chamber. The data consists <strong>of</strong> 11 male and 11female speakers, each speaking 24 Finnish sentences, both in a normalfashion and by shouting. The shouting was controlled both bylistening and by monitoring <strong>the</strong> sound pressure level. A mere raisedvoice was not accepted as shouting. Twelve <strong>of</strong> <strong>the</strong> sentences aresentences in <strong>the</strong> imperative mood, consisting <strong>of</strong> one to four Finnishwords, with a message that could plausibly be uttered in a potentiallythreatening situation, such as “anna se kamera tänne” (“give me <strong>the</strong>camera”), “älkää liikkuko” (“don’t move”) and “lopettakaa” (“stopit”). The o<strong>the</strong>r 12 sentences consist <strong>of</strong> three Finnish words, are in<strong>the</strong> indicative mood and have a neutral, abstract information content.The experiments were carried out as leave-one-out cross validation.Each speaker in turn was selected as <strong>the</strong> test speaker, and datafrom <strong>the</strong> o<strong>the</strong>r 21 speakers was used to train <strong>the</strong> speech and shoutmodels. The test material for each speaker consisted <strong>of</strong> that speaker’sspeech and shout material, both corrupted by noise with a given segmentalor frame-averaged SNR, as well as a segment <strong>of</strong> noise equalin length to <strong>the</strong> speaker’s combined speech and shout material. Thenoise model was trained using two minutes <strong>of</strong> <strong>the</strong> noise material,while <strong>the</strong> remaining portion <strong>of</strong> <strong>the</strong> noise recording was used for testing.The primary measure to assess <strong>the</strong> performance is <strong>the</strong> equalerror rate (EER), a common metric to assess <strong>the</strong> quality <strong>of</strong> a twoclassdetector. The EER corresponds to <strong>the</strong> decision threshold forwhich <strong>the</strong> miss and false alarm rates are equal.3.2. ResultsTables 1 and 2 show <strong>the</strong> shout detection results for <strong>the</strong> NOISEX-92factory1 and babble noises, respectively. In <strong>the</strong> case <strong>of</strong> 12 MFCCs,results are also shown by using only <strong>the</strong> LP or WLP spectrum envelopewithout <strong>the</strong> excitation spectrum. The usefulness <strong>of</strong> including<strong>the</strong> excitation spectrum is easily observed. Several different types<strong>of</strong> features give good performance at low to moderate noise levels.However, at SNR levels -10 dB and -20 dB, at which <strong>the</strong> systemperformance is degrading rapidly, <strong>the</strong> most resistant features are 30MFCCs obtained using LP or WLP envelope combined with <strong>the</strong> excitationspectrum.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!