Tuesday, June 12, 2012

Signal Denoising using Wavelet-based Methods

Signal Denoising using Wavelet-based Methods


The basic idea which lies behind wavelets is the representation of an arbitrary function as a combination of simpler functions, generated as scaled and dilated versions of a particular oscillatory “mother” function.
Late Jean Morlet, a geophysical engineer, introduced the term “wavelet” while attempting to analyze signals related to seismic data. The mathematical formulation of the wavelet transform and its inverse was rigorously established by Grossman and Morlet citep([5]). Since then, ideas from diverse scientific fields have resulted in developing wavelets into a powerful analysis tool.
The term wavelet is often used to denote a signal located in time with a concentrated amount of energy citep([1]). This “mother” wavelet is used to generate a set of “daughter”functions through the operations of scaling and dilation applied to the mother wavelet. This set forms an orthogonal basis that allows, using inner products, to decompose any given signal much like in the case of Fourier analysis. Wavelets, however, are superior to Fourier analysis for time information is not lost when moving to the frequency domain. This property makes them suitable for applications from diverse fields where the frequency content of a signal as well as the energy's temporal location is valuable.
The wavelets application of interest for this work is their use for data analysis, specifically for signals denoising. Denoising stands for the process of removing noise, i.e unwanted information, present in an unknown signal. The use of wavelets for noise removal was first introduced by Donoho and Johnstone citep([3]). The shrinkage methods for noise removal, first introduced by Donoho citep([4]), have led to a variety of approaches to signal denoising.
This work presents a revision of the wavelets basic theory. Definitions of mother wavelets, wavelets decomposition and its advantage over Fourier analysis are discussed in Section 1. Section 2 presents a summary of basic methods developed for noise removal. Their main features and limitations are presented, and a comparison study taken into account computational efficiency is performed. Section 3 introduces an example application for denoising methods. A given function contaminated with Gaussian additive noise is used as testbed for the described methods. Conclusions about the performance of the denoising procedures and the utility of using wavelet decomposition for this type of problem are presented.

Wavelet Decomposition Basics

Historical note

The term wavelet was first introduced by Jean Morlet while working on the analysis of signals for seismic analysis on oil-related projects. Before Morlet's work remarkable contributions were developed by Haar citep([6]) and Zweig in 1975. After the work of Morlet and Grossmann on the definition of the continous wavelet transform (CWT), several developings have followed. The work of researchers as Stromberg, Duabechies, Mallat and Newland, among others, has pushed forward the theoretical frontiers of wavelets-based orthogonal decomposition and also augmented the scope of possible application fields.

Basic concepts

A review of basic concepts in the wavelets framework is presented in next lines. This review is based upon burrus1988.
The term wavelet is mostly used for denoting a particular wave whose energy is concentrated in a specific temporal location. A wavelet is therefore a known signal with some peculiar characteristics that allow it to be employed for studying the properties of other signals simultaneously in the frequency and time domains. An typical plot of a wavelet is shown in Figure 1
Figure 1: Daubechies wavelet ψD20 citep([1])
Figure 1 (wavelet1.png)
Based on a particular wavelet, it is possible to define a wavelet expansion. A wavelet expansion is the representation of a signal in terms of an orthogonal collection of real-valued functions generated by applying suitable transformations to the original given wavelet. These functions are called “daughter” wavelets while the original wavelet is dubbed “mother” wavelet, acknowledging its function as source of the orthogonal collection. If f(t) is a given signal to be decomposed, then it is possible to represent the signal as:
In equation Equation 1ψi(t) are the orthogonal basis functions, and the coefficients ai can be found through the inner product of f(t) and the functions ψi(t) (EquationEquation 2).
The previous equations represent the general formulation for orthogonal decomposition, and so they are the same equations discussed in the particular case of Fourier analysis. In the case of wavelets expansion, and consequently with their definition, the wavelets basis functions have two integer indexes, and Equation Equation 3 must be rewritten as
Equation Equation 3 is the wavelet expansion of f(t) while the collection of coefficients aj,k is the discrete wavelet transform (DWT) of f(t).

Characteristic of wavelet expansions

The properties and hence advantages of a familiy of wavelets depend upon the mother wavelet features. However, a common set of features are shared by the most useful of them citep([1]).
  1. A wavelet expansion is formed by a two-dimensional expansion of a signal. It should be noticed that the dimension of the signal itself is not determinant in the wavelet representation.
  2. A wavelet expansion provides a dual time-frequency localization of the input signal. This implies that most of the energy of the signal will be captured by a few coefficients.
  3. The computational complexity of the discrete wavelet transform is at most O(nlog(n)) i.e. as bad as for the discrete Fourier transform (DFT) when calculated using the Fast Fourier Transform (FFT). For some particular types of wavelets, the complexity can be as low as O(n).
  4. The basis functions in a wavelet expansion are generated from the mother wavelet by scaling and translation operations. The indexing in two dimensions is achieved using this expression:
  5. Most wavelets basis functions satisfy multiresolution conditions. This property guarantees that if a set of signals can be represented by basis functions generated from a translation ψ(tk) of the mother wavelet, then a larger set of functions, including the original, can be represented by a new set of basis functions ψ(2tk). This feature is used in the algorithm of the fast wavelet transform, FWT the equivalent of the FFT algorithm for wavelets decomposition.
  6. The lower resolution coefficients can be calculated from the higher resolution coefficients using a filter bank algorithm. This property contributes to the efficiency of the calculation of the DWT.

Denoising with Wavelets

If y(t) is an empirically recorded signal with and underlying description, g(t), a model for the noise addition process transforming g(t) into y(t) is described by Equation Equation 5.
where ϵi are independent normal random variables N(0,1) and σ represents the intensity of the noise in y(t). Using this model, it follows that the objective of noise removal is, given a finite set of yi values, reconstruct the original signal g without assuming a particular structure for the signal.
The usual approach to noise removal models noise as a high frequency signal added to an original signal. Fourier transform could be used to track this high frequency, ultimately removing it by adequate filtering. This noise removal strategy is conceptually clear and efficient since depends only on calculating the DFT of a given signal. However, there are some issues that must be taken into account. The most prominent of such issues ocurrs when the original signal has important information associated to the same frequency as the noise. When a frequency domain representation of the signal is obtained, filtering out this frequency will induce noticeable loss of information of the target signal.
In cases as the one described, the wavelets approach is more appropiated due to the fact that the signal will be studied using a “dual” frequency-time representation, which allows separating noise frequencies from valuable signal frequencies. Under this approach, noise will be represented as a consistent high frequency signal in the entire time scope and so its identification will be easier than using Fourier analysis.
Once the noise representation is identified, the removal process starts. It has been proven that a suitable strategy for noise removal consists in making the coefficients associated to the noise frequency equal to zero. This statement represents a global perspective for noise removal, different methods for denoising differ in the way these coefficients are tracked and taken out from the representation. The conceptual details of several of these methods are presented in the next sections. The main reference for the methods discussed here is antoniadis2001
Before attempting to describe the methods is convenient to discuss an alternative definition for wavelet representation used for noise removal. First, the description assumes that the representation is achieved using periodised wavelets bases on [0,1]. Also, the basis functions are generated by dilation and translation of a compactely supported scaling function φ, also called father wavelet and a familiar mother wavelet function, ψψ must be associated with an r-regular multiresolution analysis of L2(R). An advantage of this approach is that generated wavelets families allow integration of different kinds of smoothness and vanishing moments. This features lead to the fact that many signals in practice can be represented sparsely (with few wavelets coefficients) and uniquely under wavelets decomposition. The decomposition expresion using a father and a mother wavelet is depicted in Equation Equation 6.
where j0 is a primary resolution level, and αj0k and βjk are calculated as the inner products shown in Equations Equation 7 and Equation 8
When the discrete wavelet transform is used, the coefficients cj0kdiscrete scaling coefficients and djkdiscrete wavelet coefficients are used instead of the continous parameters αj0kand βjk. The discrete parameters can be approximately calculated by applying a \n factor to the continous coefficients.
Finally, when the DWT is applied to Equation Equation 5, these expressions are obtained (Equations Equation 9 and Equation 10):

Classical approach to wavelet thresholding

The original and simpler way to remove noise from a contaminated signal consists in modifying the wavelets coefficients in a smart way such that the “small” coefficients associated to the noise are basically neglected. The updated coefficients can thus be used to reconstruct the original underlying function free from the effects of noise. It is implicit in the strategy that only a few “large” wavelets coefficients djk are associated with the original signal, and that their identification and elimination of any other coefficients will allow a perfect reconstruction of the underlying signal g. Several methods use this idea and implement it in different ways. When attempting to descrease the influence of noise wavelets coefficients it is possible to do it in particular ways, also the need of information of the underlying signal leads to different statistical treatments of the available information.
In the linear penalization method every wavelet coefficient is affected by a linear shrinkage particular associated to the resolution level of the coefficient. A mathematical expression for this type of approach using linear shrinkage is shown in Equation Equation 11.
In Equation Equation 11, parameter s is the known smoothness index of the underlying signal g, while parameter λ is a smooting factor whose determination is critical for this type of analysis.
It must be said that linear thresholding is adequate only for spatially homogenous signal with important levels of regularity. When homegeneity and regularity conditions are not met nonlinear wavelet thresholding or shrinkage methods are usually more suitable.
donoho1995 and donoho1995b proposed a nonlinear strategy for thresholding. Under their approach, the thresholding can be done by implementing either a hard or a soft thresholding rule. Their mathematical expressions are shown in Equation Equation 12 and Equation Equation 13 respectively.
In both methods, the role of the parameter λ as a threshold value is critical as the estimator leading to destruction, reduction, or increase in the value of a wavelet coefficient.
Several authors have discussed the properties and limitations of these two strategies; hard thresholding, due to its induced discontinuity, can be unstable and sensitive even to small changes in the data. On the other hand, soft thresholding can create unnecessary bias when the true coefficients are large. Although more sophisticated methods has been introduced to account for the drawbacks of the described nonlinear strategies, the discussion in this report is limited to the hard and soft approaches.

Term-by-Term Thresholding

One apparent problem in applying wavelet thresholding methods is the way of selecting an appropriate value for the threshold, λ. There are indeed several approaches for specifying the value of the parameter in question. In a general sense, these strategies can be classified in two groups: global thresholds and level-dependent thresholds. Global threshold implies the selection of one λ value, applied to all the wavelet coefficients. Level-dependent thresholds implies that a (possibly) different threshold value lambdaj is applied for each resolution level. All the alternatives require an estimate of the noise level σ. The standard deviation of the data values is clearly not a good estimator, unless the underlying response function g is reasonably flat. donoho1995 considered estimating σ in the wavelet domain by using the expression in Equation Equation 14.

The minimax threshold

donoho1995 obtained an optimal threshold value λM by minimizing the risk involved in estimating a function. The porposed minimax threshold depends of the available data and also takes into account the noise level contaminating the signal (Equation Equation 15).
Where, λn* is equal to the value of λ satisfying Equation Equation 16
In Equation Equation 16Rλ(d) is calculated following Equation Equation 17.
while Roracle(d) is an operators called oracle used to account for the risk associated to the modification of the value of a given wavelet coeficient. Two of these oracles were introduced by donoho1995: diagonal linear projection (DLP), and diagonal linear shrinker (DLS). The Equations Equation 18 and Equation 19 show the expressions for the two oracles.
antoniadis2001 provided values of the minimax threshold for both the hard and soft nonlinear thresholding rules. For the soft rule, 1.669 and 2.226 for n equal to 128 and 1024; for the hard rule, 2.913 and 3.497 again for n equal to 128 and 1024.

The universal threshold

donoho1995 proposed this threshold as an alternative to the minimax thresholds, applied to all the wavelet coefficients. The universal threshold is defined in Equation Equation 20.
This threshold is easy to remember and its implementation in software is simpler and the optimization problem implicit in the minimax method are avoided. Also, the universal threshold ensures, with high probability, that every sample in the wavelet transform in which the underlying function is exactly zero will be estimated as zero, although the convergance rate (depending in the size of the sample) is slow.

The translation invariant method

It has been noted that wavelet thresholding with either minimax thresholds or the universal threshold presents some inconvenient features. In particular, in the vicinity of discontinuities, these wavelet thresholds can exhibit pseudo-Gibbs phenomena. While these phenomena are less pronounced than in the case of Fourier analysis and also are present in a local scale, this situation represents a challenge for the thresholding methods.
coifman1995 proposed the use of the translation invariant wavelet thresholding scheme. The idea is to correct mis-alignments between features in the studied signal and features in the basis used for the decomposition. When the signal contains an important number of discontinuities, the method applies a range of shifts to it, and average the results obtained after such transformations.
If a empirical contaminated signal y[i],(i=1,...,n) is provided, the tranlation invariant wavelet thresholding estimator is calculated as (Equation Equation 21):
where δλ is either the hard of soft thresholding rule, W is the size n orthogonal matrix associated to the DWT, and Sk is the shift kernel defined as:
In Equation Equation 22I is the identity matrix and O stands for a zero matrix with dimensions as indicated in the expression.

The SureShrink Threshold

donoho1995 proposed a procedure to select a threshold value λj for every resolution level j. The method uses Stein’s unbiased risk criterion citep([7]) to get an unbiased estimate of the l2-risk.
In mathematical terms, given a set X1,...,Xs of variables distributed as N(μi,1) with i=1,...,s, the problem consists in estimate the vector of means with minimum l2risk. It turns out an estimator of μ, can be describes as μ(X)=X+g(X), with g a function from Rs to Rs is weakly differentiable. With this information, the risk of the estimation can be described as (Equation Equation 23):
If an operator SURE is defined as (Equation Equation 25):
where the operator #{A{ returns the cardinality of the set A, it is found that SURE is an unbiased estimate of the l2risk, i.e,
Now, the threshold λ is found by minimizing SURE over the set X of give data. Extending this principle to the entire set of resolution levels, an expression for λjs is found (EquationEquation 27):
where λU is the universal threshold, and σ is the estimator of the noise level (Equation Equation 14).

Classical Methods: Block Thresholding

Thresholding approaches resorting to term-by-term modification on the wavelets coefficients attempt to balance variance and bias contribution to the mean squared error in the estimation of the underlying signal g. However, it has been proven that such balance is not optimal. Term-by-term thresholding end sup removing to many terms leading to estimation prone to bias and with a slower convergance rate due to the number of operations involved.
A useful resource to improve the quality of the aforementioned balanced is by using information of the set of data associated to a particular wavelet coefficient. In order to do so, a block strategy for thresholded is proposed. The main idea consists in isolating a block of wavelet coefficients and based upon the information collected about the entire set make a decision about decreasing or even entirely discard the group. This procedure will allow faster manipulation of the information and accelerated convergence rates.

Overlapping Block Thresholding Estimator

cai2001 considered an overlapping block thresholding estimator by modifying the nonoverlapping block thresholding estimator citep([2]). The effect is that the treatment of empirical wavelet coefficients in the middle of each block depends on the data in the whole block. At each resolution level, this method packs wavelet coefficients djk into nonoverlapping blocks (jb) of length L0. Following this, the blocks are extended in each direction an amount L1=max(1;[
 ]), generating overlapping blocks (jB) of augmented length L=L0+2L1.
If S(jB)2 is the L2energy of the empirical signal in the augmented block jB, the wavelet coefficients in the blocks jb will be estimated simultaneously using the expression in Equation Equation 28
Once the estimated wavelet coefficients djk(jb) have been calculated, an estimation of the underlying signal g can be obtained through using the new wavelet coefficients and the unmodified scaling coefficients cj0k in the IDWT. The results from this method (NeigBlock) presented in this document used a value of L0=[log(
 )] and a value of λ=4.50524 as suggested by cai2001.

Bayesian Approach to wavelet shrinkage and thresholding

From Equations Equation 9 and Equation 10 it can be established that the empirical scaling and wavelet coefficients, conditional on their respective underlying coefficients, are independently distributed, i.e:
The Bayesian approach imposes an apriori model for the wavelets coefficients designed to capture the sparseness of the wavelet expansion common to most applications. An usual prior model for each wavelet coefficient djk is a mixture of two distributions, one of them associated to negligable coefficients and the other to significant coefficients. Two types of mixtures have been widely used. One of them employs two normal distributions while the other uses one normal distribution and one point mass at zero.
After mathematical manipulation, it can be shown that an estimator for the underlying signal can be written as (Equation Equation 31):
i.e. the scaling coefficients are estimated by the empirical scaling coefficients while the wavelet coefficients are estimated by a Bayesian rule (BR), taking into account the obtained empirical wavelet coefficient and the noise level.

Shrinkage estimates based on deterministic/stochastic decompositions

huang2000 proposed a method that takes into account the value of the prior mean for each wavelet coefficient, by introducing a estimator for the parameter into the general wavelet shrinkage model. These authors assumed thatthe undelying signal is composed of a piecewise deterministic portion with an added zero mean stochastic part.
If cj0 is the vector of empirical scaling coefficients, dj the vector of empirical wavelet coefficients, cj0 the vector of underlying scaling coefficients, and dj the vector of underlying wavelet coefficients, then the Bayesian model (Equation Equation 32):
with ω=(cj0,dj0,...,dJ1')' and the underlying signal β=(cj0',dj0',...,dJ1')' is assumed to follow an apriori distribution (Equation Equation 33)
where μ is the deterministic mean structure and Σ(θ) accounts for the uncertainty and value correlation in the underlying signal. Notice that if η following a distributionN(0,Σ(θ)) is defined as the stochastic component representing small variation (high frequency) in the signal, then μ can be interpretated as the stochastic component accounting for the large-scale variation in β. So, it is possible to rewrite β as (Equation Equation 34),
Using this model, a shrinkage rule can be established by calculating the mean of β conditional on σ2 which is expressed as (Equation Equation 35),

Numerical Simulations

Description of the Scheme

In order to assess the efficiency and accuracy of the proposed methods, a number of simulations have been conducted. To this aim, data have been generated according to the following scheme
where the data {xi{ are considered equally spaced in the interval [0,1]. The signal-to-noise ratio has been taken equal to 3. In these simulations the Symmlet 8 wavelet basis has been used. Given the random nature of {ϵi{, 100 realizations of the function {yi{ have been produced. This has been done in order to apply the comparison criteria to the ensemble average of the realizations. Since the primary goal of the simulations is the comparison of the different denoising methods, the following criteria are introduced:
  1. Root Mean Squared Error: The mean squared error defined as (Equation Equation 37)
    is computed for each realization and averaged over the 100 samples. Then, its square root is taken.
  2. Maximum Deviation: The average over the 100 samples of max|f,(xi),,fdn,(xi)|
Computational efficiency has not been chosen as one of the criteria, since it is greatly depended on the individual programming skills of the individual. Therefore, in order to avoid a non-uniform programming approach which could possibly result in misleading conclusions, time efficiency has not been considered.
The test functions f(x) and the sample sizes N have been chosen as the factors of the comparison studies. To this aim, two samples, one of moderate moderate size (N=128) and another of larger size (N=1024) have been considered.
As far as the test functions are concerned, two smooth signals (Figures Figure 2 and Figure 3) and two discontinuous ones (Figures Figure 4 and Figure 5) were taken into account. In Figure 2, the function consists of the sum of two sinusoids, whereas in Figure 3, a time shifted sine is illustrated. Since the signals are smooth, linear methods are expected to be comparable to the nonlinear ones. On the other hand, nonlinear wavelet estimators are expected to perform better for the functions in (Figure 4Figure 5). These highly discontinuous signals have been used as examples in donoho1993
Figure 2: Original function with added Gaussian White noise (Wave function)
Figure 2 (waveG.png)
Figure 3: Original function with added Gaussian White noise (Time shifted sine function)
Figure 3 (TSsineG.png)
Figure 4: Original function with added Gaussian White noise (Blocks function)
Figure 4 (BlocksG.png)
Figure 5: Original function with added Gaussian White noise (Bumps function)
Figure 5 (BumpsG.png)


The following plots, (Figures Figure 6 - Figure 13), illustrate the denoising performance for the 10 methods used. Each integer corresponds to a particular method as follows
  1. VisuShrink-Hard: Universal threshold with hard thresholding rule
  2. VisuShrink-Soft: Universal threshold with soft thresholding rule
  3. SureShrink: SureShrink threshold
  4. Translation-Invariant-Hard: Translation invariant threshold with hard thresholding rule
  5. Translation-Invariant-Soft: Translation invariant threshold with soft thresholding rule
  6. Minimax-Hard: Minimax threshold with hard thresholding rule
  7. Minimax-Soft: Minimax threshold with soft thresholding rule
  8. NeighBlock: Overlapping block thresholding (with L0=[logn/2]λ=4.50524)
  9. Linear Penalization: Term-by-term thresholding using linear shrinking
  10. Deterministic/Stochastic: Bayesian thresholding method for shrinkage estimates
Figure 6: Comparison Study using Wave function. N=128
Figure 6 (Wave128G.png)
Figure 7: Comparison Study using Wave function. N=1024
Figure 7 (Wave1024G.png)
Figure 8: Comparison Study using Time-shifted sine function. N=128
Figure 8 (TSsine128G.png)
Figure 9: Comparison Study using Time-shifted sine function. N=1024
Figure 9 (TSsine1024G.png)
Figure 10: Comparison Study using Blocks function. N=128
Figure 10 (Blocks128G.png)
Figure 11: Comparison Study using Blocks function. N=1024
Figure 11 (Blocks1024G.png)
Figure 12: Comparison Study using Bumps function. N=128
Figure 12 (Bumps128G.png)
Figure 13: Comparison Study using Bumps function. N=1024
Figure 13 (Bumps1024G.png)


A general comment can be made related to the Root Mean Squared Error (RMSE). As expected, the bigger the sample size the lower the value of the RMSE. It is readily seen that this is true for the same test function and denoising procedure.
Focusing on the smooth Wave function, the bayesian method performs well. However, the linear penalization method and the Translation-Invariant-Hard method are very competitive. The performance of the penalization method should not be surprising since the linear estimators are expected to achieve good results in smooth functions such as the Wave signal. Similar remarks can be made about the Time-Shifted Sine signal, a function that shares with the Wave signal the smoothnes feature.
As far as the Bumps function and the Blocks function are concerned, the Bayesian method outperform the classical ones in terms of RMSE. This leads to the conclusion that using Bayesian methods for such type of functions is preferable if computational efficiency is not an issue. In fact, it is well established that non-Bayesian methods uniformly outperform Bayesian methods in terms of CPU time.
Finally, as a general remark, larger values of MaxDeviation occur for functions with many spikes and discontinuities.


  1. Burrus, C.S. and Gopinath, R.A. and Guo, H. (1988). Introduction to Wavelets and Wavelet Transforms: A Primer. Prentice-Hall.
  2. Cai, Tony. (1999). Adaptive wavelet estimation: a block thresholding and oracle inequality approach. The Annals of Statistics27, 898–924.
  3. Donoho, David L. and Johnstone, Iain M. (1995). Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association90, 1200–1224.
  4. Donoho, D. (1993). Unconditional bases are optimal bases for data compression and for statistical estimation. Applied and Computational Harmonic Analysis1, 100–115.
  5. Grossmann, A. and Morlet, J. (1984). Decomposition of Hardy Functions into Square Integrable Wavelets of Constant Shape. SIAM Journal on Mathematical Analysis15(4), 723-736.
  6. Haar, A. (1910). Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen69, 331–371.
  7. Stein, Charles M. (1981). Estimation of the Mean of a Multivariate Normal Distribution. The Annals of Statistics9(6), 1135–1151.

No comments:

Post a Comment