Proteomics Informatics – Signal processing I: analysis of mass spectra (Week 3)
Example data – MALDI-TOF Peptide intensity vs m/z
Fragment intensity vs m/z Example data – ESI-LC-MS/MS Time m/z MS/MS Peptide intensity vs m/z vs time
Sinus amplitude Wave length b a c
Sinus and Cosinus b a c
Two Frequencies
Fourier Transform
from numpy import * x=2.0*pi*arange(1000.0)/ sin1 = sin(1000.0*x) sin2 = 0.2*sin( *x) sin12=sin1+sin2 fft12=fft.rfft(sin12) Frequency
Inverse Fourier Transform Frequency
Inverse Fourier Transform from numpy import * x=2.0*pi*arange(1000.0)/ sin1 = sin(1000.0*x) sin2 = 0.2*sin( *x) sin12=sin1+sin2 fft12=fft.rfft(sin12) sin12_= fft.irfft(fft12,len(sin12)) Frequency
Inverse Fourier Transform Frequency
A Peak centroid full width at half maximum (FWHM) area height maximum mean variance skewness kurtosis Intensity
Mean and variance Mean Variance A peak is defined by and
Skewness and kurtosis Skewness Kurtosis
A Gaussian Peak def gaussian(x,x0,s): return exp(-(x-x0)**2/(2*s**2)) x = linspace(-1,1,1000) y=gaussian(x,0,0.1) ffty=fft.rfft(y) Frequency
A Gaussian Peak Skewness = 0 Kurtosis = 0 Frequency
Peak with a longer tail Frequency
A skewed peak def pdf(x): return 1/sqrt(2*pi) * exp(-x**2/2) def cdf(x): return (1 + erf(x/sqrt(2))) / 2 def skew(x,e=0,w=1,a=0): t = (x-e) / w return 2 / w * pdf(t) * cdf(a*t) Frequency
Normal noise x = linspace(-1,1,1000) y=0.2*random.normal(size=len(x)) If the noise is not normally distributed, try to find a transform that makes it normal Frequency
Lognormal noise x = linspace(-1,1,1000) y=0.2*random.lognormal(size=len(x)) Frequency
Skewed noise x=random.uniform(-1.0,1.0,size=10*len(x)) y=random.uniform(0.0,1.0,size=10*len(x)) yskew=skew(x,-0.1,0.2,10)/max(yskew) yn_skew=x_test[y<yskew][:len(x)] Frequency
Gaussian peak with normal noise Frequency
Removing High Frequences Frequency
Convolution Describes the response of a linear and time- invariant system to an input signal The inverse Fourier transform of the pointwise product in frequency space
Smoothing by convolution
Smoothing w=ones(2*width+1,'d') convolve(w/w.sum(),y,'valid‘) Frequency Intensity
Smoothing
Adaptive Background Correction (unsharp masking) Unsharp masking Original wi = linspace(1,window_len,window_len) w = 1 / ( 2*r_[wi[::-1],0,wi] + 1 ) x_ = x - d*convolve(w/w.sum(),x,'valid')
Adaptive Background Correction
Smoothing and Adaptive Background Correction
Savitsky-Golay smoothing Polynomial order = 3 Bin size = 25 Bin size = 75 Bin size = 150 Polynomial order = 5Polynomial order = 7
Background Frequency
Background Subtraction Using Smoothing Bin size = 100Bin size = 200Bin size = 300 Smooting Background subtraction
Root Mean Square Deviation (RMSD) The Root Mean Square Deviation (RMSD) is often constant for the noise and larger for the peak if the window size is approximately the size of the peak.
Background Subtraction using RMSD Bin size = 100Bin size = 200Bin size = 300 RMSD Intensity
Convolution, Cross-correlation, and Autocorrelation Convolution describes the response of a linear and time-invariant system to an input signal. The inverse Fourier transform of the pointwise product in frequency space. Cross-correlation is a measure of similarity of two signals. It can be used for finding a shift between two signals. Auto-correlation is the cross-correlation of a signal with itself. It can be used for finding periodic signals obscured by noise.
Cross-correlation and autocorrelation
Autocorrelation Signal Same signal
Cross-correlation Signal Shifted signal
Cross-correlation Signal Half of the peaks shifted
How similar are two signals? Dot product Identical vectors: Perpendicular vectors: The dot product is the came as the cross-correation at zero:
What are the characteristics of the dot product? S/N Dimensions Signal+Noise Noise
Autocorrelation Signal Shifted signal Sum of signal and shifted signal
Coincidence – enhances the signal The signal to noise can be dramatically increased by measuring several independent signals of the same phenomenon and combining these signals. Ideal signal Product of the four measurements Four measurements
Coincidence – supresses and transforms the noise Noise in productOriginal noise
Coincidence – supresses interference Ideal signal Product of the four measurements Four measurements with interference
Peak Finding The derivative of a function is zero at its minima and maxima. The second derivative is negative at maxima and positive at minima.
Peak Finding 1.Characterize the signal and the noise 2.Make a model of the data 3.Select detection method 4.Select parameters using simulations Intensity
Peak Finding: Characterizing the noise Intensity Let’s first try without removing the peaks
Peak Finding: Characterizing the noise Intensity Removing the peaks by looking for outliers in the root mean square deviation (RMSD) RMSD
Peak Finding: Characterizing the peaks Intensity
Peak Finding: Model of data points=1000 x = linspace(-1,1,points) y=noise*random.normal(size=len(x)) y+=signal*gaussian(x,0,0.01) S/N=1S/N=2S/N=4
Peak Finding: Detection method S/N=1S/N=2S/N=4 Peaks can be detected by finding maxima in the moving average with a window size similar to the peak width
Peak Finding: Detection method – moving average S/N=1 S/N=2 S/N=4 Bin size = 5Bin size = 20Bin size = 80Signal
Peak Finding: Detection method – RMSD S/N=1 S/N=2 S/N=4 Bin size = 5Bin size = 20Bin size = 80Signal
Peak Finding: Information about the Peak centroid (mean) full width at half maximum (FWHM) area height maximum mean variance skewness kurtosis Intensity
Information about a Peak Centroid or mean A peak is defined by To calculate any of these measures we need to know where the peak starts and ends.
Where does a peak start and end?
Estimating peptide quantity Peak height Curve fitting Peak area Peak height Curve fitting m/z Intensity
Time dimension m/z Intensity Time m/z Time
Sampling Retention Time Intensity
5% Acquisition time = 0.05 5% Sampling
What is the best way to estimate quantity? Peak height - resistant to interference - poor statistics Peak area - better statistics - more sensitive to interference Curve fitting - better statistics - needs to know the peak shape - slow
Homework: Background Subtraction Using Smoothing
Summary Fourier transform - transformation to frequency space and back Signal – how do we detect and characterize signals? Noise – how do we characterize noise? Modeling signal and noise Simulation to select thresholds and select parameters Filters – fitering by low-pass (i.e. smoothing) and high-pass filters (e.g. adaptive background correction) Detection methods based on moving average and RMSD Convolution - describes the response of a linear and time-invariant system to an input signal Cross-correlation is a measure of similarity of two signals Autocorrelation can be used for finding periodic signals obscured by noise The dot product can be used to determine how similar two signals are Coincidence measurements enhance the signal and supresses noise The quantity associated with a peak – height and area Sampling – how often do we need to sample a peak to get a good estimate of its area?
Proteomics Informatics – Signal processing I: analysis of mass spectra (Week 3)