引言:
前幾天聽了汪德亮老師的講座,碰到一個奇怪的問題:在低信噪比、高混響下對原始信号時頻幅度譜進行修正後,再進行 istft i s t f t 和 stft s t f t 的轉換,此時的時頻譜和修正後的原始時頻譜不一樣,而且 istft i s t f t 後獲得的時域信号并沒有起到去混響的效果反而是十分奇怪的聲音。當時同僚們對此現象都感到疑惑。按照我的了解,對于任意的複數域元素 H H ,H∈CMNH∈CMN, M M 表示資料的幀數,NN表示資料的頻點數,存在如下的關系: stft(istft(H))=H s t f t ( i s t f t ( H ) ) = H ,如果以上的關系不成立,則現在絕大多數的音頻增強算法的套路:對幅度譜進行修正,利用帶噪信号相位譜進行istft變換獲得修正時域語音,會存在一定的風險。下面對這一問題進行講解。
代碼:
realData = rand(257,100);
%realData = [realData;realData(end-1:-1:2,:)];
imgData = rand(257,100);
%imgData = [imgData;-imgData(end-1:-1:2,:)];
comData = realData + 1i*imgData;
overLap = 0.5;
frameSize = 512;
y = ISTFT(comData, frameSize, overLap);
[ftbin,Nframe,Nbin,Lspeech,speechFrame] = STFT((y), frameSize, overLap, frameSize);
error = squeeze(ftbin) - comData ;
data = ones(10240,1);
overLap =0.5;
[ftbin1,Nframe,Nbin,Lspeech,speechFrame]= STFT(data, frameSize, overLap, frameSize);
y1 = ISTFT(squeeze(ftbin1), frameSize, overLap);
[ftbin2,Nframe,Nbin,Lspeech,speechFrame]= STFT(y1, frameSize, overLap, frameSize);
error1 = data - y1;
error2 = squeeze(ftbin1) - squeeze(ftbin2) ;

H∈CMN H ∈ C M N :任意的複數矩陣
F F :運算符
HH:運算符
F(H)=G(H)−H F ( H ) = G ( H ) − H
G(H)=STFT(iSTFT(H)) G ( H ) = S T F T ( i S T F T ( H ) )
按照一般的了解, F(H)=0 F ( H ) = 0 成立,然而根據前文的介紹,該等式并非恒成立。
直接粘貼論文的定義吧:
The set of ==consistent spectrograms== can thus be described as the kernel (or null space) of the R-linear operator from
CMN C M N to itself defined by
F(H)=G(H)−H F ( H ) = G ( H ) − H
G(H)=STFT(iSTFT(H)) G ( H ) = S T F T ( i S T F T ( H ) )
Let H(m,n) H ( m , n ) be a set of complex numbers, where m m will correspond to the frame index and nn to the frequency band index, and W W and SS be analysis and synthesis
windows verifying the perfect reconstruction conditions for
a frame shift S S . For the set HH to be a consistent STFT spectrogram, it needs to be the STFT S T F T spectrogram of a signal X(t) X ( t ) . But by consistency, this signal can be none other than the result of the inverse STFT of the set H(m,n) H ( m , n ) . A necessary and sufficient condition for H H to be a consistent spectrogram is thus for it to be equal to the STFTSTFT of its inverse STFT S T F T . The point here is that, for a given window length N N and a given frame shift, if we denote the inverse STFTSTFT by iSTFT i S T F T , the operation iSTFT–STFT i S T F T – S T F T from the space of real signals to itself is the identity, while STFT–iSTFT S T F T – i S T F T from CMN C M N to itself is not.
這個問題對我們的啟示是,在進行語音增強後通過得到的頻域幅度譜恢複出的時域信号再傳回到時譜幅度譜時兩者并不相同,前端信号處理在頻域完成處理後輸出時域信号給識别器時,其提取的MFCC特征可能并不是最優的。對于該問題更嚴格的推導,可參考論文。
參考論文:
1.Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction.
2.FAST SIGNAL RECONSTRUCTION FROM MAGNITUDE STFT SPECTROGRAM
BASED ON SPECTROGRAM CONSISTENCY.
author:longtaochen
email:[email protected]