3种python库中的MFCC和增量系数_Python_Audio_Speech Recognition_Librosa

3种python库中的MFCC和增量系数

python audio speech-recognition

3种python库中的MFCC和增量系数,python,audio,speech-recognition,librosa,Python,Audio,Speech Recognition,Librosa,我最近做了关于MFCC的家庭作业，我不知道使用这些库之间有什么区别我使用的3个库是：第1部分：Mel滤波器组 temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1) # speechpy do not divide 2 and add 1 when initializing temp2_fb = speechpy.feature.filterbanks(num_filter=NFIL

我最近做了关于MFCC的家庭作业，我不知道使用这些库之间有什么区别

我使用的3个库是：

第1部分：Mel滤波器组

temp1_fb = pyspeech.get_filterbanks(nfilt=NFILT, nfft=NFFT, samplerate=sample1)
# speechpy do not divide 2 and add 1 when initializing
temp2_fb = speechpy.feature.filterbanks(num_filter=NFILT, fftpoints=NFFT, sampling_freq=sample1)
temp3_fb = librosa.filters.mel(sr=sample1, n_fft=NFFT, n_mels=NFILT)
# fix librosa normalized version
temp3_fb /= np.max(temp3_fb, axis=-1)[:, None]

只有speechpy中的形状将获得（，512），其他所有形状将获得（，257）。利布罗萨的身材有点变形

第二部分：MFCC

# pyspeech without lifter. Using hamming
temp1_mfcc = pyspeech.mfcc(speaker1, samplerate=sample1, winlen=0.025, winstep=0.01, numcep=NCEPT, nfilt=NFILT, nfft=NFFT,
                           preemph=0.97, ceplifter=0, winfunc=np.hamming, appendEnergy=False)
# speechpy need pre-emphasized. Using rectangular window fixed. Mel filter bank is not the same
temp2_mfcc = speechpy.feature.mfcc(emphasized_speaker1, sampling_frequency=sample1, frame_length=0.025, frame_stride=0.01,
                                   num_cepstral=NCEPT, num_filters=NFILT, fft_length=NFFT)
# librosa need pre-emphasized. Using log energy. Its STFT using hanning, but its framing is not the same
temp3_energy = librosa.feature.melspectrogram(emphasized_speaker1, sr=sample1, S=temp3_pow.T, n_fft=NFFT,
                                          hop_length=frame_step, n_mels=NFILT).T
temp3_energy = np.log(temp3_energy)
temp3_mfcc = librosa.feature.mfcc(emphasized_speaker1, sr=sample1, S=temp3_energy.T, n_mfcc=13, dct_type=2, n_fft=NFFT,
                                  hop_length=frame_step).T

我已尽了最大努力使情况公平。说不出话来的形象越来越暗

第三部分：三角洲系数

temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)

我不能直接在speechpy中将mfcc设置为参数，否则会很奇怪。这些参数最初的作用与我的预期不同

我想知道是什么因素造成了这些差异。这只是我上面提到的东西吗？还是我犯了一些错误？希望了解详细信息，谢谢。

有许多MFCC实现，它们通常逐位不同-窗口函数形状、mel过滤器库计算、dct也可能不同。很难找到完全兼容的库。从长远来看，只要您在任何地方都使用相同的实现，这对您来说都不重要。这些差异不会影响结果。

我在绘图时截断了第一个系数（能量）

temp1 = pyspeech.delta(mfcc_speaker1, 2)
temp2 = speechpy.processing.derivative_extraction(mfcc_speaker1.T, 1).T
# librosa along the frame axis
temp3 = librosa.feature.delta(mfcc_speaker1, width=5, axis=0, order=1)