Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/azure/11.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scipy,Numpy:音频分类器,语音/语音活动检测_Numpy_Machine Learning_Scipy - Fatal编程技术网

Scipy,Numpy:音频分类器,语音/语音活动检测

Scipy,Numpy:音频分类器,语音/语音活动检测,numpy,machine-learning,scipy,Numpy,Machine Learning,Scipy,我正在编写一个程序来自动分类录制的音频电话文件(wav文件),其中至少包含一些人声或非人声(仅DTMF、拨号音、铃声、噪音) 我的第一个方法是使用ZCR(过零率)实现简单的VAD(语音活动检测器)&计算能量,但这两个参数都混淆了DTMF、拨号音和语音。这个技术失败了,所以我实现了一个简单的方法来计算200Hz和300Hz之间FFT的方差。我的numpy代码如下 wavefft = np.abs(fft(frame)) n = len(frame) fx = np.arange(0,fs,floa

我正在编写一个程序来自动分类录制的音频电话文件(wav文件),其中至少包含一些人声或非人声(仅DTMF、拨号音、铃声、噪音)

我的第一个方法是使用ZCR(过零率)实现简单的VAD(语音活动检测器)&计算能量,但这两个参数都混淆了DTMF、拨号音和语音。这个技术失败了,所以我实现了一个简单的方法来计算200Hz和300Hz之间FFT的方差。我的numpy代码如下

wavefft = np.abs(fft(frame))
n = len(frame)
fx = np.arange(0,fs,float(fs)/float(n))
stx = np.where(fx>=200)
stx = stx[0][0]
endx = np.where(fx>=300)
endx = endx[0][0]
return np.sqrt(np.var(wavefft[stx:endx]))/1000
[samplerate, sample] = wavfile.read ('profiles/noise.wav')
noiseProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/ring.wav')
ringProfile =  MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/voice.wav')
voiceProfile = MFCC(samplerate, sample)

machineData = []
for noise in noiseProfile:
    machineData.append(noise)

for voice in voiceProfile:
    machineData.append(voice)

dataLabel = []
for i in range(0, len(noiseProfile)):
    dataLabel.append (0)
for i in range(0, len(voiceProfile)):
    dataLabel.append (1)

clf = svm.SVC()
clf.fit(machineData, dataLabel)
这导致了60%的准确率

接下来,我尝试使用SVM(支持向量机)和MFCC(Mel频率倒谱系数)实现一种基于机器学习的方法。结果完全不正确,几乎所有样品的标记都不正确。如何用MFCC特征向量训练SVM?我使用scikit学习的大致代码如下

wavefft = np.abs(fft(frame))
n = len(frame)
fx = np.arange(0,fs,float(fs)/float(n))
stx = np.where(fx>=200)
stx = stx[0][0]
endx = np.where(fx>=300)
endx = endx[0][0]
return np.sqrt(np.var(wavefft[stx:endx]))/1000
[samplerate, sample] = wavfile.read ('profiles/noise.wav')
noiseProfile = MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/ring.wav')
ringProfile =  MFCC(samplerate, sample)
[samplerate, sample] = wavfile.read ('profiles/voice.wav')
voiceProfile = MFCC(samplerate, sample)

machineData = []
for noise in noiseProfile:
    machineData.append(noise)

for voice in voiceProfile:
    machineData.append(voice)

dataLabel = []
for i in range(0, len(noiseProfile)):
    dataLabel.append (0)
for i in range(0, len(voiceProfile)):
    dataLabel.append (1)

clf = svm.SVC()
clf.fit(machineData, dataLabel)

我想知道我可以实现什么替代方法?

如果您不必使用scipy/numpy,您可以使用checkout,这是一个围绕谷歌优秀语音活动检测代码的Python包装。WebRTC使用高斯混合模型(GMMs),运行良好,速度非常快

下面是一个您可以如何使用它的示例:

import webrtcvad

# audio must be 16 bit PCM, at 8 KHz, 16 KHz or 32 KHz.
def audio_contains_voice(audio, sample_rate, aggressiveness=0, threshold=0.5):
    # Frames must be 10, 20 or 30 ms.
    frame_duration_ms = 30

    # Assuming split_audio is a function that will split audio into
    # frames of the correct size.
    frames = split_audio(audio, sample_rate, frame_duration)

    # aggressiveness tells the VAD how aggressively to filter out non-speech.
    # 0 will have the most false-positives for speech, 3 the least.
    vad = webrtc.Vad(aggressiveness)

    num_voiced = len([f for f in frames if vad.is_voiced(f, sample_rate)])
    return float(num_voiced) / len(frames) > threshold

您可能需要调整SVC参数或使用不同的内核。这很难说。您可以执行搜索以查找最佳参数。在执行学习之前,我建议您将
machineData
dataLabel
(具有相同的索引)都洗牌