Tensorflow 如何从.wav文件中准备2d光谱图以输入神经网络?
我被要求为转换硕士课程建立一个语音识别系统,这有点超出了我的能力。我需要准备wav文件,以便使用RNN进行分析,但处理部分有问题。我曾尝试使用thinkdsp将wav文件转换为约23毫秒时间段的频谱图,但看不出如何使用输出:Tensorflow 如何从.wav文件中准备2d光谱图以输入神经网络?,tensorflow,neural-network,signal-processing,voice-recognition,Tensorflow,Neural Network,Signal Processing,Voice Recognition,我被要求为转换硕士课程建立一个语音识别系统,这有点超出了我的能力。我需要准备wav文件,以便使用RNN进行分析,但处理部分有问题。我曾尝试使用thinkdsp将wav文件转换为约23毫秒时间段的频谱图,但看不出如何使用输出: times: [0.011564625850340135, 0.023174603174603174, 0.034784580498866215, 0.046394557823129248, 0.058004535147392289, 0.0696145124716553
times: [0.011564625850340135, 0.023174603174603174, 0.034784580498866215, 0.046394557823129248, 0.058004535147392289, 0.069614512471655329, 0.08122448979591837, 0.092834467120181396, 0.10444444444444445, 0.11605442176870749, 0.12766439909297053, 0.13927437641723356, 0.15088435374149661, 0.16249433106575964, 0.17410430839002267, 0.18571428571428572, 0.19732426303854878, 0.20893424036281177, 0.22054421768707483, 0.23215419501133788, 0.24376417233560094, 0.25537414965986394, 0.26698412698412699, 0.27859410430839004, 0.29020408163265304, 0.3018140589569161, 0.31342403628117915, 0.32503401360544215, 0.33664399092970521, 0.34825396825396826, 0.35986394557823131, 0.37147392290249431, 0.38308390022675737, 0.39469387755102042, 0.40630385487528342, 0.41791383219954648, 0.42952380952380953, 0.44113378684807258, 0.45274376417233558, 0.46435374149659864, 0.47596371882086169, 0.48757369614512469, 0.49918367346938775, 0.5107936507936508, 0.52240362811791385, 0.53401360544217691, 0.54562358276643996, 0.55723356009070291, 0.56884353741496596, 0.58045351473922902, 0.59206349206349207, 0.60367346938775512, 0.61528344671201818, 0.62689342403628112, 0.63850340136054418, 0.65011337868480723, 0.66172335600907028, 0.67333333333333334, 0.68494331065759639, 0.69655328798185945, 0.7081632653061225, 0.71977324263038545, 0.7313832199546485, 0.74299319727891155, 0.75460317460317461, 0.76621315192743766, 0.77782312925170072, 0.78943310657596366, 0.80104308390022672, 0.81265306122448977, 0.82426303854875282, 0.83587301587301588, 0.84748299319727893, 0.85909297052154199, 0.87070294784580504, 0.88231292517006799]
{0.011564625850340135: <thinkdsp.Spectrum object at 0x101a5ecf8>, 0.023174603174603174: <thinkdsp.Spectrum object at 0x101a5ee80>, 0.034784580498866215: <thinkdsp.Spectrum object at 0x10ba04e10>, 0.046394557823129248: <thinkdsp.Spectrum object at 0x10ba04eb8>, 0.058004535147392289: <thinkdsp.Spectrum object at 0x10ba04ef0>, 0.069614512471655329: <thinkdsp.Spectrum object at 0x10ba04f28>, 0.08122448979591837: <thinkdsp.Spectrum object at 0x10ba04f60>, 0.092834467120181396: <thinkdsp.Spectrum object at 0x10ba04f98>, 0.10444444444444445: <thinkdsp.Spectrum object at 0x10ba04fd0>, 0.11605442176870749: <thinkdsp.Spectrum object at 0x10ba21048>, 0.12766439909297053: <thinkdsp.Spectrum object at 0x10ba21080>, 0.13927437641723356: <thinkdsp.Spectrum object at 0x10ba210b8>, 0.15088435374149661: <thinkdsp.Spectrum object at 0x10ba210f0>, 0.16249433106575964: <thinkdsp.Spectrum object at 0x10ba21128>, 0.17410430839002267: <thinkdsp.Spectrum object at 0x10ba21160>, 0.18571428571428572: <thinkdsp.Spectrum object at 0x10ba21198>, 0.19732426303854878: <thinkdsp.Spectrum object at 0x10ba211d0>, 0.20893424036281177: <thinkdsp.Spectrum object at 0x10ba21208>, 0.22054421768707483: <thinkdsp.Spectrum object at 0x10ba21240>, 0.23215419501133788: <thinkdsp.Spectrum object at 0x10ba21278>, 0.24376417233560094: <thinkdsp.Spectrum object at 0x10ba212b0>, 0.25537414965986394: <thinkdsp.Spectrum object at 0x10ba212e8>, 0.26698412698412699: <thinkdsp.Spectrum object at 0x10ba21320>, 0.27859410430839004: <thinkdsp.Spectrum object at 0x10ba21358>, 0.29020408163265304: <thinkdsp.Spectrum object at 0x10ba21390>, 0.3018140589569161: <thinkdsp.Spectrum object at 0x10ba213c8>, 0.31342403628117915: <thinkdsp.Spectrum object at 0x10ba21400>, 0.32503401360544215: <thinkdsp.Spectrum object at 0x10ba21438>, 0.33664399092970521: <thinkdsp.Spectrum object at 0x10ba21470>, 0.34825396825396826: <thinkdsp.Spectrum object at 0x10ba214a8>, 0.35986394557823131: <thinkdsp.Spectrum object at 0x10ba214e0>, 0.37147392290249431: <thinkdsp.Spectrum object at 0x10ba21518>, 0.38308390022675737: <thinkdsp.Spectrum object at 0x10ba21550>, 0.39469387755102042: <thinkdsp.Spectrum object at 0x10ba21588>, 0.40630385487528342: <thinkdsp.Spectrum object at 0x10ba215c0>, 0.41791383219954648: <thinkdsp.Spectrum object at 0x10ba215f8>, 0.42952380952380953: <thinkdsp.Spectrum object at 0x10ba21630>, 0.44113378684807258: <thinkdsp.Spectrum object at 0x10ba21668>, 0.45274376417233558: <thinkdsp.Spectrum object at 0x10ba216a0>, 0.46435374149659864: <thinkdsp.Spectrum object at 0x10ba216d8>, 0.47596371882086169: <thinkdsp.Spectrum object at 0x10ba21710>, 0.48757369614512469: <thinkdsp.Spectrum object at 0x10ba21748>, 0.49918367346938775: <thinkdsp.Spectrum object at 0x10ba21780>, 0.5107936507936508: <thinkdsp.Spectrum object at 0x10ba217b8>, 0.52240362811791385: <thinkdsp.Spectrum object at 0x10ba217f0>, 0.53401360544217691: <thinkdsp.Spectrum object at 0x10ba21828>, 0.54562358276643996: <thinkdsp.Spectrum object at 0x10ba21860>, 0.55723356009070291: <thinkdsp.Spectrum object at 0x10ba21898>, 0.56884353741496596: <thinkdsp.Spectrum object at 0x10ba218d0>, 0.58045351473922902: <thinkdsp.Spectrum object at 0x10ba21908>, 0.59206349206349207: <thinkdsp.Spectrum object at 0x10ba21940>, 0.60367346938775512: <thinkdsp.Spectrum object at 0x10ba21978>, 0.61528344671201818: <thinkdsp.Spectrum object at 0x10ba219b0>, 0.62689342403628112: <thinkdsp.Spectrum object at 0x10ba219e8>, 0.63850340136054418: <thinkdsp.Spectrum object at 0x10ba21a20>, 0.65011337868480723: <thinkdsp.Spectrum object at 0x10ba21a58>, 0.66172335600907028: <thinkdsp.Spectrum object at 0x10ba21a90>, 0.67333333333333334: <thinkdsp.Spectrum object at 0x10ba21ac8>, 0.68494331065759639: <thinkdsp.Spectrum object at 0x10ba21b00>, 0.69655328798185945: <thinkdsp.Spectrum object at 0x10ba21b38>, 0.7081632653061225: <thinkdsp.Spectrum object at 0x10ba21b70>, 0.71977324263038545: <thinkdsp.Spectrum object at 0x10ba21ba8>, 0.7313832199546485: <thinkdsp.Spectrum object at 0x10ba21be0>, 0.74299319727891155: <thinkdsp.Spectrum object at 0x10ba21c18>, 0.75460317460317461: <thinkdsp.Spectrum object at 0x10ba21c50>, 0.76621315192743766: <thinkdsp.Spectrum object at 0x10ba21c88>, 0.77782312925170072: <thinkdsp.Spectrum object at 0x10ba21cc0>, 0.78943310657596366: <thinkdsp.Spectrum object at 0x10ba21cf8>, 0.80104308390022672: <thinkdsp.Spectrum object at 0x10ba21d30>, 0.81265306122448977: <thinkdsp.Spectrum object at 0x10ba21d68>, 0.82426303854875282: <thinkdsp.Spectrum object at 0x10ba21da0>, 0.83587301587301588: <thinkdsp.Spectrum object at 0x10ba21dd8>, 0.84748299319727893: <thinkdsp.Spectrum object at 0x10ba21e10>, 0.85909297052154199: <thinkdsp.Spectrum object at 0x10ba21e48>, 0.87070294784580504: <thinkdsp.Spectrum object at 0x10ba21e80>, 0.88231292517006799: <thinkdsp.Spectrum object at 0x10ba21eb8>}
frequencies: [ 0. 43.06640625 86.1328125 129.19921875 172.265625
215.33203125 258.3984375 301.46484375 344.53125 387.59765625
430.6640625 473.73046875 516.796875 559.86328125 602.9296875
645.99609375 689.0625 732.12890625 775.1953125 818.26171875
861.328125 904.39453125 947.4609375 990.52734375 1033.59375
1076.66015625 1119.7265625 1162.79296875 1205.859375 1248.92578125
1291.9921875 1335.05859375 1378.125 1421.19140625 1464.2578125
1507.32421875 1550.390625 1593.45703125 1636.5234375 1679.58984375
1722.65625 1765.72265625 1808.7890625 1851.85546875 1894.921875
1937.98828125 1981.0546875 2024.12109375 2067.1875 2110.25390625
2153.3203125 2196.38671875 2239.453125 2282.51953125 2325.5859375
2368.65234375 2411.71875 2454.78515625 2497.8515625 2540.91796875
2583.984375 2627.05078125 2670.1171875 2713.18359375 2756.25
2799.31640625 2842.3828125 2885.44921875 2928.515625 2971.58203125
3014.6484375 3057.71484375 3100.78125 3143.84765625 3186.9140625
3229.98046875 3273.046875 3316.11328125 3359.1796875 3402.24609375
3445.3125 3488.37890625 3531.4453125 3574.51171875 3617.578125
3660.64453125 3703.7109375 3746.77734375 3789.84375 3832.91015625
3875.9765625 3919.04296875 3962.109375 4005.17578125 4048.2421875
4091.30859375 4134.375 4177.44140625 4220.5078125 4263.57421875
4306.640625 4349.70703125 4392.7734375 4435.83984375 4478.90625
4521.97265625 4565.0390625 4608.10546875 4651.171875 4694.23828125
4737.3046875 4780.37109375 4823.4375 4866.50390625 4909.5703125
4952.63671875 4995.703125 5038.76953125 5081.8359375 5124.90234375
5167.96875 5211.03515625 5254.1015625 5297.16796875 5340.234375
5383.30078125 5426.3671875 5469.43359375 5512.5 ]
时间:[0.011564625850340135, 0.023174603174603174, 0.034784580498866215, 0.046394557823129248, 0.058004535147392289, 0.069614512471655329, 0.08122448979591837, 0.092834467120181396, 0.10444444444444445, 0.11605442176870749, 0.12766439909297053, 0.13927437641723356, 0.15088435374149661, 0.16249433106575964, 0.17410430839002267, 0.18571428571428572, 0.19732426303854878, 0.20893424036281177, 0.22054421768707483, 0.23215419501133788, 0.24376417233560094, 0.25537414965986394, 0.26698412698412699, 0.27859410430839004, 0.29020408163265304, 0.3018140589569161, 0.31342403628117915, 0.32503401360544215, 0.33664399092970521, 0.34825396825396826, 0.35986394557823131, 0.37147392290249431, 0.38308390022675737, 0.39469387755102042, 0.40630385487528342, 0.41791383219954648, 0.42952380952380953, 0.44113378684807258, 0.45274376417233558, 0.46435374149659864, 0.47596371882086169, 0.48757369614512469, 0.49918367346938775, 0.5107936507936508, 0.52240362811791385, 0.53401360544217691, 0.54562358276643996, 0.55723356009070291, 0.56884353741496596, 0.58045351473922902, 0.59206349206349207, 0.60367346938775512, 0.61528344671201818, 0.62689342403628112, 0.63850340136054418, 0.65011337868480723, 0.66172335600907028, 0.67333333333333334, 0.68494331065759639, 0.69655328798185945, 0.7081632653061225, 0.71977324263038545, 0.7313832199546485, 0.74299319727891155, 0.75460317460317461, 0.76621315192743766, 0.77782312925170072, 0.78943310657596366, 0.80104308390022672, 0.81265306122448977, 0.82426303854875282, 0.83587301587301588, 0.84748299319727893, 0.85909297052154199, 0.87070294784580504, 0.88231292517006799]
{0.011564625850340135: , 0.023174603174603174: , 0.034784580498866215: , 0.046394557823129248: , 0.058004535147392289: , 0.069614512471655329: , 0.08122448979591837: , 0.092834467120181396: , 0.10444444444444445: , 0.11605442176870749: , 0.12766439909297053: , 0.13927437641723356: , 0.15088435374149661: , 0.16249433106575964: , 0.17410430839002267: , 0.18571428571428572: , 0.19732426303854878: , 0.20893424036281177: , 0.22054421768707483: , 0.23215419501133788: , 0.24376417233560094: , 0.25537414965986394: , 0.26698412698412699: , 0.27859410430839004: , 0.29020408163265304: , 0.3018140589569161: , 0.31342403628117915: , 0.32503401360544215: , 0.33664399092970521: , 0.34825396825396826: , 0.35986394557823131: , 0.37147392290249431: , 0.38308390022675737: , 0.39469387755102042: , 0.40630385487528342: , 0.41791383219954648: , 0.42952380952380953: , 0.44113378684807258: , 0.45274376417233558: , 0.46435374149659864: , 0.47596371882086169: , 0.48757369614512469: , 0.49918367346938775: , 0.5107936507936508: , 0.52240362811791385: , 0.53401360544217691: , 0.54562358276643996: , 0.55723356009070291: , 0.56884353741496596: , 0.58045351473922902: , 0.59206349206349207: , 0.60367346938775512: , 0.61528344671201818: , 0.62689342403628112: , 0.63850340136054418: , 0.65011337868480723: , 0.66172335600907028: , 0.67333333333333334: , 0.68494331065759639: , 0.69655328798185945: , 0.7081632653061225: , 0.71977324263038545: , 0.7313832199546485: , 0.74299319727891155: , 0.75460317460317461: , 0.76621315192743766: , 0.77782312925170072: , 0.78943310657596366: , 0.80104308390022672: , 0.81265306122448977: , 0.82426303854875282: , 0.83587301587301588: , 0.84748299319727893: , 0.85909297052154199: , 0.87070294784580504: , 0.88231292517006799: }
频率:[0.43.06640625 86.1328125 129.19921875 172.265625
215.33203125 258.3984375 301.46484375 344.53125 387.59765625
430.6640625 473.73046875 516.796875 559.86328125 602.9296875
645.99609375 689.0625 732.12890625 775.1953125 818.26171875
861.328125 904.39453125 947.4609375 990.52734375 1033.59375
1076.66015625 1119.7265625 1162.79296875 1205.859375 1248.92578125
1291.9921875 1335.05859375 1378.125 1421.19140625 1464.2578125
1507.32421875 1550.390625 1593.45703125 1636.5234375 1679.58984375
1722.65625 1765.72265625 1808.7890625 1851.85546875 1894.921875
1937.98828125 1981.0546875 2024.12109375 2067.1875 2110.25390625
2153.3203125 2196.38671875 2239.453125 2282.51953125 2325.5859375
2368.65234375 2411.71875 2454.78515625 2497.8515625 2540.91796875
2583.984375 2627.05078125 2670.1171875 2713.18359375 2756.25
2799.31640625 2842.3828125 2885.44921875 2928.515625 2971.58203125
3014.6484375 3057.71484375 3100.78125 3143.84765625 3186.9140625
3229.98046875 3273.046875 3316.11328125 3359.1796875 3402.24609375
3445.3125 3488.37890625 3531.4453125 3574.51171875 3617.578125
3660.64453125 3703.7109375 3746.77734375 3789.84375 3832.91015625
3875.9765625 3919.04296875 3962.109375 4005.17578125 4048.2421875
4091.30859375 4134.375 4177.44140625 4220.5078125 4263.57421875
4306.640625 4349.70703125 4392.7734375 4435.83984375 4478.90625
4521.97265625 4565.0390625 4608.10546875 4651.171875 4694.23828125
4737.3046875 4780.37109375 4823.4375 4866.50390625 4909.5703125
4952.63671875 4995.703125 5038.76953125 5081.8359375 5124.90234375
5167.96875 5211.03515625 5254.1015625 5297.16796875 5340.234375
5383.30078125 5426.3671875 5469.43359375 5512.5 ]
为RNN制作一个有用的二维输入向量。从我的阅读中,我会认为我会在连续的时间内得到一系列的光谱峰值。有人能给我一个好的输入实际应该是什么的例子吗
window_size_sec = 0.025
window_shift_sec = 0.0125
sample_rate = 8000
data, sampling_rate = librosa.core.load('audio.wav', sr=sample_rate, mono=True)
win_length = int(sample_rate * window_size_sec)
hop_length = int(sample_rate * window_shift_sec)
n_fft = win_length # must be >= win_length
spectrogram = librosa.core.stft(data, n_fft=n_fft, hop_length=hop_length, win_length=win_length)
spectrogram.shape
(101, 338)