Python 比较两个文本文件并将匹配值写入文本文件
我有两个文本文件:Speech.txt和Script.txt。 Speech.txt包含音频文件的文件名列表,Script.txt包含相关的成绩单。Script.txt包含所有字符和项目的成绩单,但是我只想要特定字符的成绩单。我想编写一个python脚本,将文件名与转录本进行比较,并返回一个文本文件,其中包含文件路径、文件名、扩展名和以|分隔的转录本 Speech.txt示例: Script.txt的示例: 预期产出: 正在进行的代码工作: 上面的代码似乎只适用于Speech.txt中的第一行,然后停止。我希望它贯穿整个文件,即第2行、第3行……等等。我还没有弄清楚如何将结果输出到文本文件中。我现在只能把结果打印出来。任何帮助都将不胜感激 编辑Python 比较两个文本文件并将匹配值写入文本文件,python,string-comparison,Python,String Comparison,我有两个文本文件:Speech.txt和Script.txt。 Speech.txt包含音频文件的文件名列表,Script.txt包含相关的成绩单。Script.txt包含所有字符和项目的成绩单,但是我只想要特定字符的成绩单。我想编写一个python脚本,将文件名与转录本进行比较,并返回一个文本文件,其中包含文件路径、文件名、扩展名和以|分隔的转录本 Speech.txt示例: Script.txt的示例: 预期产出: 正在进行的代码工作: 上面的代码似乎只适用于Speech.txt中的第一行,
与的链接。您可以使用readlines方法将行加载到列表中,然后对其进行迭代。这避免了Kuldeep Singh Sidhu正确标识到达文件末尾的指针的问题
f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
lines1 = f1.readlines()
lines2 = f2.readlines()
f1.close()
f2.close()
with open("output.txt","w") as outfile:
for line1 in lines1:
for line2 in lines2:
if line1[0:10]==line2[0:10]:
outfile.write('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1],"/n")
您可以使用readlines方法将这些行加载到列表中,然后对它们进行迭代。这避免了Kuldeep Singh Sidhu正确标识到达文件末尾的指针的问题
f1=open(r'C:/Speech.txt',"r", encoding='utf8')
f2=open(r'C:/script.txt',"r", encoding='utf8')
lines1 = f1.readlines()
lines2 = f2.readlines()
f1.close()
f2.close()
with open("output.txt","w") as outfile:
for line1 in lines1:
for line2 in lines2:
if line1[0:10]==line2[0:10]:
outfile.write('C:/Speech/' + line2[0:10] + '.wav' + '|' + line2[26:-1],"/n")
我会将Script.txt的内容读入字典,然后将此字典用作迭代Speech.txt中的行,只打印存在的行。这避免了多次迭代文件的需要,如果您有大文件,这可能会非常慢
演示:
输出:
它也更容易用于打开文件,因为您不需要调用.close来关闭文件,因为它会为您处理这些
我还从你的.wav文件中获取文件名。我发现这比函数更容易使用。虽然这是个人喜好,一切都会奏效
如果要将输出写入文本文件,可以使用mode=w以写入模式打开另一个输出文件:
output.txt
您可以从文档中查看有关如何在python中读取和写入文件的更多信息 我会将Script.txt的内容读入字典,然后将此字典用作您的字典,从Speech.txt迭代行,只打印存在的行。这避免了多次迭代文件的需要,如果您有大文件,这可能会非常慢
演示:
输出:
它也更容易用于打开文件,因为您不需要调用.close来关闭文件,因为它会为您处理这些
我还从你的.wav文件中获取文件名。我发现这比函数更容易使用。虽然这是个人喜好,一切都会奏效
如果要将输出写入文本文件,可以使用mode=w以写入模式打开另一个输出文件:
output.txt
您可以从文档中查看有关如何在python中读取和写入文件的更多信息 使用pandas也是另一种方法,因为这似乎是典型的连接问题
import pandas as pd
df = pd.read_csv('speech.txt', header=None, names=['name'])
df1 = pd.read_csv('script.txt', sep='|', header=None, names=['name', 'blank', 'description'])
df1['name'] = df1.name.str.strip() + '.wav'
final = pd.merge(df, df1, how='left', left_on='name', right_on='name')
final['name'] = 'C:/Speech/' + final['name']
print(final)
name blank description
0 C:/Speech/0x000f4a03.wav Thinking long-term, then. Think she'll succeed?
1 C:/Speech/0x000f4a07.wav Son's King of Skellige. Congratulations to you.
2 C:/Speech/0x000f4a0f.wav And unites the clans against Nilfgaard?
然后,只需选择所需的列并将其保存即可
final = final[['name', 'description']]
final.to_csv('some_name.csv', index=False, sep='|')
使用熊猫也是另一种方法,因为这似乎是典型的连接问题
import pandas as pd
df = pd.read_csv('speech.txt', header=None, names=['name'])
df1 = pd.read_csv('script.txt', sep='|', header=None, names=['name', 'blank', 'description'])
df1['name'] = df1.name.str.strip() + '.wav'
final = pd.merge(df, df1, how='left', left_on='name', right_on='name')
final['name'] = 'C:/Speech/' + final['name']
print(final)
name blank description
0 C:/Speech/0x000f4a03.wav Thinking long-term, then. Think she'll succeed?
1 C:/Speech/0x000f4a07.wav Son's King of Skellige. Congratulations to you.
2 C:/Speech/0x000f4a0f.wav And unites the clans against Nilfgaard?
然后,只需选择所需的列并将其保存即可
final = final[['name', 'description']]
final.to_csv('some_name.csv', index=False, sep='|')
对于Speech.txt文件的每一行,您需要检查它是否存在于Script.txt文件中。考虑到Script.txt的内容适合内存,您应该加载它的内容,以避免每次都读取它
加载Script.txt的内容后,只需处理Speech.txt的每一行,在字典中搜索并在需要时打印即可
接下来,我提供代码。请注意:
我添加了调试信息。您可以通过执行python-oscript.py来隐藏它
我使用os.path.splittextvar[0]从文件名中删除扩展名
我剥离每一条处理过的线,以消除空格/换行符。
代码:
调试输出:
输出:
输出写入文件:
对于Speech.txt文件的每一行,您需要检查它是否存在于Script.txt文件中。考虑到Script.txt的内容适合内存,您应该加载它的内容,以避免每次都读取它
加载Script.txt的内容后,只需处理Speech.txt的每一行,在字典中搜索并在需要时打印即可
接下来,我提供代码。请注意:
我添加了调试信息。您可以通过执行python-oscript.py来隐藏它
我使用os.path.splittextvar[0]从文件名中删除扩展名
我剥离每一条处理过的线,以消除空格/换行符。
代码:
调试输出:
输出:
输出写入文件:
您正在阅读f1的第一行,并将其与
f2中的所有行。执行此操作后,f2的指针已经位于最后一行,因此对于f1中除第一行之外的所有行,它不会执行任何操作。请提供指向要用于进一步帮助的文件的链接!:对于语音的每个条目,您需要检查脚本文件中是否存在该条目。这意味着一次又一次地读取脚本文件。script.txt的内容是否适合内存?在这种情况下,我建议将数据加载到字典结构中,以避免多次读取,然后只搜索字典中的条目。@Kuldeepsingsidhu感谢您的帮助,我用文本文件的链接更新了文章。您正在读取f1的第一行,并将其与f2中的所有行进行比较。执行此操作后,f2的指针已经位于最后一行,因此对于f1中除第一行之外的所有行,它不会执行任何操作。请提供指向要用于进一步帮助的文件的链接!:对于语音的每个条目,您需要检查脚本文件中是否存在该条目。这意味着一次又一次地读取脚本文件。script.txt的内容是否适合内存?在这种情况下,我建议将数据加载到字典结构中,以避免多次阅读,然后在字典中搜索条目。@Kuldeepsingsidhu感谢您的帮助,我用文本文件的链接更新了文章。
C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?
from pathlib import Path
with open("Speech.txt") as speech_file, open("Script.txt") as script_file, open("output.txt", mode="w") as output_file:
script_dict = {}
for line in script_file:
key, _, text = map(str.strip, line.split("|"))
script_dict[key] = text
for line in map(str.strip, speech_file):
filename = Path(line).stem
if filename in script_dict:
output_file.write(f"C:\Speech\{line}|{script_dict[filename]}\n")
C:\Speech\0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:\Speech\0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:\Speech\0x000f4a0f.wav|And unites the clans against Nilfgaard?
import pandas as pd
df = pd.read_csv('speech.txt', header=None, names=['name'])
df1 = pd.read_csv('script.txt', sep='|', header=None, names=['name', 'blank', 'description'])
df1['name'] = df1.name.str.strip() + '.wav'
final = pd.merge(df, df1, how='left', left_on='name', right_on='name')
final['name'] = 'C:/Speech/' + final['name']
print(final)
name blank description
0 C:/Speech/0x000f4a03.wav Thinking long-term, then. Think she'll succeed?
1 C:/Speech/0x000f4a07.wav Son's King of Skellige. Congratulations to you.
2 C:/Speech/0x000f4a0f.wav And unites the clans against Nilfgaard?
final = final[['name', 'description']]
final.to_csv('some_name.csv', index=False, sep='|')
#!/usr/bin/python
# -*- coding: utf-8 -*-
# For better print formatting
from __future__ import print_function
# Imports
import sys
import os
#
# HELPER METHODS
#
def load_script_file(script_file_path):
# Parse each line of the script file and load to a dictionary
d = {}
with open(script_file_path, "r") as f:
for transcript_info in f:
if __debug__:
print("Loading line: " + str(transcript_info))
speech_filename, _, transcription = transcript_info.split("|")
speech_filename = speech_filename.strip()
transcription = transcription.strip()
d[speech_filename] = transcription
if __debug__:
print("Loaded values: " + str(d))
return d
#
# MAIN METHODS
#
def main(speech_file_path, script_file_path, output_file):
# Load the script data into a dictionary
speech_to_transcript = load_script_file(script_file_path)
# Check each speech entry
with open(speech_file_path, "r") as f:
for speech_audio_file in f:
speech_audio_file = speech_audio_file.strip()
if __debug__:
print()
print("Checking speech file: " + str(speech_audio_file))
# Remove extension
speech_code = os.path.splitext(speech_audio_file)[0]
if __debug__:
print(" + Obtained filename: " + speech_code)
# Find entry in transcript
if speech_code in speech_to_transcript.keys():
if __debug__:
print(" + Filename registered. Loading transcript")
transcript = speech_to_transcript[speech_code]
if __debug__:
print(" + Transcript: " + str(transcript))
# Print information
output_line = "C:/Speech/" + speech_audio_file + "|" + transcript
if output_file is None:
print(output_line)
else:
with open(output_file, 'a') as fw:
fw.write(output_line + "\n")
else:
if __debug__:
print(" + Filename not registered")
#
# ENTRY POINT
#
if __name__ == '__main__':
# Parse arguments
args = sys.argv[1:]
speech = str(args[0])
script = str(args[1])
if len(args) == 3:
output = str(args[2])
else:
output = None
# Log arguments if required
if __debug__:
print("Running with:")
print(" - SPEECH FILE = " + str(speech))
print(" - SCRIPT FILE = " + str(script))
print(" - OUTPUT FILE = " + str(output))
print()
# Execute main
main(speech, script, output)
$ python speech_transcript.py ./Speech.txt ./Script.txt
Running with:
- SPEECH FILE = ./Speech.txt
- SCRIPT FILE = ./Script.txt
Loaded values: {'0x000f4a03': "Thinking long-term, then. Think she'll succeed?", '0x000f4a11': "Of course. He's already decreed new longships be built.", '0x000f4a05': "She's got a powerful ally. In me.", '0x000f4a07': "Son's King of Skellige. Congratulations to you.", '0x000f4a0f': 'And unites the clans against Nilfgaard?'}
Checking speech file: 0x000f4a03.wav
+ Obtained filename: 0x000f4a03
+ Filename registered. Loading transcript
+ Transcript: Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
Checking speech file: 0x000f4a07.wav
+ Obtained filename: 0x000f4a07
+ Filename registered. Loading transcript
+ Transcript: Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
Checking speech file: 0x000f4a0f.wav
+ Obtained filename: 0x000f4a0f
+ Filename registered. Loading transcript
+ Transcript: And unites the clans against Nilfgaard?
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
$ python -O speech_transcript.py ./Speech.txt ./Script.txt
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?
$ python -O speech_transcript.py ./Speech.txt ./Script.txt ./output.txt
$ more output.txt
C:/Speech/0x000f4a03.wav|Thinking long-term, then. Think she'll succeed?
C:/Speech/0x000f4a07.wav|Son's King of Skellige. Congratulations to you.
C:/Speech/0x000f4a0f.wav|And unites the clans against Nilfgaard?