Python 仅在以*CHI:
我正在尝试编写一个Python脚本,仅在以*CHI:开头的行上标记所有英语单词,并在单词末尾加上“@s:eng”,但代码似乎不起作用。目前,代码如下所示:Python 仅在以*CHI:,python,regex,python-3.x,Python,Regex,Python 3.x,我正在尝试编写一个Python脚本,仅在以*CHI:开头的行上标记所有英语单词,并在单词末尾加上“@s:eng”,但代码似乎不起作用。目前,代码如下所示: import re with open("transcript 0623.cha", encoding='utf8') as f: text = f.read() new_text = re.sub("A-Za-z", "A-Za-z@s:eng", text) with open("transcript 062
import re
with open("transcript 0623.cha", encoding='utf8') as f:
text = f.read()
new_text = re.sub("A-Za-z", "A-Za-z@s:eng", text)
with open("transcript 0623_out.cha", "w", encoding='utf8') as result:
result.write(new_text)
你能建议我如何改进代码吗
转录本0623的样本内容如下:
@Begin
@Languages: zho , eng
@Participants: TEA Teacher , CHI Child
@ID: zho,|change_me_later|TEA|||||Teacher|||
@ID: zho,|change_me_later|CHI|||||Child|||
@Transcriber: CKX
@Activities: Storytelling
@Comment: child used the malay word sayang
*TEA: ok , 来 , 开始 .
*CHI: 呃 , the boy@s .
*TEA: 嗯 .
*CHI: have a frog@s .
*TEA: ok .
*TEA: ok do you know what is boy in chinese ?
*TEA: can you help me tell the story in chinese ?
*TEA: ok then do you know what is a frog in chinese ?
*TEA: ok , come .
*TEA: go to the next page .
*CHI: when the boy sleeping , then the frog come out@s .
*TEA: ok .
*TEA: 还有 吗 ?
*CHI: the cat also sleeping@s .
*TEA: ok .
*TEA: do you know what is cat in chinese ?
*TEA: 嗯 , what is it ?
*CHI: 猫 .
*TEA: ok .
*TEA: so can you use your chinese for cat to help me tell the story ?
*TEA: 嗯 ?
*CHI: 猫 睡觉 .
*TEA: 啊 , 很 好 .
*TEA: 还有 吗 ?
*CHI: frog come out@s .
*TEA: ok .
*TEA: 很 好 .
*TEA: 还有 吗 ?
*CHI: next one@s .
*TEA: ok .
*CHI: the boy wake up@s .
*CHI: and , the frog is gone@s .
*TEA: 嗯 .
*CHI: then , maybe , the frog went out the window@s .
*TEA: 嗯 , ok .
*CHI: the boy is looking for the frog@s .
*TEA: 嗯 .
*CHI: the cat is looking for the frog@s .
*TEA: ok what is cat in chinese again ?
*CHI: what@s ?
*TEA: what is cat in chinese again ?
*CHI: 猫 .
*TEA: 嗯 .
*TEA: ok can you use the chinese word for cat to tell me the story again ?
*TEA: 嗯 ?
*CHI: 猫 looking for the@s .
*TEA: 啊 .
*CHI: for the@s .
*TEA: 嗯 .
*CHI: frog@s .
*TEA: ok .
*TEA: very good .
*TEA: anything else ?
*TEA: ok .
*CHI: the@s 猫 go in@s .
*CHI: and put the bottle in here@s .
*TEA: 嗯 .
*CHI: the boy has do this@s .
*TEA: 嗯 .
*CHI: the cat fall down@s .
*TEA: ok what is cat in chinese again ?
*CHI: 猫 fall down@s .
*TEA: 嗯 .
*CHI: and get the bottle@s .
*CHI: get the bottle@s .
*TEA: ok .
*TEA: very good .
*TEA: ok anything else ?
*TEA: anything else ?
*TEA: ok .
*CHI: the boy go and sayang the cat@s .
*TEA: 嗯 .
*TEA: what is cat in chinese ?
*CHI: the , the boy go and sayang the@s 猫 .
*TEA: 啊 , ok .
*TEA: very good .
*CHI: and then the bottle break@s .
*TEA: ok .
*TEA: very good .
*TEA: anything else ?
*TEA: come .
*TEA: ok this whole thing is together .
*CHI: the boy is calling for the frog@s .
*TEA: 嗯 .
*CHI: the cat is looking underneath the table@s .
*TEA: ok what is cat in chinese again ?
*CHI: the@s 猫 looking for the frog underneath@s .
*TEA: 嗯 , ok .
*CHI: they looking inside the hole if the frog is here@s .
*TEA: 嗯 .
*TEA: anything else ?
*CHI: then the boy is here@s .
*TEA: 啊 , ok very good .
*TEA: anything else ?
*CHI: the boy fall down into the water@s .
*CHI: and the cat also@s .
*CHI: and then the log break@s .
*TEA: 嗯 .
*TEA: do you know what is water in chinese ?
*TEA: what is it ?
*CHI: 水 .
*TEA: ok can you tell me the story again with the word , with the , with
the chinese word for water ?
*TEA: 嗯 ?
*CHI: the boy fall down@s .
*CHI: and the@s 猫 too@s .
*CHI: and both of them fall in the@s 水 .
*TEA: ok , very good .
*CHI: and then they all get wet@s .
*TEA: 嗯 .
*TEA: ok .
*CHI: they found some water on the log@s .
*TEA: 嗯 .
*CHI: they found so many frogs@s .
*TEA: 嗯 .
*CHI: and is this the frog that they have@s ?
*TEA: 嗯 .
*TEA: ok .
*CHI: then they say bye bye .
*TEA: 嗯 .
*TEA: you know how to say bye bye in chinese ?
*CHI: 再见 .
*TEA: ok .
*TEA: can you repeat this part again in chinese ?
*CHI: and then the boy and the cat and the frog
say@s 再见 .
*TEA: ok what is cat in chinese again ?
*CHI: 猫 .
*TEA: 啊 .
*TEA: can you repeat the whole thing ?
*CHI: the boy and the@s 猫 and the , and the frog@s .
*TEA: 嗯 .
*CHI: say@s 再见 .
*TEA: ok .
*TEA: very good .
*TEA: thank you for telling me the story ok ?
@End
您的正则表达式不正确: 搜索模式正在查找“大写A、连字符、大写Z、小写A、连字符、小写Z”。如果您只想检查以“*CHI:”开头的行,那么“*CHI:”应该是您的搜索模式的一部分 替换模式将整个线路替换为“A-Za”-z@s:eng”。您需要捕获要保留的文本部分,然后重新使用它们,并在单词末尾添加“@s:eng” 以下是您可以使用的:
重新导入
i_path=“转录本0623.cha”
o_path=“转录本0623_out.cha”
标记\u pattern=re.compile(\\*CHI:.*))
word_pattern=re.compile(([A-Za-z]+))
将open(i_路径,encoding='utf8')作为i_文件,将open(o_路径,“w”,encoding='utf8')作为o_文件:
对于i_文件中的行:
#分成可能的词
parts=line.split()
如果mark_pattern.match(零件[0])为无:
o_文件写入(行)
持续
#有一条气线吗
新线
对于第[1]部分中的单词:
匹配=单词\模式。匹配(单词)
如果匹配:
old=f“\\b{word}\\b”
new=f“{matches.group(1)}@s:eng”
新线=re.sub(旧线、新线、新线、计数=1)
o_file.write(新_行)
说明:
mark\u pattern=re.compile(\\*CHI:.*)
- 匹配以“*CHI:*”开头的行的模式。您需要在开头转义
,因为*
文档说,“当表达式在单个程序中多次使用时,使用并保存生成的正则表达式对象以供重用更为有效。”re
- 匹配以“*CHI:*”开头的行的模式。您需要在开头转义
word\u pattern=re.compile(([A-Za-z]+)”)
- 匹配单词的模式。您需要使用
指示一组字符,然后使用[]
指示匹配前面模式的一个或多个重复+
- 匹配单词的模式。您需要使用
用于i_文件中的行
- 逐行处理文件会更容易(而且内存效率更高)。您可以轻松调试正则表达式搜索并替换每行。也许可以在一次
/read()
中完成所有这些操作,但我更喜欢可读性readlines()
- 逐行处理文件会更容易(而且内存效率更高)。您可以轻松调试正则表达式搜索并替换每行。也许可以在一次
parts=line.split()
- 要查找单词,请将行拆分为可能的单词
.match(..)
- 看
- 它返回一个,如果您在正则表达式模式中有捕获(
),您可以使用()
访问它们。这用于将.group()
更改为word
word@s:eng
parts[0]
)是否为“CHI”模式。如果不是,只需将行按原样写入输出文件。如果是,则按单词继续处理
对于每个可能的单词,检查其是否与单词模式匹配。如果匹配,请使用re.sub
将行中的旧单词替换为word@s:eng
。重复此匹配,然后替换每个单词,并在新行中累积替换项。请注意,使用匹配项。分组(1)
,我将替换原始行中的@s
(与中的一样)frog@s“变成”frog@s:eng“)
我对旧的
和新的
使用了f字符串。如果不在Python3.6+上,可以使用常规字符串连接/格式化
结果:
I:*CHI:祝你好运frog@s .
O:*迟浩田:have@s:enga@s:engfrog@s:eng。
I:*CHI:猫在看地板下面table@s .
O:*迟浩田:the@s:eng@s:engcat@s:engis@s:englooking@s:engunderneath@s:eng thetable@s:eng。
(忽略标点符号)
I:*迟:what@s ?
O:*迟浩田:what@s:英语?
(忽略行中的非英语单词)
I:*迟:猫 落下down@s .
O:*迟浩田:猫 fall@s:engdown@s:eng。
(如果不是以CHI开头,则不受影响)
I:*茶:好吧,这一切都在一起。
O:*茶:好吧,这一切都在一起。
您的正则表达式不正确:
搜索模式正在查找“大写字母A、连字符、大写字母Z、小写字母A、连字符、小写字母Z”。如果您只想检查以“*CHI:”开头的行,那么“*CHI:”应该是搜索模式的一部分
替换模式将整个线路替换为“A-Za”-z@s:eng”。您需要捕获文本中要保留的部分,然后重新使用它们,并在单词末尾添加“@s:eng”
以下是您可以使用的:
重新导入
i_path=“转录本0623.cha”
o_path=“转录本0623_out.cha”
标记\u pattern=re.compile(\\*CHI:.*))
word_pattern=re.compile(([A-Za-z]+))
将open(i_路径,encoding='utf8')作为i_文件,将open(o_路径,“w”,encoding='utf8')作为o_文件:
对于i_文件中的行:
#分成可能的词
parts=line.split()
如果mark_pattern.match(零件[0])为无:
o_文件写入(行)
持续
#有一条气线吗
新线
对于第[1]部分中的单词:
匹配=单词\模式。匹配(单词)
如果匹配:
old=f“\\b{word}\\b”
new=f“{matches.group(1)}@s:eng”
新线=re.sub(旧线、新线、新线、计数=1)
o_file.write(新_行)
说明:
mark\u pattern=re.compile(\\*CHI:.*)
- 以“*CHI:*”开头的匹配行的模式。您需要在开始时退出
*
new_text = re.sub("A-Za-z", "A-Za-z@s:eng", text)