Python替换单引号,撇号除外
我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行,在空格中拆分每行,执行常规标点符号替换,然后在自己的行上打印每个单词和标点符号标记,以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式:Python替换单引号,撇号除外,python,regex,substitution,single-quotes,Python,Regex,Substitution,Single Quotes,我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行,在空格中拆分每行,执行常规标点符号替换,然后在自己的行上打印每个单词和标点符号标记,以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式: apo = re.compile("[A-Za-z]'[A-Za-z]") 并执行以下操作: if "'" in word and !apo.search(word): word = word.replace("'","
apo = re.compile("[A-Za-z]'[A-Za-z]")
并执行以下操作:
if "'" in word and !apo.search(word):
word = word.replace("'","\n<singlequote>")
输出示例(处理并打印到文件后):
不要
乔治
太太
结束
没有
不会
关于这项任务,我还有一个问题:既然区分
和
似乎相当困难,那么像这样进行替换是否更明智
word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')
word=word.replace('.','\n')
word=word.replace(“,”,“\n”)
在执行替换操作后?我建议在这里智能工作:使用的或其他NLP工具包 像这样:
import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
你可能不喜欢像don这样的收缩是分开的。实际上,这是预期的行为。看
但是,TweetTokenizer可以在这方面提供帮助:
from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")
如果涉及更多,RegexpTokenizer可能会有所帮助:
from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York. Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)
那么正确地注释标记化的单词应该容易得多
其他参考资料:
”
isregex。
要匹配它们,您应该使用:
用于启动^'
(opensingle)'
用于结束“$
(closesingle)”
replace
方法不支持正则表达式,
因此,您应该改用re.sub
下面是一个示例程序,用于打印所需的输出
(在Python 3中):
重新导入
str=“不要‘乔治夫人’结束。‘没有’、‘不会’”
words=str.split(“”)
用文字表示:
word=re.sub(r“^'”,\n',word)
word=re.sub(r“'$”,'\n',word)
word=word.replace('.','\n')
word=word.replace(“,”,“\n”)
打印(word)
我认为这可以从前向或后向引用中受益。python引用是,我经常引用的一个通用正则表达式站点是
您的数据:
words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
现在我将用正则表达式及其替换定义一个元组
In [230]: apo = (
(re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
(re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
(re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
(re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
...: ...: ...: ...: ...: ...:
In [231]: words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
...: ...: ...: ...: ...: ...:
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]:
['don<apostrophe>t',
'<opensingle>George',
'ma<apostrophe>am',
'end<period><closesingle>',
'didn<apostrophe>t<period><closesingle>',
'<opensingle>Won<apostrophe>t']
(使用
reduce
是为了方便对单词/字符串应用正则表达式的.sub
,然后将该输出保留到下一个正则表达式的.sub
,等等)如何定义单词?它只是由words=line.split()生成的数组中的字符串。在打印到文件时,我只需在空格上拆分一行,并使用\n字符将标点符号拆分为新行上的标记。但我不想删去撇号,因为我认为缩略词在字面意义上是可适当定义的“词”。另外一个例子(我忘了提)是连字符:我不想分解连词。我很感激引入一个新工具。(我保证我会使用,即使是在我开始在比约瑟夫·康拉德的《黑暗之心》更复杂的书上进行标记化时,我也会使用),但就目前的情况而言,我觉得NLP太过了。当然,我试图使用sed完成这项任务,但放弃了,转而使用Python,所以也许我很快就会使用NLP。@malan欢迎你。当然,你会的;可以说,如果你在没有NLP工具包的情况下处理文献,那你就是做错了。我也对部署它感兴趣lkit now:它能识别开放式报价和省略单引号之间的区别吗(例如,“让我们得到它们”中的区别)?
import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
word = re.sub(r"^'", '<opensingle>\n', word)
word = re.sub(r"'$", '\n<closesingle>', word)
word = word.replace('.', '\n<period>')
word = word.replace(',', '\n<comma>')
print(word)
words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
In [230]: apo = (
(re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
(re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
(re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
(re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
...: ...: ...: ...: ...: ...:
In [231]: words = ["don't",
"'George",
"ma'am",
"end.'",
"didn't.'",
"'Won't",]
...: ...: ...: ...: ...: ...:
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]:
['don<apostrophe>t',
'<opensingle>George',
'ma<apostrophe>am',
'end<period><closesingle>',
'didn<apostrophe>t<period><closesingle>',
'<opensingle>Won<apostrophe>t']
In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
...: ...: ...: ...: ...: ...: ...:
In [235]: print(
reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)
...: ...:
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t