Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/regex/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python替换单引号,撇号除外_Python_Regex_Substitution_Single Quotes - Fatal编程技术网

Python替换单引号,撇号除外

Python替换单引号,撇号除外,python,regex,substitution,single-quotes,Python,Regex,Substitution,Single Quotes,我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行,在空格中拆分每行,执行常规标点符号替换,然后在自己的行上打印每个单词和标点符号标记,以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式: apo = re.compile("[A-Za-z]'[A-Za-z]") 并执行以下操作: if "'" in word and !apo.search(word): word = word.replace("'","

我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行,在空格中拆分每行,执行常规标点符号替换,然后在自己的行上打印每个单词和标点符号标记,以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式:

apo = re.compile("[A-Za-z]'[A-Za-z]")
并执行以下操作:

if "'" in word and !apo.search(word):
    word = word.replace("'","\n<singlequote>")
输出示例(处理并打印到文件后):

不要
乔治
太太
结束
没有
不会
关于这项任务,我还有一个问题:既然区分
似乎相当困难,那么像这样进行替换是否更明智

word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')
word=word.replace('.','\n')
word=word.replace(“,”,“\n”)

在执行替换操作后?

我建议在这里智能工作:使用的或其他NLP工具包

像这样:

import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)
你可能不喜欢像don这样的收缩是分开的。实际上,这是预期的行为。看

但是,TweetTokenizer可以在这方面提供帮助:

from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")
如果涉及更多,RegexpTokenizer可能会有所帮助:

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)
那么正确地注释标记化的单词应该容易得多

其他参考资料:


您真正需要的是正确地替换起始和结束
isregex。 要匹配它们,您应该使用:

  • ^'
    用于启动
    '
    (opensingle)
  • “$
    用于结束
    (closesingle)
不幸的是,
replace
方法不支持正则表达式, 因此,您应该改用
re.sub

下面是一个示例程序,用于打印所需的输出 (在Python 3中):

重新导入
str=“不要‘乔治夫人’结束。‘没有’、‘不会’”
words=str.split(“”)
用文字表示:
word=re.sub(r“^'”,\n',word)
word=re.sub(r“'$”,'\n',word)
word=word.replace('.','\n')
word=word.replace(“,”,“\n”)
打印(word)

我认为这可以从前向或后向引用中受益。python引用是,我经常引用的一个通用正则表达式站点是

您的数据:

words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
现在我将用正则表达式及其替换定义一个元组

In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']

(使用
reduce
是为了方便对单词/字符串应用正则表达式的
.sub
,然后将该输出保留到下一个正则表达式的
.sub
,等等)

如何定义单词?它只是由words=line.split()生成的数组中的字符串。在打印到文件时,我只需在空格上拆分一行,并使用\n字符将标点符号拆分为新行上的标记。但我不想删去撇号,因为我认为缩略词在字面意义上是可适当定义的“词”。另外一个例子(我忘了提)是连字符:我不想分解连词。我很感激引入一个新工具。(我保证我会使用,即使是在我开始在比约瑟夫·康拉德的《黑暗之心》更复杂的书上进行标记化时,我也会使用),但就目前的情况而言,我觉得NLP太过了。当然,我试图使用sed完成这项任务,但放弃了,转而使用Python,所以也许我很快就会使用NLP。@malan欢迎你。当然,你会的;可以说,如果你在没有NLP工具包的情况下处理文献,那你就是做错了。我也对部署它感兴趣lkit now:它能识别开放式报价和省略单引号之间的区别吗(例如,“让我们得到它们”中的区别)?
import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
    word = re.sub(r"^'", '<opensingle>\n', word)
    word = re.sub(r"'$", '\n<closesingle>', word)
    word = word.replace('.', '\n<period>')
    word = word.replace(',', '\n<comma>')
    print(word)
words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']
In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
     ...:      ...:      ...:      ...:      ...:      ...:      ...: 
In [235]: print(
    reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)

     ...:      ...: 
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t