Python替换单引号，撇号除外_Python_Regex_Substitution_Single Quotes

Python替换单引号，撇号除外

python regex

Python替换单引号，撇号除外,python,regex,substitution,single-quotes,Python,Regex,Substitution,Single Quotes,我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行，在空格中拆分每行，执行常规标点符号替换，然后在自己的行上打印每个单词和标点符号标记，以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式： apo = re.compile("[A-Za-z]'[A-Za-z]") 并执行以下操作： if "'" in word and !apo.search(word): word = word.replace("'","

我正在单词列表上执行以下操作。我从Project Gutenberg文本文件中读取行，在空格中拆分每行，执行常规标点符号替换，然后在自己的行上打印每个单词和标点符号标记，以便以后进一步处理。我不知道如何用标签或撇号来替换每个引用。我当前的方法是使用已编译的正则表达式：

apo = re.compile("[A-Za-z]'[A-Za-z]")

并执行以下操作：

if "'" in word and !apo.search(word):
    word = word.replace("'","\n<singlequote>")

输出示例（处理并打印到文件后）：

不要
乔治
太太
结束
没有
不会

关于这项任务，我还有一个问题：既然区分

和

似乎相当困难，那么像这样进行替换是否更明智

word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')

word=word.replace（'.'，'\n'）
word=word.replace（“，”，“\n”）

在执行替换操作后？

我建议在这里智能工作：使用的或其他NLP工具包

像这样：

import nltk
sentence = """At eight o'clock on Thursday morning
Arthur didn't feel very good."""
tokens = nltk.word_tokenize(sentence)

你可能不喜欢像don这样的收缩是分开的。实际上，这是预期的行为。看

但是，TweetTokenizer可以在这方面提供帮助：

from nltk.tokenize import tknzr = TweetTokenizer()
tknzr.tokenize("The code didn't work!")

如果涉及更多，RegexpTokenizer可能会有所帮助：

from nltk.tokenize import RegexpTokenizer
s = "Good muffins cost $3.88\nin New York.  Please don't buy me\njust one of them."
tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
tokenizer.tokenize(s)

那么正确地注释标记化的单词应该容易得多

其他参考资料：

您真正需要的是正确地替换起始和结束

”

isregex。要匹配它们，您应该使用：

```
^'
```
用于启动
```
'
```
（opensingle）
```
“$
```
用于结束
```
”
```
（closesingle）

不幸的是，

replace

方法不支持正则表达式，因此，您应该改用

re.sub

下面是一个示例程序，用于打印所需的输出（在Python 3中）：

重新导入
str=“不要‘乔治夫人’结束。‘没有’、‘不会’”
words=str.split（“”）
用文字表示：
word=re.sub（r“^'”，\n'，word）
word=re.sub（r“'$”，'\n'，word）
word=word.replace（'.'，'\n'）
word=word.replace（“，”，“\n”）
打印（word）

我认为这可以从前向或后向引用中受益。python引用是，我经常引用的一个通用正则表达式站点是

您的数据：

words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]

现在我将用正则表达式及其替换定义一个元组

In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']

（使用

reduce

是为了方便对单词/字符串应用正则表达式的

.sub

，然后将该输出保留到下一个正则表达式的

.sub

，等等）

如何定义单词？它只是由words=line.split（）生成的数组中的字符串。在打印到文件时，我只需在空格上拆分一行，并使用\n字符将标点符号拆分为新行上的标记。但我不想删去撇号，因为我认为缩略词在字面意义上是可适当定义的“词”。另外一个例子（我忘了提）是连字符：我不想分解连词。我很感激引入一个新工具。（我保证我会使用，即使是在我开始在比约瑟夫·康拉德的《黑暗之心》更复杂的书上进行标记化时，我也会使用），但就目前的情况而言，我觉得NLP太过了。当然，我试图使用sed完成这项任务，但放弃了，转而使用Python，所以也许我很快就会使用NLP。@malan欢迎你。当然，你会的；可以说，如果你在没有NLP工具包的情况下处理文献，那你就是做错了。我也对部署它感兴趣lkit now：它能识别开放式报价和省略单引号之间的区别吗（例如，“让我们得到它们”中的区别）？

import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
    word = re.sub(r"^'", '<opensingle>\n', word)
    word = re.sub(r"'$", '\n<closesingle>', word)
    word = word.replace('.', '\n<period>')
    word = word.replace(',', '\n<comma>')
    print(word)

words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]

In [230]: apo = (
    (re.compile("(?<=[A-Za-z])'(?=[A-Za-z])"), "<apostrophe>",),
    (re.compile("(?<![A-Za-z])'(?=[A-Za-z])"), "<opensingle>",),
    (re.compile("(?<=[.A-Za-z])'(?![A-Za-z])"), "<closesingle>", ),
    (re.compile("(?<=[A-Za-z])\\.(?![A-Za-z])"), "<period>",),
)
     ...:      ...:      ...:      ...:      ...:      ...: 
In [231]: words = ["don't",
         "'George",
         "ma'am",
         "end.'",
         "didn't.'",
         "'Won't",]
     ...:      ...:      ...:      ...:      ...:      ...: 
In [232]: reduce(lambda w2,x: [ x[0].sub(x[1], w) for w in w2], apo, words)
Out[232]: 
['don<apostrophe>t',
 '<opensingle>George',
 'ma<apostrophe>am',
 'end<period><closesingle>',
 'didn<apostrophe>t<period><closesingle>',
 '<opensingle>Won<apostrophe>t']

In [233]: onelong = """
don't
'George
ma'am
end.'
didn't.'
'Won't
"""
     ...:      ...:      ...:      ...:      ...:      ...:      ...: 
In [235]: print(
    reduce(lambda sentence,x: x[0].sub(x[1], sentence), apo, onelong)
)

     ...:      ...: 
don<apostrophe>t
<opensingle>George
ma<apostrophe>am
end<period><closesingle>
didn<apostrophe>t<period><closesingle>
<opensingle>Won<apostrophe>t