Python 字符串到单词元组_Python_Regex_Python 3.x

Python 字符串到单词元组

python regex python-3.x

Python 字符串到单词元组,python,regex,python-3.x,Python,Regex,Python 3.x,我将单词定义为一系列字符（从a到Z），其中可能还包含撇号。我想把一个句子分成几个字，去掉撇号我目前正在做以下工作，从一段文本中获取单词 import re text = "Don't ' thread \r\n on \nme ''\n " words_iter = re.finditer(r'(\w|\')+', text) words = (word.group(0).lower() for word in words_iter) for i in words: print(i)

我将单词定义为一系列字符（从a到Z），其中可能还包含撇号。我想把一个句子分成几个字，去掉撇号

我目前正在做以下工作，从一段文本中获取单词

import re
text = "Don't ' thread \r\n on \nme ''\n "
words_iter = re.finditer(r'(\w|\')+', text)
words = (word.group(0).lower() for word in words_iter)
for i in words:
    print(i)

这给了我：

don't
'
thread
on
me
''

但我不想的是：

dont
thread
on
me

如何更改代码以实现这一点

请注意，我的输出中没有
”
。

我还希望
单词
成为生成器。

这看起来像是Regex的工作

import re

text = "Don't ' thread \r\n on \nme ''\n "

# Define a function so as to make a generator
def get_words(text):

    # Find each block, separated by spaces
    for section in re.finditer("[^\s]+", text):

        # Get the text from the selection, lowercase it
        # (`.lower()` for Python 2 or if you hate people who use Unicode)
        section = section.group().casefold()

        # Filter so only letters are kept and yield
        section = "".join(char for char in section if char.isalpha())
        if section:
            yield section

list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']

正则表达式的解释：

[^    # An "inverse set" of characters, matches anything that isn't in the set
\s    # Any whitespace character
]+    # One or more times

因此，它匹配任何非空白字符块。

这看起来像是正则表达式的作业

import re

text = "Don't ' thread \r\n on \nme ''\n "

# Define a function so as to make a generator
def get_words(text):

    # Find each block, separated by spaces
    for section in re.finditer("[^\s]+", text):

        # Get the text from the selection, lowercase it
        # (`.lower()` for Python 2 or if you hate people who use Unicode)
        section = section.group().casefold()

        # Filter so only letters are kept and yield
        section = "".join(char for char in section if char.isalpha())
        if section:
            yield section

list(get_words(text))
#>>> ['dont', 'thread', 'on', 'me']

正则表达式的解释：

[^    # An "inverse set" of characters, matches anything that isn't in the set
\s    # Any whitespace character
]+    # One or more times

因此，这将匹配任何非空白字符块。

使用

str.translate

和

re.finditer

：

>>> text = "Don't ' thread \r\n on \nme ''\n "
>>> import re
>>> from string import punctuation
>>> tab = dict.fromkeys(map(ord, punctuation))
def solve(text):
    for m in re.finditer(r'\b(\S+)\b', text):
        x = m.group(1).translate(tab).lower()
        if x : yield x
>>> list(solve(text))
['dont', 'thread', 'on', 'me']

定时比较：

使用

str.translate

和

re.finditer

：

>>> text = "Don't ' thread \r\n on \nme ''\n "
>>> import re
>>> from string import punctuation
>>> tab = dict.fromkeys(map(ord, punctuation))
def solve(text):
    for m in re.finditer(r'\b(\S+)\b', text):
        x = m.group(1).translate(tab).lower()
        if x : yield x
>>> list(solve(text))
['dont', 'thread', 'on', 'me']

定时比较： …仅对分割的数据执行一次迭代

如果数据集很大，请使用

re.finditer

而不是

str.split（）

来避免将整个数据集读入内存：

words = (x.replace("'", '') for x in re.finditer(r'[^\s]+', text))
result = tuple(x for x in words if x)

…尽管如此，

tuple（）

-ing数据将读取内存中的所有内容

…仅对分割的数据执行一次迭代

如果数据集很大，请使用

re.finditer

而不是

str.split（）

来避免将整个数据集读入内存：

words = (x.replace("'", '') for x in re.finditer(r'[^\s]+', text))
result = tuple(x for x in words if x)

…虽然，

tuple（）

-但是，不管怎样，读取数据都会读取内存中的所有内容。

text.split（）

一次处理整个shebang，Baz希望累积处理。不错，但是破折号呢？“word1---word2 word3”@hcwhsa text.lower（）将创建一个我希望避免的文本副本，因为它可能非常复杂big@hcwhsa正如我在我的问题中所说的，我只定义一个包含从a到Z加上撇号的单词。你的代码将通过非英语字母和数字等的字母。@Baz你可以使用正则表达式：

x=re.sub（r'[^a-z]+'，''，m.group（1），flags=re.I）。lower（）

text.split（）

一次处理整个shebang，Baz想要累积处理。不错，但是破折号呢？“word1---word2 word3”@hcwhsa text.lower（）将创建一个我希望避免的文本副本，因为它可能非常复杂big@hcwhsa正如我在我的问题中所说的，我只定义一个包含从a到Z加上撇号的单词。你的代码将通过非英语字母和数字等字母传递。@Baz你可以使用正则表达式：

x=re.sub（r'[^a-z]+'，''，m.group（1），flags=re.I）。lower（）

与hcwhsa的一样，

text.split（）

一次处理整个shebang，Baz希望累积处理。与hcwhsa的一样，

text.split（）

一次处理整个shebang，Baz想要累积处理。你不是快到了吗，只要在你的

for

循环中添加一个

i=i.replace（“”，”）

，然后如果字符串不是空的，就产生字符串？你想解析多少输入？@Tritium21我正在构建一个语料库，所以我正在处理不同长度的文本文件。你是不是快到了，只要在你的

for

循环中添加一个

I=I.replace（“，”）

，然后如果字符串不是空的，就产生字符串？你想解析多少输入？@Tritium21我正在构建一个语料库，所以我正在处理不同长度的文本文件。不会去掉“不”中的“不”请注意，这大约比我的版本慢4倍。在您将

.casefold

.lower

添加到您的版本后，我将接受该时间我的方法在技术上更为正确，因为我检查

isalpha

而不是删除ASCII标点，但我会让它消失。我认为您可以在

re.finditer

调用中使用

.lower（）

将其降低一次。不过，这会立即对整个字符串起作用，这是我们想要避免的。使用新的计时更新了我的解决方案。请注意，这大约比我的版本慢4倍。在您将

.casefold

.lower

添加到您的解决方案后，我将接受该计时我的方法在技术上更正确，因为我检查

isalpha

而不是删除ASCII标点，但我会让它消失。我想你可以在

re.finditer

调用中使用

.lower（）

将其降低一次。不过，这会立即对整个字符串起作用，这是我们想要避免的。用新的计时更新了我的解决方案。