Python：返回2000个文本列表中的单词数_Python_Regex_Pandas

Python：返回2000个文本列表中的单词数

python regex pandas

Python：返回2000个文本列表中的单词数,python,regex,pandas,Python,Regex,Pandas,我几乎可以肯定，我忽略了一些非常明显的东西，所以我问这个问题是希望感到尴尬：我有一个pandasdataframe，一列中有2000多个文本。我最初的目标是（现在仍然是）计算每个文本中的单词数，并使用该单词数在数据框中创建一个新列为了简化问题，我使用以下命令将文本列拖出到字符串列表中： texts = data.text.tolist() 类型是list，列表的len是2113，这是数据帧中的行数。我目前的努力是： word_counts = [] for text in texts:

我几乎可以肯定，我忽略了一些非常明显的东西，所以我问这个问题是希望感到尴尬：我有一个

pandas

dataframe，一列中有2000多个文本。我最初的目标是（现在仍然是）计算每个文本中的单词数，并使用该单词数在数据框中创建一个新列

为了简化问题，我使用以下命令将文本列拖出到字符串列表中：

texts = data.text.tolist()

类型是

list

，列表的

len

是2113，这是数据帧中的行数。我目前的努力是：

word_counts = []
for text in texts:
    count = len(re.findall(r"[a-zA-Z_]+", text))
    word_counts.append(count)

我收到的：

TypeError:预期的字符串或缓冲区

如果我对单个文本进行评估：

len(re.findall(r"[a-zA-Z_]+", texts[0]))

我得到了预期的结果：2176

我没有看到什么

编辑添加样本：

texts[0].split()[:10]

['Thank', 'you', 'so', 'much', 'Chris.', 'And', 
"it's", 'truly', 'a', 'great']

这些是谈话的记录，所以有一些标点符号，也许还有一些数字。

您可以创建一个函数来返回每个字符串的

len

，并将该函数应用于包含字符串的

pd.Series

data = pd.DataFrame(
    {'text': ["This is-four words.", "This is five whole words."]})
data
#   text
# 0 This is-four words.
# 1 This is five whole words.

def count_words(cell):
    try:
        return len(re.findall(r"[a-zA-Z_]+", cell))
    except AttributeError:
        return cell

data['word_count'] = data['text'].apply(count_words)
data

#   text                        word_count
# 0 This is-four words.         4
# 1 This is five whole words.   5

但是，如果您知道每个文本中的单词仅由空格分隔（即，不是由下划线或破折号分隔），那么我建议使用以下方法：

def count_words2(cell):
    try:
        return len(cell.split())
    except TypeError:
        return cell

count_words3 = lambda x: len(str(x).split())

它比使用正则表达式快得多。在Jupyter笔记本中：

test_str = "test " * 1000
%timeit count_words(test_str)
%timeit count_words2(test_str)
%timeit count_words3(test_str)
# 10000 loops, best of 3: 158 µs per loop
# 10000 loops, best of 3: 29.8 µs per loop
# 10000 loops, best of 3: 28.7 µs per loop

您可以创建一个函数来返回每个字符串的

len

，并将该函数应用于包含字符串的

pd.Series

data = pd.DataFrame(
    {'text': ["This is-four words.", "This is five whole words."]})
data
#   text
# 0 This is-four words.
# 1 This is five whole words.

def count_words(cell):
    try:
        return len(re.findall(r"[a-zA-Z_]+", cell))
    except AttributeError:
        return cell

data['word_count'] = data['text'].apply(count_words)
data

#   text                        word_count
# 0 This is-four words.         4
# 1 This is five whole words.   5

但是，如果您知道每个文本中的单词仅由空格分隔（即，不是由下划线或破折号分隔），那么我建议使用以下方法：

def count_words2(cell):
    try:
        return len(cell.split())
    except TypeError:
        return cell

count_words3 = lambda x: len(str(x).split())

它比使用正则表达式快得多。在Jupyter笔记本中：

test_str = "test " * 1000
%timeit count_words(test_str)
%timeit count_words2(test_str)
%timeit count_words3(test_str)
# 10000 loops, best of 3: 158 µs per loop
# 10000 loops, best of 3: 29.8 µs per loop
# 10000 loops, best of 3: 28.7 µs per loop

我认为，您不必使用正则表达式，也不需要输出要列出的值。您可以尝试改用

lambda

函数：

df = pd.DataFrame({'col1': ['Hello world', 'Hello, there world', 'Hello']})
         col1
0        Hello world
1  Hello there world
2              Hello

然后您可以使用

lambda

函数

df['count'] = df['col1'].apply(lambda x: len(str(x).split()))
         col1          count
0        Hello world      2
1  Hello there world      3
2              Hello      1

或者，如果您想使用

regex

，您仍然可以使用

lambda

功能：

df['count'] = df['col1'].apply(lambda x: len(re.findall(r"[a-zA-Z_]+", x)))
col1          count
    0        Hello world      2
    1  Hello there world      3
    2              Hello      1

我认为，您不必使用正则表达式，也不需要输出要列出的值。您可以尝试改用

lambda

函数：

df = pd.DataFrame({'col1': ['Hello world', 'Hello, there world', 'Hello']})
         col1
0        Hello world
1  Hello there world
2              Hello

然后您可以使用

lambda

函数

df['count'] = df['col1'].apply(lambda x: len(str(x).split()))
         col1          count
0        Hello world      2
1  Hello there world      3
2              Hello      1

或者，如果您想使用

regex

，您仍然可以使用

lambda

功能：

df['count'] = df['col1'].apply(lambda x: len(re.findall(r"[a-zA-Z_]+", x)))
col1          count
    0        Hello world      2
    1  Hello there world      3
    2              Hello      1

你能把你的文章贴在这里吗？对不起，我不明白为什么会出现“TypeError:预期的字符串或缓冲区”您的文本中有下划线或破折号吗？或者是用空格分隔每个单词吗？你能在这里粘贴一些文本的例子吗？对不起，我不明白为什么会出现“TypeError:预期的字符串或缓冲区”您的文本中有下划线或破折号吗？或者说，是空格分隔每个单词吗？只有当所有单词都用空格分隔时，这才有效。如果两个单词之间有连字符，此方法将把它视为一个单词。我不是英语语法专家，但我相当肯定连字符连接的单词（我认为它们被称为复合形容词）应该算作一个单词，而不是两个。我认为下划线也是如此。在任何情况下，我的答案也包括lambda函数中的正则表达式解决方案。只有当所有单词都用空格分隔时，这才有效。如果两个单词之间有连字符，此方法将把它视为一个单词。我不是英语语法专家，但我相当肯定连字符连接的单词（我认为它们被称为复合形容词）应该算作一个单词，而不是两个。我认为下划线也是如此。在任何情况下，我的答案也包括lambda函数中的正则表达式解决方案。我现在正在查看并尝试它。。。在我的笔记本里。（你在我的书房里吗？）。我得到以下错误：

-->210返回编译（模式、标志）。如果sys.hexversion>=0x02020000，则findall（字符串）211 212:TypeError:预期的字符串或缓冲区

。。。我在想，我可能有一些文本的空值，这导致了这个问题。（可能是吗？）您确定

string

实际上是type

str

？好像是别的什么东西。也许是一份清单？也许是。您是否有任何

NaN

或

None

值？关于字符串：

type（data.text[0]）

str

。我现在正在查看并尝试。。。在我的笔记本里。（你在我的书房里吗？）。我得到以下错误：

-->210返回编译（模式、标志）。如果sys.hexversion>=0x02020000，则findall（字符串）211 212:TypeError:预期的字符串或缓冲区

。。。我在想，我可能有一些文本的空值，这导致了这个问题。（可能是吗？）您确定

string

实际上是type

str

？好像是别的什么东西。也许是一份清单？也许是。您是否有任何

NaN

或

None

值？关于字符串：

type（data.text[0]）

str

。