通过自定义函数-Python3.x创建并向dataframe追加2列_Python_Pandas

通过自定义函数-Python3.x创建并向dataframe追加2列

python pandas

通过自定义函数-Python3.x创建并向dataframe追加2列,python,pandas,Python,Pandas,我有一个名为csv\u table的数据框，看起来像这样： class ID text 0 2 BIeDBg4MrEd1NwWRlFHLQQ Decent but terribly inconsistent food. I've ha... 1 4 NJHPiW30SKhItD5E2jqpHw Looks are

我有一个名为

csv\u table

的数据框，看起来像这样：

      class                      ID                                               text
0         2  BIeDBg4MrEd1NwWRlFHLQQ  Decent but terribly inconsistent food. I've ha...
1         4  NJHPiW30SKhItD5E2jqpHw  Looks aren't everything.......  This little di...
2         2  nnS89FMpIHz7NPjkvYHmug  Being a creature of habit anytime I want good ...
3         2   FYxSugh9PGrX1PR0BHBIw  I recently told a friend that I cant figure ou...
4         4  ScViKtQ2xq6i5AyN4curYQ  Chevy's five years ago was crisp and fresh and...
5         2  vz8Q37FSlypZlgy5N7Ym0A  Every time I go to this Jack In The Box I get ...
6         4   OJuG2EvItSZXbu8KowI9A  I've been going to Cluckers for years. Every t...
7         4   k9ci6SfI5RZT3smNdnvSg  .                                             ...
8         4  qq6bQbrBZyd4lOBd8KSCoA  Well, after their remodel the place no longer ...
9         4     FldFfwfuk9T8kvkp8iw  Beer selection was good, but they were out of ...
10        4  63ufCUqbPcnl6abC1SBpvQ  Ihop is my favorite breakfast chain, and the s...
11        4   nDYCZDIAvdcx77EcmYz0Q  A very good Jewish deli tucked in and amongst ...
12        4  uoC1llZumwFKgXAMlDbZIg  Went here for lunch with Rand H. and this plac...
13        2   BBs1rbz75dDifvoQyVMDg  Picture the least attractive person you'd sett...
14        4    2t9znjapzhioLqb4Pf1Q  Really really really strong Margaritas!   The ...
15        4  GqLgixGcbWh51IzkwsiswA  I would not have known about this place had it...

[1999行x 3列]

我正在尝试向

csv_表

添加

2列

，其中一列指定

文本

列中的字数（用空格分隔的“字”表示），另一列指定自定义函数定义的“干净”字数

我有能力计算干净和脏字的总数，但如何将这些函数应用于数据框中的每一行，并附加这些列

代码如下：

import nltk, re, pandas as pd
from nltk.corpus import stopwords
import sklearn, string
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from itertools import islice

# This function removes numbers from an array
def remove_nums(arr): 
    # Declare a regular expression
    pattern = '[0-9]'  
    # Remove the pattern, which is a number
    arr = [re.sub(pattern, '', i) for i in arr]    
    # Return the array with numbers removed
    return arr

# This function cleans the passed in paragraph and parses it
def get_words(para):   
    # Create a set of stop words
    stop_words = set(stopwords.words('english'))
    # Split it into lower case    
    lower = para.lower().split()
    # Remove punctuation
    no_punctuation = (nopunc.translate(str.maketrans('', '', string.punctuation)) for nopunc in lower)
    # Remove integers
    no_integers = remove_nums(no_punctuation)
    # Remove stop words
    dirty_tokens = (data for data in no_integers if data not in stop_words)
    # Ensure it is not empty
    tokens = [data for data in dirty_tokens if data.strip()]
    # Ensure there is more than 1 character to make up the word
    tokens = [data for data in tokens if len(data) > 1]

    # Return the tokens
    return tokens 

def main():

    tsv_file = "filepath"
    csv_table=pd.read_csv(tsv_file, sep='\t')
    csv_table.columns = ['class', 'ID', 'text']

    print(csv_table)

    s = pd.Series(csv_table['text'])
    new = s.str.cat(sep=' ')
    clean_words = get_words(new)
    dirty_words = [word for word in new if word.split()]
    clean_length = len(clean_words)
    dirty_length = len(dirty_words)
    print("Clean Length: ", clean_length)
    print("Dirty Length: ", dirty_length)


main()

目前生产：

Clean Length:  125823
Dirty Length:  1091370

我确实尝试了

csv_表['clean']=csv_表['text'].map（获取单词（csv_表['text']）

，结果是：

AttributeError:“Series”对象没有属性“lower”

如何将脏/干净逻辑应用于每行并将这两列附加到数据帧？用于在每行上应用函数。对于脏字计数，您可以使用pandas分割字符串，然后应用

len

获得计数。对于干净字数计数，直接应用自定义函数：

csv_table['dirty'] = csv_table['text'].str.split().apply(len)
csv_table['clean'] = csv_table['text'].apply(lambda s: len(get_words(s)))

通常，如果您有一个函数

myFunc

获取一个单元格的数据并返回所需的结果，您可以执行

df['new\u col']=df['text'].map（myFunc）

，它将在

new\u col

中为您提供每行函数的结果。我如何做到这一点，然后将其添加回这个数据框，有你可以发布的例子吗@aryerez当您这样做时，您的数据帧中将有一个名为“new_col”的新列。如果要替换现有的“text”列，请执行

df['text']=df['text'].map（myFunc）

我不想替换现有列。我还不清楚应该传递给

get_words

，因为传递

csv_table['text']

传递整个列…不仅仅是行事实上，

csv_table['clean']=csv_table['text'].map（get_words（csv_table['text']）

是一个错误，你必须

。应用（len

也转到第二行。@jorijnsmit:你是对的，我错误地认为这就是

get\u words

返回的内容。