如何在python中定义类中函数的每次迭代中使用的变量？_Python_Pandas_Class

如何在python中定义类中函数的每次迭代中使用的变量？

python pandas class

如何在python中定义类中函数的每次迭代中使用的变量？,python,pandas,class,Python,Pandas,Class,我有一个类，它定义了一个函数define\u stop\u words，该函数返回字符串标记列表。然后，我将另一个名为remove\u stopwords的函数应用到包含文本的pandas数据帧df。代码看起来像这样 class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question def define_stop_words(self)

我有一个类，它定义了一个函数

define\u stop\u words

，该函数返回字符串标记列表。然后，我将另一个名为

remove\u stopwords

的函数应用到包含文本的pandas数据帧

df

。代码看起来像这样

class ProcessText:

    def __init__(self, flag):
        self.flag = flag # not important for this question

    def define_stop_words(self):
        names = ['john','sally','billy','sarah']
        stops = ['if','the','in','then','an','a']
        return stops+names

    def remove_stopwords(self, text):
        return [word for word in text if word not in self.define_stop_words()]


 import pandas as pd
 df = pd.read_csv('data.csv')
 parse = ProcessText(flag=True)
 df['text'] = df['text'].apply(parse.remove_stopwords())

我的问题是，

remove_stopwords

函数是否会每次调用并定义由

define_stop_words

返回的变量-对于

text

中的每一个单词，对于

df

中的每一行（基本上对于每次迭代）

如果是这种情况，我不希望它像这样运行，因为它将非常缓慢和低效。我想定义一次由

define_stop_words

返回的变量，就像
ProcessText
类中的“全局变量”，然后在
remove_stopwords
中多次使用该变量（对于
df
中的每个单词和行）

有没有办法做到这一点——应该这样做吗？这种情况下的最佳做法是什么？
您可以缓存列出的单词，在init中设置它们，以便只调用一次操作。然后，不使用define\u stop\u words（）函数，而是将其作为属性

class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question self._names = ['john','sally','billy','sarah'] self._stops = ['if','the','in','then','an','a'] @property def define_stop_words(self): return self._stops + self._names def remove_stopwords(self, text): return [word for word in text if word not in self.define_stop_words]
请注意，在python中，没有真正的私有变量概念（我认为您希望在这里使用私有变量-您不希望用户能够在创建这些列表后覆盖它们？）。这意味着代码的恶意用户仍然可以在初始化器之后更新ProcessText对象中的_name和_stops属性，这意味着您会得到意外的结果

另一个要考虑的是使用一个集合而不是一个列表（特别是如果性能是个问题），因为散列会更快。当然，如果您进一步挑剔的话，合并列表并缓存合并的集合会更快，而不是对属性的每次调用都执行“添加”（这样属性调用只返回一个缓存的集合）
e、 g

您可以缓存列出的单词，在init中设置它们，以便只调用一次操作。然后，不使用define\u stop\u words（）函数，而是将其作为属性

class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question self._names = ['john','sally','billy','sarah'] self._stops = ['if','the','in','then','an','a'] @property def define_stop_words(self): return self._stops + self._names def remove_stopwords(self, text): return [word for word in text if word not in self.define_stop_words]
请注意，在python中，没有真正的私有变量概念（我认为您希望在这里使用私有变量-您不希望用户能够在创建这些列表后覆盖它们？）。这意味着代码的恶意用户仍然可以在初始化器之后更新ProcessText对象中的_name和_stops属性，这意味着您会得到意外的结果

另一个要考虑的是使用一个集合而不是一个列表（特别是如果性能是个问题），因为散列会更快。当然，如果您进一步挑剔的话，合并列表并缓存合并的集合会更快，而不是对属性的每次调用都执行“添加”（这样属性调用只返回一个缓存的集合）
e、 g

每次调用
remove\u stopwords
方法时，只会调用一次
define\u stopwords
方法
每个实例只调用一次，但在初始化实例时不调用它的一种方法（因为您可能有许多这样的方法，所有这些方法都很昂贵，并且您并不总是需要所有这些方法），就是使用以下方法：

class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question self._stop_words = None @property def stop_words(self): if self._stop_words is None: self._stop_words = set(['john','sally','billy','sarah']) self._stop_words |= set(['if','the','in','then','an','a']) return self._stop_words def remove_stopwords(self, text): return [word for word in text if word not in self.define_stop_words]

每次调用
remove\u stopwords
方法时，只会调用一次
define\u stopwords
方法
每个实例只调用一次，但在初始化实例时不调用它的一种方法（因为您可能有许多这样的方法，所有这些方法都很昂贵，并且您并不总是需要所有这些方法），就是使用以下方法：

class ProcessText: def __init__(self, flag): self.flag = flag # not important for this question self._stop_words = None @property def stop_words(self): if self._stop_words is None: self._stop_words = set(['john','sally','billy','sarah']) self._stop_words |= set(['if','the','in','then','an','a']) return self._stop_words def remove_stopwords(self, text): return [word for word in text if word not in self.define_stop_words]

您可以将这些名称指定给类变量，如下所示

class ProcessText: names = ['john','sally','billy','sarah'] stops = ['if','the','in','then','an','a'] def __init__(self, flag): self.flag = flag # not important for this question def remove_stopwords(self, text): return [word for word in text if word not in self.names + self.stops] import pandas as pd df = pd.read_csv('data.csv') parse = ProcessText(flag=True) df['text'] = df['text'].apply(parse.remove_stopwords())

这些类变量由所有实例继承。\uuuu init\uuuu（）方法中的赋值将在每次创建新实例时产生多个赋值。
您可以将这些名称分配给类变量，如下所示：

class ProcessText: names = ['john','sally','billy','sarah'] stops = ['if','the','in','then','an','a'] def __init__(self, flag): self.flag = flag # not important for this question def remove_stopwords(self, text): return [word for word in text if word not in self.names + self.stops] import pandas as pd df = pd.read_csv('data.csv') parse = ProcessText(flag=True) df['text'] = df['text'].apply(parse.remove_stopwords())

这些类变量由所有实例继承。_uuuinit_uuuu（）方法中的赋值将在每次创建新实例时产生多个赋值。
将其设置为
self
或更好的设置为类。完全不相关但
ProcessText
是动词，因此它不是类的专有名称-通常对类、对象和变量（“事物”）使用名词函数/方法的动词（“动作”）。
TextProcessor
可能是更好的选择。对于
parse
变量（而不是
processor
呢？）也有同样的事情，我会让
remove\u stopwords
成为一个生成器，让调用者负责将它变成一个
列表
或任何他们想要的容器，从一个示例数据帧中显示几行。我敢肯定你对
apply
的使用是完全不正确的。将它设置为
self
或者更好的是，设置为类。完全不相关但
ProcessText
是一个动词，所以它不是类的专有名称-你通常对类、对象和变量（“事物”）使用名词，对函数/方法（“动作”）使用动词.
TextProcessor
可能是更好的选择。对于
parse
变量（而不是
processor
呢？）也有同样的事情，我会让
remove\u stopwords
成为一个生成器，让调用者负责将它变成一个
列表
或任何他们想要的容器，从一个示例数据帧中显示几行。我很确定你对
apply
的使用是完全不正确的。为什么不把它定义为类属性呢？为什么要用一种方法来计算常数呢？@MadPhysicast你是对的，如果它真的像OP所说的那样简单的话。如果它实际上是一些昂贵的函数，那么这样做可能会更好