Python 过滤特定长度的英语单词_Python_Pandas

Python 过滤特定长度的英语单词

python pandas

Python 过滤特定长度的英语单词,python,pandas,Python,Pandas,我有这个列表（来自一列：df['Text']）。tolist（）：我只想保留大于2且为英语的单词。我的尝试如下： -对长度大于2的单词应用筛选器： new_corpus = list( map(lambda words: list(filter(lambda word: len(word)> 2, words)), my_list)) -然后对列表中的每个元素应用detect（）： def det(x): lang = detect(x)

我有这个列表（来自一列：

df['Text']）。tolist（）

：

我只想保留大于2且为英语的单词。我的尝试如下：

-对长度大于2的单词应用筛选器：

new_corpus = list( map(lambda words: list(filter(lambda word: len(word)> 2, words)), my_list))

-然后对列表中的每个元素应用

detect（）

：

 def det(x):
            lang = detect(x)
            return lang

 new_corpus.apply(det)

问题是，对于第一个代码，我得到了所有[]（空元素），因此我无法对列表应用任何

detect

函数

我的预期产出是：

my_list=[came',
 'moreover',
 'sah', # it depends on detect function, if it selects this element as English or not
 'esketamine',
 'accredited',
 'condition',
 'tailored',
 'acts',
 'terms',

 'demonstrate',
 'amidst',    # it depends on detect function, if it selects this element as English or not
 'atotxa',    # it depends on detect function, if it selects this element as English or not
 'design',
 'ante',
 'ebsite',    # it depends on detect function, if it selects this element as English or not
 'problems',
 'oncosomes', # it depends on detect function, if it selects this element as English or not
 'gradient',
 'tenable',
 'processing',
 'elemental',
 'card',
 'spreads',
 'airlines',
 'desde',      # it depends on detect function, if it selects this element as English or not
 'retains'
]

实现这一点的最简单方法是使用它，它允许将整个逻辑归结为一行。实现可以是：

new_corpus = [word for word in my_list if len(word) > 2 and detect(word)]

此外，此方法可用于直接从数据帧创建筛选列表。实现可以是：

new_corpus = [word for word in df['Text'].tolist() if len(word) > 2 and detect(word)]

但是，这不允许以后访问

df['Text'].tolist（）

。

实现这一点的最简单方法是使用它，允许将整个逻辑归结为一行。实现可能是：

new_corpus = [word for word in my_list if len(word) > 2 and detect(word)]

此外，此方法可用于直接从数据帧创建筛选列表。实现可以是：

new_corpus = [word for word in df['Text'].tolist() if len(word) > 2 and detect(word)]

但是，这不允许以后访问

df['Text'].tolist（）

。

让我们这样做吧

cond1 = df['Text'].str.len()>=2
cond2 = df['Text'].map(detect)=='en'

df_sub = df[cond1 & cond2]
#df.loc[cond1 & cond2, 'Text'].tolist()

让我们做吧

cond1 = df['Text'].str.len()>=2
cond2 = df['Text'].map(detect)=='en'

df_sub = df[cond1 & cond2]
#df.loc[cond1 & cond2, 'Text'].tolist()

感谢两位的回答，因为这两种方法都很有效。我想建议一件事，@bbnumber2：由于单词的长度，使用detect可能会出现错误。因此，在代码中切换detect和len可能会更好。这样，代码运行时不会出现错误；）还有一件事：它会自动将“en”检测为语言吗？它可以工作，但我想了解它是否是自动的和/或我是否需要做任何事情来更改/检测不同的language@still_learning这取决于您对

detect（）

函数的实现。谢谢您的回答，因为这两个函数都很好。我想建议一件事，@bbnumber2：由于单词的长度，使用detect可能会出现错误。因此，在代码中切换detect和len可能会更好。这样，代码将无错误地运行；）还有一件事：它会自动将“en”检测为语言吗？它可以工作，但我想了解它是否是自动的和/或我是否需要做任何事情来更改/检测不同的language@still_learning这取决于您对

detect（）

函数的实现。非常感谢@BEN_-YO。我选择了另一个答案，只是因为我发现它更像我想做的。但你的方法也很有效，并且使用了不同的方法。非常感谢你，本尤。我选择了另一个答案，只是因为我发现它更像我想做的。但你的方法也很有效，并且使用了不同的方法。谢谢