Python Pandas dataframe筛选出包含非英语文本的行_Python_Pandas_Algorithm_Nlp_Nltk

Python Pandas dataframe筛选出包含非英语文本的行

python pandas algorithm nlp

Python Pandas dataframe筛选出包含非英语文本的行,python,pandas,algorithm,nlp,nltk,Python,Pandas,Algorithm,Nlp,Nltk,我有一个熊猫df，它有6列，最后一列是input\u text。我想从df中删除该列中包含非英语文本的所有行。我想使用langdetect的detect功能一些模板 from langdetect import detect import pandas as pd def filter_nonenglish(df): new_df = None # Do some magical operations here to create the filtered df retur

我有一个熊猫

df

，它有6列，最后一列是

input\u text

。我想从

df

中删除该列中包含非英语文本的所有行。我想使用

langdetect

的

detect

功能

一些模板

from langdetect import detect
import pandas as pd

def filter_nonenglish(df):
    new_df = None  # Do some magical operations here to create the filtered df
    return new_df

df = pd.read_csv('somecsv.csv')
df_new = filter_nonenglish(df)
print('New df is: ', df_new)

注意！其他5列是什么并不重要。另请注意：使用

detect

非常简单：

t = 'I am very cool!'
print(detect(t))

输出为：

en

您可以在

df

上执行以下操作，并在

input\u text

列中获取所有包含英文文本的行：

df_new = df[df.input_text.apply(detect).eq('en')]

因此，基本上只需将

langdetect.detect

函数应用于

input_text

列中的值，并获取所有文本被检测为“en”

的行。事实证明，langdetect在大型文档上速度较慢，因此任何方法都可以！问题到底是什么？“我想从df中删除该列中包含非英语文本的所有行。我想使用langdetect的检测函数。”正如我指出的，数据帧中的值是不相关的。这个问题可以通过使用最后一个input_列来解决，正如我所指定的，它是一个字符串。