Python 获取具有预定义列表的列字符串中匹配单词的计数
我有一个Python 获取具有预定义列表的列字符串中匹配单词的计数,python,pandas,dataframe,scikit-learn,dataset,Python,Pandas,Dataframe,Scikit Learn,Dataset,我有一个数据框包含索引和文本列 例如: index | text 1 | "I have a pen, but I lost it today." 2 | "I have pineapple and pen, but I lost it today." 现在我有一个很长的列表,我想将文本中的每个单词与列表匹配 比如说: long_list = ['pen', 'pineapple'] 我想创建一个FunctionTransformer,
数据框
包含索引
和文本
列
例如:
index | text
1 | "I have a pen, but I lost it today."
2 | "I have pineapple and pen, but I lost it today."
现在我有一个很长的列表,我想将文本中的每个单词与列表匹配
比如说:
long_list = ['pen', 'pineapple']
我想创建一个FunctionTransformer
,将long_列表中的单词与列值的每个单词匹配,如果存在匹配项,则返回计数
index | text | count
1 | "I have a pen, but I lost it today." | 1
2 | "I have pineapple and pen, but I lost it today." | 2
我是这样做的:
def count_words(df):
long_list = ['pen', 'pineapple']
count = 0
for c in df['tweet_text']:
if c in long_list:
count = count + 1
df['count'] = count
return df
count_word = FunctionTransformer(count_words, validate=False)
我如何开发我的其他FunctionTransformer
的示例如下:
def convert_twitter_datetime(df):
df['hour'] = pd.to_datetime(df['created_at'], format='%a %b %d %H:%M:%S +0000 %Y').dt.strftime('%H').astype(int)
return df
convert_datetime = FunctionTransformer(convert_twitter_datetime, validate=False)
熊猫有str.count
:
# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
输出:
index text count
0 1 "I have a pen, but I lost it today." 1
1 2 "I have pineapple and pen, but I lost it today." 2
熊猫有str.count
:
# matching any of the words
pattern = r'\b{}\b'.format('|'.join(long_list))
df['count'] = df.text.str.count(pattern)
输出:
index text count
0 1 "I have a pen, but I lost it today." 1
1 2 "I have pineapple and pen, but I lost it today." 2
使用|
连接列表中的元素。使用.str.findall()
查找匹配的元素,并应用.str.len()
进行计数
p='|'.join(long_list)
df=df.assign(count=(df.text.str.findall(p)).str.len())
text count
0 "I have a pen, but I lost it today." 1
1 "I have pineapple and pen, but I lost it today." 2
使用|
连接列表中的元素。使用.str.findall()
查找匹配的元素,并应用.str.len()
进行计数
p='|'.join(long_list)
df=df.assign(count=(df.text.str.findall(p)).str.len())
text count
0 "I have a pen, but I lost it today." 1
1 "I have pineapple and pen, but I lost it today." 2
灵感来源于@Quang Hoang的回答
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'\b{}\b'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
导致
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
对于以下df2
:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
我们得到
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
这表明,我们将函数转换为sklearn
样式的对象。为了进一步说明这一点,我们可以将列名作为关键字参数交给count\u strings
以@Quang-Hoang的答案为灵感
import pandas as pd
import sklearn as sk
y=['pen', 'pineapple']
def count_strings(X, y):
pattern = r'\b{}\b'.format('|'.join(y))
return X['text'].str.count(pattern)
string_transformer = sk.preprocessing.FunctionTransformer(count_strings, kw_args={'y': y})
df['count'] = string_transformer.fit_transform(X=df)
导致
text count
1 "I have a pen, but I lost it today." 1
2 "I have pineapple and pen, but I lost it today. 2
对于以下df2
:
#df2
text
1 "I have a pen, but I lost it today. pen pen"
2 "I have pineapple and pen, but I lost it today."
我们得到
string_transformer.transform(X=df2)
#result
1 3
2 2
Name: text, dtype: int64
这表明,我们将函数转换为sklearn
样式的对象。为了进一步说明这一点,我们可以将列名作为关键字参数交给count\u strings
为什么不在pandas?@CeliusStingher中使用函数count()
,我正在处理一个管道,所以我的计划是为它创建一个FunctionTransformer,但我愿意接受任何解决方案!我还是新手:3你能澄清你的问题吗?为什么不在pandas中使用函数count()
。@CeliusStingher我正在处理一个管道,所以我的计划是为它创建一个FunctionTransformer,但我愿意接受任何解决方案!我还是新手:3你能澄清你的问题是什么吗?但我不太确定这种方法,因为OP说他想要一个FunctionTransformer,我认为必须创建一个函数Hanks@Quang Hoang!你的回答启发了其他人来解决我的问题!值得投票!但我不太确定这种方法,因为OP说他想要一个FunctionTransformer,我认为必须创建一个函数Hanks@Quang Hoang!你的回答启发了其他人来解决我的问题!值得投票!