使用字典计算python数据帧中的词频
我有一个由文本工作描述和3个空列组成的数据框架使用字典计算python数据帧中的词频,python,pandas,dictionary,dataframe,count,Python,Pandas,Dictionary,Dataframe,Count,我有一个由文本工作描述和3个空列组成的数据框架 index job_description level_1 level_2 level_3 0 this job requires masters in.. 0 0 0 1 bachelor degree needed for.. 0 0
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 0 0
1 bachelor degree needed for.. 0 0 0
2 ms is preferred or phd.. 0 0 0
我试着遍历每个职位描述字符串,并计算职位描述中提到的每个学位级别的频率。示例输出应该如下所示
index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 1 0
1 bachelor degree needed for.. 1 0 0
2 ms is preferred or phd.. 0 1 1
我创建了字典来进行如下所示的比较,但对于如何在dataframe“job description”列的字符串中查找这些单词并根据这些单词是否存在填充dataframe列,我有些不知所措
my_dict_1 = dict.fromkeys(['bachelors', 'bachelor', 'ba','science
degree','bs','engineering degree'], 1)
my_dict_2 = dict.fromkeys(['masters', 'ms', 'master'], 1)
my_dict_3 = dict.fromkeys(['phd','p.h.d'], 1)
我真的很感谢大家在这方面的支持。像这样的东西怎么样 由于三个字典中的每一个都对应于要创建的不同列,因此我们可以创建另一个字典映射,其中即将出现的列名作为键,每个特定级别上要搜索的字符串作为值(事实上,你甚至不需要一本字典来存储
my\u dict\ucode>项-你可以使用set
来代替-但这不是什么大问题):
然后,检查刚刚创建的字典中建议的每一列,并指定一个新列来创建所需的输出,检查每一my\u dict\uu
对象中指定的每个级别是否至少有一个属于每一行的职务描述中
>>> for level, values in lookup.items():
... df[level] = df['job_description'].apply(lambda x: 1 if any(v in x for v in values) else 0)
...
>>> df
job_description level_1 level_2 level_3
0 masters degree required 0 1 0
1 bachelor's degree required 1 0 0
2 bachelor degree required 1 0 0
3 phd required 0 0 1
另一种解决方案是使用scikit learn的CountVectorizer类,它统计字符串中出现的标记(基本上是单词)的频率:
>>> from sklearn.feature_extraction.text import CountVectorizer
指定特定词汇表-忘记所有其他不是“学历”关键字的单词:
>>> vec = CountVectorizer(vocabulary={value for level, values in lookup.items() for value in values})
>>> vec.vocabulary
{'master', 'p.h.d', 'ba', 'ms', 'engineering degree', 'masters', 'phd', 'bachelor', 'bachelors', 'bs', 'science degree'}
将变压器安装到文本iterable,df['job\u description']
:
>>> result = vec.fit_transform(df['job_description'])
更深入地了解结果:
>>> pd.DataFrame(result.toarray(), columns=vec.get_feature_names())
ba bachelor bachelors bs engineering degree master masters ms p.h.d phd science degree
0 0 0 0 0 0 0 1 0 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0
如果您想回到级别
列结构,最后一种方法可能需要更多的工作,但我认为我应该将其显示为编码这些数据点的另一种思考方式。稍微不同的方法是将关键字和职务描述存储为集合,然后计算集合交点。您可以通过矢量化集紧凑地生成交集矩阵。交集:
import pandas as pd
df = pd.read_csv(
pd.compat.StringIO(
""" index job_description level_1 level_2 level_3
0 this job requires masters in.. 0 0 0
1 bachelor degree needed for.. 0 0 0
2 ms is preferred or phd .. 0 0 0"""
),
sep=r" +",
)
levels = pd.np.array(
[
{"bachelors", "bachelor", "ba", "science degree", "bs", "engineering degree"},
{"masters", "ms", "master"},
{"phd", "p.h.d"},
]
)
df[["level_1", "level_2", "level_3"]] = (
pd.np.vectorize(set.intersection)(
df.job_description.str.split().apply(set).values[:, None], levels
)
.astype(bool)
.astype(int)
)
index job_description level_1 level_2 level_3
0 0 this job requires masters in.. 0 1 0
1 1 bachelor degree needed for.. 1 0 0
2 2 ms is preferred or phd .. 0 1 1
我认为我们可以这样做:
# create a level based mapper dict
mapper = {'level_1':['bachelors', 'bachelor', 'ba','science degree','bs','engineering degree'],
'level_2': ['masters', 'ms', 'master'],
'level_3': ['phd','p.h.d']}
# convert list to set
mapper = {k:set(v) for k,v in mapper.items}
# remove dots from description
df['description'] = df['description'].str.replace('.','')
# check if any word of description is available in the mapper dict
df['flag'] = df['description'].str.split(' ').apply(set).apply(lambda x: [k for k,v in mapper.items() if any([y for y in x if y in v])])
# convert the list into new rows
df1 = df.set_index(['index','description'])['flag'].apply(pd.Series).stack().reset_index().drop('level_2', axis=1)
df1.rename(columns={0:'flag'}, inplace=True)
# add a flag column , this value will be use as filler
df1['val'] = 1
# convert the data into wide format
df1 = df1.set_index(['index','description','flag'])['val'].unstack(fill_value=0).reset_index()
df1.columns.name = None
print(df1)
index description level_1 level_2 level_3
0 0 this job requires masters in 0 1 0
1 1 bachelor degree needed for 0 1 0 0
2 2 ms is preferred or phd 0 1 1
# create a level based mapper dict
mapper = {'level_1':['bachelors', 'bachelor', 'ba','science degree','bs','engineering degree'],
'level_2': ['masters', 'ms', 'master'],
'level_3': ['phd','p.h.d']}
# convert list to set
mapper = {k:set(v) for k,v in mapper.items}
# remove dots from description
df['description'] = df['description'].str.replace('.','')
# check if any word of description is available in the mapper dict
df['flag'] = df['description'].str.split(' ').apply(set).apply(lambda x: [k for k,v in mapper.items() if any([y for y in x if y in v])])
# convert the list into new rows
df1 = df.set_index(['index','description'])['flag'].apply(pd.Series).stack().reset_index().drop('level_2', axis=1)
df1.rename(columns={0:'flag'}, inplace=True)
# add a flag column , this value will be use as filler
df1['val'] = 1
# convert the data into wide format
df1 = df1.set_index(['index','description','flag'])['val'].unstack(fill_value=0).reset_index()
df1.columns.name = None
print(df1)
index description level_1 level_2 level_3
0 0 this job requires masters in 0 1 0
1 1 bachelor degree needed for 0 1 0 0
2 2 ms is preferred or phd 0 1 1