Python 如何计算数据帧中存在于另一个dict列表中的字符串的出现次数?
我有一个数据框,看起来像这样:Python 如何计算数据帧中存在于另一个dict列表中的字符串的出现次数?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框,看起来像这样: ngram -------------------------- 0 [] 1 [_ting, tingk, ...] 2 [_pend, pendi, ...] 3 [_teat, teate, ...] ... ... 999 [] [ { "label": "Acade
ngram
--------------------------
0 []
1 [_ting, tingk, ...]
2 [_pend, pendi, ...]
3 [_teat, teate, ...]
... ...
999 []
[
{
"label": "Academic",
"gram": "_ting"
},
{
"label": "Facility",
"gram": "_pend"
},
....,
{
"label": "Others",
"gram": "meing"
},
]
我还有一个清单,上面写着:
ngram
--------------------------
0 []
1 [_ting, tingk, ...]
2 [_pend, pendi, ...]
3 [_teat, teate, ...]
... ...
999 []
[
{
"label": "Academic",
"gram": "_ting"
},
{
"label": "Facility",
"gram": "_pend"
},
....,
{
"label": "Others",
"gram": "meing"
},
]
如何通过检查列表中是否存在单词来计算单词在数据帧中的出现次数?我的期望输出如下,将用于下一次计算:
academic_count = 1
facility_count = 1
services_count = 0
others_count = 0
我设法实现了它,但我只是使用嵌套循环,
由于数据帧的长度(1000个数据)和列表的长度(4000多个数据),速度非常慢。这是我的代码:
df = pd.DataFrame()
df['ngram'] = data
academic_chance = []
facility_chance = []
services_chance = []
others_chance = []
for idx, ngram_words in enumerate(df['ngram']):
academic_probs = []
facility_probs = []
services_probs = []
others_probs = []
for ngram in ngram_words:
academic_count = 0
facility_count = 0
services_count = 0
others_count = 0
for item in list_of_dicts:
if ngram == item["gram"] and item["label"] == "Academic":
academic_count += 1
elif ngram == item["gram"] and item["label"] == "Facility":
facility_count += 1
elif ngram == item["gram"] and item["label"] == "Services":
services_count += 1
elif ngram == item["gram"] and item["label"] == "Others":
others_count += 1
academic_cond_prob = (academic_count + 1) / academic_denominator
facility_cond_prob = (facility_count + 1) / facility_denominator
services_cond_prob = (services_count + 1) / services_denominator
others_cond_prob = (others_count + 1) / others_denominator
academic_probs.append(academic_cond_prob)
facility_probs.append(facility_cond_prob)
services_probs.append(services_cond_prob)
others_probs.append(others_cond_prob)
academic_chance.append(np.prod(academic_probs) * academic_cat_probs)
facility_chance.append(np.prod(facility_probs) * facility_cat_probs)
services_chance.append(np.prod(services_probs) * services_cat_probs)
others_chance.append(np.prod(others_probs) * others_cat_probs)
关于如何使这更有效,有什么想法吗?想法是在听写理解中,第一个将听写列表更改为单词:
L = [
{
"label": "Academic",
"gram": "_ting"
},
{
"label": "Facility",
"gram": "_pend"
},
{
"label": "Services",
"gram": "_aaa"
},
{
"label": "Others",
"gram": "meing"
},
]
d = {x['gram']:x['label'] for x in L}
print (d)
{'_ting': 'Academic', '_pend': 'Facility', '_aaa': 'Services', 'meing': 'Others'}
然后用于新的数据框中的指示器:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['ngram']),columns=mlb.classes_)
print (df1)
_pend _teat _ting pendi teate tingk
0 0 0 0 0 0 0
1 0 0 1 0 0 1
2 1 0 0 1 0 0
3 0 1 0 0 1 0
最后按字典中键的值仅筛选列,按组按字典按求和
,重命名
,按添加不匹配组按最后添加:
s = (df1.loc[:, df1.columns.isin(list(d.keys()))]
.sum()
.rename(d)
.sum(level=0)
.reindex(list(d.values()), fill_value=0))
print (s)
Academic 1
Facility 1
Services 0
Others 0
dtype: int64
如果每个值需要单独的变量:
academic_count = s['Academic']
facility_count = s['Facility']
services_count = s['Services']
others_count = s['Others']