Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何计算数据帧中存在于另一个dict列表中的字符串的出现次数?_Python_Pandas_Dataframe - Fatal编程技术网

Python 如何计算数据帧中存在于另一个dict列表中的字符串的出现次数?

Python 如何计算数据帧中存在于另一个dict列表中的字符串的出现次数?,python,pandas,dataframe,Python,Pandas,Dataframe,我有一个数据框,看起来像这样: ngram -------------------------- 0 [] 1 [_ting, tingk, ...] 2 [_pend, pendi, ...] 3 [_teat, teate, ...] ... ... 999 [] [ { "label": "Acade

我有一个数据框,看起来像这样:

                   ngram
--------------------------
  0                    []
  1   [_ting, tingk, ...]
  2   [_pend, pendi, ...]
  3   [_teat, teate, ...]
...                   ...
999                    []
[
  {
    "label": "Academic",
    "gram": "_ting"
  },
  {
    "label": "Facility",
    "gram": "_pend"
  },
  ....,
  {
    "label": "Others",
    "gram": "meing"
  },
]
我还有一个清单,上面写着:

                   ngram
--------------------------
  0                    []
  1   [_ting, tingk, ...]
  2   [_pend, pendi, ...]
  3   [_teat, teate, ...]
...                   ...
999                    []
[
  {
    "label": "Academic",
    "gram": "_ting"
  },
  {
    "label": "Facility",
    "gram": "_pend"
  },
  ....,
  {
    "label": "Others",
    "gram": "meing"
  },
]
如何通过检查列表中是否存在单词来计算单词在数据帧中的出现次数?我的期望输出如下,将用于下一次计算:

academic_count = 1
facility_count = 1
services_count = 0
others_count   = 0
我设法实现了它,但我只是使用嵌套循环, 由于数据帧的长度(1000个数据)和列表的长度(4000多个数据),速度非常慢。这是我的代码:

df = pd.DataFrame()
df['ngram'] = data

academic_chance = []
facility_chance = []
services_chance = []
others_chance   = []

for idx, ngram_words in enumerate(df['ngram']):
    academic_probs = []
    facility_probs = []
    services_probs = []
    others_probs   = []

    for ngram in ngram_words:
        academic_count = 0
        facility_count = 0
        services_count = 0
        others_count   = 0

        for item in list_of_dicts:
            if ngram == item["gram"] and item["label"] == "Academic":
                academic_count += 1
            elif ngram == item["gram"] and item["label"] == "Facility":
                facility_count += 1
            elif ngram == item["gram"] and item["label"] == "Services":
                services_count += 1
            elif ngram == item["gram"] and item["label"] == "Others":
                others_count += 1

        academic_cond_prob = (academic_count + 1) / academic_denominator
        facility_cond_prob = (facility_count + 1) / facility_denominator
        services_cond_prob = (services_count + 1) / services_denominator
        others_cond_prob   = (others_count   + 1) / others_denominator

        academic_probs.append(academic_cond_prob)
        facility_probs.append(facility_cond_prob)
        services_probs.append(services_cond_prob)
        others_probs.append(others_cond_prob)

    academic_chance.append(np.prod(academic_probs) * academic_cat_probs)
    facility_chance.append(np.prod(facility_probs) * facility_cat_probs)
    services_chance.append(np.prod(services_probs) * services_cat_probs)
    others_chance.append(np.prod(others_probs)     * others_cat_probs)

关于如何使这更有效,有什么想法吗?

想法是在听写理解中,第一个将听写列表更改为单词:

L = [
  {
    "label": "Academic",
    "gram": "_ting"
  },
  {
    "label": "Facility",
    "gram": "_pend"
  },
  {
    "label": "Services",
    "gram": "_aaa"
  },
  {
    "label": "Others",
    "gram": "meing"
  },
]

d = {x['gram']:x['label'] for x in L}
print (d)
{'_ting': 'Academic', '_pend': 'Facility', '_aaa': 'Services', 'meing': 'Others'}
然后用于新的
数据框中的指示器:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
df1 = pd.DataFrame(mlb.fit_transform(df['ngram']),columns=mlb.classes_)
print (df1)
   _pend  _teat  _ting  pendi  teate  tingk
0      0      0      0      0      0      0
1      0      0      1      0      0      1
2      1      0      0      1      0      0
3      0      1      0      0      1      0
最后按字典中键的值仅筛选列,按组按字典按
求和
重命名
,按添加不匹配组按最后添加:

s = (df1.loc[:, df1.columns.isin(list(d.keys()))]
        .sum()
        .rename(d)
        .sum(level=0)
        .reindex(list(d.values()), fill_value=0))
print (s)
Academic    1
Facility    1
Services    0
Others      0
dtype: int64
如果每个值需要单独的变量:

academic_count = s['Academic']
facility_count = s['Facility']
services_count = s['Services']
others_count   = s['Others']