Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/352.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找表中每列的每个唯一值的百分比_Python_Python 3.x_Pandas - Fatal编程技术网

Python 查找表中每列的每个唯一值的百分比

Python 查找表中每列的每个唯一值的百分比,python,python-3.x,pandas,Python,Python 3.x,Pandas,我知道,要计算列的每个唯一值并将其转换为百分比,我可以使用: df['name_of_the_column'].value_counts(normalize=True)*100 我想知道如何将所有列作为函数执行此操作,然后删除给定列中唯一值占所有值95%以上的列?请注意,该函数还应计算NaN值。您可以尝试以下方法: l=df.columns for i in l: res=df[i].value_counts(normalize=True)*100 if res.iloc[0

我知道,要计算列的每个唯一值并将其转换为百分比,我可以使用:

df['name_of_the_column'].value_counts(normalize=True)*100
我想知道如何将所有列作为函数执行此操作,然后删除给定列中唯一值占所有值95%以上的列?请注意,该函数还应计算NaN值。

您可以尝试以下方法:

l=df.columns

for i in l:
    res=df[i].value_counts(normalize=True)*100
    if res.iloc[0]>=95:
        del df[i]
您可以尝试以下方法:

l=df.columns

for i in l:
    res=df[i].value_counts(normalize=True)*100
    if res.iloc[0]>=95:
        del df[i]

您可以在
value\u counts
周围编写一个小包装,如果任何值高于某个阈值,则返回False;如果计数看起来不错,则返回True:

样本数据

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": [1] * 20,                   # should NOT survive
    "B": [1, 0] * 10,                # should survive
    "C": [np.nan] * 20,              # should NOT survive
    "D": [1,2,3,4] * 5,              # should survive
    "E": [0] * 18 + [np.nan, np.nan] # should survive
})

print(df.head())
实施

def threshold_counts(s, threshold=0):
    counts = s.value_counts(normalize=True, dropna=False)
    if (counts >= threshold).any():
        return False
    return True

column_mask = df.apply(threshold_counts, threshold=0.95)
clean_df = df.loc[:, column_mask]

print(clean_df.head())
   B  D    E
0  1  1  0.0
1  0  2  0.0
2  1  3  0.0
3  0  4  0.0
4  1  1  0.0

您可以在
value\u counts
周围编写一个小包装,如果任何值高于某个阈值,则返回False;如果计数看起来不错,则返回True:

样本数据

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": [1] * 20,                   # should NOT survive
    "B": [1, 0] * 10,                # should survive
    "C": [np.nan] * 20,              # should NOT survive
    "D": [1,2,3,4] * 5,              # should survive
    "E": [0] * 18 + [np.nan, np.nan] # should survive
})

print(df.head())
实施

def threshold_counts(s, threshold=0):
    counts = s.value_counts(normalize=True, dropna=False)
    if (counts >= threshold).any():
        return False
    return True

column_mask = df.apply(threshold_counts, threshold=0.95)
clean_df = df.loc[:, column_mask]

print(clean_df.head())
   B  D    E
0  1  1  0.0
1  0  2  0.0
2  1  3  0.0
3  0  4  0.0
4  1  1  0.0