Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/templates/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 检查特定列上Dataframe中的填充数据_Python_Python 2.7_Pandas_Numpy_Dataframe - Fatal编程技术网

Python 检查特定列上Dataframe中的填充数据

Python 检查特定列上Dataframe中的填充数据,python,python-2.7,pandas,numpy,dataframe,Python,Python 2.7,Pandas,Numpy,Dataframe,我有一个如下所示的数据帧: import numpy as np raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]} import pandas as pd df = pd.DataFrame(raw_data,columns=['Series_Date','

我有一个如下所示的数据帧:

import numpy as np
raw_data = {'Series_Date':['2017-03-10','2017-03-13','2017-03-14','2017-03-15'],'SP':[35.6,56.7,41,41],'1M':[-7.8,56,56,-3.4],'3M':[24,-31,53,5]}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Series_Date','SP','1M','3M'])
print df
我只想在此数据框中的某些列上运行测试,此列表中的所有列名:

check = {'1M','SP'}
print check
对于这些列,我想知道这些列中的值何时与前一天的值相同。因此,输出数据帧应该返回序列日期和注释,如(本例中的示例:)


你能提供一些帮助来处理这个问题吗?

我不确定这是最干净的方法。然而,它是有效的

check = {'1M', 'SP'}
prev_dict = {c: None for c in check}

def check_prev_value(row):
    global prev_dict
    msg = ""
    # MAYBE add clause to check if both are equal
    for column in check:
        if row[column] == prev_dict[column]:
            msg = 'Value for %s data is same as previous day' % column
        prev_dict[column] = row[column]
    return msg

df['comment'] = df.apply(check_prev_value, axis=1)

output_data_df = df[df['comment'] != ""]
output_data_df = output_data_df[["Series_Date", "comment"]].reset_index(drop=True)
请输入:

  Series_Date    SP    1M  3M
0  2017-03-10  35.6  -7.8  24
1  2017-03-13  56.7  56.0 -31
2  2017-03-14  41.0  56.0  53
3  2017-03-15  41.0  -3.4   5
输出为:

  Series_Date                                    comment
0  2017-03-14  Value for 1M data is same as previous day
1  2017-03-15  Value for SP data is same as previous day

以下内容或多或少满足了您的要求。 列
item_ok
添加到原始数据框中,指定值是否与前一天相同:

from datetime import timedelta
df['Date_diff'] = pd.to_datetime(df['Series_Date']).diff()
for item in check:
    df[item+'_ok'] = (df[item].diff() == 0) & (df['Date_diff'] == timedelta(1))
df_output = df.loc[(df[[item + '_ok' for item in check]]).any(axis=1)]
参考:

当发现重复项时,输出列的整数将大于零

df:

  Series_Date    SP    1M  3M  1M_dup  SP_dup
0  2017-03-10  35.6  -7.8  24       0       0
1  2017-03-13  56.7  56.0 -31       0       0
2  2017-03-14  41.0  56.0  53       1       0
3  2017-03-15  41.0  -3.4   5       0       1
切片以查找DUP:

col = 'SP'
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']]

dup_df:

  Series_Date  SP_dup
3  2017-03-15       1
下面是上述函数的一个版本(添加了处理多列的功能):

下面是使用熊猫差异的另一种方法:

def find_repeats(df, col_list, date_col='Series_Date'):
    code_list = []
    dates = list()

    for col in col_list:
        these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values
        code_arr = [col] * len(these_dates)
        dates.extend(list(these_dates))
        code_list.extend(code_arr)
    return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)

谢谢,但如果我要查看任何其他列,如just SP或SP和3M,该怎么办?我希望根据“检查”列表中的列指定要测试的列。我更新了代码。现在它将搜索出现在check中的列
col = 'SP'
dup_df = df[df[col + '_dup'] > 0][['Series_Date', col + '_dup']]

dup_df:

  Series_Date  SP_dup
3  2017-03-15       1
import pandas as pd
import numpy as np

def find_repeats(df, col_list, date_col='Series_Date'):
    dummy_df = df[[date_col, *col_list]].copy()
    dates = dummy_df[date_col]
    date_series = []
    code_series = []
    if len(col_list) > 1:
        for col in col_list:
            these_repeats = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount().values
            repeat_idx = list(np.where(these_repeats > 0)[0])
            date_arr = dates.iloc[repeat_idx]
            code_arr = [col] * len(date_arr)
            date_series.extend(list(date_arr))
            code_series.extend(code_arr)
        return pd.DataFrame({date_col: date_series, 'col_dup': code_series}).sort_values(date_col).reset_index(drop=True)
    else:
        col = col_list[0]
        dummy_df[col + '_dup'] = df[col].groupby((df[col] != df[col].shift()).cumsum()).cumcount()
        return dummy_df[dummy_df[col + '_dup'] > 0].reset_index(drop=True)

find_repeats(df, ['1M'])

  Series_Date    1M  1M_dup
0  2017-03-14  56.0       1

find_repeats(df, ['1M', 'SP'])

  Series_Date col_dup
0  2017-03-14      1M
1  2017-03-15      SP
def find_repeats(df, col_list, date_col='Series_Date'):
    code_list = []
    dates = list()

    for col in col_list:
        these_dates = df[date_col].iloc[np.where(df[col].diff().values == 0)[0]].values
        code_arr = [col] * len(these_dates)
        dates.extend(list(these_dates))
        code_list.extend(code_arr)
    return pd.DataFrame({date_col: dates, 'val_repeat': code_list}).sort_values(date_col).reset_index(drop=True)