Python 连接已解析的csv文件时结果不一致_Python_Pandas

Python 连接已解析的csv文件时结果不一致

python pandas

Python 连接已解析的csv文件时结果不一致,python,pandas,Python,Pandas,我对以下问题感到困惑。我有一组csv文件，我迭代地解析它们。在收集列表中的数据帧之前，我将一些函数（如tmp_-df*2）应用于每个tmp_-df。乍一看，这一切都非常好，直到我意识到我在每次运行的结果中都有不一致之处。例如，当我应用df.std（）时，第一次运行时可能会收到： In[2]: df1.std() Out[2]: some_int 15281.99 some_float 5.302293 第二次运行之后： In[3]: df2.std() Out[3

我对以下问题感到困惑。我有一组csv文件，我迭代地解析它们。在收集列表中的数据帧之前，我将一些函数（如

tmp_-df*2

）应用于每个

tmp_-df

。乍一看，这一切都非常好，直到我意识到我在每次运行的结果中都有不一致之处。例如，当我应用

df.std（）

时，第一次运行时可能会收到：

In[2]:  df1.std()
Out[2]:
  some_int      15281.99
  some_float    5.302293

第二次运行之后：

In[3]:  df2.std()
Out[3]:
  some_int      15281.99
  some_float    6.691013

更糟糕的是，当我不操作解析的数据时，我没有观察到像这样的不一致（只需注释掉

tmp_-df=tmp_-df*2

）。我还注意到，对于数据类型为

int

的列，每次运行的结果都是一致的，这不适用于

浮动

。我怀疑这与精确点有关。我也无法确定它们的变化模式，可能是连续两到三次运行的结果相同。也许有人知道我是否遗漏了什么。我正在处理一个复制示例，我将尽快编辑，因为我无法共享底层数据。也许有人能在这段时间内解释一下。我正在使用win8.1、pandas 17.1和python 3.4.3

代码示例：

 import pandas as pd
 import numpy as np

 data_list = list()
 csv_files = ['a.csv', 'b.csv', 'c.csv']

 for csv_file in csv_files:

    #  load csv_file
    tmp_df = pd.read_csv(csv_file, index_col='ID', dtype=np.float64)

    # replace infs by na
    tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)

    # manipulate tmp_df
    tmp_df = tmp_df*2

    data_list.append(tmp_df)

df = pd.concat(data_list, ignore_index=True)
df.reset_index(inplace=True)

更新：

在UX系统上运行相同的代码和数据效果非常好

编辑：我已经设法重新创建了问题，它应该在win和ux上运行。我在win8.1上测试过同样的问题，在ux上运行时，如果

带有_function=True

（通常在运行1-5次之后），它会毫无问题地运行<代码>with_function=False在win和ux中运行时没有差异。我还可以拒绝与

int

或

float

问题相关的假设，因为模拟的

int

也不同

代码如下：

import pandas as pd
import numpy as np
from pathlib import Path
from tempfile import gettempdir


def simulate_csv_data(tmp_dir,num_files=5):
    """ simulate a csv files
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :return:
    """

    rows = 20000
    columns = 5
    np.random.seed(1282)

    for file_num in range(num_files):

        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))
        simulated_df = pd.DataFrame(np.random.standard_normal((rows, columns)))
        simulated_df['some_int'] = np.random.randint(0,100)
        simulated_df.to_csv(str(file_path))


def get_csv_data(tmp_dir,num_files=5, with_function=True):
    """ Collect various csv files and return a concatenated dfs
    :param tmp_dir: Path, csv files are saved to
    :param num_files: int, how many csv files to simulate
    :param with_function: Bool, apply function to tmp_dataframe
    :return:
    """

    data_list = list()

    for file_num in range(num_files):
        # current file path
        file_path = tmp_dir.joinpath(''.join(['df_', str(file_num), '.csv']))

        #  load csv_file
        tmp_df = pd.read_csv(str(file_path), dtype=np.float64)

        # replace infs by na
        tmp_df.replace([np.inf, -np.inf], np.nan, inplace=True)

        # apply function to tmp_dataframe
        if with_function:
            tmp_df = tmp_df*2

        data_list.append(tmp_df)

    df = pd.concat(data_list, ignore_index=True)
    df.reset_index(inplace=True)

    return df

def main():

    # INPUT ----------------------------------------------
    num_files = 5
    with_function = True
    max_comparisons = 50
    # ----------------------------------------------------

    tmp_dir = gettempdir()
    # use temporary "non_existing" dir for new file
    tmp_csv_folder = Path(tmp_dir).joinpath('csv_files_sdfs2eqqf')

    # if exists already don't simulate data/files again
    if tmp_csv_folder.exists() is False:
        tmp_csv_folder.mkdir()
        print('Simulating temp files...')
        simulate_csv_data(tmp_csv_folder, num_files)

    print('Getting benchmark data frame...')
    df1 = get_csv_data(tmp_csv_folder, num_files, with_function)
    df_is_same = True
    count_runs = 0

    # Run until different df is found or max runs exceeded
    print('Comparing data frames...')
    while df_is_same:
        # get another data frame
        df2 = get_csv_data(tmp_csv_folder, num_files, with_function)
        count_runs += 1
        # compare data frames
        if df1.equals(df2) is False:
            df_is_same = False
            print('Found unequal df after {} runs'.format(count_runs))
            # print out a standard deviations (arbitrary example)
            print('Std Run1: \n {}'.format(df1.std()))
            print('Std Run2: \n {}'.format(df2.std()))

        if count_runs > max_comparisons:
            df_is_same = False
            print('No unequal df found after {} runs'.format(count_runs))

    print('Delete the following folder if no longer needed: "{}"'.format(
            str(tmp_csv_folder)))


if __name__ == '__main__':
    main()

您的变化是由其他因素引起的，如执行之间的输入数据更改或源代码更改

浮点精度不会在执行之间给出不同的结果

顺便说一句，清理你的例子，你会发现错误。此时，你说了一些关于和int的内容，但显示了一个十进制值

将numexpr更新到2.4.6（或更高版本），因为numexpr 2.4.4在windows上有一些bug。运行更新后，它对我有效。

如果没有复制数据，很难判断，可能是熊猫中的一个错误-看看如果你操作

df

而不是每个

tmp\u df

，你是否也有同样的问题？我简化了示例，在这种情况下，我也可以将函数2x应用于

df

，但实际上我需要应用一些

tmp\u df

依赖函数。由于2x的问题已经出现，我想让它保持简单…我刚刚添加了一个复制该问题的示例..不确定如果您认为这是pandasI的windows实现上的一个bug，那么您希望从中得到什么？我将它添加到pandas问题跟踪程序中。起初，我没有意识到这是一个windows问题原始数据类型是不同的（int和float，尽管我将它们解析为float），在这种情况下，我显示的是标准偏差，它不是int。我不知道输入数据应该如何更改，因为问题只发生在我对原始数据应用函数时，而且功能没有改变。。。