Python—解析特定excel数据的更快方法_Python_Pandas_Dataframe_Parsing

Python—解析特定excel数据的更快方法

python pandas dataframe parsing

Python—解析特定excel数据的更快方法,python,pandas,dataframe,parsing,Python,Pandas,Dataframe,Parsing,我有一个excel文件，其中每一列看起来像这样，但有5千行或更多行： ColumnName1 4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678 4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678 4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9

我有一个excel文件，其中每一列看起来像这样，但有5千行或更多行：

ColumnName1
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678
4|newColumn1|1.66|newCoumn2|2.6265|newCoumn3|2.2656|newCoumn4|2.9678

我要做的是解析这个excel文件并将其转换为一个新的excel文件，其中-newColumn1、newColumn2、newColumn3、newColumn4是标题，数据如下所示：

newColumn1 newColumn2 newColumn3 newColumn4
  1.66      2.6265     2.2656      2.9678
  1.66      2.6265     2.2656      2.9678
  1.66      2.6265     2.2656      2.9678
  1.66      2.6265     2.2656      2.9678
  1.66      2.6265     2.2656      2.9678
  1.66      2.6265     2.2656      2.9678

我的代码是这样的，但是有点慢。有没有更快的方法

    for row in dfSpecificColumn:
        allTest = row.split("|")
        allTest.pop(0) #remove the 4| in the beginning of each line
        count = 0
        columnName = ''
        dict = OrderedDict()
        # for each test and value, insert into dictonary and for evrey line in csv add it to dataframe
        for text in allTest:
            if count % 2 == 1:
                dict[columnName] = text
            else:
                columnName = text
            count = count + 1
        dfOutputWithTestThatFailed = dfOutputWithTestThatFailed.append(dict, ignore_index=True)
    return dfOutputWithTestThatFailed

我所做的是用|分割，然后添加到字典中，然后放入DF。我很确定有更快的方法来运行这个。提前谢谢

正如您所指出的，数据是.csv格式的，如下所示：

with open('myinput.csv', 'r') as f:
    # skip the header
    next(f)
    first_row = next(f).split('|')
    # not assuming 4 columns, reading from the first line
    n = first_row(int(n))
    with open('myoutput.csv', 'w') as f_out:
        # write output header
        f_out.write(first_row[1::2])
        # write first line of data
        f_out.write(first_row[2::2])
        # loop over the rest of the lines, split and only take the data
        for line in f_in:
            f_out.write(line.split('|')[2::2])

正如您所指出的，数据是.csv格式的，如下所示：

with open('myinput.csv', 'r') as f:
    # skip the header
    next(f)
    first_row = next(f).split('|')
    # not assuming 4 columns, reading from the first line
    n = first_row(int(n))
    with open('myoutput.csv', 'w') as f_out:
        # write output header
        f_out.write(first_row[1::2])
        # write first line of data
        f_out.write(first_row[2::2])
        # loop over the rest of the lines, split and only take the data
        for line in f_in:
            f_out.write(line.split('|')[2::2])

您可以使用将序列

ColumnName1

围绕分隔符

拆分，该分隔符生成一系列元素列表，然后您可以使用应用自定义函数，将序列中的每个元素列表转换为所需格式的

pd.series

：

result = (
    df['ColumnName1'].str.split('|')
    .apply(lambda x: pd.Series(x[2::2], index=x[1::2]))
)

输出：

# print(result)

  newColumn1 newColumn2 newColumn3 newColumn4
0       1.66     2.6265     2.2656     2.9678
1       1.66     2.6265     2.2656     2.9678
2       1.66     2.6265     2.2656     2.9678
3       1.66     2.6265     2.2656     2.9678
4       1.66     2.6265     2.2656     2.9678
5       1.66     2.6265     2.2656     2.9678
6       1.66     2.6265     2.2656     2.9678
7       1.66     2.6265     2.2656     2.9678

编辑（参考评论）：不使用lambda函数：

def fx(x):
    # Example of x = [4, newColumn1, 1.66, newColumn2, 2.6265, newColumn3, 2.2656, newColumn4, 2.9678]
    return pd.Series(x[2::2], index=x[1::2]) # Instantiate a pandas series from the list `x` and returns it.

result = df['ColumnName1'].str.split('|').apply(fx)

您可以使用将序列

ColumnName1

围绕分隔符

拆分，该分隔符生成一系列元素列表，然后您可以使用应用自定义函数，将序列中的每个元素列表转换为所需格式的

pd.series

：

result = (
    df['ColumnName1'].str.split('|')
    .apply(lambda x: pd.Series(x[2::2], index=x[1::2]))
)

输出：

# print(result)

  newColumn1 newColumn2 newColumn3 newColumn4
0       1.66     2.6265     2.2656     2.9678
1       1.66     2.6265     2.2656     2.9678
2       1.66     2.6265     2.2656     2.9678
3       1.66     2.6265     2.2656     2.9678
4       1.66     2.6265     2.2656     2.9678
5       1.66     2.6265     2.2656     2.9678
6       1.66     2.6265     2.2656     2.9678
7       1.66     2.6265     2.2656     2.9678

编辑（参考评论）：不使用lambda函数：

def fx(x):
    # Example of x = [4, newColumn1, 1.66, newColumn2, 2.6265, newColumn3, 2.2656, newColumn4, 2.9678]
    return pd.Series(x[2::2], index=x[1::2]) # Instantiate a pandas series from the list `x` and returns it.

result = df['ColumnName1'].str.split('|').apply(fx)

“Excel”文件是一个.xls、.xlsx文件还是一个简单的.csv文件？它是Excel文件，但我将其转换为csv文件，因此速度更快。但我的问题是在解析部分的这个特定区域。基本上，我对这些数据执行了很多函数，这些函数都发生在一个DF中。什么是

dfOutputWithTestThatFailed

pandas

DataFrame

？数据帧在追加时复制。这比只使用python列表要慢。您是否提前知道列的名称，或者是从数据中提取它们？它是一致的吗？一旦从第二行抓取了名称，我们可以假设它们在整个过程中都是相同的吗？dfOutputWithTestThatFailed是一个熊猫数据帧，是的。我事先不知道列的名称，但我在另一个函数中检查了该列，得到了所有的名称，然后才执行该函数。每个单元格中的数据（数字）不同。第二行的名字不代表所有的列，这就是为什么我必须仔细检查它的原因。有时它可能在2300行9 | xxx | 23.23 | YYY | 4.23…等等，但是有9个名称而不是4个，所以我需要从开始到结束遍历列。如果是“Excel”文件，您有一个.xls、.xlsx或只是一个简单的.csv？它是Excel，但我将它转换为csv，因此速度更快。但我的问题是在解析部分的这个特定区域。基本上，我对这些数据执行了很多函数，这些函数都发生在一个DF中。什么是

dfOutputWithTestThatFailed

pandas

DataFrame

？数据帧在追加时复制。这比只使用python列表要慢。您是否提前知道列的名称，或者是从数据中提取它们？它是一致的吗？一旦从第二行抓取了名称，我们可以假设它们在整个过程中都是相同的吗？dfOutputWithTestThatFailed是一个熊猫数据帧，是的。我事先不知道列的名称，但我在另一个函数中检查了该列，得到了所有的名称，然后才执行该函数。每个单元格中的数据（数字）不同。第二行的名字不代表所有的列，这就是为什么我必须仔细检查它的原因。有时它可能在2300行a 9 | xxx | 23.23 | YYY | 4.23…等等，但有9个名称而不是4个，所以我需要从开始到结束检查列我使用了这个，而不是75秒的列它花了11！谢谢。我想知道lambda表达式到底在做什么，如果你能解释的话？或者如果你写的不是lambda而是循环，我会独自理解。再次感谢！

df['ColumnName1'].apply

逐行迭代列，因此对于列中的给定行，lambda函数只是将给定行转换为系列。注意：列中的每一行都是令牌列表，如

[4，newColumn1，1.66，newCoumn2，2.6265，newCoumn3，2.2656，newCoumn4，2.9678]

，这些令牌是通过使用

str.split

@Itzik获得的，如果这回答了您的问题，您可能会得到答案。谢谢，是的，我只是不理解（x[2:：2]，index=x[1:：2]）部分，它如何跳过第一个值？我还有一个问题，当框为none时，它会崩溃，所以我试图添加一个“if”语句，但遇到了一些问题。比如，如果我有这样一行：———————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————只是列表的切片符号，返回

[1.66,2.6265,2.2656,2.9678]

。类似地，

x[1:：2]

返回标题

[newColumn1，newColumn2，newColumn3，newColumn4]

——我用了这个，而不是用75秒来表示一个列，它花了11秒！谢谢。我想知道lambda表达式到底在做什么，如果你能解释的话？或者如果你写的不是lambda而是循环，我会独自理解。再次感谢！

df['ColumnName1'].apply

逐行迭代列，因此对于列中的给定行，lambda函数只是将给定行转换为系列。注意：列中的每一行都是令牌列表，如

[4，newColumn1，1.66，newCoumn2，2.6265，newCoumn3，2.2656，newCoumn4，2.9678]

，这些令牌是通过使用

str.split

@Itzik获得的，如果这回答了您的问题，您可能会得到答案。谢谢，是的，我只是不理解（x[2:：2]，index=x[1:：2]）部分，