根据python/dataframe中单元格的文本内容选择（非索引）列_Python_Pandas_Dataframe

根据python/dataframe中单元格的文本内容选择（非索引）列

python pandas dataframe

根据python/dataframe中单元格的文本内容选择（非索引）列,python,pandas,dataframe,Python,Pandas,Dataframe,TL:DR-如何根据包含特定文本段的列从现有非索引数据框中的一列或多列创建数据框/系列？对Python和数据分析相对较新（这是我第一次发布关于堆栈溢出的问题，但我一直在寻找答案（并且习惯于定期编写代码），但没有任何成功我从一个没有命名/索引列的Excel文件导入了一个dataframe。我正在尝试从这些文件中的近2000个文件中成功提取数据，这些文件的数据列都略有不同（当然，为什么要使其变得简单…或者遵循一个模板…或者只是使用格式不正确的Excel电子表格以外的其他东西…）原始数据帧（来自

TL:DR-如何根据包含特定文本段的列从现有非索引数据框中的一列或多列创建数据框/系列？

对Python和数据分析相对较新（这是我第一次发布关于堆栈溢出的问题，但我一直在寻找答案（并且习惯于定期编写代码），但没有任何成功

我从一个没有命名/索引列的Excel文件导入了一个dataframe。我正在尝试从这些文件中的近2000个文件中成功提取数据，这些文件的数据列都略有不同（当然，为什么要使其变得简单…或者遵循一个模板…或者只是使用格式不正确的Excel电子表格以外的其他东西…）

原始数据帧（来自结构不良的XLS文件）看起来有点像：

0                                       NaN             RIGHT      NaN   
1                                      Date              UCVA      Sph   
2                       2007-01-13 00:00:00              6/38  [-2.00]   
3                       2009-11-05 00:00:00               6/9      NaN   
4                       2009-11-18 00:00:00              6/12      NaN   
5                       2009-12-14 00:00:00               6/9  [-1.25]   
6                       2018-04-24 00:00:00           worn CL  [-5.50]   

           3     4      5                 6     7     8        9   \
0         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
1         Cyl  Axis  BSCVA  Pentacam remarks    K1    K2  K2 back   
2     [-2.75]    65    6/9               NaN   NaN   NaN      NaN   
3         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
4         NaN   NaN    6/5         Pentacam     46  43.9     -6.6   
5     [-5.75]    60  6/6-1               NaN   NaN   NaN      NaN   
6     [+7.00}   170  6/7.5               NaN   NaN   NaN      NaN   

           ...              17                18    19    20       21     22  \
0          ...             NaN               NaN   NaN   NaN      NaN    NaN   
1          ...           BSCVA  Pentacam remarks    K1    K2  K2 back  K max   
2          ...             6/5               NaN   NaN   NaN      NaN    NaN   
3          ...             NaN               NaN   NaN   NaN      NaN    NaN   
4          ...             NaN          Pentacam  44.3  43.7     -6.2   45.5   
5          ...           6/4-4               NaN   NaN   NaN      NaN    NaN   
6          ...             6/5               NaN   NaN   NaN      NaN    NaN

我想提取一组数据帧/系列，然后将其组合在一起，以获得“整洁”的数据帧，例如：

1                                      Date              R-UCVA      R-Sph   
2                       2007-01-13 00:00:00              6/38  [-2.00]   
3                       2009-11-05 00:00:00               6/9      NaN   
4                       2009-11-18 00:00:00              6/12      NaN   
5                       2009-12-14 00:00:00               6/9  [-1.25]   
6                       2018-04-24 00:00:00           worn CL  [-5.50]   

1       R-Cyl R-Axis R-BSCVA  R-Penta          R-K1   R-K2  R-K2 back   
2     [-2.75]    65    6/9               NaN   NaN   NaN      NaN   
3         NaN   NaN    NaN               NaN   NaN   NaN      NaN   
4         NaN   NaN    6/5         Pentacam     46  43.9     -6.6   
5     [-5.75]    60  6/6-1               NaN   NaN   NaN      NaN   
6     [+7.00}   170  6/7.5               NaN   NaN   NaN      NaN

等等，等等，所以我试着写一些代码，通过查找单词“Date”或“UCVA”来提取我定义的一系列列然后我计划将它们重新缝合到一个单独的数据框中，并将患者标识符作为一个额外的列。然后循环浏览所有的XLS文件，将整个批次添加到一个单独的CSV文件中，然后我可以在该文件上做有用的事情（比如放入Access数据库——是的，我知道，但它必须易于使用，并且已经安装在NHS计算机上——以及统计分析）

有什么建议吗？我希望这是足够的信息

非常感谢

问候

Vicky

这里有一个能让你开始的东西。我已经准备了一个

text.xlsx

文件：我可以这样读

    path = 'text.xlsx'

    df = pd.read_excel(path, header=[0,1])

    # Deal with two levels of headers, here I just join them together crudely 
    df.columns = df.columns.map(lambda h: '  '.join(h))

    # Slight hack because I messed with the column names
    # I create two dataframes, one with the first column, one with the second column
    df1 = df[[df.columns[0],df.columns[1]]]
    df2 = df[[df.columns[0], df.columns[2]]]

    # Stacking them on top of each other
    result = pd.concat([df1, df2])
    print(result)

    #Merging them on the Date column
    result = pd.merge(left=df1, right=df2, on=df1.columns[0])
    print(result)

这给出了输出

  RIGHT  Sph RIGHT  UCVA       Unnamed: 0_level_0  Date
0        NaN              6/38      2007-01-13 00:00:00
1        NaN              6/37      2009-11-05 00:00:00
2        NaN              9/56      2009-11-18 00:00:00
0    [-2.00]               NaN      2007-01-13 00:00:00
1        NaN               NaN      2009-11-05 00:00:00
2        NaN               NaN      2009-11-18 00:00:00

及

一些建议：如何合并两个标题行？请参阅问题和答案

如何有条件地选择列？参见，例如或

如何合并数据帧？pandas中有一个非常好的指南，希望您能从这里开始。我已经准备了一个

text.xlsx

文件：我可以这样读

    path = 'text.xlsx'

    df = pd.read_excel(path, header=[0,1])

    # Deal with two levels of headers, here I just join them together crudely 
    df.columns = df.columns.map(lambda h: '  '.join(h))

    # Slight hack because I messed with the column names
    # I create two dataframes, one with the first column, one with the second column
    df1 = df[[df.columns[0],df.columns[1]]]
    df2 = df[[df.columns[0], df.columns[2]]]

    # Stacking them on top of each other
    result = pd.concat([df1, df2])
    print(result)

    #Merging them on the Date column
    result = pd.merge(left=df1, right=df2, on=df1.columns[0])
    print(result)

这给出了输出

  RIGHT  Sph RIGHT  UCVA       Unnamed: 0_level_0  Date
0        NaN              6/38      2007-01-13 00:00:00
1        NaN              6/37      2009-11-05 00:00:00
2        NaN              9/56      2009-11-18 00:00:00
0    [-2.00]               NaN      2007-01-13 00:00:00
1        NaN               NaN      2009-11-05 00:00:00
2        NaN               NaN      2009-11-18 00:00:00

及

一些建议：如何合并两个标题行？请参阅问题和答案

如何有条件地选择列？参见，例如或

如何合并数据帧？pandas中有一个非常好的指南，我是否正确理解了这个问题：您想从一个数据帧中提取一组列到一个新的数据帧中，然后将大量数据帧合并在一起？您想将它们合并到一个列上还是将它们堆叠起来？看起来一个好的开始是to在读取文件时使用

标题

和

skiprows

参数，假设每个参数的格式都类似。这将为列建立索引，并且您可以从中选择所需的列。我是否正确理解了问题：您想从数据帧中提取一组列，然后将其合并到新的数据帧中多个数据帧在一起？是要将它们合并到一列中还是要将它们堆叠在一起？在读取文件时使用

标题

和

skiprows

参数似乎是一个不错的开始，假设每个参数的格式都类似。这样可以对列进行索引，并且可以选择所需的参数从那里