Python 基于字符串中的值将字符串列表转换为数据帧_Python_Python 3.x_Pandas

Python 基于字符串中的值将字符串列表转换为数据帧

python python-3.x pandas

Python 基于字符串中的值将字符串列表转换为数据帧,python,python-3.x,pandas,Python,Python 3.x,Pandas,我有一个字符串列表，如下所示： input = ["number__128_alg__hello_min_n__7_max_n__9_full_seq__True_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuff

我有一个字符串列表，如下所示：

input = ["number__128_alg__hello_min_n__7_max_n__9_full_seq__True_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min_n__7_max_n__9_full_seq_embedding__False_random_color__False_shuffle_shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random_color__False_shuffle_shapes__False.pkl"]

这些字符串的格式是参数名后跟“_u”，然后是参数值。参数值之后，下一个参数名称前有一个u。值得注意的是，一些参数名中包含u（如“随机形状”）。每个字符串都有不同的参数，但有重叠。因此，我想制作一个数据框，每个参数名称作为一列，每行都是与

输入

列表的每个元素对应的值。如果列表中的特定值没有参数，则数据框应包含NA或NaN或任何内容ng

如何做到这一点

谢谢

编辑：如果无法对原始列表进行编辑，则如何处理：

input = ["number__128_alg__hello_min.n__7_max.n__9_full.seq__True_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__hi_min.n__7_max.n__9_full.seq__False_random.color__False_shuffle.shapes__False.pkl", "k__9_window__10_number__128_overlap__True_alg__what_random.color__False_shuffle.shapes__False.pkl"]

如果您假定值不能包含

\uuu

字符，则可能会出现这种情况（同时假定您希望最终放弃

.pkl

）

一个简单的正则表达式应该可以做到这一点：

import re
data = [dict(re.findall(r"([^_].*?)__([^_]+)", _[:-4])) for _ in input]
print(data)

结果:

[{'number': '128',
  'alg': 'hello',
  'min_n': '7',
  'max_n': '9',
  'full_seq_embedding': 'True',
  'random_color': 'False',
  'shuffle_shapes': 'False'},
 {'k': '9',
  'window': '10',
  'number': '128',
  'overlap': 'True',
  'alg': 'hi',
  'min_n': '7',
  'max_n': '9',
  'full_seq_embedding': 'False',
  'random_color': 'False',
  'shuffle_shapes': 'False'},
 {'k': '9',
  'window': '10',
  'number': '128',
  'overlap': 'True',
  'alg': 'what',
  'random_color': 'False',
  'shuffle_shapes': 'False'}]

作为数据帧：

import pandas as pd
pd.DataFrame(data)

如果参数名中没有

\u

，这很容易，考虑到这一要求，几乎不可能确定哪些是分隔符，哪些只是参数名中的。如果参数名中只有句点而不是下划线，是否可能？是的，那么很简单。分隔符只是需要一些不能出现在列名中的内容。一些输入不清楚，例如：alg_uuhi_umin_un_u7，它可能意味着{“alg”：“hi”，“min_un”：7}或{“alg”：“hi_umin”，“n”：7}，虽然很难说，对吧？你能根据我的编辑建议一个答案吗？参数名称中的下划线被句点替换。

import pandas as pd
pd.DataFrame(data)