Python 如何将键值管道分隔的文件转换为具有标题的完美csv文件_Python_Python 2.7_Csv

Python 如何将键值管道分隔的文件转换为具有标题的完美csv文件

python python-2.7 csv

Python 如何将键值管道分隔的文件转换为具有标题的完美csv文件,python,python-2.7,csv,Python,Python 2.7,Csv,嗨，我有一个文件的格式 key1=abc||key2=ajdskj||name=ankush||contact=123444 key1=def||name=reddy||contact=456778 key1=aef||address=ashaskawe||name=john 如何使用python将其转换为任何带有头的分隔文件。像 key1||key2||name||contact||address abc||ajdskj||ankush||123444||NULL def||NULL||re

嗨，我有一个文件的格式

key1=abc||key2=ajdskj||name=ankush||contact=123444
key1=def||name=reddy||contact=456778
key1=aef||address=ashaskawe||name=john

如何使用python将其转换为任何带有头的分隔文件。像

key1||key2||name||contact||address
abc||ajdskj||ankush||123444||NULL
def||NULL||reddy||456778||NULL
aef||NULL||john||NULL||ashaskawe

如果有更多的字段，请告诉我可以采用什么方法

我试图使用csv阅读器和pandas来读取文件，但我不知道如何区分键和值

谢谢你的帮助

我不确定熊猫能否做到这一点，但我自己花了很长时间（没那么糟糕）把钥匙分开

代码：

这是一种使用标准库中的工具并维护列顺序的方法。

messy_data.txt

文件包含原始数据，而

cleaner_data.txt

是保存cleaner数据的地方：

from collections import defaultdict, OrderedDict

with open('messy_data.txt') as infile, open('cleaner_data.txt','w') as outfile:
    whole_data = [x.strip().split("||") for x in infile]
    headers = []
    for x in whole_data:
        for k in [y.split("=")[0] for y in x]:
            if k not in headers:
                headers.append(k)
    whole_data = [dict(y.split("=") for y in x) for x in whole_data]
    output = defaultdict(list)
    for header in headers:
        for d in whole_data:
            output[header].append(d.get(header,'NULL'))
    output = OrderedDict((x,output.get(x)) for x in headers)
    outfile.write("||".join(list(output.keys()))+"\n")
    for row in zip(*output.values()):
        outfile.write("||".join(row)+"\n")

这将产生：

key1||key2||name||contact||address
abc||ajdskj||ankush||123444||NULL
def||NULL||reddy||456778||NULL
aef||NULL||john||NULL||ashaskawe

编辑：更易于调试的脚本：

from collections import defaultdict, OrderedDict

with open('messy_data.txt') as infile, open('cleaner_data.txt','w') as outfile:
    whole_data = [x.strip().split("||") for x in infile]
    headers = []
    for x in whole_data:
        for k in [y.split("=")[0] for y in x]:
            if k not in headers:
                headers.append(k)
    #whole_data = [dict(y.split("=") for y in x) for x in whole_data]
    whole_data2 = []
    for x in whole_data:
        temp_list = [y.split("=") for y in x]
        try:
            temp_dict = dict(temp_list)
            whole_data2.append(temp_dict)
        except:
            print(temp_list)
            continue
    output = defaultdict(list)
    for header in headers:
        for d in whole_data2:
            output[header].append(d.get(header,'NULL'))
    output = OrderedDict((x,output.get(x)) for x in headers)
    print(output)
    outfile.write("||".join(list(output.keys()))+"\n")
    for row in zip(*output.values()):
        outfile.write("||".join(row)+"\n")

我希望这证明是有用的。

解决方案：

正在读取文件：

df=pd.read_csv('data.csv',delimiter='|',header=None)
dfu=df.unstack().dropna()
keys,values=np.array(dfu.apply(lambda s:str.split(s,'=')).tolist()).T

制作数据帧：

data=dfu.to_frame()
data['keys']=keys
data['values']=values
final=data.reset_index().pivot(
index='level_1',columns='keys',values='values')

keys       address contact key1    key2    name
level_1                                        
0             None  123444  abc  ajdskj  ankush
1             None  456778  def    None   reddy
2        ashaskawe    None  aef    None    john

简单的正则表达式方法，易于维护：

 import re
 f = open('test.txt', 'r').readlines()
 print('key1', 'key2', 'name', 'contact', 'address', sep='||')
 for line in f:
     if re.search('key1=(\w+)', line):
          k1 = re.search('key1=(\w+)', line).group(1)
      else:
          k1 = 'NULL'
      if re.search('key2=(\w+)',line):
          k2 = re.search('key2=(\w+)',line).group(1)
      else:
          k2 = 'NULL'
      if re.search('address=(\w+)',line):
          a = re.search('address=(\w+)',line).group(1)
      else:
          a = 'NULL'
      if re.search('name=(\w+)', line):
          n = re.search('name=(\w+)', line).group(1)
      else:
          n = 'NULL'
      if re.search('contact=(\w+)', line):
          c = re.search('contact=(\w+)', line).group(1)
      else:
          c = 'NULL'
      print(k1, k2, n, c, a, sep=' || ')

输出：

 key1||key2||name||contact||address
 abc || ajdskj || ankush || 123444 || NULL
 def || NULL || reddy || 456778 || NULL
 aef || NULL || john || NULL || ashaskawe

如果你愿意，我认为使用

字典/defaultdict

可以更好地解决这个问题。让我知道。@everestial007 hi-am开放使用任何东西使用任何功能都没有限制。hi@Abdou我们有什么方法可以简化整个_数据=[dict（y.split（“=”）表示y在x中）表示x在整个_数据中]在嵌套for循环中，我对数据有一些问题，为了调试，当出现错误时，我必须拆分此行并打印值。我收到一个错误，说字典更新序列元素#0的长度为1；需要2个。谢谢你的帮助。我做了一些编辑，但是你认为问题是从哪里产生的？当我们从列表转换为dict时，它抛出了一个错误，表示dictionary update sequence元素#0的长度为1l 2是必需的。看起来你的实际数据与你这里的模拟数据不同。在我看来，有些东西不是以

string=string

的形式出现的。谢谢abdou，您的代码运行良好，代码没有问题。我对数据有意见。我正在修改代码来处理这个问题。谢谢。嗨，谢谢你的回答。请告诉我如何处理_csv。错误行包含空字节。在df=pd.read_csv（'data.csv'，delimiter='[|]+'，header=None，engine='python'）上，可能所有文件都不符合您显示的格式。你能给你的文件（一个样本）一个链接来分析它吗？这是一个有将近百万行的文件，大小接近2GB，我不知道问题到底出在哪里。你可以用df=pd.read_csv（'data.csv'，delimiter='|'，header=None，error\u bad\u lines=False）试试

 import re
 f = open('test.txt', 'r').readlines()
 print('key1', 'key2', 'name', 'contact', 'address', sep='||')
 for line in f:
     if re.search('key1=(\w+)', line):
          k1 = re.search('key1=(\w+)', line).group(1)
      else:
          k1 = 'NULL'
      if re.search('key2=(\w+)',line):
          k2 = re.search('key2=(\w+)',line).group(1)
      else:
          k2 = 'NULL'
      if re.search('address=(\w+)',line):
          a = re.search('address=(\w+)',line).group(1)
      else:
          a = 'NULL'
      if re.search('name=(\w+)', line):
          n = re.search('name=(\w+)', line).group(1)
      else:
          n = 'NULL'
      if re.search('contact=(\w+)', line):
          c = re.search('contact=(\w+)', line).group(1)
      else:
          c = 'NULL'
      print(k1, k2, n, c, a, sep=' || ')

 key1||key2||name||contact||address
 abc || ajdskj || ankush || 123444 || NULL
 def || NULL || reddy || 456778 || NULL
 aef || NULL || john || NULL || ashaskawe