Python从文件中提取和排序数据
我试图以以下格式从大型CSV文件中提取数据,假设“x”是文本或整数形式的数据。每个分组都有一个唯一的id,但每个分组或颜色的行数并不总是相同的。数据与颜色之间用逗号分隔Python从文件中提取和排序数据,python,file,csv,formatting,Python,File,Csv,Formatting,我试图以以下格式从大型CSV文件中提取数据,假设“x”是文本或整数形式的数据。每个分组都有一个唯一的id,但每个分组或颜色的行数并不总是相同的。数据与颜色之间用逗号分隔 id, x red, x green, x blue, x black, x id, x yellow, x green, blue, x black, x id, x red, x green, x blue, x black, x id, x red, x green, x blue, x id, x red,
id, x
red, x
green, x
blue, x
black, x
id, x
yellow, x
green,
blue, x
black, x
id, x
red, x
green, x
blue, x
black, x
id, x
red, x
green, x
blue, x
id, x
red, x
green, x
blue, x
black, x
我想以列格式重新排列数据。ID应该是第一列,任何数据都用逗号分隔。我的目标是让它阅读行中的第一个单词,并将其放在适当的列中
line 0 - ID - red - green - blue - yellow - black
line 1 - x, x, x, , x,
line 2 - , x, x, x, x,
line 3 - x, x, x, , x,
line 4 - x, x, x, , ,
line 5 - x, x, x, , x,
这就是我想要的
readfile = open("db-short.txt", "r")
datafilelines = readfile.readlines()
writefile = open("sample.csv", "w")
temp_data_list = ["",]*7
td_index = 0
for line_with_return in datafilelines:
line = line_with_return.replace('\n','')
if not line == '':
if not (line.startswith("ID") or
line.startswith("RED") or
line.startswith("GREEN") or
line.startswith("BLUE") or
line.startswith("YELLOW") or
line.startswith("BLACK") ):
temp_data_list[td_index] = line
td_index += 1
temp_data_list[6] = line
if (line.startswith("BLACK") or line.startswith("BLACK")):
temp_data_list[5] = line
if (line.startswith("YELLOW") or line.startswith("YELLOW")):
temp_data_list[4] = line
if (line.startswith("BLUE") or line.startswith("BLUE")):
temp_data_list[3] = line
if (line.startswith("GREEN") or line.startswith("GREEN")):
temp_data_list[2] = line
if (line.startswith("RED") or line.startswith("RED")):
temp_data_list[1] = line
if (line.startswith("ID") or line.find("ID") > 0):
temp_data_list[0] = line
if line == '':
temp_data_str = ""
for temp_data in temp_data_list:
temp_data_str += temp_data + ","
temp_data_str = temp_data_str[0:-1] + "\n"
writefile.write(temp_data_str)
temp_data_list = ["",]*7
td_index = 0
if temp_data_list[0]:
temp_data_str = ""
for temp_data in temp_data_list:
temp_data_str += temp_data + ","
temp_data_str = temp_data_str[0:-1] + "\n"
writefile.write(temp_data_str)
readfile.close()
writefile.close()
这假设Python<2.7(因此不会利用使用打开一个的多个文件,使用内置的写头程序
写入头文件等。请注意,为了使其正常工作,我删除了CSV中逗号之间的空格。正如@JamesHenstridge所提到的,它肯定值得在<代码>csv
模块,使其更有意义
import csv
with open('testfile', 'rb') as f:
with open('outcsv.csv', 'wb') as o:
# Specify your field names
fieldnames = ('id', 'red', 'green', 'blue', 'yellow', 'black')
# Here we create a DictWriter, since your data is suited for one
writer = csv.DictWriter(o, fieldnames=fieldnames)
# Write the header row
writer.writerow(dict((h, h) for h in fieldnames))
# General idea here is to build a row until we hit a blank line,
# at which point we write our current row and continue
new_row = {}
for line in f.readlines():
# This will split the line on a comma/space combo and then
# Strip off any commas/spaces that end a word
row = [x.strip(', ') for x in line.strip().split(', ')]
if not row[0]:
writer.writerow(new_row)
new_row = {}
else:
# Here we write a blank string if there is no corresponding value;
# otherwise, write the value
new_row[row[0]] = '' if len(row) == 1 else row[1].strip()
# Check new_row - if not blank, it hasn't been written (so write)
if new_row:
writer.writerow(new_row)
使用上面的数据(加上一些随机的逗号分隔数字),可以这样写:
id,red,green,blue,yellow,black
x,"2,8","2,4",x,,x
x,,,"4,3",x,x
x,x,x,x,,x
x,x,x,x,,
x,x,x,x,,x
到目前为止,您尝试了什么?标准库
csv
模块可能是一个很好的开始。我知道您说过您想要一个python解决方案,但您考虑过R吗?它是为这些任务而构建的。我会承认我是编程新手,我尝试过使用它……但我一直遇到这个错误。Indexer:list assignment index超出范围现在我了解到这是因为数据的格式,我将看一看你在for
语句中缺少if
语句的开头吗?@JamesHenstridge Ha是的,不知道怎么没有粘贴。会在一点后更新,谢谢你指出。te之间有随机空格xt和逗号,有没有办法让它检测到空格并删除它们?@thedave是的,我不适合在您显示数据时更改数据。我将添加一个更新来解决这个问题。修复方法是去除字符串中的空格,因此您可以同时在该行中尝试行[1]。strip()
(我将稍后更新)。我刚刚意识到还有另一个问题,分隔符右侧有逗号。如红色,2,7