使用python(或R)清理缺少数据的表

使用python(或R)清理缺少数据的表,python,r,Python,R,我有一个这样组织的表格(curves.csv)(不组织会更好) 我想把这张桌子换成 ,A,B,C,D,E 1,a,b,c,d,e 2,f,,h,,j 3,,g,,, 4,,,,i, 5,k,,m,,o 6,,l,,, 8,,,,n, 我目前有: celllines=["A","B","C","D","E"] sorted_days=["1","2","3","4","5","8"] for d in sorted_days: curves=open("curves.csv","rU")

我有一个这样组织的表格(curves.csv)(不组织会更好)

我想把这张桌子换成

,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,
我目前有:

celllines=["A","B","C","D","E"]
sorted_days=["1","2","3","4","5","8"]
for d in sorted_days:
    curves=open("curves.csv","rU")
    for line in curves:
        line=line.rstrip().rsplit(",")
        if line[0]!="CL":#removes header
            for x in range(0,len(line),3):
                if line[x] in celllines:
                    if line[x+1] == d:
                        print d,line[x],line[x+2]
                    else:
                        print d, line[x],""



    curves.close()
我只是觉得我离答案越来越远,而不是越来越近!
像往常一样,任何指针都会非常受欢迎

使用
csv
模块执行类似操作如何:

import csv

# make a dictionary to store the data
data = {}

# first, read it in
with open("curves.csv", "rb") as fp:

    # make a csv reader object
    reader = csv.reader(fp)

    # skip initial line
    next(reader)

    for row in reader:
        # for each triplet, store it in the dictionary
        for i in range(len(row)//3):
            CL, D, PD = row[3*i:3*i+3]
            data[D, CL] = PD

# see what we've got
print data

with open("newcurves.csv", "wb") as fp:
    # get the labels in order
    row_labels = sorted(set(k[0] for k in data), key=int)
    col_labels = sorted(set(k[1] for k in data))

    writer = csv.writer(fp)
    # write header
    writer.writerow([''] + col_labels)

    # write data rows
    for row_label in row_labels:
        # start with the label
        row = [row_label]

        # then extend a list of the data in order, using the empty string '' if
        # there's no such value
        row.extend([data.get((row_label, col_label), '') for col_label in col_labels])

        # dump it out
        writer.writerow(row)
celllines=["","A","B","C","D","E"]
days=["1","2","3","4","5","6","7","8"]

curves = sum([line.split(',') for line in open("curves.csv","rU").read().split()[1:]], [])

group = {(d,cl): pd for (cl,d,pd) in [curves[i:i+3] for i in range(0,len(curves),3)]}
table = [[d if not x else '' for x in celllines] for d in days]

for (d,cl),pd in group.items():
    table[days.index(d)][celllines.index(cl)] = pd

with open("curves2.csv", "w") as f:
    f.write('\n'.join(','.join(line) for line in [celllines]+table))
这给了我们一本字典

{('1', 'D'): 'd', ('1', 'E'): 'e', ('5', 'C'): 'm', ('1', 'B'): 'b', ('2', 'E'): 'j', ('1', 'C'): 'c', ('5', 'A'): 'k', ('6', 'B'): 'l', ('2', 'C'): 'h', ('1', 'A'): 'a', ('4', 'D'): 'i', ('8', 'D'): 'n', ('2', 'A'): 'f', ('3', 'B'): 'g', ('5', 'E'): 'o'}
和一个输出文件,如

~/coding$ cat newcurves.csv 
,A,B,C,D,E
1,a,b,c,d,e
2,f,,h,,j
3,,g,,,
4,,,,i,
5,k,,m,,o
6,,l,,,
8,,,,n,

我发现,解决这样一个问题的最好办法是将旧格式的分解和新格式的构建分开。相反,将旧格式分解为一个合理的数据结构,使Python中的数据更容易处理,然后使用这种良好的、可扩展的结构构建新格式

无论我们在哪里使用逗号分隔的值,我们都可以通过使用标准库中的

这个解决方案还大量使用了,所以如果你不熟悉它们,我建议你读一读(前面链接的是我的解释它们的短片)

我们首先使用
with
语句打开文件(这是确保文件关闭的最佳实践),然后跳过标题行并解析数据。为此,我们获取数据中的每一行,然后将该行分组为长度为3的块(使用
grouper()
函数,即)。这将为我们提供列、行和值,然后将其用作字典的键和值

这给了我们一个
{(“a”,1):“a”,…}
的字典。这是一种很好的工作格式,所以现在我们将文件重新构建为所需的格式

首先,我们需要知道我们需要什么行和列,我们只从解析数据中获取行,并创建一个集合(因为集合不能包含重复项),最后将它们重新排序到一个列表中,这样我们就有了正确的顺序

然后我们打开输出文件,将列写入其中(记住为行标题列添加一个
None
),然后写出数据。对于每一行,我们写入行号,然后从解析的数据中获取每一列的值,如果没有值,则使用so获得
None
。这将提供所需的输出


请注意:问题中您似乎使用的是Python2.x,我的答案是用3.x编写的。唯一的区别应该是
itertools.zip_longest()
itertools.izip_longest()
3.x.

中的
csv
模块:

import csv

# make a dictionary to store the data
data = {}

# first, read it in
with open("curves.csv", "rb") as fp:

    # make a csv reader object
    reader = csv.reader(fp)

    # skip initial line
    next(reader)

    for row in reader:
        # for each triplet, store it in the dictionary
        for i in range(len(row)//3):
            CL, D, PD = row[3*i:3*i+3]
            data[D, CL] = PD

# see what we've got
print data

with open("newcurves.csv", "wb") as fp:
    # get the labels in order
    row_labels = sorted(set(k[0] for k in data), key=int)
    col_labels = sorted(set(k[1] for k in data))

    writer = csv.writer(fp)
    # write header
    writer.writerow([''] + col_labels)

    # write data rows
    for row_label in row_labels:
        # start with the label
        row = [row_label]

        # then extend a list of the data in order, using the empty string '' if
        # there's no such value
        row.extend([data.get((row_label, col_label), '') for col_label in col_labels])

        # dump it out
        writer.writerow(row)
celllines=["","A","B","C","D","E"]
days=["1","2","3","4","5","6","7","8"]

curves = sum([line.split(',') for line in open("curves.csv","rU").read().split()[1:]], [])

group = {(d,cl): pd for (cl,d,pd) in [curves[i:i+3] for i in range(0,len(curves),3)]}
table = [[d if not x else '' for x in celllines] for d in days]

for (d,cl),pd in group.items():
    table[days.index(d)][celllines.index(cl)] = pd

with open("curves2.csv", "w") as f:
    f.write('\n'.join(','.join(line) for line in [celllines]+table))
只是为了证明(有点晚了)它也可以在R中完成:

curves <- read.csv("curves.csv", as.is = TRUE)
stack  <- data.frame(CL = unlist(curves[, c(TRUE, FALSE, FALSE)]),
                     D  = unlist(curves[, c(FALSE, TRUE, FALSE)]),
                     PD = unlist(curves[, c(FALSE, FALSE, TRUE)]),
                     stringsAsFactors = FALSE)
library(reshape2)
output <- acast(stack, D ~ CL, value.var = "PD", fill = "")
write.csv(output, "new_curves.csv", quote = FALSE)

曲线R带
tapply的解决方案
-连接连接函数c

cvrs <- read.table(text="CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
 A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
 A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
 A,5,k,B,6,l,C,5,m,D,8,n,E,5,o", header=TRUE, sep=",", check.names=FALSE)

long <- rbind(crvs[, 1:3], crvs[, 4:6], crvs[, 7:9], crvs[, 10:12])
out <- with( long, tapply(PD, list(D, CL), FUN=c) )
#-----------------
 write.table(out, quote=FALSE, sep=",", na="")
A,B,C,D
1,a,b,c,d
2,f,,h,
3,,g,,
4,,,,i
5,k,,m,
6,,l,,
8,,,,n

cvrs我正在看这个,我真的说不出你是如何从第一张桌子转到第二张桌子的。你能解释得更清楚一点吗?三列的组需要堆叠起来,然后用“CL”和“D”交叉制表。为什么你不想使用设计用于做你正试图做的事情的模块?
cvrs <- read.table(text="CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD,CL,D,PD
 A,1,a,B,1,b,C,1,c,D,1,d,E,1,e
 A,2,f,B,3,g,C,2,h,D,4,i,E,2,j
 A,5,k,B,6,l,C,5,m,D,8,n,E,5,o", header=TRUE, sep=",", check.names=FALSE)

long <- rbind(crvs[, 1:3], crvs[, 4:6], crvs[, 7:9], crvs[, 10:12])
out <- with( long, tapply(PD, list(D, CL), FUN=c) )
#-----------------
 write.table(out, quote=FALSE, sep=",", na="")
A,B,C,D
1,a,b,c,d
2,f,,h,
3,,g,,
4,,,,i
5,k,,m,
6,,l,,
8,,,,n