Python 保留具有特定列的最大值的行_Python_Arrays_Csv_Sorting

Python 保留具有特定列的最大值的行

python arrays csv sorting

Python 保留具有特定列的最大值的行,python,arrays,csv,sorting,Python,Arrays,Csv,Sorting,我是Python新手，我想做以下工作。我有一个csv文件input.csv，它包含一个标题行和4列。此csv文件的一部分如下所示： gene-name p-value stepup(p-value) fold-change IFIT1 6.79175E-005 0.0874312 96.0464 IFITM1 0.00304362 0.290752 86.3192 IFIT1 0.000439152 0.145488 81.499 IFIT3 5.87135E-005 0.0838258 77.

我是Python新手，我想做以下工作。我有一个csv文件input.csv，它包含一个标题行和4列。此csv文件的一部分如下所示：

gene-name p-value stepup(p-value) fold-change
IFIT1 6.79175E-005 0.0874312 96.0464
IFITM1 0.00304362 0.290752 86.3192
IFIT1 0.000439152 0.145488 81.499
IFIT3 5.87135E-005 0.0838258 77.1737
RSAD2 6.7615E-006 0.0685623 141.898
RSAD2 3.98875E-005 0.0760279 136.772
IFITM1  0.00176673 0.230063 72.0445

我只想保留fold change值最高的行，并删除包含fold change值较低的相同基因名的所有其他行。例如，在这种情况下，我需要以下格式的csv输出文件：

gene-name p-value stepup(p-value) fold-change
IFIT1 6.79175E-005 0.0874312 96.0464
IFITM1 0.00304362 0.290752 86.3192
RSAD2 6.7615E-006 0.0685623 141.898   
IFIT3 5.87135E-005 0.0838258 77.1737

如果你能为我提供解决这个问题的办法，我将不胜感激。非常感谢。

尝试使用熊猫：

import pandas as pd

df = pd.read_csv('YOUR_PATH_HERE')

print(df.loc[(df['gene-name'] != df.loc[(df['fold-change'] == df['fold-change'].max())]['gene-name'].tolist()[0])])

代码很长，因为我选择在一行中完成，但代码所做的就是这样。我获取最高折叠变化的基因名称，然后使用！=操作员说，抓取所有基因名称与我们刚才计算的基因名称不相同的地方

细分：

# gets the max value in fold-change
max_value = df['fold-change'].max()

# gets the gene name of that max value
gene_name_max = df.loc[df['fold-change'] == max_value]['gene-name']

# reassigning so you see the progression of grabbing the name
gene_name_max = gene_name_max.values[0]

# the final output
df.loc[(df['gene-name'] != gene_name_max)]

输出：

gene-name   p-value stepup(p-value) fold-change
0   IFIT1   0.000068    0.087431    96.0464
1   IFITM1  0.003044    0.290752    86.3192
2   IFIT1   0.000439    0.145488    81.4990
3   IFIT3   0.000059    0.083826    77.1737
6   IFITM1  0.001767    0.230063    72.0445

编辑：

要获得预期的输出，请使用groupby：

愚蠢的解决方案：遍历文件中的每一行，进行手动比较。假设：

每列由一个空格分隔由于在将结果刷新到文件之前，我们必须完成整个搜索和比较，所以结果行的数量应该适合内存没有预排序，所以它的缩放速度很差，因为它在每个输入行上都会遍历结果列表。如果某个基因后来不知何故发生了相同的折叠变化，你希望保留你看到的第一行。：：

这样做的一个优点是它保留了输入文件中第一个遇到的基因顺序。

您尝试过什么吗？发布您的代码…我尝试先按名称排序，然后使用df.sort保留基因的第一个最高折叠变化值，但没有成功。很抱歉，这不是我想要的。我需要在每一行中有一个不同的基因名称，具有最高的折叠变化值。您的脚本不会删除具有相同基因名称和较低折叠更改值的所有行。是清楚还是需要更多信息？有点困惑。。。您需要每个基因名称的最大值吗？@ManolisSemidalas根据您的预期输出进行了更新。因此，您能否使用新的groupby命令上载完整的脚本，因为我不清楚命令的顺序？谢谢。@ManolisSemidalas，更新版。让我知道这是否有帮助。您可能需要在最后一行代码附近调用print。不幸的是，我收到了错误：temp_a[0]。appendtemp_a AttributeError:“str”对象没有属性“append”为什么我们在@cowbert处收到此错误？重新加载页面，这是由于键入错误。不幸的是，我收到了一个新错误：如果floattemp_a[3]>floatout_a[pos][3]：索引器错误：列表索引超出范围我们如何解决它？我不知道，我刚刚编辑了代码，它适用于您原始问题中的情况。您的输入文件是否缺少某个字段？

import pandas as pd

df = pd.read_csv('YOUR_PATH_HERE')
df.groupby(['gene-name'], sort=False)['fold-change'].max()

# output below
gene-name
IFIT1      96.0464
IFITM1     86.3192
IFIT3      77.1737
RSAD2     141.8980

fi = open('inputfile.csv','r') # read

header = fi.readline() 
# capture the header line ("gene-name p-value stepup(p-value) fold-change")    

out_a = [] # we will store the results in here

for line in fi: # we can read a line this way too
    temp_a = line.strip('\r\n').split(' ') 
    # strip the newlines, split the line into an array

    try:
        pos = [gene[0] for gene in out_a].index(temp_a[0])
        # try to see if the gene is already been seen before
        # [0] is the first column (gene-name)
        # return the position in out_a where the existing gene is
    except ValueError: # python throws this if a value is not found
        out_a.append(temp_a)
        # add it to the list initially
    else: # we found an existing gene
        if float(temp_a[3]) > float(out_a[pos][3]):
            # new line has higher fold-change (column 4)
            out_a[pos] = temp_a
            # so we replace

fi.close() # we're done with our input file
fo = open('outfile.csv','w') # prepare to write to output
fo.write(header) # don't forget about our header
for result in out_a:
    # iterate through out_a and write each line to fo
    fo.write(' '.join(result) + '\n')
    # result is a list [XXXX,...,1234]
    # we ' '.join(result) to turn it back into a line
    # don't forget the '\n' which makes each result on a line

fo.close()