Python 将一行分割为多个单元格，并保持每个基因的第二个值的最大值_Python_Arrays_Csv_Optimization

Python 将一行分割为多个单元格，并保持每个基因的第二个值的最大值

python arrays csv optimization

Python 将一行分割为多个单元格，并保持每个基因的第二个值的最大值,python,arrays,csv,optimization,Python,Arrays,Csv,Optimization,我是Python新手，我准备了一个脚本来修改以下内容因此： 1）包含由//分隔的多个基因条目的每一行，例如： C16orf52 /// LOC102725138 1.00551 应转变为： C16orf52 1.00551 LOC102725138 1.00551 2）同一基因可能有不同的比值 AASDHPPT 0.860705 AASDHPPT 0.983691 我们只希望保留比率值最高的一对（删除该对AASDHPPT 0.860705）这是我写的脚本，但它没有为基因

我是Python新手，我准备了一个脚本来修改以下内容

因此：

1）包含由

//

分隔的多个基因条目的每一行，例如：

C16orf52 /// LOC102725138 1.00551

应转变为：

C16orf52 1.00551  
LOC102725138 1.00551

2）同一基因可能有不同的比值

AASDHPPT 0.860705  
AASDHPPT 0.983691

我们只希望保留比率值最高的一对（删除该对

AASDHPPT 0.860705

）

这是我写的脚本，但它没有为基因分配正确的比值：

import csv
import pandas as pd

with open('2column.csv','rb') as f:
    reader = csv.reader(f)
    a = list(reader)
gene = []
ratio = []
for t in range(len(a)):
    if '///' in a[t][0]:
        s = a[t][0].split('///')
        gene.append(s[0])
        gene.append(s[1])
        ratio.append(a[t][1])
        ratio.append(a[t][1])
    else:
        gene.append(a[t][0])
        ratio.append(a[t][1])
    gene[t] = gene[t].strip()

newgene = []
newratio = []
for i in range(len(gene)):
    g = gene[i]
    r = ratio[i]
    if g not in newgene:
        newgene.append(g)
    for j in range(i+1,len(gene)):
        if g==gene[j]:
            if ratio[j]>r:
                r = ratio[j]
    newratio.append(r)

for i in range(len(newgene)):
    print newgene[i] + '\t' + newratio[i]

if len(newgene) > len(set(newgene)):
    print 'missionfailed'

非常感谢您的帮助或建议。

试试这个：

with open('2column.csv') as f:
    lines = f.read().splitlines()

new_lines = {}
for line in lines:
    cols = line.split(',')
    for part in cols[0].split('///'):
        part = part.strip()
        if not part in new_lines:
            new_lines[part] = cols[1]
        else:
            if float(cols[1]) > float(new_lines[part]):
                new_lines[part] = cols[1]


import csv
with open('clean_2column.csv', 'wb') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    for k, v in new_lines.items():
        writer.writerow([k, v])

首先，如果要导入熊猫，请知道必须读取CSV文件

首先，让我们这样导入它：

df = pd.read_csv('2column.csv')

然后，您可以提取具有“//”模式的索引：

l = list(df[df['Gene Symbol'].str.contains('///')].index)

然后，您可以创建新行：

for i in l :
    for sub in df['Gene Symbol'][i].split('///') : 
         df=df.append(pd.DataFrame([[sub, df['Ratio(ifna vs. ctrl)'][i]]], columns = df.columns))

然后，放下旧的：

df=df.drop(df.index[l])

然后，我将做一个小技巧来删除最小的重复值。首先，我将按“比率（ifna vs.ctrl）”对它们进行排序，然后我将只列出第一个：

df = df.sort('Ratio(ifna vs. ctrl)', ascending=False).drop_duplicates('Gene Symbol', keep='first')

如果您想保持按基因符号排序，并将索引重置为更简单的索引，只需执行以下操作：

df = df.sort('Gene Symbol').reset_index(drop=True)

如果要将修改后的数据重新导出到csv，请执行以下操作：

df.to_csv('2column.csv')

编辑：我编辑了我的答案以纠正语法错误，我已经用您的csv测试了这个解决方案，它工作得很好：）

这应该可以

它使用了彼得的词典建议

import csv

with open('2column.csv','r') as f:
    reader = csv.reader(f)
    original_file = list(reader)
    # gets rid of the header 
    original_file = original_file[1:]

# create an empty dictionary 
genes_ratio = {}

# loop over every row in the original file
for row in original_file:
    gene_name = row[0]
    gene_ratio = row[1]
    # check if /// is in the string if so split the string
    if '///' in gene_name:
        gene_names = gene_name.split('///')
        # loop over all the resulting compontents
        for gene in gene_names:
            # check if the component is in the dictionary 
            # if not in dictionary set value to gene_ratio
            if gene not in genes_ratio:
                genes_ratio[gene] = gene_ratio
            # if in dictionary compare value in dictionary to gene_ratio
            # if dictionary value is smaller overwrite value
            elif genes_ratio[gene] < gene_ratio:
                genes_ratio[gene] = gene_ratio
    else:
        if gene_name not in genes_ratio:
            genes_ratio[gene_name] = gene_ratio
        elif genes_ratio[gene_name] < gene_ratio:
            genes_ratio[gene_name] = gene_ratio

#loop over dictionary and print gene names and their ratio values 
for key in genes_ratio:
    print key, genes_ratio[key]

导入csv
以open（'2column.csv'，'r'）作为f：
读卡器=csv。读卡器（f）
原始文件=列表（读卡器）
#删除标题
原始文件=原始文件[1:]
#创建一个空字典
基因比率={}
#循环原始文件中的每一行
对于原始_文件中的行：
基因名称=行[0]
基因比率=行[1]
#检查///是否在字符串中，如果是，请拆分字符串
如果基因名称中有“//”：
gene_name=gene_name.split（'//'））
#循环遍历所有生成的组件
对于gene_名称中的基因：
#检查组件是否在字典中
#如果不在字典中，则设置值与基因的比值
如果基因不在基因比例中：
基因比率[基因]=基因比率
#如果在字典中，比较字典中的值与基因的比值
#如果字典值较小，则覆盖该值
elif基因比率[基因]<基因比率：
基因比率[基因]=基因比率
其他：
如果gene_名称不在genes_比率中：
基因比率[基因名称]=基因比率
elif基因比率[基因名称]<基因比率：
基因比率[基因名称]=基因比率
#循环字典并打印基因名称及其比值
对于输入基因的比率：
打印键，基因比率[键]

Hi Manolis，也许你应该了解我认为理想情况下你可能希望将基因存储在dict中，并且在分配值时，如果键存在，则忽略它，如果它不大于当前值。谢谢你的帮助。但是，出现了以下错误：回溯（最近一次调用上次）：文件“gsea.py”，第10行，在新行[part]=cols[1]indexer中：列表索引超出范围您有何建议？可能是因为您在与共享的csv文件不同的csv文件上进行了测试。（检查是否有相同的分隔符

，

）