Python 在单独的.txt文件中打印行中的唯一元素_Python_File_Bioinformatics

Python 在单独的.txt文件中打印行中的唯一元素

python file

Python 在单独的.txt文件中打印行中的唯一元素,python,file,bioinformatics,Python,File,Bioinformatics,我有一个巨大的输入文件 con1 P1 140 602 con1 P2 140 602 con2 P5 642 732 con3 P8 17 348 con3 P9 17 348 我想在每个con中进行迭代，删除第[2]行和第[3]行中的重复元素，并将结果打印到一个新的.txt文件中，这样我的输出文件如下所示（注意：对于每个con，我的第二列可能不同）我尝试的脚本（不确定如何完成）更新：附加示例 con20 EMT20540 951 1

我有一个巨大的输入文件

con1    P1  140 602
con1    P2  140 602
con2    P5  642 732
con3    P8  17  348
con3    P9  17  348

我想在每个con中进行迭代，删除第[2]行和第[3]行中的重复元素，并将结果打印到一个新的.txt文件中，这样我的输出文件如下所示（注意：对于每个con，我的第二列可能不同）

我尝试的脚本（不确定如何完成）

更新：附加示例

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT24081    975 1655
con20   EMT19916    975 1652
con20   EMT23831    975 1655
con20   EMT19915    975 1652
con20   EMT09010    975 1649
con20   EMT29525    975 1655
con20   EMT19914    975 1652
con20   EMT19913    975 1652
con20   EMT23832    975 1652
con20   EMT09009    975 1637
con20   EMT16812    975 1649

预期产量

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

您只需按如下方式进行操作：

my_list = list(set(open(file_name, 'r')))

然后将其写入另一个文件

简单例子您可以在此处使用：

输出：

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

我说：

f = open('example.txt','r').readlines()
array = []

for line in f:
  array.append(line.rstrip().split())


def func(array, j):
  offset = []
  if j < len(array):
    firstRow = array[j-1]
    for i in range(j, len(array)):
      if (firstRow[3] == array[i][3] and firstRow[2] == array[i][2]
        and firstRow[0] == array[i][0]):
        offset.append(i)

    for item in offset[::-1]:# Q. Why offset[::-1] and not offset?
      del array[item]

    return func(array, j=j+1)

func(array, 1)

for e in array:
  print '%s\t\t%s\t\t%s\t%s' % (e[0],e[1],e[2],e[3])

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

@Downvoter，请你解释一下我的答案有什么问题，以便我可以改进或删除它。谢谢，我没有投反对票，但我认为这是因为这只会删除完全独特的行。在他的例子中，这将同时保留

P1

和

P2

行。@KirkStrauser，为什么它会同时保留P1行？它们不是相同的字符串吗（即使使用了\n）？我不明白。感谢您查看

['con1 P1 140 602'，'con1 P2 140 602']

。根据问题，这些是重复的（因为“140 602”在两个方面都是相同的）（如果我读对了的话），但在你的算法中是不同的。不，我没有，字符串是故意不同的，以更好地说明

（第[2]行，第[3]行]

对每个con是唯一的，还是全局的？也就是说，你可以有

con1 P1 140 602

和

con2 P2 140 602

？我需要有con1 P1 140 602 con1 P2 140 602才能删除，所以基本上，行[2]和行[3]应该有相同的conx，而P可以不同，所以在上面提供的示例中，我无法删除，因为他们有两个不同的con..@user3224522如果你在最后有一行

con4 P3 140 602

呢？我不删除它，因为它与con1有不同的con，基本上第0,2,3行应该是相似的，以便remove@user3224522那么请举一个更好的例子，允许我的代码重复项的一个。我第一次尝试使用groupby，但它没有删除所有重复项，我已经在我的更大的文件上检查了你的脚本，同一个问题..你介意在这行评论吗，pls columns=tuple（line.rsplit（None，2）[-2:]）@user3224522我已经添加了一些解释。

from itertools import groupby

with open('input.txt') as f1, open('f_out', 'w') as f2:
    #Firstly group the data by the first column
    for k, g in groupby(f1, key=lambda x:x.split()[0]):
        # Now during the iteration over each group, we need to store only
        # those lines that have unique 3rd and 4th column. For that we can
        # use a `set()`, we store all the seen columns in the set as tuples and
        # ignore the repeated columns.   

        seen = set()
        for line in g:
            columns = tuple(line.rsplit(None, 2)[-2:])
            if columns not in seen:
                #The 3rd and 4th column were unique here, so
                # store this as seen column and also write it to the file.
                seen.add(columns)
                f2.write(line.rstrip() + '\n') 
                print line.rstrip()

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637

f = open('example.txt','r').readlines()
array = []

for line in f:
  array.append(line.rstrip().split())


def func(array, j):
  offset = []
  if j < len(array):
    firstRow = array[j-1]
    for i in range(j, len(array)):
      if (firstRow[3] == array[i][3] and firstRow[2] == array[i][2]
        and firstRow[0] == array[i][0]):
        offset.append(i)

    for item in offset[::-1]:# Q. Why offset[::-1] and not offset?
      del array[item]

    return func(array, j=j+1)

func(array, 1)

for e in array:
  print '%s\t\t%s\t\t%s\t%s' % (e[0],e[1],e[2],e[3])

con20   EMT20540    951 1580
con20   EMT14935    975 1655
con20   EMT19916    975 1652
con20   EMT09010    975 1649
con20   EMT09009    975 1637