Python 如何使每一行组合一对

Python 如何使每一行组合一对,python,Python,我有一个.xls文件,有4行但有很多列。我把它保存在制表符分隔的.txt文件中,如下所示 第一列很重要,每个字符串由,分隔。 示例数据可以在这里找到 我希望每行组合一对,如果我们有多对组合,则保持其他行重复 这就是我要找的 A B A13 This is India AFD DNGS 3TR This is how it is AFD SGDH 3TR This is how it is DNGS SGDH

我有一个.xls文件,有4行但有很多列。我把它保存在制表符分隔的.txt文件中,如下所示 第一列很重要,每个字符串由
分隔。 示例数据可以在这里找到

我希望每行组合一对,如果我们有多对组合,则保持其他行重复

这就是我要找的

A       B        A13    This is India
AFD    DNGS      3TR    This is how it is
AFD    SGDH      3TR    This is how it is
DNGS   SGDH      3TR    This is how it is
NHYG    QHD      TRD    Where to go
NHYG    lkd      TRD    Where to go
NHYG    uyete    TRD    Where to go
QHD     lkd      TRD    Where to go
QHD     uyete    TRD    Where to go
lkd     uyete    TRD    Where to go
AFD     TTT      YTR    What to do
让我们调用我的第一个数据
data

我试着逐行阅读

import itertools


lines = open("data.txt").readlines()
for line in lines:
    myrows = line.split(",") 
out_list = []
for i in range(1, len(myrows)+1):
    out_list.extend(itertools.combinations(lines, i))

我认为您使用
itertools.combines()
的想法是正确的,但是您只需要在第一列元素中运行它,而不需要在整行中运行它

以下是我的解决方案:

import StringIO
import itertools

data = """"A,B     "    A13 This is India
"AFD,DNGS,SGDH   "  3TR This is how it is
"NHYG,QHD,lkd,uyete"    TRD Where to go
"AFD,TTT"   YTR What to do"""

for line in StringIO.StringIO(data):
    e1,e2 = line.split('\t', 1)  # extract the first part (e1) and the rest of the line (e2)
    es = e1.replace('"','').strip().split(',')  # remove extra "" and whitespace.
                                                # then split each element in a tuple
    for i in itertools.combinations(es,2):  # iterate over all combinations of 2 elements
        print '{}\t{}'.format('\t'.join(i),e2)
结果:

A   B   A13 This is India

AFD DNGS    3TR This is how it is

AFD SGDH    3TR This is how it is

DNGS    SGDH    3TR This is how it is

NHYG    QHD TRD Where to go

NHYG    lkd TRD Where to go

NHYG    uyete   TRD Where to go

QHD lkd TRD Where to go

QHD uyete   TRD Where to go

lkd uyete   TRD Where to go

AFD TTT YTR What to do
编辑

这是修改后的版本。 请注意带有
f.readlines()
enumerate()
,它返回当前行的索引

import itertools

with open('data.txt') as f:
    header = f.readline()
    with open('result.txt','w') as w:
        w.write(header)
        for n,line in enumerate(f.readlines()):
            elems = line.split('\t')
            e0 = elems[0].split(',')
            e0 = [e.replace('"','').strip() for e in e0]
            for pairs in itertools.combinations(e0,2):
                w.write('{:d}\t{}\t{}\n'.format(n+1,'\t'.join(pairs),'\t'.join(elems[1:])))
您可以使用www.u data.txt:

"Q92828, O60907, O75376"    15  NCOR complex        Human   MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes    12628926    "By using specific small interference RNAs (siRNAs), the authors demonstrate that HDAC3 is essential, whereas TBL1 and TBLR1 are functionally redundant but essential for repression by unliganded thyroid hormone receptor."
"O15143, O15144, O15145, P61158, P61160, P59998, O15511"    27  Arp2/3 protein complex      Human   MI:0027- cosedimentation | MI:0071- molecular sieving   9359840
"Q9UL46, Q06323"    30  PA28 complex     11S REG    Human   MI:0071- molecular sieving | MI:0226- ion exchange chromatography   9325261 "PA28 is a regulatory complex of the 20S proteasome. It acts as proteasome activator and stimulates cleavage after basic, acidic, and most hydrophobic residues in many peptides."
"P55036, P62333, O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O00233"    32  PA700 complex    19S complex    Human   MI:0226- ion exchange chromatography | MI:0071- molecular sieving   9148964 "The proteasome is an essential component of the ATP-dependent proteolytic pathway in eukaryotic cells and is responsible for the degradation of most cellular proteins (for reviews see PMID:8811196 and PMID:10872471). It contains a barrel-shaped proteolytic core complex (the 20S proteasome), and is capped at one or both ends by regulatory complexes like the 19S complex (PMID:11812135), modulator (PMID:8621709), PA28 (PMID:9325261) and PA28gamma (PMID:9325261). Interferon-gamma (IFN-gamma) alters the peptide-degrading specificity of proteasomes and produces an immunoproteasome responsible for accelerated processing of nonself endogenous antigens by inducing the replacement of subunits Psmb5, Psmb6 and Psmb7 by Psmb8, Psmb9 and Psmb10, respectively."
代码:


这让我想起了
flatmap

导入itertools
收费表:
返回s.split(',')
def TO管柱(l):

返回[','。如果i@nik我已经添加了范围内的i(len(l))和范围内的j(len(l))的连接([l[i[i],l[j]])。不,您添加的第一列是数字。它应该是1、2 2 2 3 3 3和4,表示这些组合来自哪些行,谢谢,但仍然不起作用。我收到了这个错误,您为什么要使用文件“test2.py”,第11行{:代码对我来说很好,检查文本文件是否与我的相同,想法很清楚,你应该知道如何做,只需编写你的代码。@nik现在你可以复制它,我测试了它。我根据你的评论修改了我的答案。只需打开一个文件进行编写,并在该文件中写入而不是stdout。请参阅文档:看起来你的数据是正确的未按您描述的格式格式化。您确定数据文件中没有空行或截断行吗?您的问题是,与初始data.txt文件相反,列之间用重复的空格分隔,而不是用制表符分隔。我已修改了代码,但仅当有4个或更多空格分隔列时,它才起作用。如果可能,您可以我们真的应该尝试用一些独特的东西来生成初始数据文件,比如制表符、分号或其他东西……此外,我已经将写入操作添加到结果文件(
result.txt
)部分。这是因为我假设您的原始数据文件已经用制表符分隔。同样,如果您可以用唯一的分隔符分隔初始数据文件,事情会简单得多。您是否自己生成数据文件,是否可以更改使用的分隔符?
"Q92828, O60907, O75376"    15  NCOR complex        Human   MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes    12628926    "By using specific small interference RNAs (siRNAs), the authors demonstrate that HDAC3 is essential, whereas TBL1 and TBLR1 are functionally redundant but essential for repression by unliganded thyroid hormone receptor."
"O15143, O15144, O15145, P61158, P61160, P59998, O15511"    27  Arp2/3 protein complex      Human   MI:0027- cosedimentation | MI:0071- molecular sieving   9359840
"Q9UL46, Q06323"    30  PA28 complex     11S REG    Human   MI:0071- molecular sieving | MI:0226- ion exchange chromatography   9325261 "PA28 is a regulatory complex of the 20S proteasome. It acts as proteasome activator and stimulates cleavage after basic, acidic, and most hydrophobic residues in many peptides."
"P55036, P62333, O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O00233"    32  PA700 complex    19S complex    Human   MI:0226- ion exchange chromatography | MI:0071- molecular sieving   9148964 "The proteasome is an essential component of the ATP-dependent proteolytic pathway in eukaryotic cells and is responsible for the degradation of most cellular proteins (for reviews see PMID:8811196 and PMID:10872471). It contains a barrel-shaped proteolytic core complex (the 20S proteasome), and is capped at one or both ends by regulatory complexes like the 19S complex (PMID:11812135), modulator (PMID:8621709), PA28 (PMID:9325261) and PA28gamma (PMID:9325261). Interferon-gamma (IFN-gamma) alters the peptide-degrading specificity of proteasomes and produces an immunoproteasome responsible for accelerated processing of nonself endogenous antigens by inducing the replacement of subunits Psmb5, Psmb6 and Psmb7 by Psmb8, Psmb9 and Psmb10, respectively."
import itertools

with open('you_data.txt') as f:
    index = 1
    for line in f:
        split_line = line.split('"')
        key = split_line[1].strip().split(',', 2)
        value = split_line[2].strip().replace('\t',' ')

        for pair in itertools.combinations(key, 2):
            pair = [i.strip() for i in pair]
            print('{:<4}{:8}{:8}{:20}'.format(index,*pair, value))
        index += 1
1   Q92828  O60907  15 NCOR complex  Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
1   Q92828  O75376  15 NCOR complex  Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
1   O60907  O75376  15 NCOR complex  Human MI:0004- affinity chromatography technologies | MI:0019- coimmunoprecipitation | MI:0069- mass spectrometry studies of complexes 12628926
2   O15143  O15144  27 Arp2/3 protein complex  Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
2   O15143  O15145, P61158, P61160, P59998, O1551127 Arp2/3 protein complex  Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
2   O15144  O15145, P61158, P61160, P59998, O1551127 Arp2/3 protein complex  Human MI:0027- cosedimentation | MI:0071- molecular sieving 9359840
3   Q9UL46  Q06323  30 PA28 complex  11S REG Human MI:0071- molecular sieving | MI:0226- ion exchange chromatography 9325261
4   P55036  P62333  32 PA700 complex  19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964
4   P55036  O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O0023332 PA700 complex  19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964
4   P62333  O43242, P35998, P48556, P62191, Q13200, Q99460, O00232, P17980, P62195, O00487, P51665, Q15008, O75832, O0023332 PA700 complex  19S complex Human MI:0226- ion exchange chromatography | MI:0071- molecular sieving 9148964