如何计算Python中列表成对比较的元素频率?

如何计算Python中列表成对比较的元素频率?,python,list,for-loop,dictionary,frequency,Python,List,For Loop,Dictionary,Frequency,我已将样本存储在以下列表中 sample = [AAAA,CGCG,TTTT,AT-T,CATC] 。。为了说明这个问题,我在下面将它们表示为“集合” 消除集合中每个元素与其自身相同的所有集合 输出: Set2 CGCG Set4 AT-T Set5 CATC 在集合之间执行成对比较。(设置2伏设置4,设置2伏设置5,设置4伏设置5) 每个成对比较只能有两种类型的组合,如果没有,则消除这些成对比较。例如 Set2 Set5 C C G A C

我已将样本存储在以下列表中

 sample = [AAAA,CGCG,TTTT,AT-T,CATC]
。。为了说明这个问题,我在下面将它们表示为“集合”

  • 消除集合中每个元素与其自身相同的所有集合
  • 输出:

     Set2 CGCG
     Set4 AT-T
     Set5 CATC
    
  • 在集合之间执行成对比较。(设置2伏设置4,设置2伏设置5,设置4伏设置5)

  • 每个成对比较只能有两种类型的组合,如果没有,则消除这些成对比较。例如

    Set2    Set5
    C       C
    G       A
    C       T 
    G       C
    
  • 这里,有两种以上类型的对(CC),(GA),(CT)和(GC)。所以这种两两比较是不可能发生的


    每次比较只能有两种组合(AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC)。。。基本上所有可能的ACGT组合,其中订单很重要

    在给定的示例中,发现超过2个这样的组合

    因此,Set2和Set4;无法考虑Set4和Set5。因此,剩下的仅有两对:

    Output
    Set2 CGCG
    Set4 AT-T
    
  • 在这种两两比较中,删除带“-”的元素及其对应的另一对元素中的任何元素

    Output    
    Set2 CGG
    Set4 ATT
    
  • 计算Set2和Set4中元素的频率。计算集合中对类型的出现频率(CA和GT对)

  • 计算相应元素的浮点(a)=(对)-(Set2)*(Set4)(任何一对都足够)

  • 注:如果这对是AAAC和CCCA,C的频率将是1/4,即这是其中一对上的基频

  • 算计

    float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
    
  • 对所有成对比较重复此操作

  • 例如

    设置2伏设置4,设置2伏设置6,设置4伏设置6

    到目前为止,我的半成品代码: **我希望所有建议的代码都是循环格式的标准代码,而不是理解代码**

    #Step 1
    for i in sample: 
        for j in range(i):
            if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                            del i 
        #insert line of code where sample1 = new sample with deletions as above
    
    #Step 2
        for i,i+1 in enumerate(sample):
        #Step 3
        for j in range(i):
            for k in range (i+1):
            #insert line of code to say only two types of pairs can be included, if yes continue else skip
                #Step 4
                if j = "-" or k = "-":
                    #Delete j/k and the corresponding element in the other pair
                    #Step 5
                    count_dict = {}
                        square_dict = {}
                    for base in list(i):
                        if base in count_dict:
                                count_dict[base] += 1
                        else:
                                count_dict[base] = 1
                        for allele in count_dict:
                        freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                        #Calculate frequency of pairs 
                    #Step 6
                    No code yet
    

    我想这就是你想要的:

    from collections import Counter
    
    # Remove elements where all nucleobases are the same.
    for index in range(len(sample) - 1, -1, -1):
        if sample[index][:1] * len(sample[index]) == sample[index]:
            del sample[index]
    
    for indexA, setA in enumerate(sample):
        for indexB, setB in enumerate(sample):
            # Don't compare samples with themselves nor compare same pair twice.
            if indexA <= indexB:
                continue
    
            # Calculate number of unique pairs
            pair_count = Counter()
            for pair in zip(setA, setB):
                if '-' not in pair:
                    pair_count[pair] += 1
    
            # Only analyse pairs of sets with 2 unique pairs.
            if len(pair_count) != 2:
                continue
    
            # Count individual bases.
            base_counter = Counter()
            for pair, count in pair_count.items():
                base_counter[pair[0]] += count
                base_counter[pair[1]] += count
    
            # Get the length of one of each item in the pair.
            sequence_length = sum(pair_count.values())
    
            # Convert counts to frequencies.
            base_freq = {}
            for base, count in base_counter.items():
                base_freq[base] = count / float(sequence_length)
    
            # Examine a pair from the two unique pairs to calculate float_a.
            pair = list(pair_count)[0]
            float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
    
            # Step 7!
            float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
    
    从集合导入计数器
    #去除所有碱基相同的元素。
    对于范围内的索引(len(样本)-1,-1,-1):
    如果样本[索引][:1]*len(样本[索引])==样本[索引]:
    del样本[索引]
    对于indexA,枚举中的setA(样本):
    对于indexB,枚举中的setB(示例):
    #不要将样本与自身进行比较,也不要将同一对样本进行两次比较。
    
    如果indexA我不理解第3步。
    CGCG
    AT-T
    如何产生这些配对?每次比较只能有两种组合(AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC)。。。基本上所有可能的ACGT组合,其中订单很重要。在给定的示例中,发现超过2个这样的组合。因此,Set2和Set4;不能考虑Set4和Set5。请举例说明步骤中AAAC和CCCA对的“c频率”是什么意思?是1/4还是1/2?也就是说,它是单个对上的基频还是两个对上的基频?base1应该是什么?另外,set2和set4在没有匹配字母的情况下如何匹配?我已经更正了“base1”语句。Set2和Set4被认为是匹配的,因为它们满足的标准是,它只有2个唯一的组合CA和GT。而Set2 v Set5有(CC),(GA),(CT)和(GC)(超过2个唯一的对)在#删除所有碱基相同的元素时,您假设样本只有4个碱基的相同结构。它可以有n个碱基。我认为最好在示例中为I添加这个sample1=[]如果len(set(I))>1:sample1.append(I)(2)我非常抱歉造成混淆,但是关于你之前的问题,当有AAAC和CCCA时,它应该是1/4,即在一对上的频率。另外,我在编译float_b=float_a/(base_freq.get)时遇到这个错误('A',0)*base_freq.get('T',0)*base_freq.get('C',0)*base_freq.get('G',0))ZeroDivisionError:整数除法或零模注释1:我很确定我说得对——如果整个字符串由第一个字符组成(如果存在的话——代码适用于零长度字符串)注释#3:我将此作为
    示例的初始值设定项
    示例=['AAAA'、'CGCG'、'TTTT'、'AT-T'、'CATC']
    。也许您有不同的想法?
    float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
    
    Set2 CGCG
    Set4 AT-T
    Set6 GCGC
    
    #Step 1
    for i in sample: 
        for j in range(i):
            if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                            del i 
        #insert line of code where sample1 = new sample with deletions as above
    
    #Step 2
        for i,i+1 in enumerate(sample):
        #Step 3
        for j in range(i):
            for k in range (i+1):
            #insert line of code to say only two types of pairs can be included, if yes continue else skip
                #Step 4
                if j = "-" or k = "-":
                    #Delete j/k and the corresponding element in the other pair
                    #Step 5
                    count_dict = {}
                        square_dict = {}
                    for base in list(i):
                        if base in count_dict:
                                count_dict[base] += 1
                        else:
                                count_dict[base] = 1
                        for allele in count_dict:
                        freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                        #Calculate frequency of pairs 
                    #Step 6
                    No code yet
    
    from collections import Counter
    
    # Remove elements where all nucleobases are the same.
    for index in range(len(sample) - 1, -1, -1):
        if sample[index][:1] * len(sample[index]) == sample[index]:
            del sample[index]
    
    for indexA, setA in enumerate(sample):
        for indexB, setB in enumerate(sample):
            # Don't compare samples with themselves nor compare same pair twice.
            if indexA <= indexB:
                continue
    
            # Calculate number of unique pairs
            pair_count = Counter()
            for pair in zip(setA, setB):
                if '-' not in pair:
                    pair_count[pair] += 1
    
            # Only analyse pairs of sets with 2 unique pairs.
            if len(pair_count) != 2:
                continue
    
            # Count individual bases.
            base_counter = Counter()
            for pair, count in pair_count.items():
                base_counter[pair[0]] += count
                base_counter[pair[1]] += count
    
            # Get the length of one of each item in the pair.
            sequence_length = sum(pair_count.values())
    
            # Convert counts to frequencies.
            base_freq = {}
            for base, count in base_counter.items():
                base_freq[base] = count / float(sequence_length)
    
            # Examine a pair from the two unique pairs to calculate float_a.
            pair = list(pair_count)[0]
            float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
    
            # Step 7!
            float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
    
    from collections import Counter
    
    BASES = 'ATCG'
    
    # Remove elements where all nucleobases are the same.
    sample = [item for item in sample if item[:1] * len(item) != item]
    
    for indexA, setA in enumerate(sample):
        for indexB, setB in enumerate(sample):
            # Don't compare samples with themselves nor compare same pair twice.
            if indexA <= indexB:
                continue
    
            # Calculate number of unique pairs
            relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
            pair_count = Counter(relevant_pairs)
    
            # Only analyse pairs of sets with 2 unique pairs.
            if len(pair_count) != 2:
                continue
    
            # setA and setB as tuples with pairs involving '-' removed.
            setA, setB = zip(*relevant_pairs)
    
            # Get the total for each base.
            seq_length = len(setA)
    
            # Convert counts to frequencies.
            base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}
    
            # Examine a pair from the two unique pairs to calculate float_a.
            pair = list(pair_count)[0]
            float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]
    
            # Step 7!
            denominator = 1
            for base in BASES:
                denominator *= base_freq.get(base, 0)
    
            float_b = float_a / denominator