如何计算Python中列表成对比较的元素频率？_Python_List_For Loop_Dictionary_Frequency

如何计算Python中列表成对比较的元素频率？

python list for-loop dictionary

如何计算Python中列表成对比较的元素频率？,python,list,for-loop,dictionary,frequency,Python,List,For Loop,Dictionary,Frequency,我已将样本存储在以下列表中 sample = [AAAA,CGCG,TTTT,AT-T,CATC] 。。为了说明这个问题，我在下面将它们表示为“集合” 消除集合中每个元素与其自身相同的所有集合输出： Set2 CGCG Set4 AT-T Set5 CATC 在集合之间执行成对比较。（设置2伏设置4，设置2伏设置5，设置4伏设置5）每个成对比较只能有两种类型的组合，如果没有，则消除这些成对比较。例如 Set2 Set5 C C G A C

我已将样本存储在以下列表中

 sample = [AAAA,CGCG,TTTT,AT-T,CATC]

。。为了说明这个问题，我在下面将它们表示为“集合”

消除集合中每个元素与其自身相同的所有集合

输出：

 Set2 CGCG
 Set4 AT-T
 Set5 CATC

在集合之间执行成对比较。（设置2伏设置4，设置2伏设置5，设置4伏设置5）

每个成对比较只能有两种类型的组合，如果没有，则消除这些成对比较。例如

Set2    Set5
C       C
G       A
C       T 
G       C

这里，有两种以上类型的对（CC），（GA），（CT）和（GC）。所以这种两两比较是不可能发生的

每次比较只能有两种组合（AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC）。。。基本上所有可能的ACGT组合，其中订单很重要

在给定的示例中，发现超过2个这样的组合

因此，Set2和Set4；无法考虑Set4和Set5。因此，剩下的仅有两对：

Output
Set2 CGCG
Set4 AT-T

在这种两两比较中，删除带“-”的元素及其对应的另一对元素中的任何元素

Output    
Set2 CGG
Set4 ATT

计算Set2和Set4中元素的频率。计算集合中对类型的出现频率（CA和GT对）

计算相应元素的浮点（a）=（对）-（Set2）*（Set4）（任何一对都足够）

注：如果这对是AAAC和CCCA，C的频率将是1/4，即这是其中一对上的基频

算计

float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)

对所有成对比较重复此操作

例如

设置2伏设置4，设置2伏设置6，设置4伏设置6

到目前为止，我的半成品代码： **我希望所有建议的代码都是循环格式的标准代码，而不是理解代码**

#Step 1
for i in sample: 
    for j in range(i):
        if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                        del i 
    #insert line of code where sample1 = new sample with deletions as above

#Step 2
    for i,i+1 in enumerate(sample):
    #Step 3
    for j in range(i):
        for k in range (i+1):
        #insert line of code to say only two types of pairs can be included, if yes continue else skip
            #Step 4
            if j = "-" or k = "-":
                #Delete j/k and the corresponding element in the other pair
                #Step 5
                count_dict = {}
                    square_dict = {}
                for base in list(i):
                    if base in count_dict:
                            count_dict[base] += 1
                    else:
                            count_dict[base] = 1
                    for allele in count_dict:
                    freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                    #Calculate frequency of pairs 
                #Step 6
                No code yet

我想这就是你想要的：

from collections import Counter

# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
    if sample[index][:1] * len(sample[index]) == sample[index]:
        del sample[index]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        pair_count = Counter()
        for pair in zip(setA, setB):
            if '-' not in pair:
                pair_count[pair] += 1

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # Count individual bases.
        base_counter = Counter()
        for pair, count in pair_count.items():
            base_counter[pair[0]] += count
            base_counter[pair[1]] += count

        # Get the length of one of each item in the pair.
        sequence_length = sum(pair_count.values())

        # Convert counts to frequencies.
        base_freq = {}
        for base, count in base_counter.items():
            base_freq[base] = count / float(sequence_length)

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))

从集合导入计数器
#去除所有碱基相同的元素。
对于范围内的索引（len（样本）-1，-1，-1）：
如果样本[索引][：1]*len（样本[索引]）==样本[索引]：
del样本[索引]
对于indexA，枚举中的setA（样本）：
对于indexB，枚举中的setB（示例）：
#不要将样本与自身进行比较，也不要将同一对样本进行两次比较。
如果indexA我不理解第3步。CGCG
和AT-T
如何产生这些配对？每次比较只能有两种组合（AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC）。。。基本上所有可能的ACGT组合，其中订单很重要。在给定的示例中，发现超过2个这样的组合。因此，Set2和Set4；不能考虑Set4和Set5。请举例说明步骤中AAAC和CCCA对的“c频率”是什么意思？是1/4还是1/2？也就是说，它是单个对上的基频还是两个对上的基频？base1应该是什么？另外，set2和set4在没有匹配字母的情况下如何匹配？我已经更正了“base1”语句。Set2和Set4被认为是匹配的，因为它们满足的标准是，它只有2个唯一的组合CA和GT。而Set2 v Set5有（CC），（GA），（CT）和（GC）（超过2个唯一的对）在#删除所有碱基相同的元素时，您假设样本只有4个碱基的相同结构。它可以有n个碱基。我认为最好在示例中为I添加这个sample1=[]如果len（set（I））>1:sample1.append（I）（2）我非常抱歉造成混淆，但是关于你之前的问题，当有AAAC和CCCA时，它应该是1/4，即在一对上的频率。另外，我在编译float_b=float_a/（base_freq.get）时遇到这个错误（'A'，0）*base_freq.get（'T'，0）*base_freq.get（'C'，0）*base_freq.get（'G'，0））ZeroDivisionError:整数除法或零模注释1：我很确定我说得对——如果整个字符串由第一个字符组成（如果存在的话——代码适用于零长度字符串）注释#3：我将此作为示例的初始值设定项
：示例=['AAAA'、'CGCG'、'TTTT'、'AT-T'、'CATC']。也许您有不同的想法？
float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)

Set2 CGCG
Set4 AT-T
Set6 GCGC

#Step 1
for i in sample: 
    for j in range(i):
        if j = j+1    #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
                        del i 
    #insert line of code where sample1 = new sample with deletions as above

#Step 2
    for i,i+1 in enumerate(sample):
    #Step 3
    for j in range(i):
        for k in range (i+1):
        #insert line of code to say only two types of pairs can be included, if yes continue else skip
            #Step 4
            if j = "-" or k = "-":
                #Delete j/k and the corresponding element in the other pair
                #Step 5
                count_dict = {}
                    square_dict = {}
                for base in list(i):
                    if base in count_dict:
                            count_dict[base] += 1
                    else:
                            count_dict[base] = 1
                    for allele in count_dict:
                    freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
                    #Calculate frequency of pairs 
                #Step 6
                No code yet

from collections import Counter

# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
    if sample[index][:1] * len(sample[index]) == sample[index]:
        del sample[index]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        pair_count = Counter()
        for pair in zip(setA, setB):
            if '-' not in pair:
                pair_count[pair] += 1

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # Count individual bases.
        base_counter = Counter()
        for pair, count in pair_count.items():
            base_counter[pair[0]] += count
            base_counter[pair[1]] += count

        # Get the length of one of each item in the pair.
        sequence_length = sum(pair_count.values())

        # Convert counts to frequencies.
        base_freq = {}
        for base, count in base_counter.items():
            base_freq[base] = count / float(sequence_length)

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))

from collections import Counter

BASES = 'ATCG'

# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]

for indexA, setA in enumerate(sample):
    for indexB, setB in enumerate(sample):
        # Don't compare samples with themselves nor compare same pair twice.
        if indexA <= indexB:
            continue

        # Calculate number of unique pairs
        relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
        pair_count = Counter(relevant_pairs)

        # Only analyse pairs of sets with 2 unique pairs.
        if len(pair_count) != 2:
            continue

        # setA and setB as tuples with pairs involving '-' removed.
        setA, setB = zip(*relevant_pairs)

        # Get the total for each base.
        seq_length = len(setA)

        # Convert counts to frequencies.
        base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}

        # Examine a pair from the two unique pairs to calculate float_a.
        pair = list(pair_count)[0]
        float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]

        # Step 7!
        denominator = 1
        for base in BASES:
            denominator *= base_freq.get(base, 0)

        float_b = float_a / denominator