如何计算Python中列表成对比较的元素频率?
我已将样本存储在以下列表中如何计算Python中列表成对比较的元素频率?,python,list,for-loop,dictionary,frequency,Python,List,For Loop,Dictionary,Frequency,我已将样本存储在以下列表中 sample = [AAAA,CGCG,TTTT,AT-T,CATC] 。。为了说明这个问题,我在下面将它们表示为“集合” 消除集合中每个元素与其自身相同的所有集合 输出: Set2 CGCG Set4 AT-T Set5 CATC 在集合之间执行成对比较。(设置2伏设置4,设置2伏设置5,设置4伏设置5) 每个成对比较只能有两种类型的组合,如果没有,则消除这些成对比较。例如 Set2 Set5 C C G A C
sample = [AAAA,CGCG,TTTT,AT-T,CATC]
。。为了说明这个问题,我在下面将它们表示为“集合”
Set2 CGCG
Set4 AT-T
Set5 CATC
Set2 Set5
C C
G A
C T
G C
每次比较只能有两种组合(AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC)。。。基本上所有可能的ACGT组合,其中订单很重要 在给定的示例中,发现超过2个这样的组合 因此,Set2和Set4;无法考虑Set4和Set5。因此,剩下的仅有两对:
Output
Set2 CGCG
Set4 AT-T
Output
Set2 CGG
Set4 ATT
float (b) = float(a)/ (freq of C in CGG) * (freq G in CGG) * (freq A in ATT) * (ATT==> freq of T in ATT)
#Step 1
for i in sample:
for j in range(i):
if j = j+1 #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
del i
#insert line of code where sample1 = new sample with deletions as above
#Step 2
for i,i+1 in enumerate(sample):
#Step 3
for j in range(i):
for k in range (i+1):
#insert line of code to say only two types of pairs can be included, if yes continue else skip
#Step 4
if j = "-" or k = "-":
#Delete j/k and the corresponding element in the other pair
#Step 5
count_dict = {}
square_dict = {}
for base in list(i):
if base in count_dict:
count_dict[base] += 1
else:
count_dict[base] = 1
for allele in count_dict:
freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
#Calculate frequency of pairs
#Step 6
No code yet
我想这就是你想要的:
from collections import Counter
# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
if sample[index][:1] * len(sample[index]) == sample[index]:
del sample[index]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
pair_count = Counter()
for pair in zip(setA, setB):
if '-' not in pair:
pair_count[pair] += 1
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# Count individual bases.
base_counter = Counter()
for pair, count in pair_count.items():
base_counter[pair[0]] += count
base_counter[pair[1]] += count
# Get the length of one of each item in the pair.
sequence_length = sum(pair_count.values())
# Convert counts to frequencies.
base_freq = {}
for base, count in base_counter.items():
base_freq[base] = count / float(sequence_length)
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
从集合导入计数器
#去除所有碱基相同的元素。
对于范围内的索引(len(样本)-1,-1,-1):
如果样本[索引][:1]*len(样本[索引])==样本[索引]:
del样本[索引]
对于indexA,枚举中的setA(样本):
对于indexB,枚举中的setB(示例):
#不要将样本与自身进行比较,也不要将同一对样本进行两次比较。
如果indexA我想这就是你想要的:
from collections import Counter
# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
if sample[index][:1] * len(sample[index]) == sample[index]:
del sample[index]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
pair_count = Counter()
for pair in zip(setA, setB):
if '-' not in pair:
pair_count[pair] += 1
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# Count individual bases.
base_counter = Counter()
for pair, count in pair_count.items():
base_counter[pair[0]] += count
base_counter[pair[1]] += count
# Get the length of one of each item in the pair.
sequence_length = sum(pair_count.values())
# Convert counts to frequencies.
base_freq = {}
for base, count in base_counter.items():
base_freq[base] = count / float(sequence_length)
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
从集合导入计数器
#去除所有碱基相同的元素。
对于范围内的索引(len(样本)-1,-1,-1):
如果样本[索引][:1]*len(样本[索引])==样本[索引]:
del样本[索引]
对于indexA,枚举中的setA(样本):
对于indexB,枚举中的setB(示例):
#不要将样本与自身进行比较,也不要将同一对样本进行两次比较。
如果indexA我不理解第3步。CGCG
和AT-T
如何产生这些配对?每次比较只能有两种组合(AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC)。。。基本上所有可能的ACGT组合,其中订单很重要。在给定的示例中,发现超过2个这样的组合。因此,Set2和Set4;不能考虑Set4和Set5。请举例说明步骤中AAAC和CCCA对的“c频率”是什么意思?是1/4还是1/2?也就是说,它是单个对上的基频还是两个对上的基频?base1应该是什么?另外,set2和set4在没有匹配字母的情况下如何匹配?我已经更正了“base1”语句。Set2和Set4被认为是匹配的,因为它们满足的标准是,它只有2个唯一的组合CA和GT。而Set2 v Set5有(CC)、(GA)、(CT)和(GC)(超过2个唯一对)我不理解第3步。CGCG
和AT-T
如何产生这些配对?每次比较只能有两种组合(AA、GG、CC、TT、AT、TA、AC、CA、AG、GA、GC、CG、GT、TG、CT、TC)。。。基本上所有可能的ACGT组合,其中订单很重要。在给定的示例中,发现超过2个这样的组合。因此,Set2和Set4;不能考虑Set4和Set5。请举例说明步骤中AAAC和CCCA对的“c频率”是什么意思?是1/4还是1/2?也就是说,它是单个对上的基频还是两个对上的基频?base1应该是什么?另外,set2和set4在没有匹配字母的情况下如何匹配?我已经更正了“base1”语句。Set2和Set4被认为是匹配的,因为它们满足的标准是,它只有2个唯一的组合CA和GT。而Set2 v Set5有(CC),(GA),(CT)和(GC)(超过2个唯一的对)在#删除所有碱基相同的元素时,您假设样本只有4个碱基的相同结构。它可以有n个碱基。我认为最好在示例中为I添加这个sample1=[]如果len(set(I))>1:sample1.append(I)(2)我非常抱歉造成混淆,但是关于你之前的问题,当有AAAC和CCCA时,它应该是1/4,即在一对上的频率。另外,我在编译float_b=float_a/(base_freq.get)时遇到这个错误('A',0)*base_freq.get('T',0)*base_freq.get('C',0)*base_freq.get('G',0))ZeroDivisionError:整数除法或零模注释1:我很确定我说得对——如果整个字符串由第一个字符组成(如果存在的话——代码适用于零长度字符串)注释#3:我把它作为sample
的初始值设定项:sample=['AAAA','CGCG','TTTT','AT-T','CATC']
。也许你有什么不同?在#删除所有碱基都相同的元素时,你假设样本只有4个碱基的相同结构。它可以有n个碱基。我认为最好将这个样本1=[]添加到样本中的I:if len(set(I))>1:sample1.append(I)(2)非常抱歉造成混淆,但关于您之前的问题,当有AAAC和CCCA时,它应该是1/4,即f
Set2 CGCG
Set4 AT-T
Set6 GCGC
#Step 1
for i in sample:
for j in range(i):
if j = j+1 #This needs to be corrected to if all elements in i identical to each other i.e. if all "j's" are the same
del i
#insert line of code where sample1 = new sample with deletions as above
#Step 2
for i,i+1 in enumerate(sample):
#Step 3
for j in range(i):
for k in range (i+1):
#insert line of code to say only two types of pairs can be included, if yes continue else skip
#Step 4
if j = "-" or k = "-":
#Delete j/k and the corresponding element in the other pair
#Step 5
count_dict = {}
square_dict = {}
for base in list(i):
if base in count_dict:
count_dict[base] += 1
else:
count_dict[base] = 1
for allele in count_dict:
freq = (count_dict[allele] / len(i)) #frequencies of individual alleles
#Calculate frequency of pairs
#Step 6
No code yet
from collections import Counter
# Remove elements where all nucleobases are the same.
for index in range(len(sample) - 1, -1, -1):
if sample[index][:1] * len(sample[index]) == sample[index]:
del sample[index]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
pair_count = Counter()
for pair in zip(setA, setB):
if '-' not in pair:
pair_count[pair] += 1
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# Count individual bases.
base_counter = Counter()
for pair, count in pair_count.items():
base_counter[pair[0]] += count
base_counter[pair[1]] += count
# Get the length of one of each item in the pair.
sequence_length = sum(pair_count.values())
# Convert counts to frequencies.
base_freq = {}
for base, count in base_counter.items():
base_freq[base] = count / float(sequence_length)
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(sequence_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
float_b = float_a / float(base_freq.get('A', 0) * base_freq.get('T', 0) * base_freq.get('C', 0) * base_freq.get('G', 0))
from collections import Counter
BASES = 'ATCG'
# Remove elements where all nucleobases are the same.
sample = [item for item in sample if item[:1] * len(item) != item]
for indexA, setA in enumerate(sample):
for indexB, setB in enumerate(sample):
# Don't compare samples with themselves nor compare same pair twice.
if indexA <= indexB:
continue
# Calculate number of unique pairs
relevant_pairs = [(elA, elB) for (elA, elB) in zip(setA, setB) if elA != '-' and elB != '-']
pair_count = Counter(relevant_pairs)
# Only analyse pairs of sets with 2 unique pairs.
if len(pair_count) != 2:
continue
# setA and setB as tuples with pairs involving '-' removed.
setA, setB = zip(*relevant_pairs)
# Get the total for each base.
seq_length = len(setA)
# Convert counts to frequencies.
base_freq = {base : count / float(seq_length) for (base, count) in (Counter(setA) + Counter(setB)).items()}
# Examine a pair from the two unique pairs to calculate float_a.
pair = list(pair_count)[0]
float_a = (pair_count[pair] / float(seq_length)) - base_freq[pair[0]] * base_freq[pair[1]]
# Step 7!
denominator = 1
for base in BASES:
denominator *= base_freq.get(base, 0)
float_b = float_a / denominator