Python 如何注册序列中出现的ID和计数（或数量）？_Python_Python 3.x

Python 如何注册序列中出现的ID和计数（或数量）？

python python-3.x

Python 如何注册序列中出现的ID和计数（或数量）？,python,python-3.x,Python,Python 3.x,我对python很陌生，所以请容忍我。我被要求创建一个程序。我需要在每个序列中查找模式（必须由用户使用键盘提供）字符串的出现，如果出现，请注册序列中出现的id和数量（计数）数据如下所示： id sequence 1 MVLSEGEWAAVLHVWAKVEADVAAGHGQDILIRLFKS 2 MNIFEMLRIAAGLRLKIYKDTEAAGYYTIGIGHLLTKSPSL 3 MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSH 4 MNIFEMLR

我对python很陌生，所以请容忍我。我被要求创建一个程序。我需要在每个序列中查找模式（必须由用户使用键盘提供）字符串的出现，如果出现，请注册序列中出现的id和数量（计数）

数据如下所示：

id  sequence
1   MVLSEGEWAAVLHVWAKVEADVAAGHGQDILIRLFKS
2   MNIFEMLRIAAGLRLKIYKDTEAAGYYTIGIGHLLTKSPSL
3   MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSH
4   MNIFEMLRAAEGAALRLKIYKAADTEGYYTIGIGHLLTKS
5   MVLSAAEGEWQLVLHVWAKVEADVAGHGQDILIRLFK

id  count
1   2
2   2
4   3
5   1

其中ID是数字，序列是每个数字下面的序列。该文件是一个矩阵（100437 x 2）

这是我目前掌握的代码：

import re

def proteins_pattern_count(pattern):
    with open("proteins.csv", 'r') as proteins:
        proteins = proteins.read()
        items = re.findall(pattern, proteins)
    return len(items)

# Reading the pattern to look for and forcing the input to change the pattern to capital letters.
pattern = input("Please type in the pattern you would like to look for: ").upper()

count = proteins_pattern_count(pattern)

print('The pattern {} appears {} times within the proteins file'.format(pattern, count))

我得到的输出：

Please type in the pattern you would like to look for: AA
The pattern AA appears 173372 times within the proteins file

但我真正想要的是：例如，如果我正在寻找的模式是“AA”，那么我只希望看到一个表，其中包含实际具有这种模式的序列ID以及序列中出现的次数（计数），如下所示：

id  sequence
1   MVLSEGEWAAVLHVWAKVEADVAAGHGQDILIRLFKS
2   MNIFEMLRIAAGLRLKIYKDTEAAGYYTIGIGHLLTKSPSL
3   MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSH
4   MNIFEMLRAAEGAALRLKIYKAADTEGYYTIGIGHLLTKS
5   MVLSAAEGEWQLVLHVWAKVEADVAGHGQDILIRLFK

id  count
1   2
2   2
4   3
5   1

我认为这很容易做到，但我对Python完全陌生

谢谢你的支持

这样的方法应该行得通

res = {}
for protein in proteins.readlines():
    if pattern in protein:
        res[protein[0]] = protein.count(pattern)

print(res)

首先，我强烈建议对

.readlines（）

返回的对象使用不同的变量名，因为当前正在覆盖文件指针。虽然这段代码没有必要这样做，但它通常是一种良好的实践

其次，为了使事情变得更简单，您可能希望使用

csv.reader

以一种友好、简单的方式拆分您的csv

以下是您可以使用的三个代码段：

import re, csv

def proteins_slow1(pattern):
    with open("proteins.csv", 'r') as fp:
        proteins = csv.reader(fp, delimiter=',')
        ids, counts = [], []
        for i,seq in proteins:
            count = len(re.findall(pattern, seq))
            if count != 0:
                ids.append(i)
                counts.append(count)
    return ids, counts

def proteins_slow2(pattern):
    with open("proteins.csv", 'r') as fp:
        proteins = csv.reader(fp, delimiter=',')
        struct = {}
        for i,seq in proteins:
            count = len(re.findall(pattern, seq))
            if count != 0:
                struct[i] = count
    return struct

def proteins_fast(pattern):
    with open("proteins.csv", 'r') as fp:
        proteins = csv.reader(fp, delimiter=',')
        struct = {}
        for i,seq in proteins:
            count = seq.count(pattern)
            if count != 0:
                struct[i] = count
    return struct

proteins_slow1

生成两个列表，如果计数为非零，则添加它们。函数结束时返回一个带有

ids

列表和

counts

列表的元组

这与

proteins\u slow2

一样快，后者制作了一个字典，并将新条目添加为键值对（id作为键，count作为值）

最快的方法是实际上不使用

re

，而是对序列字符串使用

.count（）

方法。这将减少大约30-40%的运行时间（如果您反复查看100000多行，这将变得非常重要）

（通过从以下函数生成csv文件完成计时测试：）

享受吧

打开文件后，应该使用proteins.readlines（）并在创建的列表的每个元素上迭代findall方法。您可以使用与您的

id-1

@Clément相对应的列表的idx。您如何知道实际数据中的所有id按升序递增1？我们不能。否则，如果没有对id进行排序，您仍然可以使用读取行中的第一个字符读取id信息。@Clément只有在所有id都是单个数字的情况下才有效，这似乎不太可能，否？此外，标题行需要特殊处理。所以读取文件的正确方法是使用。谢谢各位，id号实际上是从1到100436连续的。