Python 按顺序提取由括号括起的字符串的唯一部分

Python 按顺序提取由括号括起的字符串的唯一部分,python,Python,我想提取括号中包含的数据,并将其打印到另一个文本文件中 我的文本文件是 RAH71880.1苯酚单加氧酶[阿曲霉菌CBS 121060] PVV21043.1苯酚2-单加氧酶[γ-变形杆菌共生体] Ctena orbiculata]PVV21041.1苯酚羟化酶[γ变形杆菌 圆藻[PYH66749.1苯酚单加氧酶共生体 [Aspergillus vadensis CBS 113365]PYH31415.1苯酚单加氧酶 [新黑曲霉CBS 115656]PUB86175.1苯酚2-单加氧酶 [圆藻γ

我想提取括号中包含的数据,并将其打印到另一个文本文件中

我的文本文件是

RAH71880.1苯酚单加氧酶[阿曲霉菌CBS 121060] PVV21043.1苯酚2-单加氧酶[γ-变形杆菌共生体] Ctena orbiculata]PVV21041.1苯酚羟化酶[γ变形杆菌 圆藻[PYH66749.1苯酚单加氧酶共生体 [Aspergillus vadensis CBS 113365]PYH31415.1苯酚单加氧酶 [新黑曲霉CBS 115656]PUB86175.1苯酚2-单加氧酶 [圆藻γ-变形杆菌共生体]PUB86141.1苯酚 2-单加氧酶[圆藻的γ-变形杆菌共生体] PUB86139.1苯酚羟化酶[Ctena的γ-变形杆菌共生体] orbiculata]PUB79626.1苯酚羟化酶[γ变形杆菌] 圆藻共生体]PUB79624.1苯酚2-单加氧酶[γ 圆藻[PUB72973.1苯酚的蛋白菌共生体 2-单加氧酶[圆藻的γ-变形杆菌共生体] PUB72971.1苯酚羟化酶[Ctena的γ-变形杆菌共生体] orbiculata]PWY90296.1苯酚单加氧酶[曲霉 硬化剂CBS 115572]PWY63616.1苯酚单加氧酶 [桉叶曲霉CBS 122712]

我用过这个程序

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
for l in infile:
    outfile.write(l.split()[-1] + '\n')
infile.close()
outfile.close()

但是它不起作用

您想在程序中使用正则表达式。 正则表达式对于提取文本非常有用。 例:

输出

   dataFindMe

这应该完全符合您的要求:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"
    outfile.write(line)


infile.close()
outfile.close()
out3.txt

RAH71880.1 phenol monooxygenase [Aspergillus aculeatinus CBS 121060]
PVV21043.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PVV21041.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PYH66749.1 phenol monooxygenase [Aspergillus vadensis CBS 113365]
PYH31415.1 phenol monooxygenase [Aspergillus neoniger CBS 115656]
PUB86175.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86141.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86139.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79626.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79624.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72973.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72971.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PWY90296.1 phenol monooxygenase [Aspergillus sclerotioniger CBS 115572]
PWY63616.1 phenol monooxygenase [Aspergillus eucalypticola CBS 122712]
Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712
out5.txt

RAH71880.1 phenol monooxygenase [Aspergillus aculeatinus CBS 121060]
PVV21043.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PVV21041.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PYH66749.1 phenol monooxygenase [Aspergillus vadensis CBS 113365]
PYH31415.1 phenol monooxygenase [Aspergillus neoniger CBS 115656]
PUB86175.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86141.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB86139.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79626.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB79624.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72973.1 phenol 2-monooxygenase [gamma proteobacterium symbiont of Ctena orbiculata]
PUB72971.1 phenol hydroxylase [gamma proteobacterium symbiont of Ctena orbiculata]
PWY90296.1 phenol monooxygenase [Aspergillus sclerotioniger CBS 115572]
PWY63616.1 phenol monooxygenase [Aspergillus eucalypticola CBS 122712]
Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

编辑

如果您只想打印出唯一的行,可以如下更新源代码:

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
unique = []

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"

    if line not in unique:
        unique.append(line)
        outfile.write(line)


infile.close()
outfile.close()
Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712
然后您将得到如下输出(out5.txt):

infile = open('out3.txt', 'r')
outfile = open('out5.txt', 'w')
unique = []

for line in infile:
    line = (line[line.find('[') + 1:])[:-2] + "\n"

    if line not in unique:
        unique.append(line)
        outfile.write(line)


infile.close()
outfile.close()
Aspergillus aculeatinus CBS 121060
gamma proteobacterium symbiont of Ctena orbiculata
Aspergillus vadensis CBS 113365
Aspergillus neoniger CBS 115656
Aspergillus sclerotioniger CBS 115572
Aspergillus eucalypticola CBS 122712

下面是一个正则表达式解决方案,它可以工作并保留
[]
。 正则表达式:
r'(\[.+\])'

前导的
r
表示原始字符串,这会阻止python插入
\\
字符

外圆括号
()
是一个捕获组,将捕获到
m.groups()
返回的元组中

[
必须“转义”,因为它们是正则表达式元字符

+
表示任何字符(
)的一个或多个(
+

编辑:此版本使用
OrderedDict
删除重复项并保留顺序(设置
则不会):

在out5.txt中给出:

[Aspergillus aculeatinus CBS 121060]
[gamma proteobacterium symbiont of Ctena orbiculata]
[Aspergillus vadensis CBS 113365]
[Aspergillus neoniger CBS 115656]
[Aspergillus sclerotioniger CBS 115572]
[Aspergillus eucalypticola CBS 122712]

您的模式不适用于此数据。
\w
不包含空格。谢谢您,先生,我的回答对您的数据有点困惑。您能否澄清换行符在哪里?括号完成后换行符,我还想删除括号中的重复数据,因为括号中包含的数据与随意program@Hlokhande然后你可以把每个唯一的数据行保存在一个列表中,然后在它还不在给定的列表中的时候把它写到输出文件中result@Hlokhande你能提供更多的信息吗?显示你尝试过的代码吗?我只需要括号中的一行数据,因为它给出了du将括号中的数据也折叠起来,这样做的目的是什么?你不想要重复的?你为什么不在你的问题中这样说?顺序重要吗?是的,我不想要重复的是顺序matters@Hlokhande:请在将来指定您的确切要求,并说明您尝试的代码应如何满足这些要求。这不是一个程序ram写入服务。