Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/295.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 如何使用RDKit计算微笑结构列表的分子指纹和相似性?_Python_Csv_Similarity_Fingerprint_Rdkit - Fatal编程技术网

Python 如何使用RDKit计算微笑结构列表的分子指纹和相似性?

Python 如何使用RDKit计算微笑结构列表的分子指纹和相似性?,python,csv,similarity,fingerprint,rdkit,Python,Csv,Similarity,Fingerprint,Rdkit,我正在使用RDKit根据两组具有微笑结构的分子之间的Tanimoto系数计算分子相似性。 现在我可以从两个单独的csv文件中提取微笑结构。我想知道如何将这些结构放入RDKit中的指纹模块,以及如何在两个分子列表之间逐个计算相似性 from rdkit import DataStructs from rdkit.Chem.Fingerprints import FingerprintMols ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles(

我正在使用RDKit根据两组具有微笑结构的分子之间的Tanimoto系数计算分子相似性。 现在我可以从两个单独的csv文件中提取微笑结构。我想知道如何将这些结构放入RDKit中的指纹模块,以及如何在两个分子列表之间逐个计算相似性

from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
ms = [Chem.MolFromSmiles('CCOC'), Chem.MolFromSmiles('CCO'), ... Chem.MolFromSmiles('COC')]
fps = [FingerprintMols.FingerprintMol(x) for x in ms]
DataStructs.FingerprintSimilarity(fps[0],fps[1])
我想把我所有的微笑结构(超过10000个)放到“ms”列表中,并得到他们的指纹。 然后我将比较两个列表中每对分子之间的相似性,也许这里需要一个for循环

提前谢谢


我使用pandas dataframe选择并打印带有我的结构的列表,并将列表保存到列表1和列表2中。当它运行到ms1行时,有如下错误:

TypeError: No registered converter was able to produce a C++ rvalue of type std::__cxx11::basic_string<wchar_t, 
std::char_traits<wchar_t>, std::allocator<wchar_t> > from this Python object of type float
例如,钆二酰胺的微笑是“O=C1[O-][Gd+3]234567[O]=C(C[N]2(CC[N]3(CC[O-]4)=O)CC[N]5(CC([O]6)NC)CC(=O)[O-]7)C1)NC”,并且错误代码如下(当运行fps线时):

如果原始csv文件如下所示,如何在输出文件中包括分子名称以及相似性值:

姓名、微笑、价值、价值2

分子量1,CCOCN(C)(C),0.25,A

分子2,CCO,1.12,B

分子3,COC,2.25,C

我添加了这些代码以将分子名称包含在输出文件中,这些代码在名称方面存在一些数组值错误(特别是对于d2):


编辑答案以捕获所有评论。

RDKit具有大量相似性功能,因此您可以将一个指纹与一系列指纹进行比较。只需在指纹列表上循环

如果CSV是这样的

第一个微笑无效的csv

smiles,value,value2
CCOCN(C)(C),0.25,A
CCO,1.12,B
COC,2.25,C
第二个微笑是正确的

smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F
这是如何读出微笑,删除无效的微笑,做指纹相似性没有重复和保存排序值

from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])

# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)

# save as csv
df_final.to_csv('third.csv', index=False, sep=',')
打印输出:

Invalid SMILES: CCOCN(C)(C)C

CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']

   query target  Similarity
9   CCCO  CCCCO    0.769231
2    CCO   CCCO    0.600000
1    CCO  CCOCC    0.500000
7  CCOCC   CCCO    0.466667
3    CCO  CCCCO    0.461538
8  CCOCC  CCCCO    0.388889
4    COC  CCOCC    0.333333
5    COC   CCCO    0.272727
0    CCO    COC    0.250000
6    COC  CCCCO    0.214286

谢谢你的回答!你的代码运行良好。那么,如何将我的结构从csv文件导入到两个列表中呢?在你的问题中,你写道,你能够从csv文件中提取微笑。你没有把他们列在名单上吗?你做了什么?我使用pandas dataframe选择并打印带有我的结构的列表,并将列表保存到列表1和列表2中。当它运行到MS1行时,它的错误如下:Type Error:没有注册的转换器能够生成类型为STD::Y.OXCXX11::Basic字符串的C++值。我在回答中添加了一个熊猫/csv示例。谢谢。我试过了,它在同一点上返回了相同的错误代码。
smiles,value,value2
CCOCC,0.55,D
CCCO,2.58,E
CCCCO,5.01,F
from rdkit import Chem
from rdkit import DataStructs
from rdkit.Chem.Fingerprints import FingerprintMols
import pandas as pd

# read and Conconate the csv's
df_1 = pd.read_csv('first.csv')
df_2 = pd.read_csv('second.csv')
df_3 = pd.concat([df_1, df_2])

# proof and make a list of SMILES
df_smiles = df_3['smiles']
c_smiles = []
for ds in df_smiles:
    try:
        cs = Chem.CanonSmiles(ds)
        c_smiles.append(cs)
    except:
        print('Invalid SMILES:', ds)
print()

# make a list of mols
ms = [Chem.MolFromSmiles(x) for x in c_smiles]

# make a list of fingerprints (fp)
fps = [FingerprintMols.FingerprintMol(x) for x in ms]

# the list for the dataframe
qu, ta, sim = [], [], []

# compare all fp pairwise without duplicates
for n in range(len(fps)-1): # -1 so the last fp will not be used
    s = DataStructs.BulkTanimotoSimilarity(fps[n], fps[n+1:]) # +1 compare with the next to the last fp
    print(c_smiles[n], c_smiles[n+1:]) # witch mol is compared with what group
    # collect the SMILES and values
    for m in range(len(s)):
        qu.append(c_smiles[n])
        ta.append(c_smiles[n+1:][m])
        sim.append(s[m])
print()

# build the dataframe and sort it
d = {'query':qu, 'target':ta, 'Similarity':sim}
df_final = pd.DataFrame(data=d)
df_final = df_final.sort_values('Similarity', ascending=False)
print(df_final)

# save as csv
df_final.to_csv('third.csv', index=False, sep=',')
Invalid SMILES: CCOCN(C)(C)C

CCO ['COC', 'CCOCC', 'CCCO', 'CCCCO']
COC ['CCOCC', 'CCCO', 'CCCCO']
CCOCC ['CCCO', 'CCCCO']
CCCO ['CCCCO']

   query target  Similarity
9   CCCO  CCCCO    0.769231
2    CCO   CCCO    0.600000
1    CCO  CCOCC    0.500000
7  CCOCC   CCCO    0.466667
3    CCO  CCCCO    0.461538
8  CCOCC  CCCCO    0.388889
4    COC  CCOCC    0.333333
5    COC   CCCO    0.272727
0    CCO    COC    0.250000
6    COC  CCCCO    0.214286