RDkit函数的Pyspark字符串与==比较引发错误

RDkit函数的Pyspark字符串与==比较引发错误,pyspark,rdkit,Pyspark,Rdkit,我有一个Pyspark UDF定义如下- from rdkit import Chem input_smile = 'CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O' converted_smile_in = Chem.MolToSmiles(Chem.MolFromSmiles(input_smile) def convertSmile(smile): return (Chem.MolToSmiles(Chem.MolFromSmiles(smile)))

我有一个Pyspark UDF定义如下-

from rdkit import Chem

input_smile = 'CCOC(=O)c1cc2cc(ccc2[nH]1)C(=O)O'
converted_smile_in = Chem.MolToSmiles(Chem.MolFromSmiles(input_smile)

def convertSmile(smile):
        return (Chem.MolToSmiles(Chem.MolFromSmiles(smile)))
applyconvertSmileUdf = udf(convertSmile)

data_converted = data_converted.withColumn("converted_smile", applyconvertSmileUdf(data_filtered.smiles))

if __name__ == "__main__": 
        # using the new approach
        data_converted.filter(data_converted.converted_smile == converted_smile_in ).select("id","smiles").show()
else:
        print("Cannot convert!")     
数据转换后的微笑与转换后的微笑之间的比较抛出错误。我已经打印了大约20个转换的微笑值,看起来不错。我们不能这样做字符串比较吗

ArgumentError:中的Python参数类型 RDKID.CHEM.RDMOLFILL.MLTROMLILL(NoType)不匹配C++签名: Moltomiles(RDKit::ROMol mol,bool异构体英里数=真,bool kekuleSmiles=假,int rootedAtAtom=-1,bool canonical=真,bool allbondseplicit=False,bool allHsExplicit=False,bool doRandom=False)

替换

data\u converted.filter(data\u converted.converted\u smile==converted\u smile\u in)。选择(“id”,“smiles”)。显示()

来自pyspark.sql.functions的

data_converted.filter(data_converted.converted_smile==lit(converted_smile_in))。选择(“id”、“smiles”)。显示()
更换

data\u converted.filter(data\u converted.converted\u smile==converted\u smile\u in)。选择(“id”,“smiles”)。显示()

来自pyspark.sql.functions的

data_converted.filter(data_converted.converted_smile==lit(converted_smile_in))。选择(“id”、“smiles”)。显示()