Python 比较数据帧中的两个字符串并显示差异
比如说 对于两列Python 比较数据帧中的两个字符串并显示差异,python,pandas,Python,Pandas,比如说 对于两列 target read AATGGCATC AATGGCATG AATGATATA AAGGATATA AATGATGTA CATGATGTA 我想添加这个列 target read differnces AATGGCATC AATGGCATG (C,G,8) AATGATATA AAGGATATA (T,G,3) AATGATGTA CATGATGTA (A,G,0) 让我们对每
target read
AATGGCATC AATGGCATG
AATGATATA AAGGATATA
AATGATGTA CATGATGTA
我想添加这个列
target read differnces
AATGGCATC AATGGCATG (C,G,8)
AATGATATA AAGGATATA (T,G,3)
AATGATGTA CATGATGTA (A,G,0)
让我们对每个单词进行拆分(同时删除初始空白)并创建一个堆叠的数据帧,在那里,我们可以使用累积计数对每次出现的情况进行计数,并在最终创建元组时删除所有重复项
这里的关键功能是分解
,str\u分割
,堆叠
和删除重复项
s = (
df.stack()
.str.split("")
.explode()
.to_frame("words")
.replace("", np.nan, regex=True)
.dropna()
)
s['enum'] = s.groupby(level=[0,1]).cumcount()
df["diff"] = (
s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
]
.groupby("level_0")
.agg(words=("words", ",".join), pos=("enum", "first"))
.agg(tuple, axis=1)
)
我认为这个简单的函数可能会对您有所帮助 (请记住,这不是一种矢量化的方法):
回答得不错@DataNearound@ShubhamSharma谢谢你刚刚意识到我犯了一个错误,就把它修好了,奇怪的是OP接受了这个答案!
print(df)
target read diff
0 AATGGCATC AATGGCATG (C,G, 8)
1 AATGATATA AAGGATATA (T,G, 2)
2 AATGATGTA CATGATGTA (A,C, 0)
print(s.reset_index(0)[
~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])
level_0 words enum
target 0 C 8
read 0 G 8
target 1 T 2
read 1 G 2
target 2 A 0
read 2 C 0
import pandas as pd
import difflib as dl
# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame
def differences(a,b):
differences=[]
for i in range(len(a)):
l=list(dl.ndiff(a[i].strip(),b[i].strip()))
temp=[x[2] for x in l if x[0]!=' ' ]
for x in l:
if x[0]=='-' or x[0]=='+':
temp.append(l.index(x))
differences.append(tuple(temp[:3]))
return differences
df['differences']=differences(df['target'],df['read'])
print(df)