Python 比较数据帧中的两个字符串并显示差异

Python 比较数据帧中的两个字符串并显示差异,python,pandas,Python,Pandas,比如说 对于两列 target read AATGGCATC AATGGCATG AATGATATA AAGGATATA AATGATGTA CATGATGTA 我想添加这个列 target read differnces AATGGCATC AATGGCATG (C,G,8) AATGATATA AAGGATATA (T,G,3) AATGATGTA CATGATGTA (A,G,0) 让我们对每

比如说 对于两列

target        read
AATGGCATC     AATGGCATG
AATGATATA     AAGGATATA
AATGATGTA     CATGATGTA
我想添加这个列

target        read       differnces
AATGGCATC     AATGGCATG  (C,G,8)
AATGATATA     AAGGATATA  (T,G,3)
AATGATGTA     CATGATGTA  (A,G,0)
让我们对每个单词进行拆分(同时删除初始空白)并创建一个堆叠的数据帧,在那里,我们可以使用累积计数对每次出现的情况进行计数,并在最终创建元组时删除所有重复项

这里的关键功能是
分解
str\u分割
堆叠
删除重复项

s = (
    df.stack()
    .str.split("")
    .explode()
    .to_frame("words")
    .replace("", np.nan, regex=True)
    .dropna()
)

s['enum'] = s.groupby(level=[0,1]).cumcount()

df["diff"] = (
    s.reset_index(0)[
        ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)
    ]
    .groupby("level_0")
    .agg(words=("words", ",".join), pos=("enum", "first"))
    .agg(tuple, axis=1)
)
                    



我认为这个简单的函数可能会对您有所帮助 (请记住,这不是一种矢量化的方法):


回答得不错@DataNearound@ShubhamSharma谢谢你刚刚意识到我犯了一个错误,就把它修好了,奇怪的是OP接受了这个答案!
print(df)

     target       read      diff
0  AATGGCATC  AATGGCATG  (C,G, 8)
1  AATGATATA  AAGGATATA  (T,G, 2)
2  AATGATGTA  CATGATGTA  (A,C, 0)
print(s.reset_index(0)[
          ~s.reset_index(0).duplicated(subset=["level_0", "words", "enum"], keep=False)])

        level_0 words  enum
target        0     C     8
read          0     G     8
target        1     T     2
read          1     G     2
target        2     A     0
read          2     C     0
import pandas as pd
import difflib as dl

# create a dataframe
# pass the columns as argument to the function below
# df refers to the data frame

def differences(a,b):
    differences=[]
    for i in range(len(a)):
        l=list(dl.ndiff(a[i].strip(),b[i].strip()))
        temp=[x[2] for x in l if x[0]!=' ' ]
        for x in l:
            if x[0]=='-' or x[0]=='+':
                temp.append(l.index(x))
        differences.append(tuple(temp[:3]))
    return differences

df['differences']=differences(df['target'],df['read'])
print(df)