Python 具有相似字符串的两个数据帧的匹配序列parttern保持索引和序列
我有两个数据帧df和df1。其中我必须匹配一个或多个字符串,并获得唯一匹配的字符串序列,其索引号为df作为输出 df df1 输出:Python 具有相似字符串的两个数据帧的匹配序列parttern保持索引和序列,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我有两个数据帧df和df1。其中我必须匹配一个或多个字符串,并获得唯一匹配的字符串序列,其索引号为df作为输出 df df1 输出: idx id_0 user string 4 009124 17 today 5 000029 13 is 6 548751 21 a 7 479903 19 bright 8 897054 08 sunny 9 336588 7 day 我尝试了几种方法pd.merge、pd.concat、pd.join,还有isin,但是,我得到了错误的索引号 e、 g 将
idx id_0 user string
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
我尝试了几种方法pd.merge、pd.concat、pd.join,还有isin,但是,我得到了错误的索引号 e、 g
将df1.strings转换为list,并使用apply和lambda函数与df进行比较:
import pandas as pd
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
string_list = df1.string.tolist()
filt = df['string'].apply(lambda x: any([k in x for k in string_list]))
print(df[filt])
返回:
idx id_0 user string
2 2 774426 10 am
4 4 009124 17 today
5 5 000029 13 is
6 6 548751 21 a
7 7 479903 19 bright
8 8 897054 08 sunny
9 9 336588 7 day
10 10 294260 16 today
11 11 908751 29 is
12 12 558902 81 rainy
15 15 775098 16 today
16 16 665490 02 is
17 17 887099 07 sunday
18 18 389011 18 ahhh
一种可能的方法如下:
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
df.iloc[matching_indices]
输出:
id_0 user string
idx
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
上面的代码将返回所有匹配的子序列及其正确的索引,而不仅仅是第一次出现
如果希望只返回第一个匹配项,可以在第一次识别匹配项时中断循环,如下所示:
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
break
df.iloc[matching_indices]
“我尝试了几种方法pd.merge、pd.concat、pd.join,也尝试了isin,但是,我得到了错误的索引号。”---请将这些尝试包括在您的问题中。我看到
df
中的一些单词重复出现,您如何知道保留哪个索引/用户?似乎您只是保留了第一条。如果您包含一段可复制的可复制代码,可以创建您的输入数据帧,那么其他人回答您的问题可能会更容易。@np8df=pd.read_clipboard()
在这些方面效果很好examples@Dan我必须保留df的索引,而不是小数据集df1TypeError:“in”需要字符串作为左操作数,而不是引发此错误的浮点。但是这个返回也比匹配的项目序列要多。
df = pd.DataFrame([
[0, "008457", "02", "hello"],
[1, "990037", "05", "I"],
[2, "774426", "10", "am"],
[3, "564389", "08", "sleeping"],
[4, "009124", "17", "today"],
[5, "000029", "13", "is"],
[6, "548751", "21", "a"],
[7, "479903", "19", "bright"],
[8, "897054", "08", "sunny"],
[9, "336588", "7", "day"],
[10, "294260", "16", "today"],
[11, "908751", "29", "is"],
[12, "558902", "81", "rainy"],
[13, "097856", "19", "with"],
[14, "110044", "24", "cold"],
[15, "775098", "16", "today"],
[16, "665490", "02", "is"],
[17, "887099", "07", "sunday"],
[18, "389011", "18", "ahhh"],
[19, "675510", "11", "weekend"]
],
columns=["idx", "id_0", "user", "string"]
)
df = df.set_index('idx')
df1 = pd.DataFrame([
[0, "today"],
[1, "is"],
[2, "a"],
[3, "bright"],
[4, "sunny"],
[5, "day"]
],
columns=["idx", "string"]
)
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
df.iloc[matching_indices]
id_0 user string
idx
4 009124 17 today
5 000029 13 is
6 548751 21 a
7 479903 19 bright
8 897054 08 sunny
9 336588 7 day
matching_indices = []
for i in range(len(df)-len(df1)+1):
if (df.string.iloc[i:i+len(df1)].values == df1.string.values).all():
matching_indices += list(range(i,i+len(df1)))
break
df.iloc[matching_indices]