Pandas 将df中的一列映射到所有单词都存在的另一个df
我试图将一列映射到另一个数据帧中的数据帧,其中所有单词都存在于目标数据帧中 多个匹配项都可以,因为我可以在之后过滤掉它们。 提前谢谢Pandas 将df中的一列映射到所有单词都存在的另一个df,pandas,python-2.7,numpy,Pandas,Python 2.7,Numpy,我试图将一列映射到另一个数据帧中的数据帧,其中所有单词都存在于目标数据帧中 多个匹配项都可以,因为我可以在之后过滤掉它们。 提前谢谢 df1 ColA this is a sentence with some words in a column and another for fun df2 ColB ColC this a 123 in column 456 fun times 789 一些尝试 dfResult = df1.apply(lambda x:
df1
ColA
this is a sentence
with some words
in a column
and another
for fun
df2
ColB ColC
this a 123
in column 456
fun times 789
一些尝试
dfResult = df1.apply(lambda x: np.all([word in x.df1['ColA'].split(' ') for word in x.df2['ColB'].split(' ')]),axis = 1)
dfResult = df1.ColA.apply(lambda sentence: all(word in sentence for word in df2.ColB))
期望输出
dfResult
ColA ColC
this is a sentence 123
with some words NaN
in a column 456
and another NaN
for fun NaN
转到“设置”并使用Numpy广播查找子集
免责声明:不保证这会很快
A = df1.ColA.str.split().apply(set).to_numpy() # If pandas version is < 0.24 use `.values`
B = df2.ColB.str.split().apply(set).to_numpy() # instead of `.to_numpy()`
C = df2.ColC.to_numpy()
# When `dtype` is `object` Numpy falls back on performing
# the operation on each pair of values. Since these are `set` objects
# `<=` tests for subset.
i, j = np.where(B <= A[:, None])
out = pd.array([np.nan] * len(A), pd.Int64Dtype()) # Empty nullable integers
# Use `out = np.empty(len(A), dtype=object)` if pandas version is < 0.24
out[i] = C[j]
df1.assign(ColC=out)
ColA ColC
0 this is a sentence 123
1 with some words NaN
2 in a column 456
3 and another NaN
4 for fun NaN
通过使用loop和set.issubset
pd.DataFrame([[y if set(z.split()).issubset(set(x.split())) else np.nan for z,y in zip(df2.ColB,df2.ColC)] for x in df1.ColA ]).max(1)
Out[34]:
0 123.0
1 NaN
2 456.0
3 NaN
4 NaN
dtype: float64