Python 选择包含字符串列表中任何字符串的行
我正在尝试选择“故事”列中包含我列表中所选单词的任何字符串的行 我尝试了几个选项,包括isin和str.contains,但我通常只得到错误或空数据帧 df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing df4[story]=df4[story].astypestr 所选单词=[“接受”、“相信”、“信任”、“接受”、“接受”\ ‘信任’、‘相信’、‘接受’、‘信任’、‘接受’、‘相信’、‘相信’、‘正常’、‘正常化’、‘正常化’、‘常规’、‘信念’、‘信念’、‘信心’、‘采纳’\ “采纳”、“采纳”、“接受”、“批准”、“批准”、“批准”、“批准”] 现在我不知道下一步该怎么办Python 选择包含字符串列表中任何字符串的行,python,pandas,Python,Pandas,我正在尝试选择“故事”列中包含我列表中所选单词的任何字符串的行 我尝试了几个选项,包括isin和str.contains,但我通常只得到错误或空数据帧 df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing df4[story]=df4[story].astypestr 所选单词=[“接受”、“相信”、“信任”、“接受”、“接受”\ ‘信任’、‘相信’、‘接
我会收到一个空的数据帧,或者一条错误消息,具体取决于我尝试执行的操作。试试这个。我无法加载您的DF
df4[df4["story"].isin(selected_words)]
在这里你可以看到一个解决方案 基本上str.contains支持正则表达式,因此可以使用或管道连接
df4[df4.story.str.contains('|'.join(selected_words))]
我现在自己也在学习更多的熊猫,所以我想贡献一个我刚从一家公司学到的答案 可以使用Pandas系列创建一个掩码,并使用它来过滤数据帧
import pandas as pd
# This URL doesn't return CSV.
CSV_URL = "https://drive.google.com/open?id=1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW"
# Data file saved from within a browser to help with question.
# I stored the BitcoinData.csv data on my Minio server.
df = pd.read_csv("https://minio.apps.selfip.com/mymedia/csv/BitcoinData.csv")
selected_words = [
"accept",
"believe",
"trust",
"accepted",
"accepts",
"trusts",
"believes",
"acceptance",
"trusted",
"trusting",
"accepting",
"believes",
"believing",
"believed",
"normal",
"normalize",
" normalized",
"routine",
"belief",
"faith",
"confidence",
"adoption",
"adopt",
"adopted",
"embrace",
"approve",
"approval",
"approved",
"approves",
]
# %%timeit run in Jupyter notebook
mask = pd.Series(any(word in item for word in selected_words) for item in df["story"])
# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# %%timeit run in Jupyter notebook
df[mask]
# results: 955 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# %%timeit run in Jupyter notebook
df[df.story.str.contains('|'.join(selected_words))]
# results 129 ms ± 738 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# True for all
df[mask] == df[df.story.str.contains('|'.join(selected_words))]
# It is possible to calculate the mask inside of the index operation though of course a time penalty is taken rather than using the calculated and stored mask.
# %%timeit run in Jupyter notebook
df[[any(word in item for word in selected_words) for item in df["story"]]]
# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# This is still faster than using the alternative `df.story.str.contains`
#
掩码搜索方式的速度明显更快。能否向我们展示您尝试过的内容,并共享空df或回溯结果?部分问题可能是代码的这一部分:df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing 当该URL在浏览器中显示CSV数据时,当使用HTTP客户端调用它时,它不会返回CSV数据。感谢您的回答,我尝试了它,得到了一个空的数据帧,就像我以前得到的一样。让我看看如何尝试为建议链接csv不同的库,这给了我一个booleen返回列表,所以我稍微修改了它:words=df4[df4.story.str.contains'|'。joinselected_words],并得到了必要的结果。非常感谢!!!!!