Python 选择包含字符串列表中任何字符串的行_Python_Pandas

Python 选择包含字符串列表中任何字符串的行

python pandas

Python 选择包含字符串列表中任何字符串的行,python,pandas,Python,Pandas,我正在尝试选择“故事”列中包含我列表中所选单词的任何字符串的行我尝试了几个选项，包括isin和str.contains，但我通常只得到错误或空数据帧 df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing df4[story]=df4[story].astypestr 所选单词=[“接受”、“相信”、“信任”、“接受”、“接受”\ ‘信任’、‘相信’、‘接

我正在尝试选择“故事”列中包含我列表中所选单词的任何字符串的行

我尝试了几个选项，包括isin和str.contains，但我通常只得到错误或空数据帧

df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing df4[story]=df4[story].astypestr 所选单词=[“接受”、“相信”、“信任”、“接受”、“接受”\ ‘信任’、‘相信’、‘接受’、‘信任’、‘接受’、‘相信’、‘相信’、‘正常’、‘正常化’、‘正常化’、‘常规’、‘信念’、‘信念’、‘信心’、‘采纳’\ “采纳”、“采纳”、“接受”、“批准”、“批准”、“批准”、“批准”] 现在我不知道下一步该怎么办

我会收到一个空的数据帧，或者一条错误消息，具体取决于我尝试执行的操作。

试试这个。我无法加载您的DF

df4[df4["story"].isin(selected_words)]

在这里你可以看到一个解决方案

基本上str.contains支持正则表达式，因此可以使用或管道连接

df4[df4.story.str.contains('|'.join(selected_words))]

我现在自己也在学习更多的熊猫，所以我想贡献一个我刚从一家公司学到的答案

可以使用Pandas系列创建一个掩码，并使用它来过滤数据帧

import pandas as pd

# This URL doesn't return CSV.
CSV_URL = "https://drive.google.com/open?id=1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW"
# Data file saved from within a browser to help with question.

# I stored the BitcoinData.csv data on my Minio server.
df = pd.read_csv("https://minio.apps.selfip.com/mymedia/csv/BitcoinData.csv")


selected_words = [
    "accept",
    "believe",
    "trust",
    "accepted",
    "accepts",
    "trusts",
    "believes",
    "acceptance",
    "trusted",
    "trusting",
    "accepting",
    "believes",
    "believing",
    "believed",
    "normal",
    "normalize",
    " normalized",
    "routine",
    "belief",
    "faith",
    "confidence",
    "adoption",
    "adopt",
    "adopted",
    "embrace",
    "approve",
    "approval",
    "approved",
    "approves",
]

# %%timeit run in Jupyter notebook

mask = pd.Series(any(word in item for word in selected_words) for item in df["story"])

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# %%timeit run in Jupyter notebook

df[mask]

# results: 955 µs ± 6.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# %%timeit run in Jupyter notebook

df[df.story.str.contains('|'.join(selected_words))]

# results 129 ms ± 738 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# True for all
df[mask] == df[df.story.str.contains('|'.join(selected_words))]

# It is possible to calculate the mask inside of the index operation though of course a time penalty is taken rather than using the calculated and stored mask.

# %%timeit run in Jupyter notebook

df[[any(word in item for word in selected_words) for item in df["story"]]]

# results 18.2 ms ± 94.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# This is still faster than using the alternative `df.story.str.contains`

#

掩码搜索方式的速度明显更快。

能否向我们展示您尝试过的内容，并共享空df或回溯结果？部分问题可能是代码的这一部分：df4=pd.read_csvhttps://drive.google.com/file/d/1rwg8c2GmtqLeGGv1xm9w6kS98iqgd6vW/view?usp=sharing 当该URL在浏览器中显示CSV数据时，当使用HTTP客户端调用它时，它不会返回CSV数据。感谢您的回答，我尝试了它，得到了一个空的数据帧，就像我以前得到的一样。让我看看如何尝试为建议链接csv不同的库，这给了我一个booleen返回列表，所以我稍微修改了它：words=df4[df4.story.str.contains'|'。joinselected_words]，并得到了必要的结果。非常感谢！！！！！