Python WordCloud不删除停止字_Python_Beautifulsoup_Word Cloud

Python WordCloud不删除停止字

python

Python WordCloud不删除停止字,python,beautifulsoup,word-cloud,Python,Beautifulsoup,Word Cloud,我正在尝试构建一个Wordcloud，它可以自动从工作描述中提取单词，并构建一个Wordcloud。如果你有stopwords=None，它应该删除wordcloud的已知stopwords列表，但我的程序没有。我相信这可能与我如何用漂亮的汤来描述工作有关。我需要帮助，要么用beautifulsoup以不同的方式提取单词，要么我没有正确使用stopwords 导入请求 #pip安装bs4 从bs4导入BeautifulSoup #pip安装wordcloud 从wordcloud导入wordcl

我正在尝试构建一个Wordcloud，它可以自动从工作描述中提取单词，并构建一个Wordcloud。如果你有stopwords=None，它应该删除wordcloud的已知stopwords列表，但我的程序没有。我相信这可能与我如何用漂亮的汤来描述工作有关。我需要帮助，要么用beautifulsoup以不同的方式提取单词，要么我没有正确使用stopwords

导入请求
#pip安装bs4
从bs4导入BeautifulSoup
#pip安装wordcloud
从wordcloud导入wordcloud
将matplotlib.pyplot作为plt导入
#转到工作描述
url=”https://career.benteler.jobs/job/Paderborn-Head-of-Finance-&；-控制北美西北/604307901/？地区=美国”
html\u text=requests.get（url.text）
soup=BeautifulSoup（html_文本'html.parser'）
#通读美丽的汤文本中的所有单词
组合词=“”
查找汤中的单词。查找所有（'span'）：
separatedWords=words.text.split（“”）
组合词+=“”。连接（分隔词）+”
#创建wordcloud
resumeCloud=WordCloud（stopwords=None，background\u color='white'，max\u words=75，max\u font\u size=75，random\u state=1）。生成（组合词）
plt.图（figsize=（8,4））
plt.imshow（恢复云）
打印轴（“关闭”）
plt.show（）

主要问题是所有代码都在一个块中。尝试将逻辑拆分为方法，并分别测试每个位。请求未检查错误（例如，服务器可能不可用，但现在不应该是问题。）

BeautifulSoup正在提取页面上的所有span元素。这意味着它将包括菜单/页脚。如果需要作业描述，则可能需要选择具有类名jobdescription的范围。之后，您可以调用text来删除html。我不确定是否需要删除逗号和句号等其他内容

我对Word Cloud没有任何经验。然而，在下面的代码中，它返回的内容看起来像是结果

import requests
from bs4 import BeautifulSoup
from wordcloud import WordCloud
import matplotlib.pyplot as plt

def get_job_html(url):
    response = requests.get(url)
    response.raise_for_status() # check for 4xx & 5xx errors
    return response.text

def extract_combined_words(html):
    soup = BeautifulSoup(html, 'html.parser')
    job_description = soup.find("span", {"class": "jobdescription"}).text.replace('\n', ' ') # Target span with class jobdescription. text will strip out html.
    print(job_description) # TODO - Check this is the results you expect?
    return job_description

def create_resume_cloud(combinedWords):
    return WordCloud(stopwords=None, background_color='white', max_words=75, max_font_size=75, random_state=1).generate(combinedWords)

def plot_resume_cloud(resumeCloud):
    plt.figure(figsize=(8, 4))
    plt.imshow(resumeCloud)
    plt.axis('off')
    plt.show()

def run(url):
    html = get_job_html(url)
    combinedWords = extract_combined_words(html)
    resumeCloud = create_resume_cloud(combinedWords)
    plt = plot_resume_cloud(resumeCloud)
    return plt # TODO - not sure how the results gets consumed

if __name__ == '__main__':
    run("https://career.benteler.jobs/job/Paderborn-Head-of-Finance-&amp;-Controlling-North-America-NW/604307901/?locale=en_US")

这回答了你的问题吗？@barny的副本，第二个肯定有用。设置搭配=假工作。谢谢。这正是我想要清理数据的地方。另外，还有人给了我WordCloud解决方案。谢谢