Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从特定的非唯一标记创建Web垃圾_Python_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 从特定的非唯一标记创建Web垃圾

Python 从特定的非唯一标记创建Web垃圾,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我的网页包含以下数据: 输出: <span class="label">Thesis Note: </span> <span class="label">Bibliography/Index: </span> <span class="label">Abstract: </span> 论文注释: 参考书目/索引: 摘要: 您可以使用。下一个兄弟姐妹 Ex: html = """<span class="results

我的网页包含以下数据:

输出:

<span class="label">Thesis Note: </span>
<span class="label">Bibliography/Index: </span>
<span class="label">Abstract: </span>
论文注释:
参考书目/索引:
摘要:

您可以使用
。下一个兄弟姐妹

Ex:

html = """<span class="results_summary"><span class="label">Thesis Note: </span>For the degree of Executive Master in Consulting and Coaching for Change, XXXX, February 2018</span>    
<span class="results_summary"><span class="label">Bibliography/Index: </span>Includes bibliographical references</span><span class="results_summary"><span class="label">Abstract: </span>In today’s “attention economy”, self-awareness, ability to regulate one’s emotions, having the
    negative capability, improved focus and clarity of mind for better decision making stand out
    as crucial traits for effective leadership.
    Despite the scientific findings re-affirming the positive impact of the regular practice of
    mindfulness meditation on effectiveness, take-up rate of the concept for formal leadership &amp;
    talent development programs has been slow. What’s novel in this study is to experiment and
    explore possible underlying reasons for that and articulate on the viability of mindfulness rollout
    programs in leadership development context.
</span>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for span in soup.findAll('span',{'class':'label'}):
    if "Abstract:" in span.text:
        print(span.next_sibling )
html=“”论文说明:关于变革咨询和辅导执行硕士学位,XXXX,2018年2月
参考书目/索引:包括参考书目摘要:在当今的“注意力经济”中,自我意识、调节情绪的能力
消极的能力、更好的专注力和清晰的思维,使决策更为突出
作为有效领导的关键特征。
尽管科学发现再次肯定了经常性的医疗实践的积极影响
对正式领导概念的有效性、接受率的正念冥想;
人才培养计划进展缓慢。本研究的创新之处在于进行实验和研究
探索可能的潜在原因,并阐明正念的可行性
领导力发展课程。
"""
从bs4导入BeautifulSoup
soup=BeautifulSoup(html,“html.parser”)
对于soup.findAll中的span('span',{'class':'label'}):
如果span.text中为“摘要”:
打印(span.next\u同级)

您可以使用
。下一个兄弟姐妹

Ex:

html = """<span class="results_summary"><span class="label">Thesis Note: </span>For the degree of Executive Master in Consulting and Coaching for Change, XXXX, February 2018</span>    
<span class="results_summary"><span class="label">Bibliography/Index: </span>Includes bibliographical references</span><span class="results_summary"><span class="label">Abstract: </span>In today’s “attention economy”, self-awareness, ability to regulate one’s emotions, having the
    negative capability, improved focus and clarity of mind for better decision making stand out
    as crucial traits for effective leadership.
    Despite the scientific findings re-affirming the positive impact of the regular practice of
    mindfulness meditation on effectiveness, take-up rate of the concept for formal leadership &amp;
    talent development programs has been slow. What’s novel in this study is to experiment and
    explore possible underlying reasons for that and articulate on the viability of mindfulness rollout
    programs in leadership development context.
</span>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for span in soup.findAll('span',{'class':'label'}):
    if "Abstract:" in span.text:
        print(span.next_sibling )
html=“”论文说明:关于变革咨询和辅导执行硕士学位,XXXX,2018年2月
参考书目/索引:包括参考书目摘要:在当今的“注意力经济”中,自我意识、调节情绪的能力
消极的能力、更好的专注力和清晰的思维,使决策更为突出
作为有效领导的关键特征。
尽管科学发现再次肯定了经常性的医疗实践的积极影响
对正式领导概念的有效性、接受率的正念冥想;
人才培养计划进展缓慢。本研究的创新之处在于进行实验和研究
探索可能的潜在原因,并阐明正念的可行性
领导力发展课程。
"""
从bs4导入BeautifulSoup
soup=BeautifulSoup(html,“html.parser”)
对于soup.findAll中的span('span',{'class':'label'}):
如果span.text中为“摘要”:
打印(span.next\u同级)

您可以使用正则表达式,而不是使用漂亮的汤

import re

result = re.findall(r'<span class="label">Abstract: </span>(.[\s\S]*)</span>',html_text)
重新导入
result=re.findall(r'Abstract:(.[\s\s]*),html\u text)

假设
摘要:
在您的
html\u文本中是唯一的
如果不是这样,请找到一个唯一的模式来检索所需的数据。

您可以使用regex,而不是使用beautiful soup

import re

result = re.findall(r'<span class="label">Abstract: </span>(.[\s\S]*)</span>',html_text)
val = """<span class="results_summary"><span class="label">Thesis Note: </span>For the degree of Executive Master in Consulting and Coaching for Change, XXXX, February 2018</span>    \n<span class="results_summary"><span class="label">Bibliography/Index: </span>Includes bibliographical references</span><span class="results_summary"><span class="label">Abstract: </span>In todays attention economy, self-awareness, ability to regulate ones emotions, having the\n    negative capability, improved focus and clarity of mind for better decision making stand out\n    as crucial traits for effective leadership.\n    Despite the scientific findings re-affirming the positive impact of the regular practice of\n    mindfulness meditation on effectiveness, take-up rate of the concept for formal leadership &amp;\n    talent development programs has been slow. Whats novel in this study is to experiment and\n    explore possible underlying reasons for that and articulate on the viability of mindfulness rollout\n    programs in leadership development context.\n</span>"""
soup = BeautifulSoup(val, "html5lib").findAll('span', {'class': 'results_summary'})[2]

for span in soup.findAll('span'):                                                     
    span.unwrap()
print(soup.decode_contents())
重新导入
result=re.findall(r'Abstract:(.[\s\s]*),html\u text)

假设
Abstract:
html\u文本中是唯一的
如果不是这样,请找到一个唯一的模式来检索所需的数据。

您使用RegEx的解决方案看起来是可伸缩的,因为它可以针对其他标记进行修改。但当我运行它时,我得到了一个错误:“TypeError:expected string或bytes like object”。我把hmtl_文本作为一个漂亮的组合类型。。请帮忙。这里的html_文本是从爬网响应中检索到的原始html文本。而且
re.findall()
无法解析BeautifulSoup对象,因此无需将原始html转换为BeautifulSoup类型。使用正则表达式的解决方案看起来是可伸缩的,因为它可以针对其他标记进行修改。但当我运行它时,我得到了一个错误:“TypeError:expected string或bytes like object”。我把hmtl_文本作为一个漂亮的组合类型。。请帮忙。这里的html_文本是从爬网响应中检索到的原始html文本。而且
re.findall()
无法解析BeautifulSoup对象,因此无需将原始html转换为BeautifulSoup类型。
val = """<span class="results_summary"><span class="label">Thesis Note: </span>For the degree of Executive Master in Consulting and Coaching for Change, XXXX, February 2018</span>    \n<span class="results_summary"><span class="label">Bibliography/Index: </span>Includes bibliographical references</span><span class="results_summary"><span class="label">Abstract: </span>In todays attention economy, self-awareness, ability to regulate ones emotions, having the\n    negative capability, improved focus and clarity of mind for better decision making stand out\n    as crucial traits for effective leadership.\n    Despite the scientific findings re-affirming the positive impact of the regular practice of\n    mindfulness meditation on effectiveness, take-up rate of the concept for formal leadership &amp;\n    talent development programs has been slow. Whats novel in this study is to experiment and\n    explore possible underlying reasons for that and articulate on the viability of mindfulness rollout\n    programs in leadership development context.\n</span>"""
soup = BeautifulSoup(val, "html5lib").findAll('span', {'class': 'results_summary'})[2]

for span in soup.findAll('span'):                                                     
    span.unwrap()
print(soup.decode_contents())