Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/329.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/git/22.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 刮除任何元素中不包含的文本_Python_Beautifulsoup - Fatal编程技术网

Python 刮除任何元素中不包含的文本

Python 刮除任何元素中不包含的文本,python,beautifulsoup,Python,Beautifulsoup,我正在一个写得很差的网站上搜刮美丽的汤4。除了用户的电子邮件地址之外,我什么都有,它不在任何包含区别它的元素中。你知道怎么刮吗next_sibling会跳过它 <div class="fieldset-wrapper"> <strong> E-mail address: </strong> useremail@yahoo.com <div class="field field-name-ds-user-picture field-type-d

我正在一个写得很差的网站上搜刮美丽的汤4。除了用户的电子邮件地址之外,我什么都有,它不在任何包含区别它的元素中。你知道怎么刮吗<正如我所料,强元素的code>next_sibling会跳过它

<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">


电子邮件地址:

useremail@yahoo.com

我不确定这是否是最好的方法,但您可以获取父元素,然后迭代其子元素并查看非标记:

from bs4 import BeautifulSoup
import bs4

html='''
<div class="fieldset-wrapper">
 <strong>
  E-mail address:
 </strong>
 useremail@yahoo.com
 <div class="field field-name-ds-user-picture field-type-ds field-label-hidden">
  <div class="field-items">
'''


def print_if_email(s):
    if '@' in s: print s

soup = BeautifulSoup(html)

# Iterate over all divs, you could narrow this down if you had more information
for div in soup.findAll('div'):
    # Iterate over the children of each matching div
    for c in div.children:
        # If it wasn't parsed as a tag, it may be a NavigableString
        if isinstance(c, bs4.element.NavigableString):
            # Some heuristic to identify email addresses if other non-tags exist
            print_if_email(c.strip())
当然,内部for循环和if语句可以组合为:

for c in filter(lambda c: isinstance(c, bs4.element.NavigableString), div.children):

我无法直接回答您的问题,因为我从未使用过Beautiful Soup(所以不要接受这个答案!),但我只是想提醒您,如果页面都非常简单,另一种选择可能是使用
.split()

这相当笨拙,但如果页面简单/可预测,则值得考虑

也就是说,如果您对页面的总体布局有所了解 (例如,用户电子邮件总是第一个提到的电子邮件)您可以编写自己的解析器,以查找@符号前后的位

# html = the entire document as a string

# return the entire document up to the '@' sign
bit_before_at_sign = html.split('@')[0]
# only useful if you know first email is the one you care about

# you could then cut out everything before username with something like this
b = bit_before_at_sign
# a very long string, we just want the last bit right before the @ sign
username = b.split(' ')[-1].split('\n')[-1].split('\r')[-1].split('\r')[-1].split(';')[-1]
# add more if required, depending on how the html looks to you 
# (I've just guessed some html elements that might precede the username)

# you could similarly parse the bit after the @ sign, 
# html.split('@')[1]  
# e.g., checking the first few characters of this
# against a known list of .tlds like '.com', '.co.uk', etc  
# (remember some TLDs have more than one period, so don't just parse by '.')
# and combine with the username you already know

此外,如果您想缩小文档的范围,您还可以选择以下内容:

如果您想确保“e-mail”一词也在正在解析的字符串中

if 'email' in lower(b) or 'e-mail' in lower(b):
    # do something...
检查@符号首次出现在文档中的位置

html.index('@')
# e.g., if you want to see how near this '@' symbol is to some other element you know about 
# such as the word 'e-mail', or a particular div element or '</strong>'
html.index(“@”)
#例如,如果您想查看此“@”符号与您知道的其他元素的距离
#例如单词“e-mail”,或特定的div元素或“”
要将您对电子邮件的搜索限制在您知道的其他元素之前/之后的300个字符内,请执行以下操作:

startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]
startfrom=html.index(“”)
html\u i\u will\u search=html[startfrom:startfrom+300]
我想在谷歌上多呆几分钟或许会很有用;你的任务听起来并不奇怪:)

<>请确保您考虑页面上有多个电子邮件地址的情况(例如,这样您就不会分配)。support@site.com给每一个用户!)


如果您有疑问,无论使用何种方法,都值得使用email.utils.parseaddr()或其他人的正则表达式检查器检查您的答案。请参见

请插入您用来刮取HTML代码的代码。我不明白为什么,但建议使用正则表达式解析HTML是让堆栈溢出群用机械化大黄蜂攻击您家的最快方法。当心!
startfrom = html.index('</strong>')
html_i_will_search = html[startfrom:startfrom+300]