网络爬虫中的Python复制_Python

网络爬虫中的Python复制

python

网络爬虫中的Python复制,python,Python,我正在尝试构建一个网络爬虫来从页面获取特定的值。这些值可能会被更新，我不想在输出中获取以前的值以下是我的问题的一个简化示例： html_example=''' <value> this is the updated value Keyword "previous" that tell me I don't want the next value. <valueIdontwant> this is the previous value <value> t

我正在尝试构建一个网络爬虫来从页面获取特定的值。这些值可能会被更新，我不想在输出中获取以前的值

以下是我的问题的一个简化示例：

html_example=''' 
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated
<value> this is the updated value 
Keyword "previous" that tell me I don't want the next value. 
<valueIdontwant> this is the previous value
<value> this value has not been updated 
'''

我希望得到的输出：

['value', 'value', 'value', 'value']

跟踪我想忽略的值的唯一方法是关键字previous，而不是它自己的值，这些值都会因值中的值而变化，这种代码在我的情况下不起作用

我对编程相当陌生，而且我真的很不擅长，我尝试了不同的if语句，但没有成功。如果您对如何解决此问题有任何想法，请提前感谢

代码很复杂，不太像Python，但如果您想在列表上进行索引访问，请查找enumerate

def get_values_ignore_current_line(content, keyword):
   content = '\n'.join([x for x in content.splitlines() if keyword not in x]) 
   tags = re.findall('<.*?>', content)
   return tags

def get_values_ignore_next_line(content, keyword):
    lines = content.splitlines()
    new_content = [lines[0]]
    for i, line in enumerate(lines):
        if (keyword not in line) or (re.match('<.*?>', line) is not None):
            if i < len(lines) - 1:
                new_content.append(lines[i+1])
    new_content = '\n'.join(new_content)
    return re.findall('<.*?>', new_content)

dict和set是可以帮助您的数据结构；它们存储唯一的条目、唯一的键或值，并且对in运算符具有O1查找开销。见文件。

['value', 'value', 'value', 'value']

def get_values_ignore_current_line(content, keyword):
   content = '\n'.join([x for x in content.splitlines() if keyword not in x]) 
   tags = re.findall('<.*?>', content)
   return tags

def get_values_ignore_next_line(content, keyword):
    lines = content.splitlines()
    new_content = [lines[0]]
    for i, line in enumerate(lines):
        if (keyword not in line) or (re.match('<.*?>', line) is not None):
            if i < len(lines) - 1:
                new_content.append(lines[i+1])
    new_content = '\n'.join(new_content)
    return re.findall('<.*?>', new_content)