在pythex中重新工作，但不'；我不能用python工作_Python_Regex_Findall_Urlopen

在pythex中重新工作，但不'；我不能用python工作

python regex

在pythex中重新工作，但不'；我不能用python工作,python,regex,findall,urlopen,Python,Regex,Findall,Urlopen,我正在做一项作业，我需要从现场网站上搜集信息为此，我使用，并需要刮的游戏名称，价格，然后图像来源。我有工作的标题，但价格和图像来源只是重新调整空列表，虽然当通过pythex它是返回正确的答案这是我的密码： from re import findall, finditer, MULTILINE, DOTALL from urllib.request import urlopen game_html_source = urlopen\ ('https://www.nintendo.com/ga

我正在做一项作业，我需要从现场网站上搜集信息

为此，我使用，并需要刮的游戏名称，价格，然后图像来源。我有工作的标题，但价格和图像来源只是重新调整空列表，虽然当通过pythex它是返回正确的答案

这是我的密码：

from re import findall, finditer, MULTILINE, DOTALL
from urllib.request import urlopen

game_html_source = urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
read().decode("UTF-8")

# game titles - working
game_title = findall(r'<h3 class="b3">([A-Z a-z:0-9]+)</h3>', game_html_source)
print(game_title)

# game prices - retuning empty-list
game_prices = findall(r'<p class="b3 row-price">(\$[.0-9]+)</p>', game_html_source)
print(game_prices)

# game images - returning empty list
game_images = findall(r'<img alt="[A-Z a-z:]+" src=("https://media.nintendo.com/nintendo/bin/[A-Za-z0-9-\/_]+.png")>',game_html_source)
print(game_images)

从重新导入findall、finditer、多行、DOTALL
从urllib.request导入urlopen
game\u html\u source=urlopen\
('https://www.nintendo.com/games/nintendo-switch-bestsellers').\
读取（）.解码（“UTF-8”）
#游戏名称-工作
game_title=findall（r'（[A-Z A-Z:0-9]+）”，game_html_源码）
打印（游戏名称）
#游戏价格-重新调整空列表
game\u prices=findall（r'（\$[.0-9]+），game\u html\u源代码）
打印（游戏价格）
#游戏图片-返回空列表
game\u images=findall（r''，game\u html\u源代码）
打印（游戏图片）

使用正则表达式解析HTML有太多陷阱，无法进行可靠的处理。BeautifulSoup和其他HTML解析器通过构建一个完整的文档数据结构来工作，然后通过导航来提取感兴趣的部分——这是彻底而全面的，但是如果源代码中的任何地方都存在一些错误的HTML，即使它位于您不关心的部分，也可能会破坏解析过程。Pyparsing采用中间方法-您可以定义只匹配所需位的迷你解析器，并跳过所有其他内容（这也简化了解析后的导航）。为了解决HTML样式中的一些变量，pyparsing提供了一个函数

makeHTMLTags

，它为开始和结束标记返回一对pyparsing表达式：

foo_start, foo_end = pp.makeHTMLTags('foo')

foo_start

将匹配：

<foo>
<foo/>
<foo class='bar'>
<foo href=something_not_in_quotes>

有关任天堂页面刮板，请参阅下面的注释源代码：

import pyparsing as pp

# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")

# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")

# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")

# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))

# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end

# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr

# not shown - read web page into variable 'html'

# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
    if 'title' in match:
        print("Title:", match.title)
    elif 'price' in match:
        print("Price:", match.price)
    elif 'src' in match:
        print("Img src:", match.src)
    else:
        print("???", match.dump())

“在python中工作，但在python中不工作”？这是某种道家的形而上学问题吗？不要用正则表达式解析HTML。@AlexHall，你不能这样说，也不能引用：）pythex*对不起，在pythex中可以重新工作，但在python中不工作

import pyparsing as pp

# define expressions to match opening and closing tags <h3>
h3, h3_end = pp.makeHTMLTags("h3")

# define a specific type of <h3> tag that has the desired 'class' attribute
h3_b3 = h3().addCondition(lambda t: t['class'] == "b3")

# similar for <p>
p, p_end = pp.makeHTMLTags("p")
p_b3_row_price = p().addCondition(lambda t: t['class'] == "b3 row-price")

# similar for <img>
img, _ = pp.makeHTMLTags("img")
img_expr = img().addCondition(lambda t: t.src.startswith("//media.nintendo.com/nintendo/bin"))

# define expressions to capture tag body for title and price - include negative lookahead for '<' so that
# tags with embedded tags are not matched
LT = pp.Literal('<')
title_expr = h3_b3 + ~LT + pp.SkipTo(h3_end)('title') + h3_end
price_expr = p_b3_row_price + ~LT + pp.SkipTo(p_end)('price') + p_end

# compose a scanner expression by '|'ing the 3 sub-expressions into one
scanner = title_expr | price_expr | img_expr

# not shown - read web page into variable 'html'

# use searchString to search through the retrieved HTML for matches
for match in scanner.searchString(html):
    if 'title' in match:
        print("Title:", match.title)
    elif 'price' in match:
        print("Price:", match.price)
    elif 'src' in match:
        print("Img src:", match.src)
    else:
        print("???", match.dump())

Img src: //media.nintendo.com/nintendo/bin/SF6LoN-xgX1iT617eWfBrNcWH6RQXnSh/I_IRYaBzJ61i-3hnYt_k7hVxHtqGmM_w.png
Title: Hyrule Warriors: Definitive Edition
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/wcfCyAd7t2N78FkGvEwCOGzVFBNQRbhy/AvG-_d4kEvEplp0mJoUew8IAg71YQveM.png
Title: Donkey Kong Country: Tropical Freeze
Price: $59.99
Img src: //media.nintendo.com/nintendo/bin/QKPpE587ZIA5fUhUL4nSbH3c_PpXYojl/J_Wd79pnFLX1NQISxouLGp636sdewhMS.png
Title: Wizard of Legend
Price: $15.99