Python 刮取响应引导表时出现问题

Python 刮取响应引导表时出现问题,python,python-3.x,web-scraping,beautifulsoup,Python,Python 3.x,Web Scraping,Beautifulsoup,我正试图从上面刮下一张桌子,事实证明这是个问题。当我通过requests或urllib调用网站时,我只得到表的前10个结果,即使默认情况下我通常看到所有行 问题是,因为它是一个引导表,所以页面不会显示在URL中。有人能打破这些桌子吗?我的代码如下: 使用URLLib: headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/7

我正试图从上面刮下一张桌子,事实证明这是个问题。当我通过
requests
urllib
调用网站时,我只得到表的前10个结果,即使默认情况下我通常看到所有行

问题是,因为它是一个引导表,所以页面不会显示在URL中。有人能打破这些桌子吗?我的代码如下:

使用URLLib:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
reg_url = "https://www.pepperscale.com/hot-pepper-list/"
req = Request(url=reg_url, headers=headers) 
html = urlopen(req).read()

pepscrap = pd.read_html(html)
print(pepscrap[0])
使用BS4(没有完成,因为我只看到了10行)


您最好只使用在第一个位置填充引导表的相同API。我使用Google Chrome的网络记录器只查看浏览器请求的XHR(
XmlHttpRequest
)资源。通过检查过滤资源的内容,我确定浏览器使用以下URL向其API发出请求:

无论出于何种原因,您都需要发出POST(而不是GET)请求,因为每个API都是不同的。在“Header and request body”选项卡中,您可以看到API期望的表单参数——这些参数似乎很重要。我没有试图找出哪些是关键的,哪些不是必需的,所以我只是用浏览器发送的相同表单参数构造了我的POST请求

这会产生一个很好的JSON响应,解析起来很简单:

import requests

url = "https://www.pepperscale.com/wp-admin/admin-ajax.php?action=get_wdtable&table_id=5"

data = {
    "draw": "1",
    "columns[0][data]": "0",
    "columns[0][name]": "wdt_ID",
    "columns[0][searchable]": "true",
    "columns[0][orderable]": "true",
    "columns[0][search][value]": "",
    "columns[0][search][regex]": "false",
    "columns[1][data]": "1",
    "columns[1][name]": "heat",
    "columns[1][searchable]": "true",
    "columns[1][orderable]": "true",
    "columns[1][search][value]": "",
    "columns[1][search][regex]": "false",
    "columns[2][data]": "2",
    "columns[2][name]": "image",
    "columns[2][searchable]": "true",
    "columns[2][orderable]": "false",
    "columns[2][search][value]": "",
    "columns[2][search][regex]": "false",
    "columns[3][data]": "3",
    "columns[3][name]": "hotpepper",
    "columns[3][searchable]": "true",
    "columns[3][orderable]": "true",
    "columns[3][search][value]": "",
    "columns[3][search][regex]": "false",
    "columns[4][data]": "4",
    "columns[4][name]": "minshu",
    "columns[4][searchable]": "true",
    "columns[4][orderable]": "true",
    "columns[4][search][value]": "",
    "columns[4][search][regex]": "false",
    "columns[5][data]": "5",
    "columns[5][name]": "maxshu",
    "columns[5][searchable]": "true",
    "columns[5][orderable]": "false",
    "columns[5][search][value]": "",
    "columns[5][search][regex]": "false",
    "columns[6][data]": "6",
    "columns[6][name]": "formula_1",
    "columns[6][searchable]": "false",
    "columns[6][orderable]": "false",
    "columns[6][search][value]": "",
    "columns[6][search][regex]": "false",
    "columns[7][data]": "7",
    "columns[7][name]": "formula_2",
    "columns[7][searchable]": "false",
    "columns[7][orderable]": "false",
    "columns[7][search][value]": "",
    "columns[7][search][regex]": "false",
    "columns[8][data]": "8",
    "columns[8][name]": "jalrp",
    "columns[8][searchable]": "true",
    "columns[8][orderable]": "false",
    "columns[8][search][value]": "",
    "columns[8][search][regex]": "false",
    "columns[9][data]": "9",
    "columns[9][name]": "type",
    "columns[9][searchable]": "true",
    "columns[9][orderable]": "true",
    "columns[9][search][value]": "",
    "columns[9][search][regex]": "false",
    "columns[10][data]": "10",
    "columns[10][name]": "origin",
    "columns[10][searchable]": "true",
    "columns[10][orderable]": "false",
    "columns[10][search][value]": "",
    "columns[10][search][regex]": "false",
    "columns[11][data]": "11",
    "columns[11][name]": "use",
    "columns[11][searchable]": "true",
    "columns[11][orderable]": "false",
    "columns[11][search][value]": "",
    "columns[11][search][regex]": "false",
    "columns[12][data]": "12",
    "columns[12][name]": "flavor",
    "columns[12][searchable]": "true",
    "columns[12][orderable]": "false",
    "columns[12][search][value]": "",
    "columns[12][search][regex]": "false",
    "order[0][column]": "5",
    "order[0][dir]": "asc",
    "start": "0",
    "length": "-1",
    "search[value]": "",
    "search[regex]": "false",
    "wdtNonce": "2f82d8936d"
}

response = requests.post(url, data=data)
response.raise_for_status()

peppers = response.json()["data"]

# print out the first pepper information
print(peppers[0])
输出:

['1', 'Mild', "<a href='https://www.pepperscale.com/wp-content/uploads/2015/04/Tasty-Colorbell-Pepper-4-Plants-GreenYellowPurpleRed-0.jpg' target='_blank' rel='lightbox[-1]'><img src='https://www.pepperscale.com/wp-content/uploads/2015/04/Tasty-Colorbell-Pepper-4-Plants-GreenYellowPurpleRed-0-75x75.jpg' /></a>", "<a data-content='Bell Pepper' href='https://www.pepperscale.com/bell-pepper' target='_blank'>Bell Pepper</a>", '0', '0', '0', '0.00', '-8,000 to -2,500', 'annuum', 'Mexico', 'Culinary', 'Bright, Sweet']
['1'、'轻度'、''、''、'0'、'0'、'0.00'、'-8000至-2500'、'Annium'、'Mexico'、'Cuminary'、'Bright、Sweet']

表的其余部分通过js加载。如果你用硒之类的东西,你可以等着,然后再刮。
['1', 'Mild', "<a href='https://www.pepperscale.com/wp-content/uploads/2015/04/Tasty-Colorbell-Pepper-4-Plants-GreenYellowPurpleRed-0.jpg' target='_blank' rel='lightbox[-1]'><img src='https://www.pepperscale.com/wp-content/uploads/2015/04/Tasty-Colorbell-Pepper-4-Plants-GreenYellowPurpleRed-0-75x75.jpg' /></a>", "<a data-content='Bell Pepper' href='https://www.pepperscale.com/bell-pepper' target='_blank'>Bell Pepper</a>", '0', '0', '0', '0.00', '-8,000 to -2,500', 'annuum', 'Mexico', 'Culinary', 'Bright, Sweet']