Python 3.x 如何从data-*属性中提取数据？_Python 3.x_Regex_Web Scraping_Beautifulsoup_Custom Data Attribute

Python 3.x 如何从data-*属性中提取数据？

python-3.x regex web-scraping

Python 3.x 如何从data-*属性中提取数据？,python-3.x,regex,web-scraping,beautifulsoup,custom-data-attribute,Python 3.x,Regex,Web Scraping,Beautifulsoup,Custom Data Attribute,我想写一个刮刀，将获得一个磁铁链接从任何自定义数据属性的任何HTML标记。例如，在kickassto.cc上，磁铁链接未分配给锚定标记的href属性，而是分配给div标记的数据sc params属性，如： <a data-download rel="nofollow" class="kaGiantButton siteButton iconButton" title="Download verified torrent file"

我想写一个刮刀，将获得一个磁铁链接从任何自定义数据属性的任何HTML标记。例如，在kickassto.cc上，磁铁链接未分配给锚定标记的href属性，而是分配给div标记的数据sc params属性，如：

<a data-download rel="nofollow" class="kaGiantButton siteButton iconButton" title="Download verified torrent file" target="_blank" 
href="/torrents/Download Something in the Woods 2016 HDRip XviD AC3-EVO Torrent">
<i class="ka ka-verify"></i>
<span>Download torrent</span></a>
<div data-sc-replace data-sc-slot="_b6f619f42a2411c6688f2273fa3f628a" class="inlineblock" 
data-sc-params="{ 'magnet': 'magnet:?xt=urn:btih:CC75C59E9FE0E8689DFD21558C02E9C9F92AE714&dn=something+in+the+woods+2016+hdrip+xvid+ac3+evo&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce', 'extension': 'avi', 'stream': '' }"></div>

为了获得磁铁链接，我编写了以下代码：

import requests
from bs4 import BeautifulSoup
import re

#All the URLs found within a page’s <a> tags:

url = input("What is the address of the  web page in question?")
#Here you would enter: https://kickassto.cc/something-in-the-woods-2016-hdrip-xvid-ac3-evo-t12972573.html

response = requests.get(url)

soup = BeautifulSoup(response.text, 'html.parser')

# RE patterns:
magnet1 = re.compile(r"^magnet:\?xt=urn:btih:")
magnet2 = re.compile(r"magnet:\?xt=urn:btih:")
whateverTagOrAttribute = re.compile(r".{1,40}") #That has no more than forty characters
kickass = "data-sc-params"
dataAttribute = re.compile(r"data.{1,30}") # to match "data-whatever..", this whatever is unlikely to be longer than 30 characters in a name of an attribute.

links = soup.find_all("a", attrs={"href": magnet1})

if links == []:
    links = soup.find_all("a", attrs={"href": magnet2}) # ? is a special character, therefore has to be escaped

if links == []:
    links = soup.find_all("div", attrs={"data-sc-params": magnet2}) #kickassto.cc webpages do not place their magnets in a tags, but hide them in divs.
    #links = soup.find_all(whateverTagOrAttribute, attrs={whateverTagOrAttribute: magnet2}) 

if links == []:
    #the following works
    links = soup.find_all(whateverTagOrAttribute, attrs={"data-sc-params": magnet2})
   
if links == []:
    #the following does not work
    links = soup.find_all(whateverTagOrAttribute, attrs={dataAttribute: magnet2})
    
if links != []:
    print(f"The magnet links that we managed to scrape: {links}")

导入请求
从bs4导入BeautifulSoup
进口稀土
#页面中的所有URL，最好是来自任何自定义属性。遗憾的是，我无法用re.compile（r“data.{1,30}”）获得它们，我不知道为什么。我错在哪里？
您可以使用此脚本从任意HTML属性解析磁铁链接：
import re
from bs4 import BeautifulSoup

txt = '''
<a data-download rel="nofollow" class="kaGiantButton siteButton iconButton" title="Download verified torrent file" target="_blank"
href="/torrents/Download Something in the Woods 2016 HDRip XviD AC3-EVO Torrent">
<i class="ka ka-verify"></i>
<span>Download torrent</span></a>
<div data-sc-replace data-sc-slot="_b6f619f42a2411c6688f2273fa3f628a" class="inlineblock"
data-sc-params="{ 'magnet': 'magnet:?xt=urn:btih:CC75C59E9FE0E8689DFD21558C02E9C9F92AE714&dn=something+in+the+woods+2016+hdrip+xvid+ac3+evo&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce', 'extension': 'avi', 'stream': '' }"></div>

<div some-attribute="magnet:?xt=urn:btih:THIS IS OTHER LINK">
</div>
'''

soup = BeautifulSoup(txt, 'html.parser')

r = re.compile(r'(magnet:\?xt=urn:btih:[^\'"]+)')

def find_magnet_link(t):
    rv = []
    for k in t.attrs:
        if isinstance(t[k], list):
            continue
        m = r.search(t[k])
        if m:
            rv.append(m.group(1))
    return rv

for tag in soup.find_all(find_magnet_link):
    for link in find_magnet_link(tag):
        print(link)

magnet:?xt=urn:btih:CC75C59E9FE0E8689DFD21558C02E9C9F92AE714&dn=something+in+the+woods+2016+hdrip+xvid+ac3+evo&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Fglotorrents.pw%3A6969%2Fannounce&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80%2Fannounce&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337%2Fannounce
magnet:?xt=urn:btih:THIS IS OTHER LINK