在python中使用Beautifullsoup刮取Reelgood.com_Python_Web Scraping_Beautifulsoup

在python中使用Beautifullsoup刮取Reelgood.com

python web-scraping

在python中使用Beautifullsoup刮取Reelgood.com,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在尝试为网站构建一个scraper（Python）如果我在Reelgood上观看特定电影，它会显示一个播放按钮，如下所示：如果我点击那个按钮，它会将我重定向到例如现在，我想抓取那个特定的URL。所以我想我制作了一个小的python脚本来删除所有包含的链接所以我想到了这个： from bs4 import BeautifulSoup import requests URL = "https://reelgood.com/movie/the-intouchables-201

我正在尝试为网站构建一个scraper（Python）

如果我在Reelgood上观看特定电影，它会显示一个播放按钮，如下所示：

如果我点击那个按钮，它会将我重定向到例如现在，我想抓取那个特定的URL。所以我想我制作了一个小的python脚本来删除所有包含的链接

所以我想到了这个：

from bs4 import BeautifulSoup
import requests

URL = "https://reelgood.com/movie/the-intouchables-2011"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for a_href in soup.find_all("a", href=True):
    print(a_href["href"])

现在，这确实给了我一个打印从所有的链接，但没有链接包含的网址，我重定向到

有人知道如何弹出Netflix.com url吗？

来过滤只包含

https://www.netflix.com/

，您可以使用CSS选择器：

a[href*=”https://www.netflix.com/“]

，它将选择包含

https://www.netflix.com/

from bs4 import BeautifulSoup
import requests

URL = "https://reelgood.com/movie/the-intouchables-2011"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

for a_href in soup.select('a[href*="https://www.netflix.com/"]'):
    print(a_href["href"])

输出：

https://www.netflix.com/watch/70232180

我认为您可以使用我的代码来满足您的需要
从selenium导入webdriver 从selenium.webdriver.chrome.options导入选项从selenium.webdriver.support.ui导入WebDriverWait 导入时间从geopy.geocoders导入提名导入时间从pprint导入pprint

# instantiate a new Nominatim client
app = Nominatim(user_agent="tutorial")

def getLocation():
    #autoriser le naviagateur pour acceder à l'emplacement actuelle par defaut,
    # Si on essaye d'accéder à un site Web : « https://mycurrentlocation.net » via chrome,
    # il demande d'autoriser l'accès à la localisation. La commande « - use-fake-ui-for-media-stream »
    # accordera toutes les autorisations pour l'emplacement, le microphone, etc. automatiquement.
    options = Options()
    options.add_argument("--use-fake-ui-for-media-stream")
    # appelez la page Web https://mycurrentlocation.net/ et attendez 20 secondes que la page se charge.
    timeout = 20
    #Pour chromedriver il faut avoir la meme version que google chrome
    driver = webdriver.Chrome(executable_path = './chromedriver.exe', chrome_options=options)
    #driver = webdriver.Chrome(executable_path = './chromedriver.exe', chrome_options=options) -> on peut mettre ça a la palce
    driver.get("https://mycurrentlocation.net/")
    wait = WebDriverWait(driver, timeout)
    time.sleep(3)
    #Trouvez le XPath des éléments de latitude et de longitude mentionnés sur la page Web puis copier le nom de la classe qu'on souhaite récupérer
    neighborhood = driver.find_elements_by_xpath('//*[@id="neighborhood"]')
    neighborhood = [x.text for x in neighborhood]
    neighborhood = str(neighborhood[0])

    regionname = driver.find_elements_by_xpath('//*[@id="regionname"]')
    regionname = [x.text for x in regionname]
    regionname = str(regionname[0])

    placename = driver.find_elements_by_xpath('//*[@id="placename"]')
    placename = [x.text for x in placename]
    placename = str(placename[0])

    driver.quit()
    return (neighborhood,regionname,placename)

neighborhood,regionname,placename=getLocation()

print("le résultat est : \n ",
    neighborhood,regionname,placename)

感谢您的帮助，但是如果您运行脚本，它将找不到Netflix URL。虽然它必须在那里somewhere@JohnDoe它对我有用。当你打印（soup.prettify（））时，它会出现吗？我的错，u在正确的地方，对于这个例子，它是有效的。