Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/fsharp/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从使用React的网站获取所有HTML代码_Python_Web Scraping - Fatal编程技术网

Python 从使用React的网站获取所有HTML代码

Python 从使用React的网站获取所有HTML代码,python,web-scraping,Python,Web Scraping,我正在尝试刮去页面,更具体地说是显示“东西”的页面,例如。问题在于,当发出get请求(使用python urllib或requests包)时,响应是一个空HTML文件,其中包含大量头数据、一些脚本和一个空的react app div: <!doctype html> <html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb

我正在尝试刮去页面,更具体地说是显示“东西”的页面,例如。问题在于,当发出get请求(使用python urllib或requests包)时,响应是一个空HTML文件,其中包含大量头数据、一些脚本和一个空的react app div:

<!doctype html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">

<head>
    <title>PCB Feet/Standoffs for M3 by scruss - Thingiverse</title>
    <meta http-equiv="Content-type" content="text/html; charset=utf-8">
    <meta charset="utf-8">
    <meta http-equiv="Content-Language" content="EN">
    <meta http-equiv="imagetoolbar" content="no">
    <meta name="keywords"
        content="things, digital design, physical objects, rapid prototyping, 3D objects, 3D printing, reprap, fabrication, laser cutter, laser, thingaverse, thingyverse">
    <meta name="abstract" content="Share your digital designs for physical objects.">
    <meta name="author" content="Thingiverse.com">
    <meta name="distribution" content="Global">
    <meta name="revisit-after" content="1 days">
    <meta name="robots" content="follow,index">
    <meta name="description"
        content="Download files and build them with your 3D printer, laser cutter, or CNC. Thingiverse is a universe of things.">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
    <meta name="theme-color" content="#248bfb">

    <meta property="og:type" content="website">
    <meta property="og:title" content="PCB Feet/Standoffs for M3 by scruss">
    <meta property="og:description"
        content="Basic &quot;I don&#039;t want my protoboard shorting or gouging holes in my desk&quot; feet/standoffs for M3-drilled 1.6 mm thick PCBs.">
    <meta property="og:image" content="https://cdn.thingiverse.com/assets/d5/9e/0e/1c/f3/featured_preview_pcb_feet.png">
    <meta property="twitter:card" content="summary">
    <meta property="twitter:site" content="@thingiverse">
    <meta property="og:url" content="https://www.thingiverse.com/thing:4796603">
    <meta property="twitter:creator" content="@scruss">
    <link rel="apple-touch-icon" sizes="57x57"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-57x57.png">
    <link rel="apple-touch-icon" sizes="114x114"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-114x114.png">
    <link rel="apple-touch-icon" sizes="72x72"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-72x72.png">
    <link rel="apple-touch-icon" sizes="144x144"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-144x144.png">
    <link rel="apple-touch-icon" sizes="60x60"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-60x60.png">
    <link rel="apple-touch-icon" sizes="120x120"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-120x120.png">
    <link rel="apple-touch-icon" sizes="76x76"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-76x76.png">
    <link rel="apple-touch-icon" sizes="152x152"
        href="https://cdn.thingiverse.com/site/img/favicons/apple-touch-icon-152x152.png">
    <link rel="icon" type="image/png" href="https://cdn.thingiverse.com/site/img/favicons/favicon-192x192.png"
        sizes="192x192">
    <link rel="icon" type="image/png" href="https://cdn.thingiverse.com/site/img/favicons/favicon-160x160.png"
        sizes="160x160">
    <link rel="icon" type="image/png" href="https://cdn.thingiverse.com/site/img/favicons/favicon-96x96.png"
        sizes="96x96">
    <link rel="icon" type="image/png" href="https://cdn.thingiverse.com/site/img/favicons/favicon-16x16.png"
        sizes="16x16">
    <link rel="icon" type="image/png" href="https://cdn.thingiverse.com/site/img/favicons/favicon-32x32.png"
        sizes="32x32">
    <meta name="msapplication-TileColor" content="#ffffff">
    <meta name="msapplication-TileImage" content="https://cdn.thingiverse.com/site/img/favicons/mstile-144x144.png">

    <link rel="alternate" type="application/rss+xml" title="Thingiverse - PCB Feet/Standoffs for M3 Comments"
        href="https://rss.thingiverse.com/thing:4796603">


    <script type="text/javascript" src="https://www.datadoghq-browser-agent.com/datadog-logs-us.js"></script>
    <script>
        const ddClientToken = "pub24a00142f6aa558abe1827e911e11e58";
            const ddEnv = "production";
            const ddVersion = "2.11.0";

            DD_LOGS.init({
                clientToken: ddClientToken,
                forwardErrorsToLogs: true,
                service: "thingiverse-client",
                env: ddEnv,
                version: ddVersion,
                sampleRate: 20
            });

            
            const ddIsTvNext = true;
            const ddBuildTime = "1617625667";

            DD_LOGS.addLoggerGlobalContext("is_thingiverse_next", ddIsTvNext);
            DD_LOGS.addLoggerGlobalContext("build_time", ddBuildTime);
    </script>

    <script>
        var scripts     = ["https://cdn.thingiverse.com/site/js/thingiverse/build/lib-afbc32d766.js","https://cdn.thingiverse.com/site/js/thingiverse/build/header-aa33d7b171.js","https://cdn.thingiverse.com/site/js/thingiverse/build/footer-df22f3acb4.js","https://cdn.thingiverse.com/site/js/thingiverse/build/things-d4ffa805ef.js","https://cdn.thingiverse.com/site/js/thingiverse/build/orders-e1ac5a6395.js","https://cdn.thingiverse.com/site/js/thingiverse/build/gallery-7fc215e644.js"];
            var stylesheets = [];
            var build_time  = 1617625667;
    </script>
</head>


<script src="https://cdn.thingiverse.com/site/js/three.min.bundle.js?1617625667"></script>
<div class="react-app" id="react-app"></div>
<script src="https://cdn.thingiverse.com/site/js/app.bundle.js?1617625667"></script>
<script>
    (function(w,d,s){w._uptime_rum={};w._uptime_rum.uuid='AVO7-994EF0DD9662F23C';w._uptime_rum.url='https://rum.uptime.com/rum/record-data';s=document.createElement('script');s.async=1;s.src='https://uptime.com/static/rum/compiled/rum.js';d.getElementsByTagName('head')[0].appendChild(s);})(window,document);
</script>

通过scruss-Thingiverse为M3提供PCB支脚/支架
const ddClientToken=“pub24a00142f6aa558abe1827e911e11e58”;
const ddEnv=“生产”;
const ddVersion=“2.11.0”;
DD_LOGS.init({
clientToken:ddClientToken,
forwardErrorsToLogs:对,
服务:“thingiverse客户端”,
环境:ddEnv,,
版本:DDV,
采样器:20
});
const ddIsTvNext=true;
const ddBuildTime=“1617625667”;
DD_LOGS.addLoggerGlobalContext(“is_thingiverse_next”,ddIsTvNext);
DD_LOGS.addLoggerGlobalContext(“构建时间”,ddBuildTime);
变量脚本=[”https://cdn.thingiverse.com/site/js/thingiverse/build/lib-afbc32d766.js","https://cdn.thingiverse.com/site/js/thingiverse/build/header-aa33d7b171.js","https://cdn.thingiverse.com/site/js/thingiverse/build/footer-df22f3acb4.js","https://cdn.thingiverse.com/site/js/thingiverse/build/things-d4ffa805ef.js","https://cdn.thingiverse.com/site/js/thingiverse/build/orders-e1ac5a6395.js","https://cdn.thingiverse.com/site/js/thingiverse/build/gallery-7fc215e644.js"];
var样式表=[];
var构建时间=1617625667;
(函数(w,d,s){w.\u uptime\u rum={};w.\u uptime\u rum.uuid='AVO7-994EF0DD9662F23C';w.\u uptime\u rum.url='0https://rum.uptime.com/rum/record-data“;s=document.createElement('script');s.async=1;s.src=”https://uptime.com/static/rum/compiled/rum.js'd.getElementsByTagName('head')[0].appendChild;}(窗口,文档);

不幸的是,这不是您在浏览器中检查页面时看到的HTML。我猜React稍后会插入其HTML,这就是div为空的原因。有没有办法绕过这一点并接收您在浏览器中看到的实际HTML代码?

您需要一个浏览器来呈现javascript,然后提取呈现试试selenium。它可以让你通过python代码管理浏览器,并与网页元素交互

安装selenium:

pip安装selenium

然后像这样提取HTML

从selenium导入webdriver
driver=webdriver.Chrome('./chromedriver')#下载Chrome驱动程序并替换此路径
获取(“你的url”)
#等待某个元素被渲染或只是一个盲睡眠
打印(driver.page_source)#这将为您提供完整的呈现HTML
驱动程序关闭()
您可以使用一个名为的工具来确保它加载,然后将其删除。对的回答应该会给您一些提示。