Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/320.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 美化群分解()_Python_Python 3.x_Beautifulsoup - Fatal编程技术网

Python 美化群分解()

Python 美化群分解(),python,python-3.x,beautifulsoup,Python,Python 3.x,Beautifulsoup,我正试图利用beatifulsoup去除标记和标记内的内容。我看了文档,似乎是一个非常简单的函数。有关该函数的更多信息,请参阅。这是我到目前为止解析的html页面的内容 <body class="pb-theme-normal pb-full-fluid"> <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id=

我正试图利用beatifulsoup去除
标记和标记内的内容。我看了文档,似乎是一个非常简单的函数。有关该函数的更多信息,请参阅。这是我到目前为止解析的html页面的内容

<body class="pb-theme-normal pb-full-fluid">
    <div class="pub_300x250 pub_300x250m pub_728x90 text-ad textAd text_ad text_ads text-ads text-ad-links" id="wp-adb-c" style="width: 1px !important;
    height: 1px !important;
    position: absolute !important;
    left: -10000px !important;
    top: -1000px !important;
    ">
</div>
<div id="pb-f-a">
</div>
    <div class="" id="pb-root">
    <script>
    (function(a){
        TWP=window.TWP||{};
        TWP.Features=TWP.Features||{};
        TWP.Features.Page=TWP.Features.Page||{};
        TWP.Features.Page.PostRecommends={};
        TWP.Features.Page.PostRecommends.url="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/hybrid.json?callback\x3d?";
        TWP.Features.Page.PostRecommends.trackUrl="https://recommendation-hybrid.wpdigital.net/hybrid/hybrid-filter/tracker.json?callback\x3d?";
        TWP.Features.Page.PostRecommends.profileUrl="https://usersegment.wpdigital.net/usersegments";
        TWP.Features.Page.PostRecommends.canonicalUrl=""
    })(jQuery);

    </script>
    </div>
</body>
soup.script.decompose()

这将仅从“Soup”中删除单个脚本元素。相反,我想你是想把它们全部分解:

for script in soup("script"):
    script.decompose()

为了详细说明alecxe提供的答案,这里有一个完整的脚本供任何人参考:

selects = soup.findAll('select')
for match in selects:
    match.decompose()

soup.script.decompose()只会将其从soup变量中删除。。。不是html_body变量。您还必须将其从html_body变量中删除。(我想。)

我用以下代码解决了这个问题

scripts = soup.findAll(['script', 'style'])
    for match in scripts:
        match.decompose()
        file_content = soup.get_text()
        # Striping 'ascii' code
        content = re.sub(r'[^\x00-\x7f]', r' ', file_content)
    # Creating 'txt' files
    with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:
        webpage_out.write(content)
        print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')
        count += 1
错误是带有open(…的
是匹配的一部分或

不起作用的代码

scripts = soup.findAll(['script', 'style'])
    for match in scripts:
        match.decompose()
        file_content = soup.get_text()
        # Striping 'ascii' code
        content = re.sub(r'[^\x00-\x7f]', r' ', file_content)
        # Creating 'txt' files
        with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:
            webpage_out.write(content)
            print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')
            count += 1

粘贴您正在运行的实际代码。当我测试您描述的步骤时,一切都正常。另外,编辑时,您缺少一个结束的
div
,但这对BSM来说没有问题,因为某种原因,
分解()
停止工作。现在我的
.txt
文件中有
script
代码。我查看了文档,说明与以前基本相同。
#删除汤中脚本的JS和CSS('script','style'):script.decompose()和open(我的参数['q']+''.+str(count)+'txt',w'))as webpage_out:webpage_out.write(soup.get_text())print('文件'+my_参数['q']+'.''+str(计数)+'.txt'+'已成功创建')count+=1,除了:pass
感谢分享最简单但非常有效的方法来消除一切无用的东西。我从这个方法中受益已经足够长时间了。:)
scripts = soup.findAll(['script', 'style'])
    for match in scripts:
        match.decompose()
        file_content = soup.get_text()
        # Striping 'ascii' code
        content = re.sub(r'[^\x00-\x7f]', r' ', file_content)
        # Creating 'txt' files
        with open(my_params['q'] + '_' + str(count) + '.txt', 'w+') as webpage_out:
            webpage_out.write(content)
            print('The file ' + my_params['q'] + '_' + str(count) + '.txt ' + 'has been created successfully.')
            count += 1