python lxml未显示所有内容_Python_Html_Parsing_Web Scraping_Lxml

python lxml未显示所有内容

python html parsing web-scraping

python lxml未显示所有内容,python,html,parsing,web-scraping,lxml,Python,Html,Parsing,Web Scraping,Lxml,我正试图从网页的某个特定部分抓取信息，并最终计算词频。但我发现很难理解全文。据我从HTML代码中了解，我的脚本省略了该部分中的一部分，该部分位于换行符中，但没有标记。我的代码： import urllib from lxml import html as LH import lxml import requests scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-s

我正试图从网页的某个特定部分抓取信息，并最终计算词频。但我发现很难理解全文。据我从HTML代码中了解，我的脚本省略了该部分中的一部分，该部分位于换行符中，但没有标记。我的代码：

import urllib
from lxml import html as LH
import lxml
import requests

scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()

scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[@class="scrolling-script-container"]/text()')
print script
print type(script)

这是输出：

[“\n\n\n\n\t\t\t（收音机单击，\r播放音乐）\r\r光盘主持人：\r
纽约经典摇滚乐\r q104.，“3.，”
\r\r早上好。”，“\r我是吉姆·克尔。”， “\r\r即将出现\r”

当我迭代结果时，只有紧跟在/r后面并后跟逗号或双逗号的短语

for res in script:
    print res

输出为：

问题104。 3. 早上好我是吉姆·科尔

我并不局限于lxml，但因为我比较新，我对其他方法不太熟悉。

lxml元素既有文本方法也有尾部方法。您正在搜索文本，但如果元素中嵌入了HTML元素（例如，br），您对文本的搜索深度将仅与解析器从元素的text（）方法中获取的第一个文本深度相同

尝试：

lxml元素同时具有text和tail方法。您正在搜索文本，但如果元素中嵌入了HTML元素（例如，br），则您对文本的搜索深度将仅与解析器从元素的text（）方法中获取的第一个文本深度相同

尝试：

这让我很困扰，我写了一个解决方案：

import requests
import lxml
from lxml import etree
from io import StringIO

parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)

script = root.xpath('//div[@class="scrolling-script-container"]')
text_list = []

for elem in script:
    print(elem.attrib)
    if hasattr(elem, 'text'):
        text_list.append(elem.text)
    if hasattr(elem, 'tail'):
        text_list.append(elem.tail)

for elem in text_list:
# only gets the first block of text before 
# it encounters a br tag
        print(elem)

for elem in script:
# prints everything 
    for sib in elem.iter():
        print(sib.attrib)
        if hasattr(sib, 'text'):
            print(sib.text)
        if hasattr(sib, 'tail'):
            print(sib.tail)

这让我很困扰，我写了一个解决方案：

import requests
import lxml
from lxml import etree
from io import StringIO

parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)

script = root.xpath('//div[@class="scrolling-script-container"]')
text_list = []

for elem in script:
    print(elem.attrib)
    if hasattr(elem, 'text'):
        text_list.append(elem.text)
    if hasattr(elem, 'tail'):
        text_list.append(elem.tail)

for elem in text_list:
# only gets the first block of text before 
# it encounters a br tag
        print(elem)

for elem in script:
# prints everything 
    for sib in elem.iter():
        print(sib.attrib)
        if hasattr(sib, 'text'):
            print(sib.text)
        if hasattr(sib, 'tail'):
            print(sib.tail)

首先，xpath表达式是无效的，

join

不是它应该如何调用，

text

和

tail

不是方法。好的，这是有意义的。但是打印不起作用，因为“script”是一个列表而不是一个树。对此有不同的解决方案吗？对不起，script[0]。text，script[0].tail应该可以工作感谢这一点。因为Tree.xpath创建了一个列表，连接会给出一个错误，它没有定义。我想我需要从一开始就简化它。我需要将div类中HTML的一部分作为树。对于初学者，xpath表达式无效，

join

不是应该如何调用的，

 text

和

tail

不是方法。好的，这是有意义的。但是打印不起作用，因为“script”是一个列表而不是一个树。对此有其他解决方案吗？对不起，script[0]。text，script[0].tail应该可以工作谢谢。因为Tree.xpath创建了一个列表，连接会给出一个错误，它没有定义。我想我需要从一开始就简化它。我需要将div类中HTML的一部分作为树。谢谢你的努力！它仍然忽略两个br之间的所有文本，只打印以h“。可能是HTML代码本身的问题吗？它看起来很有趣，没有对齐。也许最简单的方法是删除所有BR？BeautifulSoup可能更宽容。我承认，这是一个不寻常的情况，我通常使用etree时没有太多问题。感谢您的努力！它仍然忽略两个BR之间的所有文本，只打印这些文本可能是HTML代码本身的问题吗？它看起来很有趣，没有对齐。也许最简单的方法是删除所有BR？BeautifulSoup可能更宽容。我承认，这是一个不寻常的情况，我通常使用etree时没有太多问题。