Python beautifulsoup不'；我不能展示所有的元素_Python_Parsing_Beautifulsoup

Python beautifulsoup不'；我不能展示所有的元素

python parsing

Python beautifulsoup不'；我不能展示所有的元素,python,parsing,beautifulsoup,Python,Parsing,Beautifulsoup,我正在尝试解析淘宝网站，通过BeautifulSoup.find获取有关商品的信息（照片、文本和链接），但它没有找到所有的类 url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram' def get_html(url): r = requests.get(url) retu

我正在尝试解析淘宝网站，通过BeautifulSoup.find获取有关商品的信息（照片、文本和链接），但它没有找到所有的类

url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'

def get_html(url):
    r = requests.get(url)
    return r.text

html=get_html(url)
soup=BeautifulSoup(html, 'lxml')
z=soup.find("div",{"class":"J_TItems"})

z-是空的。但例如：

z=soup.find("div",{"class":"skin-box-bd"})
len(z)
Out[196]: 3

很好

为什么这种方法不起作用？我应该做些什么来获取关于good的所有信息？我使用的是Python2.7，因此，您想要解析的项目似乎是由JavaScript动态构建的，这就是为什么

soup.text.find（“J_TItems”）

-1

，即文本中根本没有“J_TItems”。您可以通过JS解释器使用，对于无头浏览，您可以使用如下：

from bs4 import BeautifulSoup
from selenium import webdriver

url='https://xuanniwen.world.tmall.com/category-1268767539.htm?search=y&catName=%BC%D0%BF%CB#bd&view_op=citations_histogram'

browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source

soup = BeautifulSoup(html, 'html5lib') # I'd also recommend using html5lib
JTitems = soup.find("div", attrs={"class":"J_TItems"})

请注意，您需要的项目位于

定义的每一行中，其中有5个（您可能只需要前三个，因为其他两个并不在主搜索中，过滤应该不难，简单的

行=行[2://code>就可以了）：
现在请注意，您在问题中提到的每个“好”都在
中，因此您需要将它们全部放入for
循环中：
Goods = []    
for row in rows:
    for item in row.findAll("dl", attrs={"class":"item"}):
        Goods.append(item)

剩下要做的就是获得您提到的“照片、文本和链接”，访问商品
列表中的每一项都可以轻松做到这一点，通过检查，您可以知道如何获得每一项信息，例如，对于图片url，简单的一行是：
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'

尝试soup.text.find（“J_TItems”）
你会看到它会说在soup
中根本没有J_TItems
，我想发生的是你想要解析的内容不在html中，实际上是由JavaScript动态构建的，您应该看看Python的selenium模块。非常感谢您的帮助。也许你们可以告诉我，当我试图解析网站时，我怎么能理解使用哪种方法来处理一些隐藏的内容JavaScript、Ajax等。@egorkh我很乐意提供帮助！我想说，选择selenium并加载JavaScript始终是最好的选择，但是您可以通过检查要刮取的页面的html源代码来判断是否需要，如果您想要的东西不在那里，但在检查代码窗口中，您将需要JavaScript解析！
>>> Goods[0].find("dt", class_='photo').a.img["src"]
'//img.alicdn.com/bao/uploaded/i3/TB19Fl1SpXXXXbsaXXXXXXXXXXX_!!0-item_pic.jpg_180x180.jpg'