Python 在BeautifulSoup的一个div中获得一个div_Python_Beautifulsoup

Python 在BeautifulSoup的一个div中获得一个div

python

Python 在BeautifulSoup的一个div中获得一个div,python,beautifulsoup,Python,Beautifulsoup,所以我看到很多问题和我的相似，但我并没有找到一个很好的答案。我的网页结构如下：我想要的是获取Id，比如线程-XXXXXXXXX。这是我的密码： from bs4 import BeautifulSoup import urllib.request req = urllib.request.Request("http://boards.4chan.org/g/catalog", headers={'User-Agent' : "Magic Browser"}) soup = Beautifu

所以我看到很多问题和我的相似，但我并没有找到一个很好的答案。我的网页结构如下：

我想要的是获取Id，比如线程-XXXXXXXXX。这是我的密码：

from bs4 import BeautifulSoup
import urllib.request

req = urllib.request.Request("http://boards.4chan.org/g/catalog", headers={'User-Agent' : "Magic Browser"})
soup = BeautifulSoup(urllib.request.urlopen(req), "html.parser")
data2 = soup.find_all("div", attrs={"id": "threads"})
print (data2)

它打印出：

[]

。好吧，但我该怎么做呢

这不起作用：

data3 = soup.find_all("div", attrs={"class": "thread"})

我的意思是，它只是打印出来：

[]

，带有

子项

属性：

data2[0].children

使用

子项

属性：

data2[0].children

内容是动态呈现的，

[]

显然没有子级，作为替代方案，您可以解析包含源线程数据的json：

from bs4 import BeautifulSoup
import requests
import re
import json

# use pattern to pull the json
patt = re.compile("var catalog\s+=\s+(\{.*?\});")
soup = BeautifulSoup(requests.get("http://boards.4chan.org/g/catalog").content, "html.parser")

# find the correct script tag.
data2 = soup.find("script", text=re.compile("var catalog ="))
# convert to json.
threads_js = json.loads(patt.search(data2.text).group(1))

这将为您提供一个包含所有动态内容的dict，您需要的是threads键下的内容。有太多的数据要发布，但您需要的一切都应该在那里，它看起来像：

 {u'57205979': {u'b': 69, u'sub': u'', u'author': u'Anonymous', u'i': 5, u'tn_w': 250, u'teaser': u'Gotta love that hanging.', u'r': 17, u'lr': {u'date': 1477253272,

其中，每个外键都是

另一方面，在查找单个标记时，应使用find和use can pass关键字，而无需使用attrs：

 data2 = soup.find("div", id="threads")

内容是动态呈现的，

[]

显然没有子级，作为替代方案，您可以解析包含源线程数据的json：

from bs4 import BeautifulSoup
import requests
import re
import json

# use pattern to pull the json
patt = re.compile("var catalog\s+=\s+(\{.*?\});")
soup = BeautifulSoup(requests.get("http://boards.4chan.org/g/catalog").content, "html.parser")

# find the correct script tag.
data2 = soup.find("script", text=re.compile("var catalog ="))
# convert to json.
threads_js = json.loads(patt.search(data2.text).group(1))

这将为您提供一个包含所有动态内容的dict，您需要的是threads键下的内容。有太多的数据要发布，但您需要的一切都应该在那里，它看起来像：

 {u'57205979': {u'b': 69, u'sub': u'', u'author': u'Anonymous', u'i': 5, u'tn_w': 250, u'teaser': u'Gotta love that hanging.', u'r': 17, u'lr': {u'date': 1477253272,

其中，每个外键都是

另一方面，在查找单个标记时，应使用find和use can pass关键字，而无需使用attrs：

 data2 = soup.find("div", id="threads")

print（data2[0]。children）

@FrynioS打印（list（…）
或对其进行迭代。print（list（data2[0]。children））
->打印（data2[0]。children）
->打印（list（…）
或对其进行迭代。打印（list（list（list（list（data2[0]。children
使用动态加载内容Js@PadraicCunningham所以我对此无能为力？请看下面的答案内容是使用动态加载的Js@PadraicCunningham所以我对此无能为力？看下面的答案，我肯定认为硒是更好的选择，尤其是作为一种学习技能。@AlexHall，selenium的速度很慢，我在Scrapy的生产代码中使用过这种方法，并且在很多情况下效果都很好。我绝对认为selenium是更好的方法，尤其是作为一种学习技能。@AlexHall，selenium的速度很慢，我在Scrapy的生产代码中使用过这种方法，并且在很多情况下效果都很好。