Python 3.x 复合蟒蛇3
从一行中提取数据时,下面的代码工作得很好,在我的例子中是行[0]。我想知道如何调整它以从多行中提取数据 此外,我希望能够指定用于特定列的divTag类(请参见下面的代码) 类似于第[1,2]行的使用:Python 3.x 复合蟒蛇3,python-3.x,csv,web-scraping,beautifulsoup,Python 3.x,Csv,Web Scraping,Beautifulsoup,从一行中提取数据时,下面的代码工作得很好,在我的例子中是行[0]。我想知道如何调整它以从多行中提取数据 此外,我希望能够指定用于特定列的divTag类(请参见下面的代码) 类似于第[1,2]行的使用: divTag = soup.find("div", {"class": "productsPicture"}) 对于行[4,5]使用: divTag = soup.find("div", {"class": "product_content"}) 如果你们觉得有道理的话 from bs4 im
divTag = soup.find("div", {"class": "productsPicture"})
对于行[4,5]使用:
divTag = soup.find("div", {"class": "product_content"})
如果你们觉得有道理的话
from bs4 import BeautifulSoup
import requests
import csv
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile, delimiter=';')
writer = csv.writer(results)
for row in reader:
# get the url
url = row[0]
print(url)
# fetch content from server
try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
writer.writerow([url, '', 'bad url'])
continue
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", {"class": "productsPicture"})
if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']
# Test that link is valid
try:
r = requests.get(url_sub)
writer.writerow([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
writer.writerow([url, url_sub, 'bad link'])
else:
writer.writerow([url, '', 'no results'])
url.csv
示例:
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093;
要搜索的示例类:
要添加每列查找参数,可以创建一个字典,将索引号映射到所需的查找参数,如下所示:
from bs4 import BeautifulSoup
import requests
import csv
class_1 = {"class": "productsPicture"}
class_2 = {"class": "product_content"}
class_3 = {"class": "id-fix"}
# map a column number to the required find parameters
class_to_find = {
0 : class_3, # Not defined in question
1 : class_1,
2 : class_1,
3 : class_3, # Not defined in question
4 : class_2,
5 : class_2}
with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile)
writer = csv.writer(results)
for row in reader:
# get the url
output_row = []
for index, url in enumerate(row):
url = url.strip()
# Skip any empty URLs
if len(url):
#print('col: {}\nurl: {}\nclass: {}\n\n'.format(index, url, class_to_find[index]))
# fetch content from server
try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
output_row.extend([url, '', 'bad url'])
continue
except requests.exceptions.MissingSchema as e:
output_row.extend([url, '', 'missing http...'])
continue
# soup fetched content
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find("div", class_to_find[index])
if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']
# Test that link is valid
try:
r = requests.get(url_sub)
output_row.extend([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
output_row.extend([url, url_sub, 'bad link'])
else:
output_row.extend([url, '', 'no results'])
writer.writerow(output_row)
该函数用于返回遍历列表的计数器whist。因此,对于第一个URL,index
将是0
,对于下一个URL,1
。然后可以将其与class\u to\u find
字典一起使用,以获取搜索所需的参数
每个URL将导致创建3列:URL、成功的子URL和结果。如果不需要,可以将其删除。有人可以帮助我解决此问题吗@马丁·埃文斯?伙计们,请看这里,帮我解决这个难题是的!对对你就是那个人!BUUUUUT它不是按列执行,而是按行执行。如果我们能找出如何做到每列,这将是faaaantastic!再次感谢你的时间!!!尝试重新复制,出现了一个bug。好吧,@Martin Evans我测试了一下,看起来你仍然不能一次完成两个类,对不对?比如说,对于第2列,是否有类产品结构,对于第3列,是否有类产品内容?你是如此的接近!!!!请不要放弃!目前
class_1
用于列0
和3
(它们从0开始编号)class_2
用于列1
和4
。添加enumerate()
。e、 g.对于行号,枚举中的行(读卡器,start=2):
然后可以有一个if
语句,在行号==2