Python 从网站上抓取数据以获得一个表,但我得到的是一个空表
我试图从这个表中刮取数据,但是我得到一个空的csv文件,只包含标题 我试过这个密码。我不知道我的代码发生了什么,为什么它返回一个空表 我的代码:Python 从网站上抓取数据以获得一个表,但我得到的是一个空表,python,web-scraping,Python,Web Scraping,我试图从这个表中刮取数据,但是我得到一个空的csv文件,只包含标题 我试过这个密码。我不知道我的代码发生了什么,为什么它返回一个空表 我的代码: url = "http://www.peaklist.org/WWlists/WorldTop50.html" r = requests.get(url) data= r.text soup=BeautifulSoup(data,"html.parser") scripts = soup.find_all(&q
url = "http://www.peaklist.org/WWlists/WorldTop50.html"
r = requests.get(url)
data= r.text
soup=BeautifulSoup(data,"html.parser")
scripts = soup.find_all("script")
file_name = open("table.csv","w",newline="")
writer = csv.writer(file_name)
list_to_write = []
list_to_write.append(["Summit Name", "Country", "Lat.", "Long.", "Elevation mtrs.", "Prom. mtrs.", "Saddle mtrs.", "Saddle Location", "Elevation ft.", "Prom. ft.", "Notes", "Aerial Photo" ])
for script in scripts:
text = script.text
start = 0
end = 0
if(len(text) > 10000):
while(start > -1):
start = text.find('"Summit Name":"',start)
if(start == -1):
break
start += len('"Summit Name":"')
end = text.find('"',start)
summit_name = text[start:end]
#start = text.find('"Summit Name":"',start)
#start += len('"Summit Name":"')
#end = text.find('"',start)
#summit_name = text[start:end]
start = text.find('"Country":"',start)
start += len('"Country":"')
end = text.find('"',start)
country = text[start:end]
start = text.find('"Lat.":"',start)
start += len('"Lat":"')
end = text.find('"',start)
lat = text[start:end]
start = text.find('"Long.":"',start)
start += len('"Long.":"')
end = text.find('"',start)
long = text[start:end]
start = text.find('"Elevation mtrs.":"',start)
start += len('"Elevation mtrs.":"')
end = text.find('"',start)
elevation = text[start:end]
start = text.find('"Prom. mtrs.":"',start)
start += len('"Prom. mtrs.":"')
end = text.find('"',start)
prom = text[start:end]
start = text.find('"Saddle mtrs.":"',start)
start += len('"Saddle mtrs.":"')
end = text.find('"',start)
saddle = text[start:end]
start = text.find('"Saddle Location":"',start)
start += len('"Saddle Location":"')
end = text.find('"',start)
saddle_loc = text[start:end]
start = text.find('"Elevation ft.":"',start)
start += len('"Elevation ft.":"')
end = text.find('"',start)
elevation_ft = text[start:end]
start = text.find('"Prom. ft.":"',start)
start += len('"Prom. ft.":"')
end = text.find('"',start)
prom_ft = text[start:end]
start = text.find('"Notes":"',start)
start += len('"Notes":"')
end = text.find('"',start)
notes = text[start:end]
start = text.find('"Aerial Photo":"',start)
start += len('"Aerial Photo":"')
end = text.find('"',start)
aerial = text[start:end]
list_to_write.append([summit_name,country,lat,long,elevation,prom,saddle,saddle_loc,elevation_ft,prom_ft,notes,aerial])
writer.writerows(list_to_write)
file_name.close()
我没有收到这段代码的错误消息,只是一个空表,所以我想这个方法可能无法识别网站中的表数据
谢谢 您的问题在于迭代数据的方式。 将来,在编写程序时,试着从逻辑上考虑一下,例如在
find_all
中使用正确的标记名,并使用最容错的方式对正在抓取的数据进行迭代
这个脚本并不完美,但我认为它可以指导你更好地理解如何刮。检查注释以了解代码的作用
url = "http://www.peaklist.org/WWlists/WorldTop50.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,"html.parser")
to_csv = [["Summit Name", "Country", "Lat.", "Long.", "Elevation mtrs.", "Prom. mtrs.", "Saddle mtrs.", "Saddle Location", "Elevation ft.", "Prom. ft.", "Notes", "Aerial Photo" ]]
table = soup.find_all('table')[1] # Chose the second table on the page
rows = table.find_all('tr') # Get all table rows from our table element
del rows[0] # Remove the first row which is the table heading (we already have it)
for row in rows:
tmp = [] # We're gonna add our column data in this list
columns = row.find_all('td') # Find all columns in the row
for column in columns:
tmp.append(column.text.strip()) # strip() is used to remove extra space from the text
to_csv.append(tmp) # Append this list to the main list
with open('output.csv', 'w') as csvfile:
for row in to_csv: # For each list in the main list
line = ','.join(row) # We're gonna join the column data in row
csvfile.write(line) # Write each line to the file
谢谢大家的提示。下面的代码运行良好:
url="http://www.peaklist.org/WWlists/WorldTop50.html"
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.find_all("tr")
print("Number of rows on site: ",len(gdp))
body_rows = gdp[2:]
all_rows = []
for row_num in range(len(body_rows)):
row = []
for row_item in body_rows[row_num].find_all("td"):
aa = re.sub("(\xa0)|(\n)|,","",row_item.text)
row.append(aa)
all_rows.append(row)
df = pd.DataFrame(data=all_rows)
在此之前,我检查了html代码,这就是gdp从第2行开始的原因。我认为问题出在您的逻辑中,预期的输出是什么?发布您期望的csv中的一些行。@Jarvis csv文件应包含以下列:summit_name、country、lat、long、elevation、prom、saddle、saddle_loc、elevation_ft、prom_ft、notes、Aerial谢谢,我尝试过这个,但仍然得到一个空的csv文件…尝试新的解决方案