Python 对HTML拆分和从网站读取HTML表感到困惑
我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据,该网站整理社交媒体帐户数据。我一直在关注来自的帮助,但我似乎永远都无法得到50个用户的完整列表,我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作,但它似乎仍然不能正常工作,它只检索前10个 我正在抓取的网站是:该表位于Twitter个人资料统计下 我用于爬网的代码:Python 对HTML拆分和从网站读取HTML表感到困惑,python,html,twitter,web-crawler,Python,Html,Twitter,Web Crawler,我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据,该网站整理社交媒体帐户数据。我一直在关注来自的帮助,但我似乎永远都无法得到50个用户的完整列表,我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作,但它似乎仍然不能正常工作,它只检索前10个 我正在抓取的网站是:该表位于Twitter个人资料统计下 我用于爬网的代码: try: print('getting html content for url: %s' % url)
try:
print('getting html content for url: %s' % url)
page = requests.get(url)
tree = html.fromstring(page.text)
table = tree.xpath('//table[@class="brand-table-list"]')[0]
data = [[text(td) for td in tr.xpath('td')] for tr in table.xpath('//tr')]
ids = table.xpath('//a[@class="acc-placeholder-img"]')
print(data)
name_id = {}
for uid in ids:
name_id[uid.attrib['href'].split('/')[-1].split('-')[1]] = uid.attrib['href'].split('/')[-1].split('-')[0]
for row in data:
print(row)
followings = None
followers = None
name = None
uid = None
user_country = None
if len(row) == 4:
'''obtaining name and id'''
name = row[1].split()[-1][2:-1]
#print(name)
name = name.decode("utf-8")
#print(name)
uid = name_id[name.lower()]
#print(uid)
'''Obtaining user country'''
status_code = 0
while status_code != 200:
new_url = 'http://www.socialbakers.com/statistics/twitter/profiles/detail/' + uid + '-' + name
while True:
try:
print("hnere")
new_page = requests.get(new_url)
break
except Exception:
logging.info('sleeping for 5 seconds...')
time.sleep(5)
continue
#print("hnere")
status_code = new_page.status_code
if status_code != 200:
logging.error(status_code)
delay = np.random.rand() * 5
time.sleep(delay)
new_tree = html.fromstring(new_page.text)
tag_list = new_tree.xpath('//div[@class="account-tag-list"]')
tags = tag_list[0].text_content().split()
for tag in tags:
if 'GLOBAL' in tags:
user_country = 'GLOBAL'
elif len(tag) == 2:
user_country = tag
'''obtaining number of followers and friends'''
try:
followings = int(row[2][10:])
except ValueError:
try:
followings = int(row[2][20:])
except Exception:
print('error occured for: %s' % row)
try:
followers = int(row[3][9:])
except ValueError:
try:
followers = int(row[3][18:])
except Exception:
print('error occured for: %s' % row)
'''writing to file'''
print('followings:')
print(followings)
print('followers:')
print(followers)
print('name:')
print(name)
print('user_country:')
print(user_country)
if (followings is not None) and (followers is not None) and (name is not None) and (
user_country is not None):
fp.write(uid + '\t' + str(name) + '\t' + str(followings) + '\t' + str(followers) + '\t' + str(
user_country) + '\n')
delay = np.random.rand()
except Exception:
print('error occured for a url, sleeping for 10 seconds...')
delay = np.random.rand() * 10
从我打印出从表中找到的数据检索列表开始,它以Android而不是@NetflixLat结束,后者位于第50位。关于如何解析HTMl页面,是否存在错误?我是新来的,所以任何帮助都是欢迎的,如果我含糊其辞,我会进行相应的编辑
从表中检索到的数据列表:[[b'1',b'1PlayStation(@PlayStation)”,b'followings754',b'FollowersFollowers19190696',[b'2',b'2Xbox(@Xbox'),b'followers16649',b'Followers14339548',[b'3',b'3CHANEL(@CHANEL'),b'followers1',b'Followers13110884',[b'4SpaceX(@SpaceX'),b'followers96',b'followers126948',],[b'5',b'5SamsungMobile(@SamsungMobile)”,b'followers 12150120',[b'6',b'6RockstarGames(@RockstarGames)”,b'followers 1373',b'followers 11048983',[b'7',b'7Starbucks咖啡(@Starbucks)”,b'followers 93694',b'followers 11019118',[b'8',b'Victoria Secret(@VictoriasSecret)”,b'followers 1203',b'followers 10821167',],[b'9',b'9Nintendo of America(@Nintendo America'),b'followers 227',b'followers 10718269',[b'10',b'10Android(@Android'),b'followers 68',b'followers 10502286',[b'展示更多品牌推特档案]