Python 对HTML拆分和从网站读取HTML表感到困惑_Python_Html_Twitter_Web Crawler

Python 对HTML拆分和从网站读取HTML表感到困惑

python html twitter web-crawler

Python 对HTML拆分和从网站读取HTML表感到困惑,python,html,twitter,web-crawler,Python,Html,Twitter,Web Crawler,我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据，该网站整理社交媒体帐户数据。我一直在关注来自的帮助，但我似乎永远都无法得到50个用户的完整列表，我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作，但它似乎仍然不能正常工作，它只检索前10个我正在抓取的网站是：该表位于Twitter个人资料统计下我用于爬网的代码： try: print('getting html content for url: %s' % url)

我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据，该网站整理社交媒体帐户数据。我一直在关注来自的帮助，但我似乎永远都无法得到50个用户的完整列表，我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作，但它似乎仍然不能正常工作，它只检索前10个

我正在抓取的网站是：该表位于Twitter个人资料统计下

我用于爬网的代码：

try:
            print('getting html content for url: %s' % url)
            page = requests.get(url)
            tree = html.fromstring(page.text)
            table = tree.xpath('//table[@class="brand-table-list"]')[0]
            data = [[text(td) for td in tr.xpath('td')] for tr in table.xpath('//tr')]
            ids = table.xpath('//a[@class="acc-placeholder-img"]')
            print(data) 
            name_id = {}
        
            for uid in ids:
                name_id[uid.attrib['href'].split('/')[-1].split('-')[1]] = uid.attrib['href'].split('/')[-1].split('-')[0]
            for row in data:
                print(row)
                followings = None
                followers = None
                name = None
                uid = None
                user_country = None
                
                if len(row) == 4: 
                    '''obtaining name and id'''
                    name = row[1].split()[-1][2:-1]
                    #print(name)
                    name = name.decode("utf-8")
                    #print(name)
                    uid = name_id[name.lower()]
                    #print(uid)
                    '''Obtaining user country'''
                    status_code = 0
                    while status_code != 200:
                        new_url = 'http://www.socialbakers.com/statistics/twitter/profiles/detail/' + uid + '-' + name
                        while True:
                            try:
                                print("hnere")
                                new_page = requests.get(new_url)
                                break
                            except Exception:
                                logging.info('sleeping for 5 seconds...')
                                time.sleep(5)
                                continue
                        #print("hnere")
                        status_code = new_page.status_code
                        if status_code != 200:
                            logging.error(status_code)
                            delay = np.random.rand() * 5
                            time.sleep(delay)
                    new_tree = html.fromstring(new_page.text)
                    tag_list = new_tree.xpath('//div[@class="account-tag-list"]')
                    tags = tag_list[0].text_content().split()
                    for tag in tags:
                        if 'GLOBAL' in tags:
                            user_country = 'GLOBAL'
                        elif len(tag) == 2:
                            user_country = tag
                    '''obtaining number of followers and friends'''
                    try:
                        followings = int(row[2][10:])
                    except ValueError:
                        try:
                            followings = int(row[2][20:])
                        except Exception:
                            print('error occured for: %s' % row)
                    try:
                        followers = int(row[3][9:])
                    except ValueError:
                        try:
                            followers = int(row[3][18:])
                        except Exception:
                            print('error occured for: %s' % row)

                '''writing to file'''
                print('followings:')   
                print(followings)   
                print('followers:')    
                print(followers)    
                print('name:')  
                print(name)   
                print('user_country:')   
                print(user_country)    
                if (followings is not None) and (followers is not None) and (name is not None) and (
                    user_country is not None):
                    fp.write(uid + '\t' + str(name) + '\t' + str(followings) + '\t' + str(followers) + '\t' + str(
                        user_country) + '\n')

                delay = np.random.rand()
        except Exception: 
            print('error occured for a url, sleeping for 10 seconds...') 
            delay = np.random.rand() * 10

从我打印出从表中找到的数据检索列表开始，它以Android而不是@NetflixLat结束，后者位于第50位。关于如何解析HTMl页面，是否存在错误？我是新来的，所以任何帮助都是欢迎的，如果我含糊其辞，我会进行相应的编辑

从表中检索到的数据列表：

[[b'1'，b'1PlayStation（@PlayStation）”，b'followings754'，b'FollowersFollowers19190696'，[b'2'，b'2Xbox（@Xbox'），b'followers16649'，b'Followers14339548'，[b'3'，b'3CHANEL（@CHANEL'），b'followers1'，b'Followers13110884'，[b'4SpaceX（@SpaceX'），b'followers96'，b'followers126948'，]，[b'5'，b'5SamsungMobile（@SamsungMobile）”，b'followers 12150120'，[b'6'，b'6RockstarGames（@RockstarGames）”，b'followers 1373'，b'followers 11048983'，[b'7'，b'7Starbucks咖啡（@Starbucks）”，b'followers 93694'，b'followers 11019118'，[b'8'，b'Victoria Secret（@VictoriasSecret）”，b'followers 1203'，b'followers 10821167'，]，[b'9'，b'9Nintendo of America（@Nintendo America'），b'followers 227'，b'followers 10718269'，[b'10'，b'10Android（@Android'），b'followers 68'，b'followers 10502286'，[b'展示更多品牌推特档案]