Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 对HTML拆分和从网站读取HTML表感到困惑_Python_Html_Twitter_Web Crawler - Fatal编程技术网

Python 对HTML拆分和从网站读取HTML表感到困惑

Python 对HTML拆分和从网站读取HTML表感到困惑,python,html,twitter,web-crawler,Python,Html,Twitter,Web Crawler,我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据,该网站整理社交媒体帐户数据。我一直在关注来自的帮助,但我似乎永远都无法得到50个用户的完整列表,我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作,但它似乎仍然不能正常工作,它只检索前10个 我正在抓取的网站是:该表位于Twitter个人资料统计下 我用于爬网的代码: try: print('getting html content for url: %s' % url)

我一直在尝试从一个名为Socialbakers的网站上读取用户帐户数据,该网站整理社交媒体帐户数据。我一直在关注来自的帮助,但我似乎永远都无法得到50个用户的完整列表,我只能从50个用户中得到10个。我曾尝试修改将表内容添加到列表中的操作,但它似乎仍然不能正常工作,它只检索前10个

我正在抓取的网站是:该表位于Twitter个人资料统计下

我用于爬网的代码:

try:
            print('getting html content for url: %s' % url)
            page = requests.get(url)
            tree = html.fromstring(page.text)
            table = tree.xpath('//table[@class="brand-table-list"]')[0]
            data = [[text(td) for td in tr.xpath('td')] for tr in table.xpath('//tr')]
            ids = table.xpath('//a[@class="acc-placeholder-img"]')
            print(data) 
            name_id = {}
        
            for uid in ids:
                name_id[uid.attrib['href'].split('/')[-1].split('-')[1]] = uid.attrib['href'].split('/')[-1].split('-')[0]
            for row in data:
                print(row)
                followings = None
                followers = None
                name = None
                uid = None
                user_country = None
                
                if len(row) == 4: 
                    '''obtaining name and id'''
                    name = row[1].split()[-1][2:-1]
                    #print(name)
                    name = name.decode("utf-8")
                    #print(name)
                    uid = name_id[name.lower()]
                    #print(uid)
                    '''Obtaining user country'''
                    status_code = 0
                    while status_code != 200:
                        new_url = 'http://www.socialbakers.com/statistics/twitter/profiles/detail/' + uid + '-' + name
                        while True:
                            try:
                                print("hnere")
                                new_page = requests.get(new_url)
                                break
                            except Exception:
                                logging.info('sleeping for 5 seconds...')
                                time.sleep(5)
                                continue
                        #print("hnere")
                        status_code = new_page.status_code
                        if status_code != 200:
                            logging.error(status_code)
                            delay = np.random.rand() * 5
                            time.sleep(delay)
                    new_tree = html.fromstring(new_page.text)
                    tag_list = new_tree.xpath('//div[@class="account-tag-list"]')
                    tags = tag_list[0].text_content().split()
                    for tag in tags:
                        if 'GLOBAL' in tags:
                            user_country = 'GLOBAL'
                        elif len(tag) == 2:
                            user_country = tag
                    '''obtaining number of followers and friends'''
                    try:
                        followings = int(row[2][10:])
                    except ValueError:
                        try:
                            followings = int(row[2][20:])
                        except Exception:
                            print('error occured for: %s' % row)
                    try:
                        followers = int(row[3][9:])
                    except ValueError:
                        try:
                            followers = int(row[3][18:])
                        except Exception:
                            print('error occured for: %s' % row)

                '''writing to file'''
                print('followings:')   
                print(followings)   
                print('followers:')    
                print(followers)    
                print('name:')  
                print(name)   
                print('user_country:')   
                print(user_country)    
                if (followings is not None) and (followers is not None) and (name is not None) and (
                    user_country is not None):
                    fp.write(uid + '\t' + str(name) + '\t' + str(followings) + '\t' + str(followers) + '\t' + str(
                        user_country) + '\n')

                delay = np.random.rand()
        except Exception: 
            print('error occured for a url, sleeping for 10 seconds...') 
            delay = np.random.rand() * 10 
从我打印出从表中找到的数据检索列表开始,它以Android而不是@NetflixLat结束,后者位于第50位。关于如何解析HTMl页面,是否存在错误?我是新来的,所以任何帮助都是欢迎的,如果我含糊其辞,我会进行相应的编辑

从表中检索到的数据列表:
[[b'1',b'1PlayStation(@PlayStation)”,b'followings754',b'FollowersFollowers19190696',[b'2',b'2Xbox(@Xbox'),b'followers16649',b'Followers14339548',[b'3',b'3CHANEL(@CHANEL'),b'followers1',b'Followers13110884',[b'4SpaceX(@SpaceX'),b'followers96',b'followers126948',],[b'5',b'5SamsungMobile(@SamsungMobile)”,b'followers 12150120',[b'6',b'6RockstarGames(@RockstarGames)”,b'followers 1373',b'followers 11048983',[b'7',b'7Starbucks咖啡(@Starbucks)”,b'followers 93694',b'followers 11019118',[b'8',b'Victoria Secret(@VictoriasSecret)”,b'followers 1203',b'followers 10821167',],[b'9',b'9Nintendo of America(@Nintendo America'),b'followers 227',b'followers 10718269',[b'10',b'10Android(@Android'),b'followers 68',b'followers 10502286',[b'展示更多品牌推特档案]