使用Pandas和BeautifulSoup刮刀创建新的重复行

使用Pandas和BeautifulSoup刮刀创建新的重复行,pandas,csv,beautifulsoup,Pandas,Csv,Beautifulsoup,我在Politifact网站上建立了一个刮板。对美国新闻的真实性进行评级的非营利组织。我已经检索到了18066个网址到新闻评级,我想从中获取一些信息。不知何故,在代码执行后,我在一个.csv文件中存储了总共41666个新闻评级,出现了一些重复。这是我的密码 #Get links from CSV file urls = [] with open(f'{PATH}ULR_list.csv', 'r') as f: urls = f.read().split(',') num_urls =

我在Politifact网站上建立了一个刮板。对美国新闻的真实性进行评级的非营利组织。我已经检索到了18066个网址到新闻评级,我想从中获取一些信息。不知何故,在代码执行后,我在一个.csv文件中存储了总共41666个新闻评级,出现了一些重复。这是我的密码

#Get links from CSV file
urls = []
with open(f'{PATH}ULR_list.csv', 'r') as f:
    urls = f.read().split(',')

num_urls = len(urls)
print(f'There is total of {num_urls} URLS')

#Final DataFrame
df  = pd.read_csv(f'{PATH}politifactDataset.csv')
count = 0

for u in urls:
    dic = {}

    #Construct Link
    link = 'https://www.politifact.com'+u.replace('"','')

    #Request
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'html.parser')

    #Get Title
    title_ = soup.select('.m-statement__content .m-statement__quote')
    title = title_[0].text.replace('\n','')
    dic['Title'] = title

    #Get Tags
    tags_ = soup.select('.m-list.m-list--horizontal .m-list__item')
    tags = ''
    for t in tags_:
        tags += t.select_one('span').text + ','
    dic['Tags'] = tags

    #Get Author
    author_ = soup.select(".m-statement__author .m-statement__name")
    author = author_[0].text.replace('\n','')
    dic['Author'] = author

    #Get Rating
    rating_ = soup.select(".m-statement__body .m-statement__meter [alt]")
    rating = rating_[0]['alt']
    dic['Rating'] = rating

    #Save into DataFrame
    df = df.append(dic, ignore_index=True)

    
    count += 1
    if count%100 == 0:
        #time.sleep(3)
        percentage = (count/num_urls)*100
        print(f'{percentage}%')
        df.to_csv('politifactDataset.csv', index = False)

df.to_csv('politifactDataset.csv', index = False)
总共有18066个URL

如果我打开.csv文件

df = pd.read_csv('politifactDataset.csv')
length = len(df)
print(f'There is a total of {length} rows')
总共有41666行


我知道我可以使用df.drop_duplicates,但仍有一些重复项。我的问题是重复的行从哪里来?

这是最后的代码。我通过只保存一次数据帧解决了这个问题,当时所有的URL都被删除了

#Get links from CSV file
urls = []
with open(f'{PATH}ULR_list.csv', 'r') as f:
    urls = f.read().split(',')

num_urls = len(urls)
print(f'There is total of {num_urls} URLS')

#Final DataFrame
df  = pd.DataFrame(columns=['Title','Tags','Author','Rating'])

for u in urls:
    dic = {}

    #Construct Link
    link = 'https://www.politifact.com'+u.replace('"','')

    #Request
    r = requests.get(link)
    soup = BeautifulSoup(r.content, 'html.parser')

    #Get Title
    title_ = soup.select('.m-statement__content .m-statement__quote')
    title = title_[0].text.replace('\n','')
    dic['Title'] = title

    #Get Tags
    tags_ = soup.select('.m-list.m-list--horizontal .m-list__item')
    tags = ''
    for t in tags_:
        tags += t.select_one('span').text + ','
    dic['Tags'] = tags

    #Get Author
    author_ = soup.select(".m-statement__author .m-statement__name")
    author = author_[0].text.replace('\n','')
    dic['Author'] = author

    #Get Rating
    rating_ = soup.select(".m-statement__body .m-statement__meter [alt]")
    rating = rating_[0]['alt']
    dic['Rating'] = rating

    #Save into DataFrame
    df = df.append(dic, ignore_index=True)

    
    if len(df)%1000 == 0:
        percentage = (len(df)/num_urls)*100
        print(f'{percentage}%')

df.to_csv('politifactDataset.csv', index = False)

你能用
ULR\u list.csv
中的内容更新你的问题吗?我找到了解决方案。谢谢你的关注!