无法在CSV中存储信息（Python Web垃圾处理）_Python_Csv_Web Scraping_Beautifulsoup

无法在CSV中存储信息（Python Web垃圾处理）

python csv web-scraping

无法在CSV中存储信息（Python Web垃圾处理）,python,csv,web-scraping,beautifulsoup,Python,Csv,Web Scraping,Beautifulsoup,我的代码没有将结果正确存储到我创建的csv文件中我需要从数据库中提取每个账单的号码、赞助商和参与方的数据在解释器中运行代码时，它可以正常工作，并提供所需的结果。但是，在我创建的csv文件中，我有以下问题之一：同一发起方对每一张票据（正确的票据编号，但所有票据共享同一发起方）有趣的是，我发现的名字（格里贾尔瓦，劳尔）对应于比尔7302 正确的发起方，但只是第100条法案，即每100个发起方我有7402条；7302等等如上所述，不同的发起人和参与方，但账单的数量每100个发起人/参

我的代码没有将结果正确存储到我创建的csv文件中

我需要从数据库中提取每个账单的号码、赞助商和参与方的数据

在解释器中运行代码时，它可以正常工作，并提供所需的结果。但是，在我创建的csv文件中，我有以下问题之一：

同一发起方对每一张票据（正确的票据编号，但所有票据共享同一发起方）

有趣的是，我发现的名字（格里贾尔瓦，劳尔）对应于比尔7302

正确的发起方，但只是第100条法案，即每100个发起方我有7402条；7302等等
如上所述，不同的发起人和参与方，但账单的数量每100个发起人/参与方对变化一次，然后以100乘100（第一个100对为7402，第二个为7302，依此类推）

正确的发起方但没有账单，以下代码就是这样的

EDIT:如果我把
Congress=[-]+[-]+[-]+[-]
放在第一个名为

with open('115congress.csv', 'w') as f: fwriter=csv.writer(f, delimiter=';') fwriter.writerow(['SPONS', 'PARTY', 'NBILL']) BillN=[] Spons=[] Party=[] for j in range(1, 114): hrurl='https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j) hrpage=requests.get(hrurl, headers=headers) data=hrpage.text soup=BeautifulSoup(data, 'lxml') for q in soup.findAll('span', {'class':'result-item'}): for a in q.findAll('a', href=True, text=True, target='_blank'): secondindex=secondindex+1 if (secondindex/2).is_integer(): continue Spons=a.text print(Spons) SPONS=Spons if 'R' in Spons: Party='Republican' if 'D' in Spons: Party='Democratic' print(Party) PARTY=Party Congress115=[SPONS]+[PARTY] fwriter.writerow(Congress115) for r in soup.findAll('span', {'class':'result-heading'}): index=index+1 if (index/2).is_integer(): continue Bill=r.findNext('a') BillN=Bill.text print(BillN) NBILL=BillN Congress115= [SPONS]+[PARTY]+[NBILL] fwriter.writerow(Congress115) f.close()

如何修复写入CSV的代码以避免出现这些问题？
我不理解您对代码提出的所有问题，因为我无法重现您的错误。然而，我认为您的代码存在一些问题，我想向您展示另一种可能的方法
我认为您的主要错误之一是将变量多次写入csv文件。此外，如果只在包含party abbrev和名称的字符串中查找单个字符，则会得到许多关于party的错误条目
假设您想从每个条目中提取
账单编号
、
担保人
和
参与方
，您可以执行以下操作（请参见代码中的注释）：

导入csv 导入请求从bs4导入BeautifulSoup 对于范围（1114）内的j： hrurl=f'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page={j} " hrpage=requests.get（hrurl） data=hrpage.text soup=BeautifulSoup（数据'html5lib'） #获取主div，它包含页面上的所有条目 main_div=soup.find（'div'，{'id'：'main'}） #每个条目都在元素中 all_li=main_div.findAll（'li'，{'class'：'expanded'}） #在元素上迭代对于所有的李： #得到比尔 bill_nr_raw=li.find（'span'，{'class'：'result-heading'}）。文本 #我假设只有第一部分是Nr，所以您可以使用以下内容提取它票据编号=票据编号原始分割（）[0] #得到海绵 spons_raw=li.find（'span'，{'class'：'result-item'}） spons=spons_raw.find（'a'）。文本 #参加聚会 #检查字符串是否以以下内容之一开头，以确保您选择了正确的一方如果发起人以（'Rep'）开头：政党=‘共和党’ elif spons.startswith（'Dem'）：政党=‘民主党’ #将从单个条目（
-元素）提取的所有信息放入列表，并将该列表（=一行）写入csv文件条目=[票据编号、担保人、当事人] 打开（'output.csv'，'a'）作为输出文件： out=csv.writer（out\u文件） out.writerow（条目）

请注意，只有在Python>3.6中才支持使用f字符串（在主循环的开头）。更好的方法是在不同的元素（例如
）上循环，然后在其中找到所需的元素要获得共同赞助者，首先需要通过检查号码来测试是否有共同赞助者。如果这不是0 ，则首先获取子页面的链接。使用单独的BeautifulSoup对象请求此子页。然后可以解析包含共同赞助者的表，并将所有共同赞助者添加到列表中。如果需要，您可以在此处添加额外的处理。然后将列表合并为单个字符串，以便将其保存到CSV文件中的单个列中 from bs4 import BeautifulSoup import csv import requests import string headers = None with open('115congress.csv', 'w', newline='') as f: fwriter = csv.writer(f, delimiter=';') fwriter.writerow(['SPONS', 'PARTY', 'NBILL', 'TITLE', 'COSPONSORS']) for j in range(1, 3): #114): print(f'Getting page {j}') hrurl = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j) hrpage = requests.get(hrurl, headers=headers) soup = BeautifulSoup(hrpage.content, 'lxml') for li in soup.find_all('li', class_='expanded'): bill_or_law = li.span.text sponsor = li.find('span', class_='result-item').a.text title = li.find('span', class_='result-title').text nbill = li.find('a').text.strip(string.ascii_uppercase + ' .') if '[R' in sponsor: party = 'Republican' elif '[D' in sponsor: party = 'Democratic' else: party = 'Unknown' # Any cosponsors? cosponsor_link = li.find_all('a')[2] if cosponsor_link.text == '0': cosponsors = "No cosponsors" else: print(f'Getting cosponsors for {sponsor}') # Get the subpage containing the cosponsors hr_cosponsors = requests.get(cosponsor_link['href'], headers=headers) soup_cosponsors = BeautifulSoup(hr_cosponsors.content, 'lxml') table = soup_cosponsors.find('table', class_="item_table") # Create a list of the cosponsors cosponsor_list = [] for tr in table.tbody.find_all('tr'): cosponsor_list.append(tr.td.a.text) # Join them together into a single string cosponsors = ' - '.join(cosponsor_list) fwriter.writerow([sponsor, party, nbill, f'{bill_or_law} - {title}', cosponsors]) 为您提供一个输出CSV文件，开始： SPONS；聚会；恩比尔；标题共同赞助者埃里森众议员，基思[D-MN-5]；民主的；7401;法案——加强难民安置法；没有共同赞助者威尔德众议员，苏珊[D-PA-15]；民主的；7400;法案-为海岸警卫队持续拨款。；没有共同赞助者斯坎隆众议员，玛丽·盖伊[D-PA-7]；民主的；7399;法案-创始基金诚信法案；没有共同赞助者福斯特众议员，比尔[D-IL-11]；民主的；7398;法案-SPA法案；没有共同赞助者霍耶众议员Steny H.[D-MD-5]；民主的；7397;法案-为2019财年和其他目的提供更多的持续拨款。；没有共同赞助者托雷斯众议员，诺玛J.[D-CA-35]；民主的；7396;法案——边境安全和儿童安全法；众议员巴尔加斯，胡安[D-CA-51]*-众议员麦戈文，詹姆斯·P.[D-MA-2]* 梅多斯众议员，马克[R-NC-11]；共和国的7395;法案-指示卫生与公众服务部长允许无人机系统运送医疗用品，以及用于其他目的。；没有共同赞助者勒特克迈耶众议员布莱恩[R-MO-3]；共和国的7394;“法案-禁止联邦金融监管机构要求遵守与当前预期信用损失（“CECL”）相关的金融会计准则委员会的会计准则更新。”，要求证券交易委员会在接受拟议会计原则之前考虑该原则的某些影响，以及出于其他目的。”；代表巴德，特德[R-NC-13]* 法索众议员约翰·J.[R-NY-19]；共和国的7393;法案-医疗补助质量护理法；没有共同赞助者众议员巴宾，布莱恩[R-TX-36]；共和国的7392;法案追踪法；没有共同赞助者阿灵顿众议员，Jodey C.[R-TX-19]；共和国的7391;法案——2018年《农村医院自由和灵活性法案》；没有共同赞助者众议员杰克逊·李，希拉[D-TX-18]；民主的；7390;条例草案-暴力侵害儿童 import csv import requests from bs4 import BeautifulSoup for j in range(1,114): hrurl=f'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page={j}' hrpage=requests.get(hrurl) data=hrpage.text soup=BeautifulSoup(data, 'html5lib') # get the main div, that contains all entries on the page main_div = soup.find('div', {'id':'main'}) # every entry is within a <li> element all_li = main_div.findAll('li', {'class':'expanded'}) # iterate over <li>-elements for li in all_li: # get BILL_NR bill_nr_raw = li.find('span', {'class':'result-heading'}).text # I assume only the first part is the Nr, so you could extract it with the following bill_nr = bill_nr_raw.split()[0] # get SPONS spons_raw = li.find('span', {'class':'result-item'}) spons = spons_raw.find('a').text # get PARTY # check if the string starts with one of the following to ensure you pick the right party if spons.startswith('Rep'): party = 'Republican' elif spons.startswith('Dem'): party = 'Democratic' # put all the information you extracted from this single entry (=<li>-element) into a list and write that list (=one row) to the csv file entry = [bill_nr, spons, party] with open('output.csv', 'a') as out_file: out = csv.writer(out_file) out.writerow(entry) from bs4 import BeautifulSoup import csv import requests import string headers = None with open('115congress.csv', 'w', newline='') as f: fwriter = csv.writer(f, delimiter=';') fwriter.writerow(['SPONS', 'PARTY', 'NBILL', 'TITLE', 'COSPONSORS']) for j in range(1, 3): #114): print(f'Getting page {j}') hrurl = 'https://www.congress.gov/search?q=%7B%22source%22%3A%22legislation%22%2C%22congress%22%3A%22115%22%2C%22type%22%3A%22bills%22%7D&page='+str(j) hrpage = requests.get(hrurl, headers=headers) soup = BeautifulSoup(hrpage.content, 'lxml') for li in soup.find_all('li', class_='expanded'): bill_or_law = li.span.text sponsor = li.find('span', class_='result-item').a.text title = li.find('span', class_='result-title').text nbill = li.find('a').text.strip(string.ascii_uppercase + ' .') if '[R' in sponsor: party = 'Republican' elif '[D' in sponsor: party = 'Democratic' else: party = 'Unknown' # Any cosponsors? cosponsor_link = li.find_all('a')[2] if cosponsor_link.text == '0': cosponsors = "No cosponsors" else: print(f'Getting cosponsors for {sponsor}') # Get the subpage containing the cosponsors hr_cosponsors = requests.get(cosponsor_link['href'], headers=headers) soup_cosponsors = BeautifulSoup(hr_cosponsors.content, 'lxml') table = soup_cosponsors.find('table', class_="item_table") # Create a list of the cosponsors cosponsor_list = [] for tr in table.tbody.find_all('tr'): cosponsor_list.append(tr.td.a.text) # Join them together into a single string cosponsors = ' - '.join(cosponsor_list) fwriter.writerow([sponsor, party, nbill, f'{bill_or_law} - {title}', cosponsors])