Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/302.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
如何在Python和BS4中正确地刮取数据?_Python_Bs4 - Fatal编程技术网

如何在Python和BS4中正确地刮取数据?

如何在Python和BS4中正确地刮取数据?,python,bs4,Python,Bs4,这是所需的输出。包含两行的CSV文件: 1639, 06/05/17, 08,09,16,26,37,50 1639, 06/05/17, 13,28,32,33,37,38 今天,我只有这个,但使用VBA Excel代码来清理/组织数据: 08,09,16,26,37,50 13,28,32,33,37,38 第一行中的“1639,06/05/17”来自Resultado Concurso 1639 06/05/2017,“08,09,16,26,37,50”来自下面提供的标签: <

这是所需的输出。包含两行的CSV文件:

1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
今天,我只有这个,但使用VBA Excel代码来清理/组织数据:

08,09,16,26,37,50
13,28,32,33,37,38
第一行中的“1639,06/05/17”来自Resultado Concurso 1639 06/05/2017,“08,09,16,26,37,50”来自下面提供的标签:

<ul class="numbers dupla-sena">
<h6>1º sorteio</ <h6>1º sorteio</h6>
<li>08</li><li>09</li><li>16</li><li>26</li><li>37</li><li>50</li>    
</ul>
使用下面的命令,我想我可以得到我想要的一切,但我不知道如何以我需要的方式提取数据:

ltr.findAll("div",{"class":"content-section section-text with-box no-margin-bottom"})
因此,我尝试了另一种方法来获取“1ºsorteio da dupla sena”中的值

print('-----------------dupla-sena 1º sorteio-----------------------------')
d1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[0].text.strip())
产出1

1º sorteio
080916263750
分隔两位数字的步骤

d1 = '0'+ d1 if len(d1)%2 else d1    
gi = [iter(d1)]*2   
r = [''.join(dz1) for dz1 in zip(*gi)]
d3=",".join(r)
dd1 = '0'+ dd1 if len(dd1)%2 else dd1    
gi = [iter(dd1)]*2    
r1 = [''.join(ddz1) for ddz1 in zip(*gi)]    
dd3=",".join(r1)
结果

08,09,16,26,37,50
第二次提取也是如此

print('-----------------dupla-sena 2º sorteio-----------------------------')
dd1 = ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip()
print(ltr.findAll("ul",{"class":"numbers dupla-sena"})[1].text.strip())
产出2

2º sorteio
132832333738
分隔两位数字的步骤

d1 = '0'+ d1 if len(d1)%2 else d1    
gi = [iter(d1)]*2   
r = [''.join(dz1) for dz1 in zip(*gi)]
d3=",".join(r)
dd1 = '0'+ dd1 if len(dd1)%2 else dd1    
gi = [iter(dd1)]*2    
r1 = [''.join(ddz1) for ddz1 in zip(*gi)]    
dd3=",".join(r1)
那么我们有

13,28,32,33,37,38
将数据保存到csv文件

f.write(d3 + ',' + dd3 +'\n')
f.close()
输出:当前目录中的csv文件:

01,º ,so,rt,ei,o
,08,09,16,26,37,50,02,º ,so,rt,ei,o
,13,28,32,33,37,38
我可以使用上面的方法/输出,但我也必须使用VBA excel来处理这些混乱的数据,但我试图避免使用VBA代码。实际上,我对学习Python更感兴趣,并且越来越多地使用这个强大的工具。 使用此解决方案,我只实现了我想要的一个部分,即:

08,09,16,26,37,50
13,28,32,33,37,38
但是,正如我们所知,所需的输出是:

1639, 06/05/17, 08,09,16,26,37,50
1639, 06/05/17, 13,28,32,33,37,38
我正在MAC OS X Yosemite10.10.5中使用Python3.6.1 v3.6.1:,Jupyter笔记本

我怎样才能做到这一点?我不知道如何提取'1639,06/05/17'并将其放入csv文件中,是否有更好的方法提取六个数字08,09,16,26,37,50和13,28,32,33,37,38,并且不使用下面的代码和不使用vba

要分隔两位数字,请执行以下操作:

d1 = '0'+ d1 if len(d1)%2 else d1
gi = [iter(d1)]*2
r = [''.join(dz1) for dz1 in zip(*gi)]
更新问题 更新的问题2 仍然丢失了2个。我知道我做错了什么,因为1ºsorteio和2ºsorteio都在:

print(rows[0]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 1, 'numero2': 21, 'numero3': 22, 'numero4': 43, 'numero5': 47, 'numero6': 50} 

 print(rows[1]) --> {'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}
但是,当我尝试在csv中存储第[0]行的内容时,第[0]行显示在第[0]行中。我正在尝试找出如何包含缺少的内容。也许我错了,但我认为1ºsorteio和2ºsorteio都应该包含在第二行中,但代码没有证实这一点,当我们看到这一点时,这是一个猜测:

print(row_dict)
{'sena': '1643', 'data': '16/05/2017', 'numero1': 3, 'numero2': 4, 'numero3': 9, 'numero4': 19, 'numero5': 21, 'numero6': 26}

我看不出我做错了什么。我知道这个答案要花很多时间,但在这个过程中,我从你们身上学到了很多。并且已经使用了我从您那里学到的一些工具,如re、concepts、dict、zip。

免责声明:我对beautiful soup不太熟悉,我通常使用lxml,也就是说

soup = BeautifulSoup(response.text)  # <-- edit showing how i assigned soup
pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)
if match:
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    numeros = []
    for i in nums:
        numeros.append(','.join(j.text for j in i.findAll('li')))
    rows = []
    for n in numeros:
        rows.append(','.join([concurso, data, n]))

print(rows)
['1639,06/05/2017,08,09,16,26,37,50', '1639,06/05/2017,13,28,32,33,37,38']
这部分有点杂乱:

rows = [
    dict(zip(
        field_names, 
        [concurso, data, *[int(num.text) for num in group.findAll('li')]
    )) 
    for group in nums]
让我用另一种方式来写:

rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
    # start each row with the shared concurso, data elements
    row = [concurso, data]
    # for each `ul` get all the `li` elements containing the individual number
    for num in group.findAll('li'):
        # add each number
        row.append(int(num.text))
    # get [('seria', '1234'), ('data', '12/13'2017'),...]
    row_title_value_pairs = zip(field_names, row)
    # turn into dict {'seria': '1234', 'data': '12/13/2017', ...}
    row_dict = dict(row_title_value_pairs)
    rows.append(row_dict)
    # or just write the csv here instead of appending to rows and re-looping over the values
    ...
更新2: 我希望您从中学习的一件事是在学习时使用print语句,这样您就可以理解代码的作用。我不会进行更正,但我会指出它们,并在发生重大变化的每个位置添加打印语句

match = pat.search(concurso_e_data)


# everything should be indented under this block since 
# if there is no match then none of the below code should run
if match:  
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    field_names = ['sena', 'data', *num_headers]

# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
    # here each group is a full `sena` row from the site
    print('Pulling sena: ', group.text)
    row = [concurso, data]
    print('Adding concurso + data to row: ', row)
    for num in group.findAll('li'):
        row.append(int(num.text))
        print('Adding {} to row.'.format(num))
    print('Row complete: ', row)
    row_title_value_pairs = zip(field_names, row)
    print('Transform row to header, value pairs: ', row_title_value_pairs)
    row_dict = dict(row_title_value_pairs)
    print('Row dictionary: ', row_dict)
    rows.append(row_dict)
    print('Rows: ', rows)

    # PROBLEM 2
    # It would seem that you've confused this section when switching
    # out the original list comprehension with the more explicit 
    # for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the 
# `for group in nums:` above

    # the below block loops over rows, but you are still building
    # the rows in the for loop
    # you are effectively double looping over the values in `row`
    with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
            csvfile,
            fieldnames=field_names,
            dialect='excel',
            extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
            quoting=csv.QUOTE_NONNUMERIC  # quote anything thats not a number, again just in case
        )
        csv_writer.writeheader()
        # this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
        for row in rows:
            print('Adding row to CSV: ', row)
            csv_writer.writerow(row_dict)
运行此命令,查看打印语句显示的内容。但是也要阅读注释,因为如果sena、数据不匹配,会导致错误

提示:进行缩进,然后添加其他内容:打印“无sena,数据匹配!”在if match:块的最末端下。。。但是先运行这个,然后检查它打印的内容。

代表OP发布

在你的帮助下,我们成功了!输出的方式,我需要!我已将输出更改为:

sena;data;numero1;numero2;numero3;numero4;numero5;numero6
1644;18/05/2017;4;6;31;39;47;49
1644;18/05/2017;20;37;44;45;46;50
所以,关注,关注,;在Excel中,我决定更改,例如;在Excel中打开列而不出现任何问题

import requests
from bs4 import BeautifulSoup  
import re
import csv

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")  ## "lxml" to avoid the warning

pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)

if match:  
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    field_names = ['sena', 'data', *num_headers]

    rows = []
    for group in nums:
        row = [concurso, data]
        for num in group.findAll('li'):
            row.append(int(num.text))
        row_title_value_pairs = zip(field_names, row)
        row_dict = dict(row_title_value_pairs)
        rows.append(row_dict)

    with open('ds_v10.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
        csvfile,
        fieldnames=field_names,             
        dialect='excel',       
            delimiter = ';',  #to handle column issue in excel!
        )
        csv_writer.writeheader()
        csv_writer.writerow(rows[0])
        csv_writer.writerow(rows[1])

你好,谢谢你的帮助!我尝试并得到了以下消息:TypeError:find缺少1个必需的位置参数:“self”@fabio Voce nao pode simplemente me dar o erro sem o traceback。Mas,欧盟是一个问题。Meu变量汤e sua汤sao definidas differente。欧盟必须遵守最基本的规则。建议!!当然,这是一个很好的解决方案。你是阿迪夸达门特出口公司吗?使用我的方法是将csv数据传输到一个文件中,它的格式是:concurso-1 Coluna,data do sortio-1 Coluna e Resulta to da extracao 6 dezenas-6 colunas:file='s_stack.csv'和openfile,'a'作为f:writer=csv.writerf writer.Writerowsfoi falha minha na Declarcao do problema,talvez nao tenha sido claro quanto有一个必要的设计和安装,因为它有6个孔。马斯诺-斯波德街的坦泰·康卡特纳尔(Tentei concatenar)即将上市。Veja o回溯:filename=ds\u stack\u 2.csv f=openfilename,w f.writerows+'\n'f.close TypeError回溯1中最后一次调用filename=ds\u stack\u 2.csv 2 f=openfilename,w-->3 f.writerows+'\n“4 f.close类型错误:只能将列表而不是str连接到list@fabionao precisa o'\n':f.writerows+'\n'->f.writerows。你是谁?你是谁?O problma e qui rows e tipo list e'\n'e tipo stringadded solution with explanations此处的问题是,您复制并粘贴了代码的不同部分,而没有花时间了解它的功能。我试着在每一行上写下它的功能,但是如果你不花时间去读,那么这些错误就会发生。检查
我的第二次更新…你们都做得很好,坚持不懈地获得了所需的结果,做得很好。然而,这里的一篇文章中可能塞满了太多的问题,这是一个堆栈交换原则,使问题比这更简单。首先,法比奥,口头回答非常友好,但大多数回答者可能会正确地问你一个新问题;复杂的问题可以被分解,这样新的读者就不必完全摸索前一阶段来理解当前的问题。但是,同一个问题中的多阶段问题对于提问者来说是非常独特的,因此可能不会有太多的未来适用性。谢谢@halfer的帮助!现在好多了!
rows = [
    dict(zip(
        field_names, 
        [concurso, data, *[int(num.text) for num in group.findAll('li')]
    )) 
    for group in nums]
rows = []
# then add the numbers
# nums is all the `ul` list elements contains the drawing numbers
for group in nums:
    # start each row with the shared concurso, data elements
    row = [concurso, data]
    # for each `ul` get all the `li` elements containing the individual number
    for num in group.findAll('li'):
        # add each number
        row.append(int(num.text))
    # get [('seria', '1234'), ('data', '12/13'2017'),...]
    row_title_value_pairs = zip(field_names, row)
    # turn into dict {'seria': '1234', 'data': '12/13/2017', ...}
    row_dict = dict(row_title_value_pairs)
    rows.append(row_dict)
    # or just write the csv here instead of appending to rows and re-looping over the values
    ...
match = pat.search(concurso_e_data)


# everything should be indented under this block since 
# if there is no match then none of the below code should run
if match:  
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    field_names = ['sena', 'data', *num_headers]

# PROBLEM 1
# all this should be indented into the `if match:` block above
# none of this should run if there is no match
# you cannot build the rows without the match for sena and data
# Let's add some print statements to see whats going on
rows = []
for group in nums:
    # here each group is a full `sena` row from the site
    print('Pulling sena: ', group.text)
    row = [concurso, data]
    print('Adding concurso + data to row: ', row)
    for num in group.findAll('li'):
        row.append(int(num.text))
        print('Adding {} to row.'.format(num))
    print('Row complete: ', row)
    row_title_value_pairs = zip(field_names, row)
    print('Transform row to header, value pairs: ', row_title_value_pairs)
    row_dict = dict(row_title_value_pairs)
    print('Row dictionary: ', row_dict)
    rows.append(row_dict)
    print('Rows: ', rows)

    # PROBLEM 2
    # It would seem that you've confused this section when switching
    # out the original list comprehension with the more explicit 
    # for loop in building the rows.
# The below block should be indented to this level.
# Still under the `if match:`, but out of the the 
# `for group in nums:` above

    # the below block loops over rows, but you are still building
    # the rows in the for loop
    # you are effectively double looping over the values in `row`
    with open('file_v5.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
            csvfile,
            fieldnames=field_names,
            dialect='excel',
            extrasaction='ignore', # drop extra fields if not in field_names not necessary but just in case
            quoting=csv.QUOTE_NONNUMERIC  # quote anything thats not a number, again just in case
        )
        csv_writer.writeheader()
        # this is where you are looping extra because this block is in the `for` loop mentioned in my above notes
        for row in rows:
            print('Adding row to CSV: ', row)
            csv_writer.writerow(row_dict)
sena;data;numero1;numero2;numero3;numero4;numero5;numero6
1644;18/05/2017;4;6;31;39;47;49
1644;18/05/2017;20;37;44;45;46;50
import requests
from bs4 import BeautifulSoup  
import re
import csv

url = 'http://loterias.caixa.gov.br/wps/portal/loterias/landing/duplasena/'

r = requests.get(url)

soup = BeautifulSoup(r.text, "lxml")  ## "lxml" to avoid the warning

pat = re.compile(r'(?i)(?<=concurso)\s*(?P<concurso>\d+)\s*\((?P<data>.+?)(?=\))')
concurso_e_data = soup.find(id='resultados').h2.span.text
match = pat.search(concurso_e_data)

if match:  
    concurso, data = match.groups()
    nums = soup.find_all("ul", {"class": "numbers dupla-sena"})
    num_headers = (','.join(['numero%d']*6) % tuple(range(1,7))).split(',')
    field_names = ['sena', 'data', *num_headers]

    rows = []
    for group in nums:
        row = [concurso, data]
        for num in group.findAll('li'):
            row.append(int(num.text))
        row_title_value_pairs = zip(field_names, row)
        row_dict = dict(row_title_value_pairs)
        rows.append(row_dict)

    with open('ds_v10.csv', 'w', encoding='utf-8') as csvfile:
        csv_writer = csv.DictWriter(
        csvfile,
        fieldnames=field_names,             
        dialect='excel',       
            delimiter = ';',  #to handle column issue in excel!
        )
        csv_writer.writeheader()
        csv_writer.writerow(rows[0])
        csv_writer.writerow(rows[1])