Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/361.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将HTML表格转换为CSV文件_Python_Csv_Web Scraping_Beautifulsoup - Fatal编程技术网

Python 将HTML表格转换为CSV文件

Python 将HTML表格转换为CSV文件,python,csv,web-scraping,beautifulsoup,Python,Csv,Web Scraping,Beautifulsoup,如何使用Python和BeautifulSoup将这样的表转换为CSV文件 我想要第一个标题,上面写着Rk、Gcar、Gtm等,而不是表格中的任何其他标题(赛季中每个月的标题) 以下是我目前掌握的代码: from bs4 import BeautifulSoup from urllib2 import urlopen import csv def stir_the_soup(): player_links = open('player_links.txt', 'r') play

如何使用Python和BeautifulSoup将这样的表转换为CSV文件

我想要第一个标题,上面写着Rk、Gcar、Gtm等,而不是表格中的任何其他标题(赛季中每个月的标题)

以下是我目前掌握的代码:

from bs4 import BeautifulSoup
from urllib2 import urlopen
import csv

def stir_the_soup():
    player_links = open('player_links.txt', 'r')
    player_ID_nums = open('player_ID_nums.txt', 'r')
    id_nums = [x.rstrip('\n') for x in player_ID_nums]
    idx = 0
    for url in player_links:
        print url
        soup = BeautifulSoup(urlopen(url), "lxml")
        p_type = ""
        if url[-12] == 'p':
            p_type = "pitching"
        elif url[-12] == 'b':
            p_type = "batting" 
        table = soup.find(lambda tag: tag.name=='table' and tag.has_attr('id') and tag['id']== (p_type + "_gamelogs"))
        header = [[val.text.encode('utf8') for val in table.find_all('thead')]]
        rows = []
        for row in table.find_all('tr'):
            rows.append([val.text.encode('utf8') for val in row.find_all('th')])
            rows.append([val.text.encode('utf8') for val in row.find_all('td')])
        with open("%s.csv" % id_nums[idx], 'wb') as f:
            writer = csv.writer(f)
            writer.writerow(header)
            writer.writerows(row for row in rows if row)
        idx += 1
    player_links.close()

if __name__ == "__main__":
    stir_the_soup()
id_nums列表包含每个玩家的所有id号,用作单独CSV文件的名称

对于每一行,最左边的单元格是一个标记,该行的其余部分是标记。除了标题外,我如何将其放在一行中

import pandas as pd
from bs4 import BeautifulSoup
import urllib2

url = 'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010'
html=urllib2.urlopen(url)

bs = BeautifulSoup(html,'lxml')

table = str(bs.find('table',{'id':'batting_gamelogs'}))

dfs = pd.read_html(table)
这使用熊猫,这是非常有用的东西像这样。它还以一种非常合理的格式对其执行其他操作


这段代码为您提供了一个大的统计表,我想这正是您想要的。 确保已安装
lxml
beautifulsoup4
pandas

df = pd.read_html(r'https://www.baseball-reference.com/players/gl.fcgi?id=abreuto01&t=b&year=2010')
print(df[4])
这是前5行的输出。您可能需要稍微清洁一下,因为我不知道您的最终目标是什么:

df[4].head(5)
    Rk  Gcar    Gtm Date    Tm  Unnamed: 5  Opp Rslt    Inngs   PA  ... CS  BA  OBP SLG OPS BOP aLI WPA RE24    Pos
0   1   66  2 (1)   Apr 6   ARI NaN SDP L,3-6   7-8 1   ... 0   1.000   1.000   1.000   2.000   9   .94 0.041   0.51    PH
1   2   67  3   Apr 7   ARI NaN SDP W,5-3   7-8 1   ... 0   .500    .500    .500    1.000   9   1.16    -0.062  -0.79   PH
2   3   68  4   Apr 9   ARI NaN PIT W,9-1   8-GF    1   ... 0   .667    .667    .667    1.333   2   .00 0.000   0.13    PH SS
3   4   69  5   Apr 10  ARI NaN PIT L,3-6   CG  4   ... 0   .500    .429    .500    .929    2   1.30    -0.040  -0.37   SS
4   5   70  7 (1)   Apr 13  ARI @   LAD L,5-9   6-6 1   ... 0   .429    .375    .429    .804    9   1.52    -0.034  -0.46   PH
要选择此数据框中的某些列:
df[4]['COLUMN\u NAME\u HERE']。标题(5)

示例:
df[4]['Gcar']


此外,如果执行
df[4]
变得烦人,您可以随时切换到另一个数据帧
df2=df[4]

Pandas有一个很好的功能
读取html
,这可能正是您想要的。在这之后,如何将其转换为自己的CSV文件,因为read_html()将其转换为数据帧的元组?