Parsing 从web解析后将列表转换为数据帧

Parsing 从web解析后将列表转换为数据帧,parsing,dataframe,web-scraping,beautifulsoup,Parsing,Dataframe,Web Scraping,Beautifulsoup,我是python的新手。我想在CME网站下面重新创建表格,但是我无法将我创建的列表转换为数据框。非常感谢任何帮助!提前谢谢 url = "http://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_product_calendar_futures.html" user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' headers = { 'User-Age

我是python的新手。我想在CME网站下面重新创建表格,但是我无法将我创建的列表转换为数据框。非常感谢任何帮助!提前谢谢

url = "http://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_product_calendar_futures.html"
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
headers = { 'User-Agent' : user_agent }

req = urllib2.Request(url, headers=headers)

response = urllib2.urlopen(req)

soup = BeautifulSoup(response)

header = soup.findAll('th',limit = 8)
column_header = []
for j in header:
    column_header.append(j.getText())




data_rows = soup.findAll('tr')[2:]
dates = []
for i in range(len(data_rows)):
    for td in data_rows[i].findAll('td'):
        dates.append(td.getText())
输出:

  • 使用
    .tbody
    .thead
    缩小范围,不要使用
    限制
  • 使用列表理解避免使用
    append

  • 非常感谢!(Y) 我真的找到了一个办法。我不是用bs4,而是通过传递头来解析read_html,因为访问被禁止。
    from bs4 import BeautifulSoup
    import requests
    
    r = requests.get("http://www.cmegroup.com/trading/energy/crude-oil/light-sweet-crude_product_calendar_futures.html")
    
    soup = BeautifulSoup(r.content, "lxml")
    
    headers = [th.text for th in soup.thead.find_all('th')]  # use thead to narrow the scope
    print(headers)
    for tr in soup.tbody.find_all('tr'):
        row = [i.get_text(strip=True) for i in tr(['th', 'td'])]
        print(row)
    
    ['Contract Month', 'Product Code', 'First TradeLast Trade', 'Settlement', 'First HoldingLast Holding', 'First PositionLast Position', 'First NoticeLast Notice', 'First DeliveryLast Delivery']
    ['Feb 2017', 'CLG17', '21 Nov 201120 Jan 2017', '20 Jan 2017', '--', '23 Jan 201723 Jan 2017', '24 Jan 201724 Jan 2017', '01 Feb 201728 Feb 2017']
    ['Mar 2017', 'CLH17', '21 Nov 201121 Feb 2017', '21 Feb 2017', '--', '22 Feb 201722 Feb 2017', '23 Feb 201723 Feb 2017', '01 Mar 201731 Mar 2017']
    ['Apr 2017', 'CLJ17', '21 Nov 201121 Mar 2017', '21 Mar 2017', '--', '22 Mar 201722 Mar 2017', '23 Mar 201723 Mar 2017', '01 Apr 201730 Apr 2017']
    ['May 2017', 'CLK17', '21 Nov 201120 Apr 2017', '20 Apr 2017', '--', '21 Apr 201721 Apr 2017', '24 Apr 201724 Apr 2017', '01 May 201731 May 2017']