Html “提取多个”；td'；从表中删除类_Html_Pandas_Web Scraping_Beautifulsoup

Html “提取多个”；td'；从表中删除类

html pandas web-scraping

Html “提取多个”；td'；从表中删除类,html,pandas,web-scraping,beautifulsoup,Html,Pandas,Web Scraping,Beautifulsoup,我有一个网站，上面有很多表格，我正在循环浏览这些表格，从所有表格中获取数据，并将其导出到csv文件中。我的问题是，每个表都有多个“td”类，我只想要几个，但我不知道如何将它们缩小到我想要的范围。当我运行我当前的代码时，我得到了表中我不想要的额外部分，这使得我的csv文件非常混乱。下面是包含我想要的信息的页面的html代码片段 607 136.5 28 32 60 封面：+1.5 608 « -2.5 24 37 61 以下：121 您需要以下内容： var blacks = docu

我有一个网站，上面有很多表格，我正在循环浏览这些表格，从所有表格中获取数据，并将其导出到csv文件中。我的问题是，每个表都有多个“td”类，我只想要几个，但我不知道如何将它们缩小到我想要的范围。当我运行我当前的代码时，我得到了表中我不想要的额外部分，这使得我的csv文件非常混乱。下面是包含我想要的信息的页面的html代码片段


607
136.5
28
32
60
封面：+1.5
608  «
-2.5 
24
37
61
以下：121

您需要以下内容：

var blacks = document.getElementsByClassName('black');

var blacks将是一个数组，您可以按如下方式访问每个元素：

for(i = 0; i< blacks.length;i++){
    var black = blacks[i];
}

for（i=0；i

另一种解决方案
from simplified_scrapy.spider import SimplifiedDoc 
html='''
<table class="sportPicksBorder">
 <tr class="table_title">
  <td class="sportPicksBorderL2 tanBg fourleft regular" nowrap="nowrap">
    607 <a class="black" href="/college-basketball/teams/team-page.cfm/team/oakland">OAKLND</a>
  </td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">&nbsp;136.5&nbsp;</td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">28</td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">32</td>
  <td class="sportPicksBorderL2 tanBg zerocenter sub_title_red" nowrap="nowrap">60</td>
  <td class="sportPicksBorderR2 tanBg zerocenter regular" nowrap="nowrap" width="100">Cover: +1.5&nbsp;</td>
</tr>
<tr class="table_title">
  <td class="sportPicksBorderL2 tanBg fourleft regular" nowrap="nowrap">
    608 <a class="black" href="/college-basketball/teams/team-page.cfm/team/youngstown-state">YOUNG</a> <span class="sub_title_red">«</span>
  </td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">&nbsp;-2.5&nbsp;</td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">24</td>
  <td class="sportPicksBorderL2 tanBg zerocenter regular" nowrap="nowrap">37</td>
  <td class="sportPicksBorderL2 tanBg zerocenter sub_title_red" nowrap="nowrap">61</td>
  <td class="sportPicksBorderR2 tanBg zerocenter regular" nowrap="nowrap" width="100">Under: 121&nbsp;</td>
</tr>
</table>
'''
doc = SimplifiedDoc(html) # create doc
tables = doc.getElementsByClass('sportPicksBorder')
# If you only want to include 'td' of 'a' and remove the arrow at the back, you can do this
for table in tables:
    rows = table.trs
    for row in rows:
        column = row.td # Get the first td
        column.removeElement('span') # Delete that arrow
        print (column.text)
        # You can do the same
        print (column.firstText(),column.a.text)

这里有更多的例子：
我发现在我的columns循环中，我需要以这种方式格式化代码，以便只找到我想要的特定“td”
columns = row.find_all(class_=["black", "sportPicksBorderL2 tanBg zerocenter regular", "sportPicksBorderL2 tanBg zerocenter sub_title_red", "sportPicksBorderR2 tanBg zerocenter regular"])

使用find_all查找所有类，然后使用class_查看所有这些类以查找需要的类，我需要用逗号分隔它们，然后将它们按我希望它们显示在csv文件中的顺序排列，这就是我需要做的 我想我对这个答案有点困惑，我使用的python脚本工作得很好，只是返回了比我想要的更多的数据，因为网页上的表是如何设计的。每个表中有4个tr，我只想要其中2个的td。每一行中的td都有自己的td类，按照我的代码现在的工作方式，它捕获了所有4个tr中的所有td，我只想捕获其中2个tr中的数据。我如何用我当前的代码实现这一点，它运行良好，只返回比我想要的更多的数据。对不起，我不熟悉BeautifulSoup，因此我不知道如何使用BeautifulSoup筛选和提取您想要的数据。
columns = row.find_all(class_=["black", "sportPicksBorderL2 tanBg zerocenter regular", "sportPicksBorderL2 tanBg zerocenter sub_title_red", "sportPicksBorderR2 tanBg zerocenter regular"])