R:2019年改变后的雅虎财经网站_R_Web Scraping_Rvest_Yahoo Finance

R:2019年改变后的雅虎财经网站

r web-scraping

R:2019年改变后的雅虎财经网站,r,web-scraping,rvest,yahoo-finance,R,Web Scraping,Rvest,Yahoo Finance,很长一段时间以来，我一直很高兴地使用大量从其他stackoverflow答案中借来的代码浏览yahoo.finance页面，而且效果很好，但在过去几周，yahoo将它们的表更改为可折叠/可扩展表。这已经破坏了代码，尽管几天来我尽了最大的努力，我还是无法修复这个bug 下面是其他人使用多年的代码示例（然后由不同的人以不同的方式解析和处理）对于预期结果的示例，我们可以尝试另一个yahoo未更改的页面，例如： # Create a URL string myURL2 <- "https:/

很长一段时间以来，我一直很高兴地使用大量从其他stackoverflow答案中借来的代码浏览yahoo.finance页面，而且效果很好，但在过去几周，yahoo将它们的表更改为可折叠/可扩展表。这已经破坏了代码，尽管几天来我尽了最大的努力，我还是无法修复这个bug

下面是其他人使用多年的代码示例（然后由不同的人以不同的方式解析和处理）

对于预期结果的示例，我们可以尝试另一个yahoo未更改的页面，例如：

 # Create a URL string
myURL2 <-  "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"

df2 <- myURL2 %>% 
  read_html() %>% 
  html_table(header = FALSE) %>% 
  map_df(bind_cols) %>% 
  as_tibble()

#创建URL字符串
myURL2%
html_表格（标题=FALSE）%>%
映射df（绑定列）%>%
作为_tible（）

如果您查看df2，您将得到59个观察值，其中两个变量是该页面上的主表，从

市值（日内）5[此处的价值] 企业价值3[此处的价值]

诸如此类……

这看起来有点像是家喻户晓，但我想避免在页面上出现很多我怀疑是动态的内容（例如许多类名），并提供一些可能具有更长保存期限的内容

您的代码之所以失败，部分原因是没有包含该数据的

表元素。相反，您可以使用外观更稳定的fi row
class属性来收集所需输出表的“行”。在每一行中，您可以通过基于父行节点匹配具有title
属性或data test='fin-col'
的元素来收集列
我使用正则表达式来匹配日期（因为这些日期会随着时间的推移而变化），并将它们与静态的两个标头结合起来，为输出提供最终的dataframe标头。我将正则表达式限制为单个节点的文本，我知道这些文本应该包含仅为那些必需日期的模式匹配

R:
library(rvest)
library(stringr)
library(magrittr)

page <- read_html('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
nodes <- page %>%html_nodes(".fi-row")
df = NULL

for(i in nodes){
  r <- list(i %>%html_nodes("[title],[data-test='fin-col']")%>%html_text())
  df <- rbind(df,as.data.frame(matrix(r[[1]], ncol = length(r[[1]]), byrow = TRUE), stringsAsFactors = FALSE))
}

matches <- str_match_all(page%>%html_node('#Col1-3-Financials-Proxy')%>%html_text(),'\\d{1,2}/\\d{1,2}/\\d{4}')  
headers <- c('Breakdown','TTM', matches[[1]][,1]) 
names(df) <- headers
View(df)

import requests, re
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
soup = bs(r.content, 'lxml')
results = []

for row in soup.select('.fi-row'):
    results.append([i.text for i in row.select('[title],[data-test="fin-col"]')])

p = re.compile(r'\d{1,2}/\d{1,2}/\d{4}')
headers = ['Breakdown','TTM']
headers.extend(p.findall(soup.select_one('#Col1-3-Financials-Proxy').text))
df = pd.DataFrame(results, columns = headers)
print(df)

正如上面的评论中所提到的，这里有一个替代方法，可以尝试处理发布的不同表大小。我一直在做这件事，并得到了一位朋友的帮助
library(rvest)
library(tidyverse)

url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
  # Create empty data frame with the required dimentions
  df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                      stringsAsFactors = F)

  # Fill the table looping through rows
  for (i in 1:length(raw_table)) {
    # Find the row name and set it.
    df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
    # Now grab the values
    row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
    for (j in 1:(number_of_columns - 1)) {
      df[i, j+1] <- row_values[j] %>% html_text()
    }
  }
view(df)

库（rvest）
图书馆（tidyverse）
url%length（）
如果（列数>1）{
#创建具有所需尺寸的空数据框
df%html_text（）
#现在抓住这些值
行值%html\u节点（“div.Ta\\（end\\）”）
对于（j在1中：（列的数量-1））{
df[i，j+1]%html_text（）
}
}
视图（df）
您能更清楚地说明预期输出与实际发生的情况吗？当然。我将更详细地编辑上面的问题。：-）您将如何使用您的脚本构建一个循环来运行多个ticker的scrape，然后将它们绑定在一起？非常好的工作！非常感谢你。我真的很喜欢你的方法。我也一直在做这件事（一段不健康的时间），并且使用了一种类似的方法，一行一行地检查它，我也添加了它来按列检查它。我将在下面的答案中发布我的代码。您在这里没有太多选择，仍然能够生成一些可能持续一段时间的东西。您编写的python代码太棒了，太短太甜了。你会在这里看到我的R版本-它没有你的简洁。干得好，非常感谢。不客气。记住你可以在两天内接受自己的答案。这样做有助于向人们展示什么是有效的。str_match_all（第%>%html_节点（“#Col1-1-Financials-Proxy”）%%>%html_text（），“\\d{1,2}/\\d{1,2}/\\d{1,2}/\\d{4}”
import requests, re
import pandas as pd
from bs4 import BeautifulSoup as bs

r = requests.get('https://finance.yahoo.com/quote/AAPL/financials?p=AAPL')
soup = bs(r.content, 'lxml')
results = []

for row in soup.select('.fi-row'):
    results.append([i.text for i in row.select('[title],[data-test="fin-col"]')])

p = re.compile(r'\d{1,2}/\d{1,2}/\d{4}')
headers = ['Breakdown','TTM']
headers.extend(p.findall(soup.select_one('#Col1-3-Financials-Proxy').text))
df = pd.DataFrame(results, columns = headers)
print(df)

library(rvest)
library(tidyverse)

url <- https://finance.yahoo.com/quote/AAPL/financials?p=AAPL

# Download the data
raw_table <- read_html(url) %>% html_nodes("div.D\\(tbr\\)")

number_of_columns <- raw_table[1] %>% html_nodes("span") %>% length()

if(number_of_columns > 1){
  # Create empty data frame with the required dimentions
  df <- data.frame(matrix(ncol = number_of_columns, nrow = length(raw_table)),
                      stringsAsFactors = F)

  # Fill the table looping through rows
  for (i in 1:length(raw_table)) {
    # Find the row name and set it.
    df[i, 1] <- raw_table[i] %>% html_nodes("div.Ta\\(start\\)") %>% html_text()
    # Now grab the values
    row_values <- raw_table[i] %>% html_nodes("div.Ta\\(end\\)")
    for (j in 1:(number_of_columns - 1)) {
      df[i, j+1] <- row_values[j] %>% html_text()
    }
  }
view(df)