使用rvest包的webscraping结果为空_R_Parsing_Web Scraping_Rvest

使用rvest包的webscraping结果为空

r parsing web-scraping

使用rvest包的webscraping结果为空,r,parsing,web-scraping,rvest,R,Parsing,Web Scraping,Rvest,在下面的链接中有一个按国家划分的税收表，我想把它拼凑成一个包含国家和税收列的数据框架我尝试使用rvest包获取我的国家栏，如下所示，但是我生成的列表是空的，我不明白为什么如果您能为我提供解决这个问题的建议，我将不胜感激 library(rvest) d1 <- read_html( "http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates" ) TaxCountry <- d1 %>% ht

在下面的链接中有一个按国家划分的税收表，我想把它拼凑成一个包含国家和税收列的数据框架

我尝试使用rvest包获取我的国家栏，如下所示，但是我生成的列表是空的，我不明白为什么

如果您能为我提供解决这个问题的建议，我将不胜感激

library(rvest)
d1 <- read_html(
  "http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates"
  )
TaxCountry <- d1 %>%
  html_nodes('.countryNameQC') %>%
  html_text()

库（rvest）
d1%
html_text（）

在浏览器中运行javascript时，数据会动态加载，DOM也会改变。这不会发生在

rvest

上

浏览器中的以下选择器将隔离您的节点：

.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(1) .countryYear 
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryNameQC
.twoCountryWrapper .countryNameAndYearQC:nth-child(2) .countryYear

但是，这些类甚至没有出现在

rvest

return中

感兴趣的数据实际上存储在多个节点中；所有这些文件的ID都位于公共前缀

dspQCLinks

内。其中的数据如下所示：

因此，您可以使用css attribute=value和start with operator（^）语法收集所有这些节点：

然后提取文本并合并成一个字符串

paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = '')

现在，表中的每一行都由

分隔，因此我们可以在此基础上拆分以生成行：
info = strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]

然后，示例行将如下所示：
"Albania@/uk/taxsummaries/wwts.nsf/ID/Albania-Corporate-Taxes-on-corporate-income@15"

如果我们拆分@
上的每一行，我们需要的数据位于索引1和3：
arr = strsplit(i, '@')[[1]]
country <- arr[1]
tax <- arr[3]

然后删除df底部的空行
 df <- df[df$Country != "",] 


阅读：

您希望输出中有多少列？Source2中有4列，请参阅我的帖子。抱歉。我现在明白了。我希望你能用apply的形式来改进上面的内容。你不需要apply
或循环。给定字符串向量，str_split_fixed（info，“@”，3）为您提供一个字符矩阵，可以直接强制到数据帧，并过滤掉任何不需要的行。谢谢@Brian。我已经按照我认为你的意思更新了我的答案。我注意到我在底部留下了两个“空”行，所以我使用apply来删除这些行，但是答案看起来更干净，而且我怀疑效率更高。非常感谢。@QHarr和Brian:谢谢你的详细解释！非常感谢@Brian。所以我可以简单地说df
df <- data.frame(str_split_fixed(info, "@", 3))

 df <- df[df$Country != "",] 

library(rvest)
library(stringr)
library(magrittr)

page <- read_html('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
info =  strsplit(paste(html_text(html_nodes(page, "[id^=dspQCLinks]")), collapse = ''),"!,")[[1]]
df <- data.frame(str_split_fixed(info, "@", 3))
colnames(df) <- c("Country","Link","Tax")
df <- subset(df, select = c("Country","Tax"))
df <- df[df$Country != "",] 
View(df)

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('http://taxsummaries.pwc.com/ID/Corporate-income-tax-(CIT)-rates')
soup = bs(r.content, 'lxml')
text = ''

for i in soup.select('[id^=dspQCLinks]'):
    text+= i.text

rows = text.split('!,')
countries = []
tax_info = []

for row in rows:
    if row:
        items = row.split('@')
        countries.append(items[0])
        tax_info.append(items[2])

df = pd.DataFrame(list(zip(countries,tax_info)))
print(df)