如何从R中的部分非结构化txt文件中提取表？_R_Dataframe_Text Extraction_Readr

如何从R中的部分非结构化txt文件中提取表？

r dataframe

如何从R中的部分非结构化txt文件中提取表？,r,dataframe,text-extraction,readr,R,Dataframe,Text Extraction,Readr,我有一个txt文件的URL列表。txt文件的结构使得某些部分是纯文本，而某些部分是表格。我想提取表并将它们导出到数据帧。以下是URL的示例： txt文件的结构使表格以开头，以结尾。我想把所有的桌子合并起来。我试过使用read.delim，但我不知道如何仅在桌子上使用它。下面是预期输出的示例。我将感谢任何关于如何继续我的项目的指导 Example of current df: +----+------------------------------------------------------

我有一个txt文件的URL列表。txt文件的结构使得某些部分是纯文本，而某些部分是表格。我想提取表并将它们导出到数据帧。以下是URL的示例：

txt文件的结构使表格以

开头，以

结尾。我想把所有的桌子合并起来。我试过使用read.delim，但我不知道如何仅在桌子上使用它。下面是预期输出的示例。我将感谢任何关于如何继续我的项目的指导

Example of current df:
+----+--------------------------------------------------------------------------+
| ID |                                   URL                                    |
+----+--------------------------------------------------------------------------+
|  1 | https://www.sec.gov/Archives/edgar/data/1000097/0000919574-13-001835.txt |
|  2 | https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt |
|  3 | https://www.sec.gov/Archives/edgar/data/1000742/0000898432-13-000218.txt |
+----+--------------------------------------------------------------------------+

Example of txt file from url:

text text text
text text text
text text text

<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|   NAME OF ISSUER    | TITLE OF CLASS |   CUSIP   | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC          | COM            | 00287Y109 |        1,547 |          45,300 | SHS    |          | Shared-Defined        | 1/2/3       |           45,300 |
| ABERCROMBIE & FITCH | CL A           | 002896207 |        4,797 |         100,000 | SHS    |          | Shared-Defined        | 1/2/3       |          100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>

<TABLE>
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|   NAME OF ISSUER    | TITLE OF CLASS |   CUSIP   | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ABBVIE INC          | COM            | 00287Y109 |        1,547 |          45,300 | SHS    |          | Shared-Defined        | 1/2/3       |           45,300 |
| ABERCROMBIE & FITCH | CL A           | 002896207 |        4,797 |         100,000 | SHS    |          | Shared-Defined        | 1/2/3       |          100,000 |
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
</TABLE>



Expected output:
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
| ID | NAME OF ISSUER | TITLE OF CLASS | CUSIP | VALUE (x1000 | SHRS OR PRN AMT | SH/PRN | PUT/CALL | INVESTMENT DISCRETION | OTHER MNGRS | VOTING AUTHORITY |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  1 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
|  2 | x              | x              | x     | x            | x               | x      | x        | x                     | x           | x                |
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+

当前df的示例：
+----+--------------------------------------------------------------------------+
|ID | URL|
+----+--------------------------------------------------------------------------+
|  1 | https://www.sec.gov/Archives/edgar/data/1000097/0000919574-13-001835.txt |
|  2 | https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt |
|  3 | https://www.sec.gov/Archives/edgar/data/1000742/0000898432-13-000218.txt |
+----+--------------------------------------------------------------------------+
url中的txt文件示例：
文本文本文本
文本文本文本
文本文本文本
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|发行人名称|类别名称| CUSIP |价值（x1000 | SHR或PRN金额| SH/PRN |看跌/看涨|投资自由裁量权|其他MNGRS |投票权|
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|ABBVIE公司| COM | 00287Y109 | 1547 | 45300 | SHS |共享定义| 1/2/3 | 45300|
|ABERCROMBIE&FITCH | CL A | 002896207 | 4797 | 100000 | SHS |共享定义| 1/2/3 | 100000|
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|发行人名称|类别名称| CUSIP |价值（x1000 | SHR或PRN金额| SH/PRN |看跌/看涨|投资自由裁量权|其他MNGRS |投票权|
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|ABBVIE公司| COM | 00287Y109 | 1547 | 45300 | SHS |共享定义| 1/2/3 | 45300|
|ABERCROMBIE&FITCH | CL A | 002896207 | 4797 | 100000 | SHS |共享定义| 1/2/3 | 100000|
+---------------------+----------------+-----------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
预期产出：
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|ID |发行人名称|类别名称| CUSIP |价值（x1000 | SHR或PRN金额| SH/PRN |看跌/看涨|投资自由裁量权|其他MNGR |投票权|
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+
|1 | x | x | x | x | x | x | x | x | x|
|1 | x | x | x | x | x | x | x | x | x|
|1 | x | x | x | x | x | x | x | x | x|
|2 | x | x | x | x | x | x | x | x | x|
|2 | x | x | x | x | x | x | x | x | x|
|2 | x | x | x | x | x | x | x | x | x|
+----+----------------+----------------+-------+--------------+-----------------+--------+----------+-----------------------+-------------+------------------+

这里有一个粗略的解决方案

# Read the text files from the web
fileContents <- readr::read_file("https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt")
# Extract the tables.  The regex isn't quite right, as it includes the starting <TABLE>
# and ending </TABLE> tags, but more complicated regexes failed.  Regex isn't my
# strong point, and I can handle the extra work
tables <- stringr::str_extract_all(
            fileContents, 
            stringr::regex("(?s)<TABLE>(.*?)</TABLE>", 
              multiline=TRUE, 
              dotall=TRUE
            )
          )

# Function to process a single tibble
toTibble <- function(y) {
  lines <- str_split_fixed(y, "\n", n=Inf)
  colStarts <- c()
  colEnds <- c()
  # Scroll through to final table header
  for (i in 1:(length(lines)-1)) { # Final line is '</TABLE>' because of initial regex
    # Could probably to this with regexes, but my head is hurting
    if (any(!is.na(stringr::str_locate(lines[i], "<\\w>")))) {
      # Define column widths based on locations of type markers.  THIS IS AN ASSUMPTION
      colStarts <- stringr::str_locate_all(lines[i], "<\\w>")[[1]][,"start"]
      for (i in 1:(length(colStarts)-1)) colEnds[i] <- colStarts[i+1] -1
      colEnds[length(colStarts)] <- stringr::str_length(lines[i])
      lines <- lines[(i+1):(length(lines)-1)]
      data <- dplyr::bind_rows(
                lapply(
                  lines,                   # For each data line  
                  function(line) 
                    tibble::enframe(       # Split in to columns and convert to a tibble of name/value pairs
                      stringr::str_trim(
                        stringr::str_sub(
                          line, 
                          colStarts, 
                          colEnds
                        )
                      )
                    ) %>%                  # Convert from name/value pairs to columns
                    tidyr::pivot_wider(
                      values_from="value", 
                      names_from="name", 
                      names_prefix="Column"
                    )
                  )
                )
      # Finished
      return(data)
    }
  }
}

文件中不到300个表，因此将所有表绑定到一个TIBLE中

alldata <- bind_rows(lapply(tables[[1]], function(t) toTibble(t)))

alldata这里有一个粗略的解决方案
# Read the text files from the web
fileContents <- readr::read_file("https://www.sec.gov/Archives/edgar/data/1000275/0001140361-13-007449.txt")
# Extract the tables.  The regex isn't quite right, as it includes the starting <TABLE>
# and ending </TABLE> tags, but more complicated regexes failed.  Regex isn't my
# strong point, and I can handle the extra work
tables <- stringr::str_extract_all(
            fileContents, 
            stringr::regex("(?s)<TABLE>(.*?)</TABLE>", 
              multiline=TRUE, 
              dotall=TRUE
            )
          )

# Function to process a single tibble
toTibble <- function(y) {
  lines <- str_split_fixed(y, "\n", n=Inf)
  colStarts <- c()
  colEnds <- c()
  # Scroll through to final table header
  for (i in 1:(length(lines)-1)) { # Final line is '</TABLE>' because of initial regex
    # Could probably to this with regexes, but my head is hurting
    if (any(!is.na(stringr::str_locate(lines[i], "<\\w>")))) {
      # Define column widths based on locations of type markers.  THIS IS AN ASSUMPTION
      colStarts <- stringr::str_locate_all(lines[i], "<\\w>")[[1]][,"start"]
      for (i in 1:(length(colStarts)-1)) colEnds[i] <- colStarts[i+1] -1
      colEnds[length(colStarts)] <- stringr::str_length(lines[i])
      lines <- lines[(i+1):(length(lines)-1)]
      data <- dplyr::bind_rows(
                lapply(
                  lines,                   # For each data line  
                  function(line) 
                    tibble::enframe(       # Split in to columns and convert to a tibble of name/value pairs
                      stringr::str_trim(
                        stringr::str_sub(
                          line, 
                          colStarts, 
                          colEnds
                        )
                      )
                    ) %>%                  # Convert from name/value pairs to columns
                    tidyr::pivot_wider(
                      values_from="value", 
                      names_from="name", 
                      names_prefix="Column"
                    )
                  )
                )
      # Finished
      return(data)
    }
  }
}

文件中不到300个表，因此将所有表绑定到一个TIBLE中
alldata <- bind_rows(lapply(tables[[1]], function(t) toTibble(t)))

所有数据好吧，第一步是定位
和
之间的文本块。你是怎么做到的？然后你需要解析每个块中的单元格定义。给我们一些东西来处理！不幸的是，我也被困在这一部分。我已经在网上查找并尝试了几种方法，包括fread
、read.pattern
、和Readlines
，但我无法使它们按预期工作。第一步是定位
和
之间的文本块。你是如何处理的？然后你需要解析每个块中的单元格定义。给我们一些东西来处理！不幸的是，我很抱歉我也被困在了这一部分。我在网上搜索并尝试了几种方法，包括fread
、read.pattern
和Readlineslength(tables[[1]])
[1] 299

alldata <- bind_rows(lapply(tables[[1]], function(t) toTibble(t)))