R 如何向量化函数_R - Fatal编程技术网

R 如何向量化函数

R 如何向量化函数,r,R,我有一个函数，它获取一些自由文本，然后根据单词列表将文本分成列。它工作得很好，但有人向我建议，如果将其矢量化，效果会更好该函数称为提取器： Extractor <- function(x, y, stra, strb, t) { x <- data.frame(x) t <- gsub("[^[:alnum:],]", " ", t) t <- gsub(" ", "", t, fixed = TRUE) x[, t] <- stringr

我有一个函数，它获取一些自由文本，然后根据单词列表将文本分成列。它工作得很好，但有人向我建议，如果将其矢量化，效果会更好

该函数称为

提取器

：

Extractor <- function(x, y, stra, strb, t) {
  x <- data.frame(x)
  t <- gsub("[^[:alnum:],]", " ", t)
  t <- gsub(" ", "", t, fixed = TRUE)
  x[, t] <-
    stringr::str_extract(x[, y], stringr::regex(paste(stra,
                                                      "(.*)", strb, sep = ""), 
                                                dotall = TRUE))
  x[, t] <- gsub("\\\\.*", "", x[, t])

  names(x[, t]) <- gsub(".", "", names(x[, t]), fixed = TRUE)
  x[, t] <- gsub("       ", "", x[, t])
  x[, t] <- gsub(stra, "", x[, t], fixed = TRUE)
  if (strb != "") {
    x[, t] <- gsub(strb, "", x[, t], fixed = TRUE)
  }
  x[, t] <- gsub("       ", "", x[, t])
  x[, t] <- ColumnCleanUp(x, t)
  return(x)
}

ColumnCleanUp <- function(x, y) {
  x <- (data.frame(x))
  x[, y] <- gsub("^\\.\n", "", x[, y])
  x[, y] <- gsub("^:", "", x[, y])
  x[, y] <- gsub(".", "\n", x[, y], fixed = TRUE)
  x[, y] <- gsub("\\s{5}", "", x[, y])
  x[, y] <- gsub("^\\.", "", x[, y])
  x[, y] <- gsub("$\\.", "", x[, y])
  return(x[, y])
}

基本上，我通过循环反复调用该函数（尽管这里只有一个示例，实际数据帧的行数大于2000行）

apply（）

是以矢量化方式应用函数的一种方式吗？如果没有，我可以有一个指针来说明如何将其矢量化，从而避免使用循环吗？我理解将函数矢量化的想法意味着将函数作为一个整体应用于向量，而不是循环，我需要将输入列表转换为字符向量，但从那以后我就陷入了困境。

我想我应该尝试简化各种正则表达式，而不是将现有函数矢量化。我可以看到你在做什么，你有一个data.frame，其中包含一个看起来很糟糕的字符串中的原始病理学数据，例如：

医院号码233456患者姓名：Jonny Begood出生日期：13/01/77 全科医生：De'ath医生手术日期：1999年1月13日详细信息：Dyaphagia和回流宏观描述：3件食道，所有的活组织检查都很好。组织学：显示慢性反流和其他位n bob。诊断：可能有胃酸反流

我相信您正在使用一种很好的方法，即使用标题（“医院编号”、“患者姓名：”、…）来提取数据（“233456”、“Jonny Begood”、…）。但是，我认为有一种更简单的方法可以使用正则表达式来实现这一点，即使用头作为标记。因此，在上面的字符串中，我们看到医院编号的数据是介于“Hospital Number”和“Patient Name:”之间的所有数据，删除了空格，即“233456”。同样的原理也可以应用于提取每个后续数据段。再多几行代码将把单独的数据片段放到data.frame中它们自己的列中

以下是创建test data.frame的代码：

Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"

> glimpse(out)
Observations: 1
Variables: 10
$ PathReportWhole        <fctr> Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and re...
$ HospitalNumber         <chr> "233456"
$ PatientName            <chr> "Jonny Begood"
$ DOB                    <chr> "13/01/77"
$ GeneralPractitioner    <chr> "Dr De'ath"
$ DateofProcedure        <chr> "13/01/99"
$ ClinicalDetails        <chr> "Dyaphagia and reflux"
$ Macroscopicdescription <chr> "3 pieces of oesophagus, all good biopsies."
$ Histology              <chr> "These show chronic reflux and other bits n bobs."
$ Diagnosis              <chr> "Acid reflux likely"

这里我使用的是

tidverse

包生态系统，即

dplyr

和

stringr

。函数循环遍历每个标头，生成适当的正则表达式，然后应用这些正则表达式来创建新的数据列

这样调用函数：

out <- extractPath(Mypath, "PathReportWhole", x)

out什么是函数ColumnCleanup（）
？我们需要它来运行您的代码。@StuartAllen ColumnCleanUp函数已如上所述添加。谢谢这里有很多可以改进的地方。首先，为什么提取器
函数两次接受相同的参数？在循环中，对于参数stra
和t
，函数调用似乎都采用as.character（historltree[i]）
。我会重写这个函数，使您只提供数据帧和一个字符串向量（历史树的内容），并单独使用这个向量。另外，为什么有两行代码来重新格式化您自己提供的参数t？为什么不在调用函数时以正确的方式提供它呢？似乎您正在尝试解析作为键值对传递的数据。如果您放置一些诚实的Chrismas分隔符来分隔键值对，那么生活就会简单得多。有相当多的问题和答案与代码处理这种输入。
Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"

x <- c("Hospital Number", "Patient Name:", "DOB:", "General Practitioner:", "Date of Procedure:", "Clinical Details:", "Macroscopic description:", "Histology:", "Diagnosis:")

extractPath <- function(df, colName, headers) {
  # df: data.frame containing raw path data
  # colName: name of column containing data
  # headers: character vector of headers (delimiters in raw path data)

  for (i in seq_len(length(headers))) {

    # left delimiter
    delimLeft <- headers[i]

    # right delimiter, not relevant if at end of headers
    if (i < length(headers)) {
      delimRight <- headers[i+1]
      # regex to match everything between delimiting headers
      regex <- paste0("(?<=", delimLeft, ").*(?=", delimRight, ")")
    } else {
      # regex to match everything to right of final delimiting header
      regex <- paste0("(?<=", delimLeft, ").*$")
    }

    # generate column name for new column
    # use alpha characters only (i.e. ignore colon), and remove spaces
    columnName <- str_extract(delimLeft, "[[:alpha:] ]*") %>% str_replace_all(" ", "")

    # create new column of data, and trim whitespace
    df[[columnName]] <- str_extract(df[[colName]], regex) %>% str_trim()
  }

  # return output data.frame
  df
}

out <- extractPath(Mypath, "PathReportWhole", x)

> glimpse(out)
Observations: 1
Variables: 10
$ PathReportWhole        <fctr> Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and re...
$ HospitalNumber         <chr> "233456"
$ PatientName            <chr> "Jonny Begood"
$ DOB                    <chr> "13/01/77"
$ GeneralPractitioner    <chr> "Dr De'ath"
$ DateofProcedure        <chr> "13/01/99"
$ ClinicalDetails        <chr> "Dyaphagia and reflux"
$ Macroscopicdescription <chr> "3 pieces of oesophagus, all good biopsies."
$ Histology              <chr> "These show chronic reflux and other bits n bobs."
$ Diagnosis              <chr> "Acid reflux likely"