如何计算R中的R1(词汇丰富度指数)?

如何计算R中的R1(词汇丰富度指数)?,r,list,function,text-mining,quanteda,R,List,Function,Text Mining,Quanteda,嗨,我需要编写一个函数来计算R1,其定义如下: R1=1-(F(h)-h*h/2N)) 其中N是令牌的数量,h是赫希点,F(h)是该点之前的累积相对频率。我使用quanteda软件包计算了赫希点 a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters ha

嗨,我需要编写一个函数来计算R1,其定义如下:

R1=1-(F(h)-h*h/2N))

其中N是令牌的数量,h是赫希点,F(h)是该点之前的累积相对频率。我使用
quanteda
软件包计算了赫希点

 a <- c("The truck driver whose runaway vehicle rolled into the path of an express train and caused one of Taiwan’s worst ever rail disasters has made a tearful public apology.", "The United States is committed to advancing prosperity, security, and freedom for both Israelis and Palestinians in tangible ways in the immediate term, which is important in its own right, but also as a means to advance towards a negotiated two-state solution.")
a1 <- c("The 49-year-old is part of a team who inspects the east coast rail line for landslides and other risks.", "We believe that this UN agency for so-called refugees should not exist in its current format.")
a2 <- c("His statement comes amid an ongoing investigation into the crash, with authorities saying the train driver likely had as little as 10 seconds to react to the obstruction.", " The US president accused Palestinians of lacking “appreciation or respect.", "To create my data I had to chunk each text in an increasing manner.", "Therefore, the input is a list of chunked texts within another list.")
a3 <- c("We plan to restart US economic, development, and humanitarian assistance for the Palestinian people,” the secretary of state, Antony Blinken, said in a statement.", "The cuts were decried as catastrophic for Palestinians’ ability to provide basic healthcare, schooling, and sanitation, including by prominent Israeli establishment figures.","After Donald Trump’s row with the Palestinian leadership, President Joe Biden has sought to restart Washington’s flailing efforts to push for a two-state resolution for the Israel-Palestinian crisis, and restoring the aid is part of that.")
txt <-list(a,a1,a2,a3)

    
library(quanteda)
DFMs <- lapply(txt, dfm)
txt_freq <- function(x) textstat_frequency(x, groups = docnames(x), ties_method = "first")
Fs <- lapply(DFMs, txt_freq)

get_h_point <- function(DATA) {
  fn_interp <- approxfun(DATA$rank, DATA$frequency)
  fn_root <- function(x) fn_interp(x) - x
  uniroot(fn_root, range(DATA$rank))$root
}

s_p <- function(x){split(x,x$group)}  
tstat_by <- lapply(Fs, s_p)
h_values <-lapply(tstat_by, vapply, get_h_point, double(1))
如您所见,分组是相同的-原始字符向量的每个块的docname是相同的(text1、text2、text3等)。我的问题是如何为fh_txt(s)编写一个函数,以便可以选择使用lappy来计算R1的F(h)


请注意,我们的目标是编写一个函数来计算R1,我在这里介绍的是在这方面所做的工作。

我简化了下面的输入,并使用了
textstat\u frequency()
中的
groups
参数,而不是您创建dfm对象列表的方法


tnx表示您在这方面花费的时间。您为文本添加了分组参数,这很有帮助。然而,我需要计算F(h)-累积相对频率-直到每个文本的h点。顺便说一下,我的实际数据是一个文本列表,我需要一个函数。我修改了密码。看一看
fh_txt1 <- tail(prop.table(cumsum(tstat_by[[1]][["text1"]]$rank:h_values[[1]][["text1"]])), n=1)
fh_txt2 <-tail(prop.table(cumsum(tstat_by[[1]][["text2"]]$rank:h_values[[1]][["text2"]])), n=1)
...

tail(prop.table(cumsum(tstat_by[[4]][["text2"]]$rank:h_values[[4]][["text2"]])), n=1)
[1] 1
tail(prop.table(cumsum(tstat_by[[4]][["text3"]]$rank:h_values[[4]][["text3"]])), n=1)
[1] 0.75