Python 获取R中doc/docx文件的字数_Python_R_Ms Word

Python 获取R中doc/docx文件的字数

python r ms-word

Python 获取R中doc/docx文件的字数,python,r,ms-word,Python,R,Ms Word,我有一系列的doc/docx文档，我需要这些文档的字数到目前为止，这个过程是手动打开文档并记下MS word本身提供的字数，我正在尝试使用R 这就是我所尝试的： library(textreadr) library(stringr) myDocx = read_docx(myDocxFile) docText = str_c(myDocx , collapse = " ") wordCount = str_count(test, "\\s+") + 1 不幸的是，wordCount不是Wor

我有一系列的doc/docx文档，我需要这些文档的字数

到目前为止，这个过程是手动打开文档并记下MS word本身提供的字数，我正在尝试使用R

这就是我所尝试的：

library(textreadr)
library(stringr)
myDocx = read_docx(myDocxFile)
docText = str_c(myDocx , collapse = " ")
wordCount = str_count(test, "\\s+") + 1

不幸的是，

wordCount

不是Word女士所建议的

例如，我注意到MS Word统计编号列表中的数字，而

textreadr

甚至不导入它们

有解决办法吗？我也不介意在Python中尝试一些东西，尽管我在这方面的经验较少

任何帮助都将不胜感激。

这应该可以使用R中的

tidytext

包来完成

library(textreadr)
library(tidytext)
library(dplyr)

#read in word file without password protection
x <- read_docx(myDocxFile)
#convert string to dataframe
text_df <-tibble(line = 1:length(x),text = x)
#tokenize dataframe to isolate separate words
words_df <- text_df %>%
  unnest_tokens(word,text)
#calculate number of words in passage
word_count <- nrow(words_df)

库（textreadr）
图书馆（tidytext）
图书馆（dplyr）
#在没有密码保护的情况下读取word文件
恐怕我不能同意你的看法。如果您将建议的代码与我尝试的代码进行比较，您将看到idential（x，myDocx）
rreturnsTRUE
。拟议中的图书馆以完全相同的方式阅读文本。您确实建议了另一种计算单词的方法，但这也永远不会与MS Word一致，因为输入本身是不同的。