在R中计算标记之间的单词

在R中计算标记之间的单词,r,quanteda,R,Quanteda,我有几个文本文件,我导入到一个语料库。每一篇文章都有几个部分,据说是在不同的日子里写的,并标有#。一周以美元标记。在每一篇课文中,我如何计算一天有多少单词,一周有多少单词? 文本T1有几天,最后用#标记,我需要计算每天的单词数。周以$分隔,我还需要知道一周的字数,还有文本T2和T3…Tn 问题是我如何用quanteda在R中实现这一点 <T1> (25.02.2009) This chapter thoroughly describes the idea of analyzing

我有几个文本文件,我导入到一个语料库。每一篇文章都有几个部分,据说是在不同的日子里写的,并标有#。一周以美元标记。在每一篇课文中,我如何计算一天有多少单词,一周有多少单词? 文本T1有几天,最后用#标记,我需要计算每天的单词数。周以$分隔,我还需要知道一周的字数,还有文本T2和T3…Tn 问题是我如何用quanteda在R中实现这一点

<T1>
 (25.02.2009) This chapter thoroughly describes the idea of analyzing text “as data” with a social science focus. It traces a brief history of this approach and distinguishes it from alternative approaches to text. It identifies the key research designs and methods for various ways that scholars in political science and international relations have used text, with references to fields such as natural language processing and computational linguistics from which some of the key methods are influenced or inherited. It surveys the varieties of ways that textual data is used and analyzed, covering key methods and pointing to applications of each. It also identifies the key stages of a research design using text as data, and critically discusses the practical and epistemological challenges at each stage.                                                        

# (26.02.2009) Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Dail confidence vote. To solve the Dail scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Dail debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.                       

# (28.02.2009) Borrowing from automated “text as data” approaches, we show how statistical scaling models can be applied to hand-coded content analysis to improve estimates of political parties’ left-right policy positions. We apply a Bayesian item-response theory (IRT) model to category counts from coded party manifestos, treating the categories as “items” and policy positions as a latent variable. This approach also produces direct estimates of how each policy category relates to left-right ideology, without having to decide these relationships in advance based on out of sample fitting, political theory, assertion, or guesswork. This approach not only prevents the misspecification endemic to a fixed-index approach, but also works well even with items that are not specifically designed to measure ideological positioning.              
# (02.03.2009) This chapter thoroughly describes the idea of analyzing text “as data” with a social science focus. It traces a brief history of this approach and distinguishes it from alternative approaches to text. It identifies the key research designs and methods for various ways that scholars in political science and international relations have used text, with references to fields such as natural language processing and computational linguistics from which some of the key methods are influenced or inherited. It surveys the varieties of ways that textual data is used and analyzed, covering key methods and pointing to applications of each. It also identifies the key stages of a research design using text as data, and critically discusses the practical and epistemological challenges at each stage. .                                           

# (03.03.2009) Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Dail confidence vote. To solve the Dail scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Dail debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.                                    

#
($)
 (04.03.2009) Borrowing from automated “text as data” approaches, we show how statistical scaling models can be applied to hand-coded content analysis to improve estimates of political parties’ left-right policy positions. We apply a Bayesian item-response theory (IRT) model to category counts from coded party manifestos, treating the categories as “items” and policy positions as a latent variable. This approach also produces direct estimates of how each policy category relates to left-right ideology, without having to decide these relationships in advance based on out of sample fitting, political theory, assertion, or guesswork. This approach not only prevents the misspecification endemic to a fixed-index approach, but also works well even with items that are not specifically designed to measure ideological positioning.                                      

# (05.03.2009) Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Dail confidence vote. To solve the Dail scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Dail debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.  
# (06.03.2009)  This chapter thoroughly describes the idea of analyzing text “as data” with a social science focus. It traces a brief history of this approach and distinguishes it from alternative approaches to text. It identifies the key research designs and methods for various ways that scholars in political science and international relations have used text, with references to fields such as natural language processing and computational linguistics from which some of the key methods are influenced or inherited. It surveys the varieties of ways that textual data is used and analyzed, covering key methods and pointing to applications of each. It also identifies the key stages of a research design using text as data, and critically discusses the practical and epistemological challenges at each stage. 

# (07.03.2009)  This chapter thoroughly describes the idea of analyzing text “as data” with a social science focus. It traces a brief history of this approach and distinguishes it from alternative approaches to text. It identifies the key research designs and methods for various ways that scholars in political science and international relations have used text, with references to fields such as natural language processing and computational linguistics from which some of the key methods are influenced or inherited. It surveys the varieties of ways that textual data is used and analyzed, covering key methods and pointing to applications of each. It also identifies the key stages of a research design using text as data, and critically discusses the practical and epistemological challenges at each stage. 

# (08.03.2009) Probabilistic methods for classifying text form a rich tradition in machine learning and natural language processing. For many important problems, however, class prediction is uninteresting because the class is known, and instead the focus shifts to estimating latent quantities related to the text, such as affect or ideology. We focus on one such problem of interest, estimating the ideological positions of 55 Irish legislators in the 1991 Dail confidence vote. To solve the Dail scaling problem and others like it, we develop a text modeling framework that allows actors to take latent positions on a “gray” spectrum between “black” and “white” polar opposites. We are able to validate results from this model by measuring the influences exhibited by individual words, and we are able to quantify the uncertainty in the scaling estimates by using a sentence-level block bootstrap. Applying our method to the Dail debate, we are able to scale the legislators between extreme pro-government and pro-opposition in a way that reveals nuances in their speeches not captured by their votes or party affiliations.                    

# (09.03.2009) Borrowing from automated “text as data” approaches, we show how statistical scaling models can be applied to hand-coded content analysis to improve estimates of political parties’ left-right policy positions. We apply a Bayesian item-response theory (IRT) model to category counts from coded party manifestos, treating the categories as “items” and policy positions as a latent variable. This approach also produces direct estimates of how each policy category relates to left-right ideology, without having to decide these relationships in advance based on out of sample fitting, political theory, assertion, or guesswork. This approach not only prevents the misspecification endemic to a fixed-index approach, but also works well even with items that are not specifically designed to measure ideological positioning.                          

# (10.03.2009) This chapter thoroughly describes the idea of analyzing text “as data” with a social science focus. It traces a brief history of this approach and distinguishes it from alternative approaches to text. It identifies the key research designs and methods for various ways that scholars in political science and international relations have used text, with references to fields such as natural language processing and computational linguistics from which some of the key methods are influenced or inherited. It surveys the varieties of ways that textual data is used and analyzed, covering key methods and pointing to applications of each. It also identifies the key stages of a research design using text as data, and critically discusses the practical and epistemological challenges at each stage.                             

#
($)

(25.02.2009)本章详细描述了以社会科学为重点分析文本“作为数据”的思想。它追溯了这种方法的简要历史,并将其与文本的替代方法区分开来。它确定了政治学和国际关系学者使用文本的各种方式的关键研究设计和方法,并参考了自然语言处理和计算语言学等领域,其中一些关键方法受到了影响或继承。它调查了使用和分析文本数据的各种方式,涵盖了关键方法,并指出了每种方法的应用。它还确定了使用文本作为数据的研究设计的关键阶段,并批判性地讨论了每个阶段的实践和认识论挑战。
#(26.02.2009)文本分类的概率方法形成了机器学习和自然语言处理的丰富传统。然而,对于许多重要的问题来说,班级预测是无趣的,因为班级是已知的,相反,重点转移到估算与文本相关的潜在数量,如情感或意识形态。我们关注这样一个令人感兴趣的问题,估计了55名爱尔兰立法者在1991年Dail信心投票中的意识形态立场。为了解决Dail缩放问题和其他类似问题,我们开发了一个文本建模框架,允许参与者在“黑色”和“白色”极性对立面之间的“灰色”光谱上占据潜在位置。我们能够通过测量单个单词所显示的影响来验证该模型的结果,并且我们能够通过使用句子级块引导来量化缩放估计中的不确定性。将我们的方法应用到日常辩论中,我们能够在极端亲政府和亲反对派之间衡量立法者,从而揭示他们演讲中的细微差别,而不是他们的选票或党派关系。
#(28.02.2009)借鉴自动化的“文本即数据”方法,我们展示了如何将统计比例模型应用于手工编码的内容分析,以改进对政党左右政策立场的估计。我们将贝叶斯项目反应理论(IRT)模型应用于编码政党宣言的类别计数,将类别视为“项目”,将政策立场视为潜在变量。这种方法还可以直接估算出每个政策类别与左右翼意识形态之间的关系,而无需事先根据样本拟合、政治理论、断言或猜测来确定这些关系。这种方法不仅可以防止固定索引方法特有的错误指定,而且可以很好地处理不是专门用于衡量意识形态定位的项目。
#(02.03.2009)本章详细介绍了以社会科学为重点分析文本“作为数据”的思想。它追溯了这种方法的简要历史,并将其与文本的替代方法区分开来。它确定了政治学和国际关系学者使用文本的各种方式的关键研究设计和方法,并参考了自然语言处理和计算语言学等领域,其中一些关键方法受到了影响或继承。它调查了使用和分析文本数据的各种方式,涵盖了关键方法,并指出了每种方法的应用。它还确定了使用文本作为数据的研究设计的关键阶段,并批判性地讨论了每个阶段的实践和认识论挑战。
#(03.03.2009)文本分类的概率方法形成了机器学习和自然语言处理的丰富传统。然而,对于许多重要的问题来说,班级预测是无趣的,因为班级是已知的,相反,重点转移到估算与文本相关的潜在数量,如情感或意识形态。我们关注这样一个令人感兴趣的问题,估计了55名爱尔兰立法者在1991年Dail信心投票中的意识形态立场。为了解决Dail缩放问题和其他类似问题,我们开发了一个文本建模框架,允许参与者在“黑色”和“白色”极性对立面之间的“灰色”光谱上占据潜在位置。我们能够通过测量单个单词所显示的影响来验证该模型的结果,并且我们能够通过使用句子级块引导来量化缩放估计中的不确定性。将我们的方法应用到日常辩论中,我们能够在极端亲政府和亲反对派之间衡量立法者,从而揭示他们演讲中的细微差别,而不是他们的选票或党派关系。
#
($)
(04.03.2009)借鉴自动化的“文本即数据”方法,我们展示了如何将统计比例模型应用于手工编码的内容分析,以改进对政党左右政策立场的估计。我们将贝叶斯项目反应理论(IRT)模型应用于编码政党宣言的类别计数,将类别视为“项目”,将政策立场视为潜在变量。这种方法还可以直接估计每个政策类别与左翼的关系