Python 如何将histwords应用于我自己的文本语料库？_Python_Nlp_Nltk_Word2vec

Python 如何将histwords应用于我自己的文本语料库？

python nlp

Python 如何将histwords应用于我自己的文本语料库？,python,nlp,nltk,word2vec,Python,Nlp,Nltk,Word2vec,我最近遇到了这篇论文（），我一直在阅读GitHub（），但我仍然不太清楚如何将它应用到我自己的数据中。我的数据格式如下： #### 2008 text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.", "I love coding in Python in 2008", "I love building chatbots in 2008

我最近遇到了这篇论文（），我一直在阅读GitHub（），但我仍然不太清楚如何将它应用到我自己的数据中。我的数据格式如下：

#### 2008
   text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.",
            "I love coding in Python in 2008",
            "I love building chatbots in 2008",
            "they chat amagingly well"]})
    ID_2008 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2008 = text.join(ID)

#### 2009
   text_2009 = pd.DataFrame({'dat1': ["I love machine learning. Its awesome.",
            "I love coding in Python",
            "I love building chatbots",
            "they chat amagingly well"]})
    ID_2009 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2009 = text.join(ID)


#### 2010
   text_2010 = pd.DataFrame({'dat1': ["I love machine learning more in 2010. Its awesome.",
            "I love coding in Python in 2010",
            "I love building chatbots in 2010",
            "they chat amagingly well"]})
    ID_2010 = pd.DataFrame({'dat2': [1,2,3,4]})

    my_actual_data_format_2010 = text.join(ID)

因此，我有多个pandas数据帧，每行包含一个

ID

和

text

列

据我所知，sgns接受的是.txt文件，而不是数据帧。（）

从主页上可以看到“如果您想了解新数据的历史嵌入，建议使用sgns目录中的代码”

如果有人能把我推向正确的方向，那就太棒了！我应该将熊猫行“文本”保存为.txt文件吗？

查看自述中提到的管道

**DATA:**  raw corpus  =>  corpus  =>  pairs  =>  counts  =>  vocab  
**TRADITIONAL:**  counts + vocab  =>  pmi  =>  svd  
**EMBEDDINGS:**  pairs  + vocab  =>  sgns  

**raw corpus  =>  corpus**  
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.

**corpus  =>  pairs**  
- *corpus2pairs.py*  
- Extracts a collection of word-context pairs from the corpus.

**pairs  =>  counts**  
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.

**counts  =>  vocab**  
- *counts2vocab.py*  
- Creates vocabularies with the words' and contexts' unigram distributions.

**counts + vocab  =>  pmi**  
- *counts2pmi.py*  
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.

**pmi  =>  svd**  
- *pmi2svd.py*  
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.

**pairs  + vocab  =>  sgns**  
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:  
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**

An example pipeline is demonstrated in: *example_test.sh*

从

查看自述文件中提到的管道

**DATA:**  raw corpus  =>  corpus  =>  pairs  =>  counts  =>  vocab  
**TRADITIONAL:**  counts + vocab  =>  pmi  =>  svd  
**EMBEDDINGS:**  pairs  + vocab  =>  sgns  

**raw corpus  =>  corpus**  
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.

**corpus  =>  pairs**  
- *corpus2pairs.py*  
- Extracts a collection of word-context pairs from the corpus.

**pairs  =>  counts**  
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.

**counts  =>  vocab**  
- *counts2vocab.py*  
- Creates vocabularies with the words' and contexts' unigram distributions.

**counts + vocab  =>  pmi**  
- *counts2pmi.py*  
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.

**pmi  =>  svd**  
- *pmi2svd.py*  
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.

**pairs  + vocab  =>  sgns**  
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:  
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**

An example pipeline is demonstrated in: *example_test.sh*

从