Python 如何将histwords应用于我自己的文本语料库?
我最近遇到了这篇论文(),我一直在阅读GitHub(),但我仍然不太清楚如何将它应用到我自己的数据中。我的数据格式如下:Python 如何将histwords应用于我自己的文本语料库?,python,nlp,nltk,word2vec,Python,Nlp,Nltk,Word2vec,我最近遇到了这篇论文(),我一直在阅读GitHub(),但我仍然不太清楚如何将它应用到我自己的数据中。我的数据格式如下: #### 2008 text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.", "I love coding in Python in 2008", "I love building chatbots in 2008
#### 2008
text_2008 = pd.DataFrame({'dat1': ["I love machine learning in 2008. Its awesome.",
"I love coding in Python in 2008",
"I love building chatbots in 2008",
"they chat amagingly well"]})
ID_2008 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2008 = text.join(ID)
#### 2009
text_2009 = pd.DataFrame({'dat1': ["I love machine learning. Its awesome.",
"I love coding in Python",
"I love building chatbots",
"they chat amagingly well"]})
ID_2009 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2009 = text.join(ID)
#### 2010
text_2010 = pd.DataFrame({'dat1': ["I love machine learning more in 2010. Its awesome.",
"I love coding in Python in 2010",
"I love building chatbots in 2010",
"they chat amagingly well"]})
ID_2010 = pd.DataFrame({'dat2': [1,2,3,4]})
my_actual_data_format_2010 = text.join(ID)
因此,我有多个pandas数据帧,每行包含一个ID
和text
列
据我所知,sgns接受的是.txt文件,而不是数据帧。()
从主页上可以看到“如果您想了解新数据的历史嵌入,建议使用sgns目录中的代码”
如果有人能把我推向正确的方向,那就太棒了!我应该将熊猫行“文本”保存为.txt文件吗?查看自述中提到的管道
**DATA:** raw corpus => corpus => pairs => counts => vocab
**TRADITIONAL:** counts + vocab => pmi => svd
**EMBEDDINGS:** pairs + vocab => sgns
**raw corpus => corpus**
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.
**corpus => pairs**
- *corpus2pairs.py*
- Extracts a collection of word-context pairs from the corpus.
**pairs => counts**
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.
**counts => vocab**
- *counts2vocab.py*
- Creates vocabularies with the words' and contexts' unigram distributions.
**counts + vocab => pmi**
- *counts2pmi.py*
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.
**pmi => svd**
- *pmi2svd.py*
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.
**pairs + vocab => sgns**
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**
An example pipeline is demonstrated in: *example_test.sh*
从查看自述文件中提到的管道
**DATA:** raw corpus => corpus => pairs => counts => vocab
**TRADITIONAL:** counts + vocab => pmi => svd
**EMBEDDINGS:** pairs + vocab => sgns
**raw corpus => corpus**
- *scripts/clean_corpus.sh*
- Eliminates non-alphanumeric tokens from the original corpus.
**corpus => pairs**
- *corpus2pairs.py*
- Extracts a collection of word-context pairs from the corpus.
**pairs => counts**
- *scripts/pairs2counts.sh*
- Aggregates identical word-context pairs.
**counts => vocab**
- *counts2vocab.py*
- Creates vocabularies with the words' and contexts' unigram distributions.
**counts + vocab => pmi**
- *counts2pmi.py*
- Creates a PMI matrix (*scipy.sparse.csr_matrix*) from the counts.
**pmi => svd**
- *pmi2svd.py*
- Factorizes the PMI matrix using SVD. Saves the result as three dense numpy matrices.
**pairs + vocab => sgns**
- *word2vecf/word2vecf*
- An external program for creating embeddings with SGNS. For more information, see:
**"Dependency-Based Word Embeddings". Omer Levy and Yoav Goldberg. ACL 2014.**
An example pipeline is demonstrated in: *example_test.sh*
从