The Complete Beginner’s Guide to the Gensim Library

by Alex
The Complete Beginner's Guide to the Gensim Library

Gensim is an open-source Python library written by Radim Rehurek that is used for teacherless topic modeling and natural language processing (NLP). It is designed to extract semantic themes from documents. Gensim can handle large text collections. This distinguishes it from other machine learning software libraries that focus on in-memory processing. GenSim also provides efficient multi-core implementations of various algorithms to increase processing speed. It adds more convenient text processing facilities than competitors such as Scikit-learn, R, etc. This tutorial will cover the following concepts:

  1. Creating an enclosure from a given dataset.
  2. TFIDF matrices in Gensim.
  3. Creation of bigrams and trigrams using Gensim.
  4. Word2Vec models, using Gensim.
  5. Doc2Vec models, with Gensim.
  6. Creating Thematic Models with LDA.
  7. Creating a topic model with LSI.

Before moving on, let’s understand what the following terms mean:

  • Corpus: a collection of text documents.
  • Vector: a form of text representation.
  • Model: the algorithm used to generate a representation of data.
  • Topic modeling: an information mining tool used to extract semantic themes from documents.
  • Theme: a recurring group of words that occur frequently together.

For example: You have a document consisting of words such as: bat, car, racquet, score, glass, drive, cup, keys, water, game, steering, liquid. You can group them into different topics: Topic 1Topic 2Topic 3glassbatcarcupracquetdrivecaterscorekeysliquidgamesterring Some of the methods of thematic modeling:

  • Latent Semantic Analysis (LSI)
  • Latent Dirichlet placement (LDA)

Now that we have a basic understanding of the terminology, let’s move on to using the Gensim package. First, install the library using the following commands:

pip install gensim
# or
conda install gensim

Step 1: Create a corpus from a given dataset

You will need to follow these steps to create your dataset:

  1. Load the selected dataset.
  2. Pre-process your dataset.
  3. Create a dictionary.
  4. Create a Bag of Words.

1.1 Load the selected dataset:

You can have a .txt file as a dataset or you can also download the required datasets using the Gensim Downloader API.


import os
# read the text file as an object
doc = open('sample_data.txt', encoding ='utf-8')

The Gensim Downloader API is a module available in the Gensim library which is an API for downloading, getting information and downloading datasets/models.


import gensim.downloader as api
# check for available models and datasets
info_datasets = api.info()
print(info_datasets)
# information about a specific dataset
dataset_info = api.info("text8")
# dataset "text8" is loaded
dataset = api.load("text8")
# load pre-trained model
word2vec_model = api.load('word2vec-google-news-300')

Here we will use the text file as the raw dataset, which is the text from the Wikipedia page.

1.2 Pre-processing the dataset

In NLP, text preprocessing refers to the process of cleaning and preparing text data. To do this, we will use simple_preprocess(), which returns a list of tokens after tokenization and normalization.

import gensim
import os
from gensim.utils import simple_preprocess
# read text file as object
doc = open('nlp-wiki.txt', encoding ='utf-8')
# file preprocessing to get a list of tokens
tokenized = []
for sentence in doc.read().split('.'):
# simple_preprocess returns the word list of each sentence
tokenized.append(simple_preprocess(sentence, deacc = True))
print(tokenized)
doc.close()

Tokenized output:

[['the', 'history', 'of', 'natural', 'language', 'processing', 'generally', 'started', 'in', 'the', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods'], ['in', 'alan', 'turing', 'published', 'an', 'article', 'titled', 'intelligence', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing', 'test', 'as', 'criterion', 'of', 'intelligence'], ['the', 'georgetown', 'experiment', 'in', 'involved', 'fully', 'automatic', 'translation', 'of', 'more', 'than', 'sixty', 'russian', 'sentences', 'into', 'english'], ['the', 'authors', 'claimed', 'that', 'within', 'three', 'or', 'five', 'years', 'machine', 'translation', 'would', 'be', 'solved', 'problem'],
...

1.3 Creating a Dictionary

We now have pre-processed data that can be converted into a dictionary using corpora.Dictionary(). This dictionary is a collection of unique tokens.


from gensim import corpora
# save the tokens we have extracted into the dictionary
my_dictionary = corpora.Dictionary(tokenized)
print(my_dictionary)
Dictionary(410 unique tokens: ['although', 'be', 'can', 'earlier', 'found']...)

1.3.1 Saving the dictionary

You can save (or load) your dictionary to disk directly or as a text file, as shown below:


# save the dictionary to disk
my_dictionary.save('my_dictionary.dict')
# load it back
load_dict = corpora.Dictionary.load('my_dictionary.dict')
# save the dictionary in a text file
from gensim.test.utils import get_tmpfile
tmp_fname = get_tmpfile('dictionary')
my_dictionary.save_as_text(tmp_fname)
# load text file with your dictionary
load_dict = corpora.Dictionary.load_from_text(tmp_fname)

1.4 Creating a Bag of Words

Once we have a dictionary we can create a Bag of Words corpus using doc2bow(). This function counts the number of occurrences and generates an integer identifier for each word. The result is returned as a sparse vector.


# converting into words
bow_corpus =[my_dictionary.doc2bow(doc, allow_update = True) for doc in tokenized]
print(bow_corpus)
[[(0, 1), (1, 1), (2, 1),
...
(407, 1), (408, 1), (409, 1)], []]

1.4.1 Saving the hull to disk

Code for saving/loading your enclosure:


from gensim.corpora import MmCorpus
from gensim.test.utils import get_tmpfile
output_fname = get_tmpfile("BoW_corpus.mm")
# save the case to disk
MmCorpus.serialize(output_fname, bow_corpus)
# load the hull
load_corpus = MmCorpus(output_fname)

Step 2: Create a TF-IDF matrix in Gensim

TF-IDF (Term Frequency – Inverse Document Frequency) is a frequently used natural language processing model that helps you identify the most important words for each document in a corpus. It has been developed for small-sized collections. Some words may not be stop words, but still occur quite frequently in documents with little significance. Consequently, these words should be removed or their importance reduced. The TFIDF model takes text written in a single language and ensures that the most common words in the entire corpus do not show up as keywords. You can build a TFIDF model using Gensim and the corpus you developed earlier as follows:


from gensim import models
import numpy as np
# word weight in the Bag of Word corpus
word_weight =[]
for doc in bow_corpus:
for id, freq in doc:
word_weight.append([my_dictionary[id], freq])
print(word_weight)

Weight the words before applying TF-IDF:

[['although', 1], ['be', 1], ['can', 1], ['earlier', 1],
...
['steps', 1], ['term', 1], ['transformations', 1]]

Code (TF-IDF model application):


# create TF-IDF model
tfIdf = models.TfidfModel(bow_corpus, smartirs ='ntc')
# TF-IDF word weight
weight_tfidf =[]
for doc in tfIdf[bow_corpus]:
for id, freq in doc:
weight_tfidf.append([my_dictionary[id], np.around(freq, decimals=3)])
print(weight_tfidf)

Weight of words after TF-IDF application:

[['although', 0.339], ['be', 0.19], ['can', 0.237], ['earlier', 0.339],
...
['steps', 0.191], ['term', 0.191], ['transformations', 0.191]]

You can see that words often found in documents are now assigned lower weights.

Step 3: Create bigrams and trigrams with Gensim

Many words are used together in the text. Such combinations have a different meaning than the words that make them up individually. For example: Beatboxing -> the words beat and boxing have their own semantic variations, but together they represent a very different meaning. A bigram is a group of two words. A trigram is a group of three words. Here we will use dataset text8, which can be downloaded using the API downloader Gensim. Code for building bigrams and trigrams:


import gensim.downloader as api
from gensim.models.phrases import Phrases
# download dataset "text8"
dataset = api.load("text8")
# extract word list from dataset
data =[]
for word in dataset:
data.append(word)
# bigram using phraser model
bigram_model = Phrases(data, min_count=3, threshold=10)
print(bigram_model[data[0]])
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans_culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still' ...

To create trigrams, we simply pass the obtained above bigram model to the same function.


# trigram using the phrase model
trigram_model = Phrases(bigram_model[data], threshold=10)
# trigram
print(trigram_model[bigram_model[data[0]]
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early' ...

Step 4: Create a Word2Vec model with Gensim

ML/DL algorithms cannot use text directly, so we need some numerical representation so that these algorithms can process the data. Simple machine learning applications use CountVectorizer and TFIDF, which do not preserve the connection between words. Word2Vec is a text transformation method for creating vector representations (Word Embeddings) that map all the words present in the language into a vector space of a given dimensionality. We can perform mathematical operations on these vectors, which help to keep the connection between words. Example: queen – women + man = king. Ready-made vector-semantic models, such as word2vec, GloVe, fasttext and others can be loaded using Gensim loader API. Sometimes vector representations of certain words from your document can be absent in these packages. But you can solve this problem by training your model.

4.1) Teaching the model


import gensim.downloader as api
from multiprocessing import cpu_count
from gensim.models.word2vec import Word2Vec
# download dataset "text8"
dataset = api.load("text8")
# extract word list from dataset
data =[]
for word in dataset:
data.append(word)
# divide data into two parts
data_1 = data[:1200] # used to train the model
data_2 = data[1200:] # used to update the model
# training the Word2Vec model
w2v_model = Word2Vec(data_1, min_count=0, workers=cpu_count())
# word vector for "time
print(w2v_model.wv['time'])

Vector for the word “time”:

[-0.04681756 -0.08213229 1.0628034 -1.0186515 1.0779341 -0.89710116
0.6538859 -0.81849015 -0.29984367 0.55887854 2.138567 -0.93843514
...
-1.4128548 -1.3084044 0.94601256 0.27390406 0.6346426 -0.46116787
0.91097695 -3.597664 0.6901859 1.0902803 ]

You can also use most_similar() to find words similar to the one passed in.


# words similar to "time"
print(w2v_model.wv.most_similar('time'))
# save and load the model
w2v_model.save('Word2VecModel')
model = Word2Vec.load('Word2VecModel')

Words most similar to “time”:

[('moment', 0.6137239933013916), ('period', 0.5904807448387146), ('stage', 0.5393826961517334), ('decade', 0.51670902967453), ('lifetime', 0.4878680109977722), ('once', 0.4843854010105133), ('distance', 0.4821343719959259), ('breteuil', 0.4815649390220642), ('preestablished', 0.47662678360939026), ('point', 0.4757876396179199)]

4.2) Model update


# let's build the vocabulary from the sample sequence of sentences
w2v_model.build_vocab(data_2, update=True)
# training a vector of words
w2v_model.train(data_2, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)
print(w2v_model.wv['time'])

The output will give you new weights for the words.

Step 5: Creating a Doc2Vec model with Gensim

Unlike the Word2Vec model, the Doc2Vec model generates a vector representation for an entire document or group of words. With this model, we can find relationships between different documents, as shown below: If we train the model on literature like Alice in Looking Glass. We can say that Alice in Looking Glass == Alice in Wonderland.

5.1) Train the model


import gensim
import gensim.downloader as api
from gensim.models import doc2vec
# get dataset
dataset = api.load("text8")
data =[]
for w in dataset:
data.append(w)
# To train the model we need a list of target documents
def tagged_document(list_of_ListOfWords):
for x, ListOfWords in enumerate(list_of_ListOfWords):
yield doc2vec.TaggedDocument(ListOfWords, [x])
# training data
data_train = list(tagged_document(data))
# output the trained data set
print(data_train[:1])

The output is a trained dataset.

5.2) Update the model

# initialize model
d2v_model = doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30)
# expand vocabulary
d2v_model.build_vocab(data_train)
# train Doc2Vec model
d2v_model.train(data_train, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
# analysis of output data
analyze = d2v_model.infer_vector(['violent', 'means', 'to', 'destroy'])
print(analyze)

Output the updated model:

[-3.79053354e-02 -1.03341974e-01 -2.85615563e-01 1.37473553e-01
1.79868549e-01 3.42468806e-02 -1.68495290e-02 -1.86038092e-01
...
-1.20517321e-01 -1.48323074e-01 -5.70210926e-02 -2.15077385e-01]

Step 6: Create a Topic Model with LDA

LDA is a popular thematic modeling method in which each document is treated as a set of topics in a certain proportion. We need to derive useful qualities of topics, such as how divided and meaningful they are. Good quality topics depend on:

  1. quality of text processing,
  2. finding the optimal number of topics,
  3. algorithm parameters tuning.

Take the following steps to create a model.

6.1 Data preparation

This is done by removing stop words and then lemmatizing your data. To perform lemmatization with Gensim, we need to first download the pattern package and the stop words.

pip install pattern
# in the python console
>>> import nltk
>>> nltk.download('stopwords')

import gensim
from gensim import corpora
from gensim.models import LdaModel, LdaMulticore
import gensim.downloader as api
from gensim.utils import simple_preprocess
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
# nltk.download('stopwords')
from ntlk.corpus import stopwords
import re
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
# loading stopwords
stop_words = stopwords.words('english')
# add stopwords
stop_words = stop_words + ['subject', 'com', 'are', 'edu', 'would', 'could']
lemmatizer = WordNetLemmatizer()
# dataset load
dataset = api.load("text8")
data = [w for w in dataset]
# data preparation
processed_data = []
for x, doc in enumerate(data[:100]):
doc_out = []
for word in doc:
if word not in stop_words: # to remove stop words
lemmatized_word = lemmatizer.lemmatize(word) # lemmatize
if lemmatized_word:
print
doc_out.append(lemmatized_word)
else:
continue
processed_data.append(doc_out) # processed_data is the word list
# sample output
print(processed_data[0][:10])
['anarchism', 'originated', 'term', 'abuse', 'first', 'used', 'early', 'working', 'class', 'radical']

6.2 Dictionary and corpus creation

The processed data will now be used to create the dictionary and corpus.


dictionary = corpora.Dictionary(processed_data)
corpus = [dictionary.doc2bow(l) for l in processed_data]

6.3 Training the LDA model

We will train an LDA model with 5 topics using the dictionary and corpus created earlier. LdaModel() is used here, but you can also use LdaMulticore() as it allows for parallel processing.


# Training
LDA_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5)
# save model
LDA_model.save('LDA_model.model')
# show the themes
print(LDA_model.print_topics(-1))

Words that occur in more than one topic and have little value can be added to the list of forbidden words.

6.4 Output Interpretation

The LDA model basically gives us information in three ways:

  1. Topics in the document
  2. Which theme does each word belong to?
  3. Phi value

The phi value is the probability that a word belongs to a particular topic. For a selected word, the sum of phi values gives the number of times it occurs in the document.


# probability that the word belongs to the topic
LDA_model.get_term_topics('fire')
bow_list =['time', 'space', 'car']
# first convert it into a bag of words
bow = LDA_model.id2word.doc2bow(bow_list)
# interpreting data
doc_topics, word_topics, phi_values = LDA_model.get_document_topics(bow, per_word_topics=True)

Step 7: Create a Topic Model with LSI

To create a model with LSI, simply follow the same steps as with LDA. Only use LsiModel() instead of LdaMulticore() or LdaModel() for training.


from gensim.models import LsiModel
# LSI model training
LSI_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=7, decay=0.5)
# themes
print(LSI_model.print_topics(-1))

Conclusion

These are just some of the features of the Gensim library. They’re very handy to use, especially when you’re doing NLP. You are, of course, free to apply them as you see fit.

Related Posts

LEAVE A COMMENT