Gensim is an open-source Python library written by Radim Rehurek that is used for teacherless topic modeling and natural language processing (NLP). It is designed to extract semantic themes from documents. Gensim can handle large text collections. This distinguishes it from other machine learning software libraries that focus on in-memory processing. GenSim also provides efficient multi-core implementations of various algorithms to increase processing speed. It adds more convenient text processing facilities than competitors such as Scikit-learn, R, etc. This tutorial will cover the following concepts:
- Creating an enclosure from a given dataset.
- TFIDF matrices in Gensim.
- Creation of bigrams and trigrams using Gensim.
- Word2Vec models, using Gensim.
- Doc2Vec models, with Gensim.
- Creating Thematic Models with LDA.
- Creating a topic model with LSI.
Before moving on, let’s understand what the following terms mean:
- Corpus: a collection of text documents.
- Vector: a form of text representation.
- Model: the algorithm used to generate a representation of data.
- Topic modeling: an information mining tool used to extract semantic themes from documents.
- Theme: a recurring group of words that occur frequently together.
For example: You have a document consisting of words such as: bat, car, racquet, score, glass, drive, cup, keys, water, game, steering, liquid. You can group them into different topics: Topic 1Topic 2Topic 3glassbatcarcupracquetdrivecaterscorekeysliquidgamesterring Some of the methods of thematic modeling:
- Latent Semantic Analysis (LSI)
- Latent Dirichlet placement (LDA)
Now that we have a basic understanding of the terminology, let’s move on to using the Gensim package. First, install the library using the following commands:
pip install gensim # or conda install gensim
Step 1: Create a corpus from a given dataset
You will need to follow these steps to create your dataset:
- Load the selected dataset.
- Pre-process your dataset.
- Create a dictionary.
- Create a Bag of Words.
1.1 Load the selected dataset:
You can have a .txt file as a dataset or you can also download the required datasets using the Gensim Downloader API.
import os # read the text file as an object doc = open('sample_data.txt', encoding ='utf-8')
The Gensim Downloader API is a module available in the Gensim library which is an API for downloading, getting information and downloading datasets/models.
import gensim.downloader as api # check for available models and datasets info_datasets = api.info() print(info_datasets) # information about a specific dataset dataset_info = api.info("text8") # dataset "text8" is loaded dataset = api.load("text8") # load pre-trained model word2vec_model = api.load('word2vec-google-news-300')
Here we will use the text file as the raw dataset, which is the text from the Wikipedia page.
1.2 Pre-processing the dataset
In NLP, text preprocessing refers to the process of cleaning and preparing text data. To do this, we will use
simple_preprocess(), which returns a list of tokens after tokenization and normalization.
import gensim import os from gensim.utils import simple_preprocess # read text file as object doc = open('nlp-wiki.txt', encoding ='utf-8') # file preprocessing to get a list of tokens tokenized =  for sentence in doc.read().split('.'): # simple_preprocess returns the word list of each sentence tokenized.append(simple_preprocess(sentence, deacc = True)) print(tokenized) doc.close()
[['the', 'history', 'of', 'natural', 'language', 'processing', 'generally', 'started', 'in', 'the', 'although', 'work', 'can', 'be', 'found', 'from', 'earlier', 'periods'], ['in', 'alan', 'turing', 'published', 'an', 'article', 'titled', 'intelligence', 'which', 'proposed', 'what', 'is', 'now', 'called', 'the', 'turing', 'test', 'as', 'criterion', 'of', 'intelligence'], ['the', 'georgetown', 'experiment', 'in', 'involved', 'fully', 'automatic', 'translation', 'of', 'more', 'than', 'sixty', 'russian', 'sentences', 'into', 'english'], ['the', 'authors', 'claimed', 'that', 'within', 'three', 'or', 'five', 'years', 'machine', 'translation', 'would', 'be', 'solved', 'problem'], ...
1.3 Creating a Dictionary
We now have pre-processed data that can be converted into a dictionary using
corpora.Dictionary(). This dictionary is a collection of unique tokens.
from gensim import corpora # save the tokens we have extracted into the dictionary my_dictionary = corpora.Dictionary(tokenized) print(my_dictionary)
Dictionary(410 unique tokens: ['although', 'be', 'can', 'earlier', 'found']...)
1.3.1 Saving the dictionary
You can save (or load) your dictionary to disk directly or as a text file, as shown below:
# save the dictionary to disk my_dictionary.save('my_dictionary.dict') # load it back load_dict = corpora.Dictionary.load('my_dictionary.dict') # save the dictionary in a text file from gensim.test.utils import get_tmpfile tmp_fname = get_tmpfile('dictionary') my_dictionary.save_as_text(tmp_fname) # load text file with your dictionary load_dict = corpora.Dictionary.load_from_text(tmp_fname)
1.4 Creating a Bag of Words
Once we have a dictionary we can create a Bag of Words corpus using
doc2bow(). This function counts the number of occurrences and generates an integer identifier for each word. The result is returned as a sparse vector.
# converting into words bow_corpus =[my_dictionary.doc2bow(doc, allow_update = True) for doc in tokenized] print(bow_corpus)
[[(0, 1), (1, 1), (2, 1), ... (407, 1), (408, 1), (409, 1)], ]
1.4.1 Saving the hull to disk
Code for saving/loading your enclosure:
from gensim.corpora import MmCorpus from gensim.test.utils import get_tmpfile output_fname = get_tmpfile("BoW_corpus.mm") # save the case to disk MmCorpus.serialize(output_fname, bow_corpus) # load the hull load_corpus = MmCorpus(output_fname)
Step 2: Create a TF-IDF matrix in Gensim
TF-IDF (Term Frequency – Inverse Document Frequency) is a frequently used natural language processing model that helps you identify the most important words for each document in a corpus. It has been developed for small-sized collections. Some words may not be stop words, but still occur quite frequently in documents with little significance. Consequently, these words should be removed or their importance reduced. The TFIDF model takes text written in a single language and ensures that the most common words in the entire corpus do not show up as keywords. You can build a TFIDF model using Gensim and the corpus you developed earlier as follows:
from gensim import models import numpy as np # word weight in the Bag of Word corpus word_weight = for doc in bow_corpus: for id, freq in doc: word_weight.append([my_dictionary[id], freq]) print(word_weight)
Weight the words before applying TF-IDF:
[['although', 1], ['be', 1], ['can', 1], ['earlier', 1], ... ['steps', 1], ['term', 1], ['transformations', 1]]
Code (TF-IDF model application):
# create TF-IDF model tfIdf = models.TfidfModel(bow_corpus, smartirs ='ntc') # TF-IDF word weight weight_tfidf = for doc in tfIdf[bow_corpus]: for id, freq in doc: weight_tfidf.append([my_dictionary[id], np.around(freq, decimals=3)]) print(weight_tfidf)
Weight of words after TF-IDF application:
[['although', 0.339], ['be', 0.19], ['can', 0.237], ['earlier', 0.339], ... ['steps', 0.191], ['term', 0.191], ['transformations', 0.191]]
You can see that words often found in documents are now assigned lower weights.
Step 3: Create bigrams and trigrams with Gensim
Many words are used together in the text. Such combinations have a different meaning than the words that make them up individually. For example: Beatboxing -> the words beat and boxing have their own semantic variations, but together they represent a very different meaning. A bigram is a group of two words. A trigram is a group of three words. Here we will use dataset text8, which can be downloaded using the API downloader Gensim. Code for building bigrams and trigrams:
import gensim.downloader as api from gensim.models.phrases import Phrases # download dataset "text8" dataset = api.load("text8") # extract word list from dataset data = for word in dataset: data.append(word) # bigram using phraser model bigram_model = Phrases(data, min_count=3, threshold=10) print(bigram_model[data])
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early', 'working_class', 'radicals', 'including', 'the', 'diggers', 'of', 'the', 'english', 'revolution', 'and', 'the', 'sans_culottes', 'of', 'the', 'french_revolution', 'whilst', 'the', 'term', 'is', 'still' ...
To create trigrams, we simply pass the obtained above bigram model to the same function.
# trigram using the phrase model trigram_model = Phrases(bigram_model[data], threshold=10) # trigram print(trigram_model[bigram_model[data]
['anarchism', 'originated', 'as', 'a', 'term', 'of', 'abuse', 'first', 'used', 'against', 'early' ...
Step 4: Create a Word2Vec model with Gensim
ML/DL algorithms cannot use text directly, so we need some numerical representation so that these algorithms can process the data. Simple machine learning applications use CountVectorizer and TFIDF, which do not preserve the connection between words. Word2Vec is a text transformation method for creating vector representations (Word Embeddings) that map all the words present in the language into a vector space of a given dimensionality. We can perform mathematical operations on these vectors, which help to keep the connection between words. Example: queen – women + man = king. Ready-made vector-semantic models, such as word2vec, GloVe, fasttext and others can be loaded using Gensim loader API. Sometimes vector representations of certain words from your document can be absent in these packages. But you can solve this problem by training your model.
4.1) Teaching the model
import gensim.downloader as api from multiprocessing import cpu_count from gensim.models.word2vec import Word2Vec # download dataset "text8" dataset = api.load("text8") # extract word list from dataset data = for word in dataset: data.append(word) # divide data into two parts data_1 = data[:1200] # used to train the model data_2 = data[1200:] # used to update the model # training the Word2Vec model w2v_model = Word2Vec(data_1, min_count=0, workers=cpu_count()) # word vector for "time print(w2v_model.wv['time'])
Vector for the word “time”:
[-0.04681756 -0.08213229 1.0628034 -1.0186515 1.0779341 -0.89710116 0.6538859 -0.81849015 -0.29984367 0.55887854 2.138567 -0.93843514 ... -1.4128548 -1.3084044 0.94601256 0.27390406 0.6346426 -0.46116787 0.91097695 -3.597664 0.6901859 1.0902803 ]
You can also use
most_similar() to find words similar to the one passed in.
# words similar to "time" print(w2v_model.wv.most_similar('time')) # save and load the model w2v_model.save('Word2VecModel') model = Word2Vec.load('Word2VecModel')
Words most similar to “time”:
[('moment', 0.6137239933013916), ('period', 0.5904807448387146), ('stage', 0.5393826961517334), ('decade', 0.51670902967453), ('lifetime', 0.4878680109977722), ('once', 0.4843854010105133), ('distance', 0.4821343719959259), ('breteuil', 0.4815649390220642), ('preestablished', 0.47662678360939026), ('point', 0.4757876396179199)]
4.2) Model update
# let's build the vocabulary from the sample sequence of sentences w2v_model.build_vocab(data_2, update=True) # training a vector of words w2v_model.train(data_2, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs) print(w2v_model.wv['time'])
The output will give you new weights for the words.
Step 5: Creating a Doc2Vec model with Gensim
Unlike the Word2Vec model, the Doc2Vec model generates a vector representation for an entire document or group of words. With this model, we can find relationships between different documents, as shown below: If we train the model on literature like Alice in Looking Glass. We can say that Alice in Looking Glass == Alice in Wonderland.
5.1) Train the model
import gensim import gensim.downloader as api from gensim.models import doc2vec # get dataset dataset = api.load("text8") data = for w in dataset: data.append(w) # To train the model we need a list of target documents def tagged_document(list_of_ListOfWords): for x, ListOfWords in enumerate(list_of_ListOfWords): yield doc2vec.TaggedDocument(ListOfWords, [x]) # training data data_train = list(tagged_document(data)) # output the trained data set print(data_train[:1])
The output is a trained dataset.
5.2) Update the model
# initialize model d2v_model = doc2vec.Doc2Vec(vector_size=40, min_count=2, epochs=30) # expand vocabulary d2v_model.build_vocab(data_train) # train Doc2Vec model d2v_model.train(data_train, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs) # analysis of output data analyze = d2v_model.infer_vector(['violent', 'means', 'to', 'destroy']) print(analyze)
Output the updated model:
[-3.79053354e-02 -1.03341974e-01 -2.85615563e-01 1.37473553e-01 1.79868549e-01 3.42468806e-02 -1.68495290e-02 -1.86038092e-01 ... -1.20517321e-01 -1.48323074e-01 -5.70210926e-02 -2.15077385e-01]
Step 6: Create a Topic Model with LDA
LDA is a popular thematic modeling method in which each document is treated as a set of topics in a certain proportion. We need to derive useful qualities of topics, such as how divided and meaningful they are. Good quality topics depend on:
- quality of text processing,
- finding the optimal number of topics,
- algorithm parameters tuning.
Take the following steps to create a model.
6.1 Data preparation
This is done by removing stop words and then lemmatizing your data. To perform lemmatization with Gensim, we need to first download the pattern package and the stop words.
pip install pattern # in the python console >>> import nltk >>> nltk.download('stopwords')
import gensim from gensim import corpora from gensim.models import LdaModel, LdaMulticore import gensim.downloader as api from gensim.utils import simple_preprocess import nltk from nltk.stem.wordnet import WordNetLemmatizer # nltk.download('stopwords') from ntlk.corpus import stopwords import re import logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s') logging.root.setLevel(level=logging.INFO) # loading stopwords stop_words = stopwords.words('english') # add stopwords stop_words = stop_words + ['subject', 'com', 'are', 'edu', 'would', 'could'] lemmatizer = WordNetLemmatizer() # dataset load dataset = api.load("text8") data = [w for w in dataset] # data preparation processed_data =  for x, doc in enumerate(data[:100]): doc_out =  for word in doc: if word not in stop_words: # to remove stop words lemmatized_word = lemmatizer.lemmatize(word) # lemmatize if lemmatized_word: print doc_out.append(lemmatized_word) else: continue processed_data.append(doc_out) # processed_data is the word list # sample output print(processed_data[:10])
['anarchism', 'originated', 'term', 'abuse', 'first', 'used', 'early', 'working', 'class', 'radical']
6.2 Dictionary and corpus creation
The processed data will now be used to create the dictionary and corpus.
dictionary = corpora.Dictionary(processed_data) corpus = [dictionary.doc2bow(l) for l in processed_data]
6.3 Training the LDA model
We will train an LDA model with 5 topics using the dictionary and corpus created earlier.
LdaModel() is used here, but you can also use
LdaMulticore() as it allows for parallel processing.
# Training LDA_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=5) # save model LDA_model.save('LDA_model.model') # show the themes print(LDA_model.print_topics(-1))
Words that occur in more than one topic and have little value can be added to the list of forbidden words.
6.4 Output Interpretation
The LDA model basically gives us information in three ways:
- Topics in the document
- Which theme does each word belong to?
- Phi value
The phi value is the probability that a word belongs to a particular topic. For a selected word, the sum of phi values gives the number of times it occurs in the document.
# probability that the word belongs to the topic LDA_model.get_term_topics('fire') bow_list =['time', 'space', 'car'] # first convert it into a bag of words bow = LDA_model.id2word.doc2bow(bow_list) # interpreting data doc_topics, word_topics, phi_values = LDA_model.get_document_topics(bow, per_word_topics=True)
Step 7: Create a Topic Model with LSI
To create a model with LSI, simply follow the same steps as with LDA. Only use
LsiModel() instead of
LdaModel() for training.
from gensim.models import LsiModel # LSI model training LSI_model = LsiModel(corpus=corpus, id2word=dictionary, num_topics=7, decay=0.5) # themes print(LSI_model.print_topics(-1))
These are just some of the features of the Gensim library. They’re very handy to use, especially when you’re doing NLP. You are, of course, free to apply them as you see fit.