7 minute read


While browsing the Internet, you have probably seen a picture of a cloud filled with words of varying sizes that reflect the frequency of each word within a given text. This is referred to as a Tag Cloud or a Word Cloud. In this tutorial (see the notebook here), we will learn how to make Word Clouds in Python. This tool is useful for a visual exploration of text data.

We will use legal texts for the purpose of this tutorial, namely the Tunisian Constitution and the Tunisian Hydrocarbons Code.


As usual, we start by importing the different libraries used.
The NumPy library is used for handling large, multi-dimensional arrays and matrices.
For visualization, matplotlib is a comprehensive plotting library. It enables other libraries, such as seaborn and wordcloud, to run on its base.
The pillow library adds support for opening, manipulating, and saving many different image file formats.

from PIL import Image
from wordcloud import WordCloud, STOPWORDS
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns

Generating Word Clouds

Then, we read the Hydrocarbons Code stored in a .txt file. I chose this legal document, because I have worked on it and analysed it for a thesis report I wrote. It is a complex legal text that pertains to a sensitive and important topic, that is natural resources.

We have to make some necessary text processing. First, we convert the entire text to lower case. This is important since Python strings are case sensitive. Afterwards, we use the magic of regular expressions to deal with apostrophes and other characters to be removed.
Whenever in doubt regarding regular expressions syntax, you can use this website or the official Python documentation to have the desired outcome.

hydrocarbons_code = open("hydrocarbons_code.txt", encoding='latin_1').read().replace("\n"," ")
hydrocarbons_code = hydrocarbons_code.lower()
hydrocarbons_code = re.sub(".'|[^\w ]", " ", hydrocarbons_code)

Since the text is in French, we have to make our own list (technically it’s a Python set with curly brackets here) of the words to remove from the given text. These words would not show in the Word Cloud.

If you do not speak French, most of these words are the equivalent to English articles, pronouns and conjunctions. They would not help us much in understanding the text through visuals.

stop = {'de','du','des','et','est','la','le','les','en','ou','par','au','aux','dans','une','un','pour','sur','ce','ces',
        'ne','qui','que','son','ses','sa','il','ci','a'}

In order to create an image form for the Word Cloud, we need to use a PNG file as a mask. Here, I use the map of Tunisia, just for fun.

The mask argument in the WordCloud function takes an N dimensional array (ndarray). We use Image module to open the PNG file, and we transform it to the numpy array form.
According to WordCloud documentation, all white entries will be considerd masked out, while other entries will be free to draw on. In the NumPy array, all white parts of the mask have a value of 255, whereas values of 1 are black.

map_mask = np.array(Image.open("map.png"))

Now, we have a proper mask and we can make a cloud with the desired shape.
WordCloud takes several parameters, and you can create a personalised result by changing the optional arguments. Some of these are fairly self-explanatory. For the rest, you can always consult the relevant documentation. Or, you can check out the docstring of the function and see the required and optional arguments, by typing and running ?WordCloud.

hydrocarbons_cloud = WordCloud(max_words=1000, mask=map_mask, stopwords=stop , min_word_length = 3, min_font_size = 8,
                               margin=5, random_state=1, background_color="white", include_numbers=True).generate(hydrocarbons_code)

Finally, we can output the result and have an insightful and beautiful visualizations.
Naturally, the most frequent words are hydrocarbons, code, article and holder, as expected in a legal document about hydrocarbons!

%matplotlib inline
plt.figure(figsize=(18,18))
ax = plt.gca()
ax.set_title("Tunisian Hydrocarbons Code Cloud",fontsize=22)
plt.imshow(hydrocarbons_cloud.recolor(colormap=sns.color_palette(palette='blend:red,brown',as_cmap=True), random_state=3),
           interpolation="bilinear")
plt.axis("off")
plt.show()
#hydrocarbons_cloud.to_file("Hydrocarbons code cloud.png") #comment out to save the figure in a PNG format

png showing Hydrocarbons code cloud as the Tunisian map


Now, let’s take a look at the Tunisian Constitution. We will use an English translation from constitute project.

As before, we start by constructing a mask. For this example, it will be the Tunisian flag. For the rest, the only difference is that we use the default built-in STOPWORDS list.

flag_mask = np.array(Image.open("Flag_of_Tunisia.png"))

tunisian_constitution = open("constitution.txt").read().replace("\n"," ").lower()
tunisian_constitution = re.sub("[^\w ]", " ", tunisian_constitution)

constitution_cloud = WordCloud(max_words=1000, mask=flag_mask, stopwords=STOPWORDS , min_word_length = 3, min_font_size = 8,
                               margin=5, random_state=1, background_color="white", include_numbers=True).generate(tunisian_constitution)

Our exquisite result looks good, and presents us with useful visual anchor.

%matplotlib inline
plt.figure(figsize=(20,10))
ax = plt.gca()
ax.set_title("Tunisian Constitution Cloud",fontsize=22, y=1.04)
plt.imshow(constitution_cloud.recolor(colormap=sns.color_palette(palette='blend:red,brown',as_cmap=True)),
           interpolation="bilinear")
plt.axis("off")
plt.show()
#constitution_cloud.to_file("Tunisian Constitution Cloud.png") #comment out to save the figure in a PNG format

png showing Tunisian Constitution Cloud as the Tunisian flag


Counting words

However, text analysis goes beyond visualisations. We can count words using the Counter collection. If we are only interested in the most common words, we can use .most_common() with or without an argument specifiying the number of words.

In the code below, we use list comprehension to store words appearing more than 30 times in the constitution and having at least 4 letters.

from collections import Counter
tunisian_words = [x for x in Counter(tunisian_constitution.split()).most_common(50) if x[1]>30 and len(x[0])>3 ]
tunisian_words
[('article', 173),
 ('shall', 173),
 ('assembly', 149),
 ('president', 110),
 ('people', 108),
 ('government', 100),
 ('representatives', 97),
 ('with', 97),
 ('republic', 96),
 ('state', 67),
 ('members', 62),
 ('court', 62),
 ('their', 50),
 ('draft', 48),
 ('head', 47),
 ('constitutional', 45),
 ('that', 43),
 ('laws', 42),
 ('within', 40),
 ('from', 39),
 ('right', 38),
 ('council', 37),
 ('judicial', 37),
 ('authorities', 34),
 ('local', 33),
 ('national', 32),
 ('after', 32),
 ('rights', 31),
 ('constitution', 31)]

Comparing different documents

We can compare the occurences of words in two documents as well. We will use the French constitution for comparison, despite the differences between the governance systems of the two countries as implemented in their respective constitutions.

We read and process the .txt file as we did before. Then, we store the words of each document in a list (tunisian_words and french_words).

french_constitution = open("french constitution.txt").read().replace("\n"," ").lower()
french_constitution = re.sub("[^\w ]", " ", french_constitution)

tunisian_words = tunisian_constitution.split(" ")
french_words = french_constitution.split(" ")

Afterwards, we construct a DataFrame by passing in the Counter collections as data entries, and setting the column labels.
In this example, we chose to remove missing values or drop them by using .dropna() with a 0 argument. That means, we drop rows which contain missing values. That would give us a DataFrame of only the words that exist in both documents!

The result below shows the number of occurrences of each word, the total, and a percentage value indicating the prevalence of that word in the Tunisian constitution with reference to its total use in both documents.

df = pd.DataFrame({
    'Tunisian_constitution': Counter(tunisian_words),
    'French_constitution': Counter(french_words)
}).dropna(0)

df['Total'] = df.Tunisian_constitution + df.French_constitution
df['Tunisian_percentage'] = (df.Tunisian_constitution / df.Total) * 100

df.head(10)

picture showing DataFrame of occurrences and total for each word in both documents


All of these common words between the two documents were used differently in each. We can check how many times they appeared by returning the sum of the values over the desired axis.

df.sum(axis = 0)
Tunisian_constitution    11950.000000
French_constitution      11769.000000
Total                    23719.000000
Tunisian_percentage      40472.171694
dtype: float64

Now, let’s look at words used ten or more times in the Tunisian constitution. We sort them by descending value.
The for loop is used to go through df.index and remove the words having three characters or less.

for ind in df.index:
    if len(ind)<4:
        df.drop(ind, inplace = True)

df[df.Tunisian_constitution >= 10].sort_values(by='Tunisian_constitution', ascending=False)

picture showing DataFrame of occurrences and total for word that appeared at least 10 times in both documents


We can use the df.sum() again to compute the frequency of those words.

df.sum(axis = 0)
Tunisian_constitution     5109.000000
French_constitution       5125.000000
Total                    10234.000000
Tunisian_percentage      34301.487972
dtype: float64

In this notebook (using python 3.7 pandas 1.2.1 and matplotlib 3.3.2), we have learned how to draw a Word Cloud that would be helpful for visualization of any text. Besides, we used Counter to count words in documents. The tool worked well with pandas DataFrames, allowing us to make simple comparisons.

This might have been naive text analysis, but it is an important first step towards a more comprehensive and elaborate text analysis.

Leave a comment