Embedding Words for Natural Language Processing: A Tutorial

Natural Language Processing (NLP) utilises word embedding as a crucial technique to convert words into real-valued vectors. This innovative approach has resulted in enhanced comprehension of written content by machines, marking it as a game-changing development in the field of NLP that has been powered by Deep Learning.

Word and document representation through numerical vectors allows for similar words to have comparable vector representations. This data is then incorporated as input for a machine learning model designed to process text data while preserving its grammatical and semantic structure. Once the data has been transformed, it can be effortlessly processed by natural language processing algorithms, which accurately interpret these learned representations.

The numerous benefits it offers have contributed to the increasing popularity of ML NLP among software engineers.

With the information provided, you are now equipped with a basic understanding of word embeddings, its techniques, and applications.

Definition of Word Embedding:

A word vector, popularly known as Word Embedding, is a technique that transforms textual data into a numerical representation. This numeric representation is generated by assigning comparable words the same numerical vector input, producing an estimated meaning of the word in a lower dimensional space. Compared to manually constructed models like WordNet, this method of producing numerical representations is much more efficient in training time.

With the capacity to represent approximately 50 unique characteristics of a word, Word Embeddings provide a powerful tool. Due to their efficiency in capturing vital information and dimensionality reduction, pre-trained Word Embedding models like Flair, fastText, SpaCy, and others are becoming increasingly popular. These models have a wide range of applications in natural language processing, including machine translation and sentiment analysis.

In this article, we will delve further into the topic. Let’s begin with a brief overview and an example to help you understand better.

The Problem:

Supervised machine learning can be implemented to distinguish false tweets regarding real catastrophes. In this context, the text contained in the tweets can be recognised as the independent variable, while binary values (1: Real Disaster, 0: Not Real Disaster) can be identified as the dependent variable.

Modern Machine Learning and Deep Learning algorithms require numeric inputs exclusively. Hence, the task at hand is to convert tweets into quantifiable data that can be processed by these algorithms. One potential solution to address this challenge is through word embedding, as it is capable of solving the issue until more advanced strategies are developed.

The Solution:

In natural language processing (NLP), word embeddings is a methodology that assigns real-valued vectors to words in a reduced dimensional space. By assigning each word with a real-valued vector of tens or hundreds of dimensions, this approach conserves space efficiently. Applications in the field of NLP benefit significantly from this technique, leading to more precise and comprehensive language data analysis.

TF-IDF: Inverse Document Frequency Distribution by Search Term

When it comes to machine learning, word embedding for text utilises the term frequency-inverse document frequency (TF-IDF) technique. This approach entails two essential factors: term frequency (TF) and inverse document frequency (IDF). TF determines the occurrence frequency of a given term in a document, whereas IDF assesses how frequently a specified term appears in a collection of documents. The combination of TF and IDF through the TF-IDF method can measure the significance of a particular term within a document.

By using a statistical measure, this technique is designed to identify which words are most relevant to the text, whether it is an individual document or a group of documents (a corpus).

The term frequency (TF) score indicates the frequency of how frequently specific terms appear within a document. The occurrence frequency of terms within the document is measured.

The Inverse Document Frequency (IDF) score reflects the frequency with which a particular term occurs in a document. Unlike Term Frequency (TF), which focuses more on commonly used terms, IDF score prioritises less commonly used terms across the corpus of documents. Therefore, when it comes to information discovery, a higher weighting is given to the IDF score.

The TF-IDF algorithm is a widely accepted and dependable solution for basic text analysis tasks, such as keyword extraction, stop word removal and information retrieval. However, it has its limitations in capturing the context and sequential semantic meaning of words accurately, which is crucial for particular applications.

The Output of TfidfVectorizer

Dividing the vocabulary into columns and the documents into rows enables the calculation of term frequency–inverse document frequency (tf-idf) values for each respective cell (i,j). The resulting matrix and associated target variable can be used to train a machine learning or deep learning algorithm.

We will now explore two different strategies for constructing word embeddings and examine its practical applications.

Bag of Words (BOW)

Word embedding techniques, which adopt a “bag of words” strategy wherein each vector element denotes the number of words in a particular sentence or document, are widely used. This method allows the extraction of features from the text, and it is not restricted to being referred to as “vectorization”.

Below is an elementary overview of the necessary steps to create a BOW:

  • The first step in tokenization is to separate the text into distinct phrases.
  • Tokenized words are then added to the initially tokenized phrases.
  • Eliminate any unneeded punctuation, such as commas.
  • Ensure that all words are converted to lowercase.
  • The final stage is to construct a word cloud to evaluate the frequency of each term.

In the following section, we will take a continuous bag of words as an example to illustrate BOW.

Word to Vector (Word2Vec)

The Word2Vec technique was devised by Google in 2013, and it is a distributional hypothesis-driven approach that is extensively employed to tackle intricate issues in Natural Language Processing (NLP). This method is explicitly created to train word embeddings.

To express its concepts, this theory utilises skip-grams or continuous bags of words (CBOW).

A “shallow” neural network is built on the foundation of input, output, and projection layers. This type of network considers both the previous and future word order when reconstructing the linguistic context of the words.

Repetitive iterations are used to identify semantic connections among words in a text corpus. This approach is based on the assumption that words in text that are in proximity to one another have similar characteristics. By converting related words into embedding vectors, which are located close to one another, this technique aids in a better understanding of the semantic meanings of words.

To determine the similarity between two vector representations of words or texts, this technique employs the cosine similarity measure. The cosine similarity value is derived by calculating the cosine of the angle between two vectors.

  • The words have overlap if the cosine angle value is one.
  • When the cosine angle is exactly 90 degrees, it means that the words have no semantic association with each other.

To summarise, we may conclude that this measure positions identical vector representations on the same plane.

Word2Vec has two variations.

Neural networks are used in two types of Word2Vec: Continuous Bag of Words (CBOW) and Skip-gram.

1. CBOW The neural network model accepts various inputs, with Continuous Bag of Words (CBOW) being one of them. This model is specifically built to predict a target word based on the input provided by other words, and it is a proficient method of obtaining a numerical representation for frequently used words. To grasp the idea of context and the current term for CBOW, let us contemplate the following example.

The Continuous Bag of Words (CBOW) model defines the size of the window. In this model, the word in the center of the window is the active word, while the words on each side provide contextual information. This facilitates making predictions about the present word based on the surrounding context. The CBOW neural network receives a sequence of words that have been encoded with One Hot Encoding, utilising the predetermined vocabulary.

The obscured layer is a closely connected regular layer that computes the probabilities of the target term within the vocabulary and forwards them to the output layer.

Following the introduction of the Bag of Words (BOW), also referred to as a vectorizer, we will now elucidate this concept with an example.

Let us consider these four tweets as an illustration of BOW in action:

(Filled with sadness) “kind truth”

the “profanity mix” created a worldwide uproar,

authentic car wreck with expletives,

Regrettably, the car ignited.

To build a Bag of Words (BOW), Scikit-Learn’s CountVectorizer was utilised. This approach divides a set of text documents into tokens, generates a list of unique words, and thereafter encodes new text using this vocabulary.

Output of CountVectorizer

In this instance, each of the four documents is depicted in a distinct row. The columns in the table correspond to the vocabulary terms that are common to all four documents, and the figures denote the total number of times that each of those vocabulary terms is present across all four papers.

The CountVectorizer can be applied to all 11,370 documents in the training dataset to produce a matrix that can be valuable while training a machine learning or deep learning model with the corresponding target variable.

The Skip-Gramme 2Dissimilar to Contrastive Bag-of-Words (CBOW), the Skip-gramme does not anticipate the current word through contextual data. Instead, it employs a log-linear classifier to transmit the current term through a continuous projection layer. Predictions are then made for the words following and preceding the current word, thereby distinguishing the Skip-gramme approach from other word embedding techniques.

This variant only necessitates one word input to make correct predictions about the correlated words within a specific context. As a result, it can accurately represent even infrequent words.

The ultimate objective of Word2Vec (in both its forms) is to comprehend the hidden layer weights, which can then be used as our word embeddings, having potential but unpredictable ramifications.

Issues with TF-IDF and Bags of Words

Now, let us discuss the problems we have confronted thus far with the two text vectorization approaches we have examined.

The total number of words used in a Bag of Words (BOW) vocabulary will control the vector length used for representing each document. A BOW with a mostly zero-valued vector representation is considered to be sparse. Sparse representations are more challenging to model due to computational constraints and limited data availability.

Also, Bag-of-Words (BOW) fails to consider the word order and does not capture significant word relationships. Another challenge in employing this representation is its inability to consider the contextual meaning of words.

  • Weight Overload:

    While training a neural network, an increase in the number of input vectors results in more weight assigned to each node.
  • Lack of Significant Connections or Word Order Importance:

    The context of words or phrases is disregarded, and they are combined in a bag.
  • Computational Demands:

    Dealing with higher complexity necessitates additional processing time for training and prediction.

The Term Frequency-Inverse Document Frequency (TF-IDF) model adopts a distinct approach to the Bag of Words (BOW) model as it disregards semantic similarities between words. Instead, it offers insights into the relative significance of words by assigning higher weights to words that occur more often across a document corpus. Nonetheless, it does not tackle the challenges of high dimensionality and sparsity that exist in the BOW model.

Word Representation Using Global Vectors (GloVe)

Pennington et al. developed the GloVe method of word embedding for Natural Language Processing (NLP) while they were at Stanford University. This approach is known as “Global Vectors” because it effectively utilizes data from the global corpus. GloVe has shown exceptional outcomes when applied to global analogy and named entity recognition scenarios.

This method employs a less complicated least square cost or error function to train a model, creating innovative and better word embeddings with decreased computational resources. These low-dimensional word representations are produced by utilizing local context window techniques like Mikolov’s skip-gram model and Global Matrix factorization techniques.

Latent Semantic Analysis (LSA) employs statistical data which suggests an ineffective vector space structure. LSA utilizes a Global Matrix decomposition technique that produces unsatisfactory outcomes for global analogies.

Compared to other techniques, the skip-gram model delivers exceptional performance in analogy challenges. Nonetheless, since there was no recommendation to employ global co-occurrence counts, it does not make optimal use of the data available in the given corpus.

Therefore, GloVe has an edge over Word2Vec due to its utilization of global context instead of local context in creating word embeddings. A co-occurrence matrix is used to understand the meaning of words in the GloVe embeddings.

Consider these two instances:

In my opinion, I am a data science enthusiast.

I am seeking a position as a data scientist.

For the above-mentioned sentences, the GloVe co-occurrence matrix could appear like this:

This co-occurrence matrix helps us comprehend the frequency of words appearing in the same row or column. It’s worth noting that the total number of instances each word appears in a given time frame is used to generate this co-occurrence matrix. If there are 1,000,000 unique words in a text corpus, the co-occurrence matrix will be of the same size – 1,000,000 by 1,000,000. GloVe’s fundamental concept is that the model can “learn” word representations by analyzing their co-occurrence in the data.

To gain a more profound understanding of the co-occurrence probability ratios utilized in the GloVe model, let us examine an example from the Stanford study. Let’s look at the likelihood of the words ‘ice’ and ‘steam’ appearing in combination with other words within a 6 billion-word corpus. The actual probability, as determined by the study, is as follows:

Assuming that the variable ‘k’ represents a group of terms, we expect that the Pik/Pjk ratio for terms linked to the ‘solid’ group (i.e., terms related to ice but not steam) will be substantially higher. By contrast, the ratio for terms associated with the ‘gas’ group (i.e., terms related to steam but not ice) is expected to be low. For terms like water or fashion, which are related to both ice and steam or neither, we anticipate the ratio to be roughly one.

Using ratios of co-occurrence probabilities instead of the actual probabilities themselves is more effective in differentiating useful terms like “solid” and “gas” from less meaningful ones, like “fashion” and “water.” This enhances word separation capabilities. Consequently, the GloVe word vector learning approach starts with ratios of co-occurrence probabilities rather than the probabilities themselves.

BERT (Bidirectional Encoder Representations from Transformers)

This NLP algorithm uses transformers, a type of tool, to manipulate text. BERT-Base has 110 million parameters, while BERT-Large has 340 million parameters.

The attention mechanism used in this model enables the creation of world embeddings that are very context-dependent. After training the embedding, it’s fed as input to the next BERT layer, allowing that layer’s attention mechanism to comprehend the association between words both to the left and right of them.

This approach is more advanced than previous ones because it produces more precise word embeddings. This is primarily due to the use of a large word corpus and a pre-trained model that employs data from Wikipedia. To make this method more effective for task-specific datasets, the embeddings can be fine-tuned.

It is highly beneficial for cross-lingual translation.


Recent progress in deep learning models, like Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Transformers, have exhibited their usefulness in Natural Language Processing (NLP) tasks, including sentiment classification, named entity recognition, and speech recognition. To leverage these models, they can be trained using word embeddings.

After much consideration, I have compiled this summary checklist.

  • Bag of words: identifies features from the text
  • Keyword extraction and information retrieval using the TF-IDF index
  • Semantic analysis task using Word2Vec
  • Word analogy and named entity recognition tasks worldwide
  • BERT platform for question-answering and translation

In this post, we performed a comparative evaluation of two commonly-used Natural Language Processing (NLP) vectorization techniques: Bag of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF). In doing so, we pointed out the limitations of each approach and how they can be addressed through word-embedding methods such as GloVe and Word2Vec, which employ dimensionality reduction and contextual similarity. Following a discussion of the potential of word embeddings, it is apparent that they can have tremendous practical value in everyday life.

Join the Top 1% of Remote Developers and Designers

Works connects the top 1% of remote developers and designers with the leading brands and startups around the world. We focus on sophisticated, challenging tier-one projects which require highly skilled talent and problem solvers.
seasoned project manager reviewing remote software engineer's progress on software development project, hired from Works blog.join_marketplace.your_wayexperienced remote UI / UX designer working remotely at home while working on UI / UX & product design projects on Works blog.join_marketplace.freelance_jobs