Embedding Words for Natural Language Processing: A Tutorial

Word embedding is a significant technique employed in Natural Language Processing (NLP) to represent words as real-valued vectors. This approach has enabled computers to better comprehend written material, and is considered to be one of the most significant breakthroughs in NLP powered by Deep Learning.

This method of representing words and documents numerically as vectors enables comparable words to have similar vector representations. This collected data is then utilised as an input in a machine learning model designed to process textual data while preserving its grammatical and semantic structure. After this data has been converted, it can be processed by natural language processing algorithms that can easily interpret these learned representations.

Because of the many advantages it provides, ML NLP is quickly becoming a popular option for software engineers.

You should now be ready for a ground-up introduction to word embeddings, its methods, and its uses.

Word embedding is defined as.

Word embedding, also referred to as a word vector, is a technique used to create a numerical representation of textual data. This representation is generated by mapping words with comparable meanings to the same numerical vector input. By doing so, an approximation of the meaning of the word can be represented in a low dimensional space. This method of generating numerical representations is far more efficient than manually constructed models, such as graph embeddings like WordNet, in terms of training time.

Word embeddings are powerful tools that can represent a word with up to 50 distinct characteristics. Pre-trained word embedding models, such as Flair, fastText, SpaCy, and others, are increasingly popular due to their ability to capture meaningful information about words and their context while reducing dimensionality. These models have a wide range of applications in natural language processing, including sentiment analysis and machine translation.

Inside the article, we’ll go into further detail. So, let’s move on to a quick overview and an example of it.

The issue

It is possible to use supervised machine learning to identify which tweets about actual catastrophes are false. In this scenario, the tweets (text) can be considered as the independent variable, while the binary values (1: Real Disaster, 0: Not real Disaster) can be considered as the dependent variable.

It is clear that numerical input is the only acceptable type for modern Machine Learning and Deep Learning algorithms. Therefore, the challenge is to quantify tweets for these algorithms to process. One possible solution is word embedding, which can be used until more advanced strategies are developed to handle this issue.

Here’s the fix

Word embeddings is a method of natural language processing (NLP) in which words are represented as real-valued vectors in a lower-dimensional space. Each word is assigned a real-valued vector of tens or hundreds of dimensions, allowing the words to be represented in a more efficient, space-saving manner. This technique is invaluable in the field of NLP, allowing for more accurate and comprehensive analysis of language data.

Inverse document frequency distribution by search term (TF-IDF)

In machine learning, word embedding for text is accomplished through the use of the term frequency-inverse document frequency (TF-IDF) method. This method involves two factors: term frequency (TF) and inverse document frequency (IDF). TF quantifies the number of times a given term appears in a document, while IDF evaluates how often a given term appears in a corpus of documents. By combining TF and IDF, the TF-IDF method can provide a measure of how important a given term is to a particular document.

This method utilises a statistical measure to determine which words are most pertinent to the text, be it an individual document or a set of documents (a corpus).

A document’s TF score indicates how often certain terms appear inside that text. The frequency with which terms appear in the papers is measured.

The Inverse Document Frequency (IDF) score is a measure of how often a particular term appears in a document. In comparison to Term Frequency (TF) which places a greater emphasis on terms that commonly appear, the IDF score has a greater focus on words that are not used as frequently across the corpus of documents. Consequently, the IDF score is given more weight when it comes to the process of information discovery.

When it comes to basic text analysis tasks such as keyword extraction, stop word removal, and information retrieval, the TF-IDF algorithm is a widely used and reliable solution. However, it is not capable of accurately capturing the context and sequential semantic meaning of words, which can be of essential importance in certain applications.

TfidfVectorizer’s Output

By splitting the vocabulary into columns and the documents into rows, it is possible to compute the term frequency–inverse document frequency (tf-idf) values for each corresponding cell (i,j). The resulting matrix and the associated target variable can then be utilised to train a machine learning or deep learning algorithm.

Let’s talk about two distinct strategies for constructing word embeddings. The practical application will be analysed as well!

Words in a bag (BOW)

Word embedding techniques that utilise a “bag of words” approach, where each element in the vector represents the number of words present in a given sentence or document, are quite common. Through this method, features can be extracted from the text, and this approach is not limited to simply being referred to as “vectorization”.

Here is a basic outline of the steps required to make BOW.

  • Separating the text into individual phrases is the first stage in tokenization.
  • The first-step tokenized phrases then have additional tokenized words added to them.
  • Take off the commas and other unnecessary punctuation.
  • Then, be sure to lowercase every single word.
  • The last step is to make a word cloud to see how often each term appears.

Below, we will go through BOW using a continuous bag of words as an example.


In 2013, Google developed the Word2Vec technique, a distributional hypothesis-based method that is now widely used to address complex challenges in Natural Language Processing (NLP). This technique is specifically designed to train word embeddings.

This theory makes use of skip-grammes, or a continuous bag of words, to convey its ideas (CBOW).

Input, output, and projection layers form the basis of what is known as a “shallow” neural network. Through this type of network, both the order of words in the past and the order of words in the future are taken into consideration when reconstructing the linguistic context of the words.

The process of discovering semantic relationships between words in a text corpus is achieved through the use of repeated iterations. This system operates on the premise that words that are close to each other in a text share similar properties. By converting related words into embedding vectors that are located close to each other, this technique helps to better understand the semantic meanings of words.

In order to establish a measure of similarity between two vector representations of words or texts, this approach utilises the cosine similarity metric. The cosine similarity metric is a measure of similarity between two vectors which is calculated by taking the cosine of the angle between them.

  • The words overlap if the cosine angle is equal to one.
  • If the cosine angle is a perfect 90 degrees, it signifies that the words have no semantic relationship to one another.

In a nutshell, we may state that this measure places identical vector representations on identical boards.

Word2Vec comes in two flavours.

There are two versions of Word2Vec that make use of neural networks: Continuous Bag of Words (CBOW) and Skip-gramme.

1. CBOW The neural network model takes a variety of inputs, and the Continuous Bag of Words (CBOW) version is one of them. This particular model is designed to make predictions about the intended term based on the input of other words; it is an efficient and effective way of obtaining a numerical representation for commonly used words. To gain a better understanding about the concept of context and the present term for CBOW, let us consider the following example.

In the Continuous Bag of Words (CBOW) model, the size of the window is specified. The word in the centre of this window is the active one, with the words on either side providing contextual information. This allows the model to make predictions about the current word based on the surrounding context. The CBOW neural network is fed a sequence of words which have been encoded with One Hot Encoding, using the defined vocabulary.

The obfuscated layer is a densely connected regular layer that is used to compute the likelihoods of the sought-after term within the vocabulary, which are subsequently sent to the output layer.

After introducing the bag of words (BOW), also known as a vectorizer, we will now use an example to illustrate this concept.

Take these four tweets as an example of BOW in action:-

(With much sadness) “gentle truth”

the “swear jam” lit the globe on fire,

swear real automobile crash,

Unfortunately, the automobile caught fire.

Scikit-Learn’s CountVectorizer was employed to construct a Bag of Words (BOW); this method tokenizes a collection of text documents, creates a list of distinct words, and then encodes new text using this vocabulary.

CountVectorizer’s Output

For this example, each of the four documents is represented by a separate row. The columns in the table represent the vocabulary terms that are shared by the four documents, and the values indicate the total amount of times that each of those vocabulary terms appears across all four papers.

It is possible to use the CountVectorizer on all 11,370 documents in the training dataset to create a matrix that will be useful when training a machine learning or deep learning model in combination with the associated target variable.

The Skip-Gramme 2Unlike Contrastive Bag-of-Words (CBOW), Skip-gramme does not predict the current word using context information. Rather, a log-linear classifier is used to feed the current word through a continuous projection layer. Predictions are then formulated for the words preceding and succeeding the present word, thus making Skip-gramme a distinct word embedding approach.

This variation only requires a single word as an input in order to make accurate estimations about the associated words within a particular context. Consequently, even rare words can be accurately represented.

The ultimate goal of Word2Vec (in both its forms) is to gain knowledge of the weights of the hidden layer, which will then be used as our word embeddings, with potentially far-reaching, yet unseen, consequences.

Problems with TF-IDF and bags of words

Let’s talk about the difficulties we’ve encountered so far with the two text vectorization methods we’ve covered.

The number of words used in a Bag of Words (BOW) vocabulary will dictate the length of the vector representation for each document. When most of the values in the vector are 0, the resulting BOW will be considered sparse. From a computational standpoint and due to the limited amount of available data, sparse representations are more difficult to model.

Moreover, Bag-of-Words (BOW) does not consider the order of words, nor does it capture meaningful relationships between words. An additional difficulty in utilising this representation is that it does not take into account the context of words.

  • Excessive amounts of weights: When training a neural network, more input vectors mean more weight may be assigned to each node.
  • There are no significant connections or care for word order: Words in a text or phrase are thrown into a bag regardless of their context.
  • Extensive Use of computation: Additional processing time is required for training and prediction when dealing with greater complexity.

The Term Frequency-Inverse Document Frequency (TF-IDF) model takes a different approach to the Bag of Words (BOW) model, in that it does not consider semantic similarities between words. Instead, it provides information on the relative importance of words by assigning higher weights to words that occur more frequently across a corpus of documents. However, it fails to address the issues of high dimensionality and sparsity present in the BOW model.

Word Representation with a Global Vector (GloVe)

Pennington et al., while students at Stanford University, developed the GloVe technique of word embedding for Natural Language Processing (NLP). This method is referred to as “Global Vectors” due to its successful application of data gathered from the entire global corpus. When applied to global analogy and named entity recognition challenges, GloVe has demonstrated remarkable results.

This approach utilises a simpler least square cost or error function to train a model, resulting in the creation of novel and improved word embeddings with reduced computational expenditure. These low-dimensional word representations are generated through the application of local context window techniques, such as the skip-gramme model of Mikolov and Global Matrix factorization techniques.

Statistical data that implies an inefficient vector space structure is employed in Latent Semantic Analysis (LSA), a Global Matrix decomposition technique that yields unsatisfactory results when it comes to global analogies.

In comparison to other approaches, the skip-gramme approach performs particularly well on the analogy challenge. However, since there was no guidance regarding the utilisation of global co-occurrence counts, it does not make the most effective use of the data present in the given corpus.

Consequently, GloVe has an advantage over Word2Vec because it utilises global context in the construction of its word embeddings rather than local context. To comprehend the meaning of the words in the GloVe embeddings, a co-occurrence matrix is employed.

Take these two examples:

I consider myself to be a data science aficionado.

I am in the market for a data scientist position.

For the aforementioned sentences, the GloVe co-occurrence matrix might look like this:

This co-occurrence matrix provides us with an understanding of the frequency of words appearing in the same row or column. It is important to bear in mind that the total number of occurrences of each word in a given period of time was used to create this co-occurrence matrix. If there are 1,000,000 different words in a text corpus, the co-occurrence matrix will also be of a size of 1,000,000 by 1,000,000. The fundamental principle behind GloVe is that the model can “learn” the word representation based on the co-occurrence of the words in the data.

In order to gain an even deeper comprehension of the co-occurrence probability ratios utilised in the GloVe model, let us take a look at an example from the Stanford research. Consider the likelihood of the words ‘ice’ and ‘steam’ appearing in conjunction with other words within a 6 billion word corpus. The actual probability, as determined by the study, is as follows:

Assuming that the variable ‘k’ represents a category of terms, we anticipate that the Pik/Pjk ratio for terms that are related to the ‘solid’ category (i.e. terms that are ice-related but steam-unrelated) will be quite high. Conversely, the ratio for terms related to the ‘gas’ category (i.e. terms that are connected to steam but not ice) is expected to be low. We anticipate that the ratio will be close to one for terms such as water or fashion, which are related to both ice and steam or neither.

The utilisation of ratios of co-occurrence probabilities instead of the raw probabilities themselves has been shown to be more effective in distinguishing useful terms such as “solid” and “gas” from those which are less meaningful, such as “fashion” and “water”. This has enhanced the ability to separate words. As a result, the GloVe method of word vector learning commences with ratios of co-occurrence probabilities, rather than the probabilities themselves.

BERT (Bidirectional encoder representations from transformers) (Bidirectional encoder representations from transformers)

This natural language processing (NLP) algorithm employs transformers, a subclass of tools, to modify text. BERT-Base consists of 110 million parameters, whereas BERT-Large boasts 340 million.

The attention mechanism utilised by this model allows for the generation of world embeddings that are highly contextualised. Once the embedding is trained, it is provided as input to the next BERT layer, allowing the layer’s attention mechanism to understand the correlation between words to the left and right of them.

This method is more sophisticated than the previous ones due to the more accurate word embeddings that are produced. This is largely due to the large word corpus and pre-trends model which utilises data from Wikipedia. To further improve the effectiveness of this method for task-specific datasets, the embeddings can be fine-tuned.

It is very useful for translating across languages.


Recent advancements in deep learning models, such as Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), and Transformers, have demonstrated their utility in Natural Language Processing (NLP) tasks, including sentiment classification, name entity identification, and audio recognition. To take advantage of these models, they can be trained using word embeddings.

After much deliberation, I have come up with this summary checklist.

  • Words in a sack: picks out characteristics from the text
  • Information retrieval and keyword extraction using the TF-IDF index
  • Task for semantic analysis using Word2Vec
  • Word-analogy and named-entity recognition jobs all around the world
  • Translation and question-answering platform BERT

In this article, we conducted a comparative analysis of two widely-used Natural Language Processing (NLP) vectorization techniques: Bag of Words (BoW) and Term Frequency–Inverse Document Frequency (TF-IDF). By doing so, we highlighted the shortcomings of each method and how they can be overcome through word-embedding methods such as GloVe and Word2Vec, which rely on dimensionality reduction and contextual similarity. Having discussed the potential of word embeddings, it is now evident that these can be of great use in daily life.

Join the Top 1% of Remote Developers and Designers

Works connects the top 1% of remote developers and designers with the leading brands and startups around the world. We focus on sophisticated, challenging tier-one projects which require highly skilled talent and problem solvers.
seasoned project manager reviewing remote software engineer's progress on software development project, hired from Works blog.join_marketplace.your_wayexperienced remote UI / UX designer working remotely at home while working on UI / UX & product design projects on Works blog.join_marketplace.freelance_jobs