A survival guide to Natural Language Processing and ChatGPT

Have you noticed that the suggestions to finish your sentences when you type something on Google are getting better and better? Sometimes they’re so good that you wonder whether your phone’s microphone is being tapped! And since ChatGPT came out, it looks more than ever like magic. For my PhD, I investigated how these tools are being applied and I’d like to share five lessons that are crucial to understand the basics of the technology responsible for these developments.

Lesson #1: You’re more predictable than you think

When I started my research, I was expecting to confirm all the tin-foil theories. But the answer I found was much simpler: you’re just more predictable than you would like to think. Ever since the time of Markov, people have known that there is a certain frequency in the distributions of characters and words across a “corpus” (that is, a large ensemble of texts). Indeed, Markov chains were first applied to predict the next words in Russian poetry – in 1913! One of the key insights is that the chance that you will use certain words probably won’t differ much from that of your peers, so it’s just a matter of hard work to find those patterns. And the field in charge of predicting what are you going to say next is known as “Natural Language Processing” (NLP).

Lesson #2: Natural Language Processing is extremely valuable

But the most exciting thing I found was that completing sentences or predicting next words are only two of the many things this field is concerned about and capable of! A better definition of the goal of the whole enterprise would be “to analyze and interpret fragments of written text”. If you have a lot of written data, it is extremely valuable to try to aggregate it in some useful form. For a bank in particular, this can have a number of applications: automatic extractions of financials, faster decision-making on which clients to accept, crawl the web for relevant news about your clients, and efficient assessment of client feedback.

Lesson #3: Meanings are on the eyes of the reader

How in the world are we supposed to assign any meaning to a word, a phrase, or a sentence, in a way that a computer can understand? Could a computer understand this very sentence, like you do? I invite you to pause here for a moment and think about what the meaning of this sentence could be.

Well, the truth is that there is more than one answer to this question. It often boils down to in which sense you want to extract a meaning, or what about a particular written text you are trying to understand. These goals are organized in tasks. To achieve different tasks that involve specific challenges, you might need a different definition of meaning.

Examples of tasks are:

  • Translation (correctly concluding that “challenge” in English is “desafio” in Portuguese)
  • Named entity recognition (correctly assigning the tag of “country” to Barbados)
  • ​​​​​​​Summarization (creating a sentence that summarizes an entire paragraph)
  • Clustering (aggregating words according to some common feature, like “animals”)
  • Topic modelling (extracting the most relevant words used in a large body of text)
  • And many more, including the up-and-coming-robots-are-conscious task of text generation, the baseline for ChatGPT.

Lesson #4: Meanings are quantified by vectors

Let’s look at one of the simplest and oldest tasks, whose solution happens to provide the starting point to tackle the toughest ones. This is the task of quantifying the similarity between two words. Introduced by Salton in the 80’s, the goal was to effectively match a natural language query with the correct database item. So, for example, if I type “yellow fruit”, I want to get “banana” with a high probability. To achieve that, I want the computer to reflect that these two items to be very similar.

The baseline idea uses the distributional principle: “You shall know a word by the company it keeps.” That is, similar words or expressions are used in similar contexts. This principle was used to put together an ingenious solution: if I count how many times each of these words appears next to my target word, say in a window of two words to the right and two words to the left, and then I store this count as entries of a vector, then I have a vector that quantifies the contexts of my words. And if a word is defined by its context, then I have a vector representing my words!

Image 1 shows the distributional principle in natural language processing. Words represented by vectors describe how many times other words appear next to it. In this case, "political" appears more often close to "democracy" than to "lion", but "dangerous" appears more often close to "lion" than to "democracy". The words "dangerous" and "political" are part of the context in which other words appear.
Image 1. Distributional principle.

For example, as in the figure above, if “dangerous” is a word that appears very often next to “lion” and “dragon”, I will count how many times it appears next to them, but also next to “democracy” and “political”. Then I can store the corresponding number in a vector entry corresponding to “dangerous”, and I can normalize with respect to the number of possible words. Naturally, the count will be higher for “dragon” and “lion”, but it gives a base for comparison. By doing this for all possible context words, I can build vectors that represent words as a function of their contexts. In this way, if two words have the same contexts, their context vectors will be similar – hence, by measuring the angle between the two vectors (aka “cosine similarity”), I can assess how similar the words are! This is how, by building a vector for “yellow fruit” and another one for “banana”, we can identify the query and the database item.

So in this sense, for this task, the meaning of a word is its context vector. What is so nice about these vectors is that it is possible to automate this extraction process, building vectors with numbers in them which computers can understand and perform operations on. This idea that words can be represented as vectors is the backbone of modern Natural Language Processing, because with these representations any task can use them as input and use them as the fundamental modelling clay for more advanced stuff.

Lesson #5: There are two main ways to build vectors of meaning

Currently, two main ways of extracting these vectors are widely used in the industry: TD-IDF and the embeddings approach. Let me explain both of them.

The TF-IDF approach

TF-IDF approach looks at the frequency of a word in the corpus, normalized by its relevance with respect to all the other words. This is one of the simplest and most reliable methods, closely related with the context vectors described above, since it only relies on the word counts. The downside is that is does not preserve any information about word order, or grammatical structure, so it is more appropriate for tasks where we mostly want to find the most relevant words, such as clustering, topic modelling and similarity.

The embeddings approach

The embeddings approach uses a machine learning model that takes as inputs random vectors for each word in a text and optimizes these vectors on the supervised task of predicting the vector of the next word. This task is also called language modelling. Eventually, this optimization converges, and the vectors obtained (the embeddings) are now our word representations. An example can be found in the figure on the right: in very broad strokes, a vector for “cat” is used as input, and the output vector is that corresponding to the next word “purr” (not “purrs” because words are usually reduced to their simplest form before this step). Then the outcome vector will be used in a similar way as the input to predict the word coming after it, until the input and output vectors of every word are the same (if you’re interested in getting deeper, just look up “word2vec”). And as you remember from the beginning, predicting the next word(s) is the example task that we started with from Markov! This way of doing things can be seen as stochastic, because the prediction of the next word depends on a certain range of words around it.

This image shows word embeddings in natural language processing. It shows a machine learning algorithm takes a vector as input and has the goal of predicting the vector of the next word. In this case, if the input is the vector of the word "cat", the goal is for the algorithm to predict the vector of the word "purr". In turn, the vector of the word "purr" is the input to predict the vector of the word after that, until convergence is reached.
Image 2. Word embeddings.

Starting from any of these vector representations, another process can then take place is fine-tuning, which is when you improve your representations with any other task in mind. In general, all these representations are usually more expensive to get, compared with TF-IDF, and are most useful when the structure of a phrase is relevant, like in translation, language generation and summarization.

In recent years, GPUs have boosted this method because running codes on GPUs lets us do parallel computing faster and cheaper. More specifically, with parallel computing we can look at different word vectors of a sentence simultaneously, using what is called attention, and use this to improve our estimation of the next word’s vector. ChatGPT is nothing but such an algorithm trained on an (absurdly!) large corpus, and very efficiently implemented.

Conclusion

As I have shown, efforts to extract information from written text have a long tradition and rely on certain assumptions, like the distributional principle and stochastic prediction. Therefore, allow me to finish with three final take-aways. The first is that most of language analysis is based on describing written text as vectors; while this is a very handy way of dealing with them computationally, they do not exhaust the ways to describe communication between people, and much work certainly lies ahead. The second is that the fanciest language model is not always the most appropriate for a task; as we talked about, sometimes simple TF-IDF can yield a lot of value for a small investment, and it gains in interpretability of the outputs. The third one is that not all that glitters is gold; large language models are just elaborate stochastic prediction machines. Like ChatGPT, they will surely revolutionize the use of these tools for the wider public, but in a corporate context it is important to understand how they are built and in what data they were trained, so that the maximum value can be extracted.

It is tremendously valuable to have a good grasp of the underlying assumptions in these models so that we can demystify the developments ahead, spot better business applications and avoid a number of serious pitfalls. So maybe your microphone is listening in on your conversations – never let go of a healthy scepticism! – but more likely than not, it is just really good at predicting what your next interest will be.

About the author

Adriana Correia
Credit Model Validator

Adriana has joined Rabobank with a vision: to raise awareness for the value of Machine Learning (ML), in general, and Natural Language Processing (NLP) tools, in particular, for the bank. She is currently working within Credit Model Validation (CMV) to update the processes that would enable the bank to move forward into the future of modelling.

Related articles

ChatGPT and its potential impact on banking services

  • 10 January 2023
  • 5 min
By Bruno Sant’Anna

A Real-Life Machine Learning Journey

  • 15 September 2021
  • 4 min
By Thomas Alderse Baas

All you need to know about Rabobank Engineer's Week

  • 2 November 2021
  • 5 min
By Eduardo Barra Cordeiro