By Sidharth Ramachandran — Jan 23, 2024

The amazing journey of text classification

I remember working on my first text classification problem over 10 years ago. I had recently taken Andrew Ng's famous machine learning course and was keen on doing some "machine learning" 😄. I came up with the idea of creating a dashboard that could be presented to banks that would report their reputation score based on social media interactions. The use case was trivial and so was my approach - count positive and negative words based on open-source lexicons! I'm not sure how accurate it was but for a dashboard to validate the idea with stakeholders, it worked well. More scientific approaches would have offered better accuracy but I was amazed at how easy it was to build a solution that provided a valuable signal to stakeholders.

I was reminded of this last week when I was preparing for my course on "Building Products with OpenAI" where I make use of a large language model (LLM) like gpt-3.5-turbo to achieve the same task of sentiment classification, but this time with just a single line prompt. It barely takes any time and is probably more accurate than a robust machine learning-based solution. The cost (time and effort) of building a solution to check if it is valuable is so low that everyone should be building products, iterating on them, and discarding them when they're not useful. This has been one of my driving principles in the last year and also what I emphasize in my course.

But just like before, there will be ways that we can push the limits and identify sophisticated and robust ways of doing the same task to achieve state-of-the-art results. I came across a paper last week 'Text Classification via Large Language Models' that creates prompts and adds supporting techniques to improve the classification accuracy when using LLMs for this task. Initially, I dismissed it as a fancy prompt-pretending-to-be-a-paper but reading through it I discovered some interesting ideas and processes that I would like to highlight.

The basic premise of the paper is of course the prompt - which the authors call Clue And Reasoning Prompting, abbreviated as CARP. The prompt looks as follows

This is an overall sentiment classifier for movie reviews.

First, list CLUES (i.e. keywords, phrases, contextual information, semantic relations, semantic meaning, tones, references) that support the sentiment determination of input.

Second, deduce the diagnostic REASONING process from premises (i.e. clues, input) that supports the INPUT sentiment determination (Limit the number of words to 130).

Third, based on clues, reasoning, and input, determine the overall SENTIMENT of INPUT as Positive or Negative.

I think the approach is similar to the Chain-of-Thought (CoT) prompting paradigm but this one uses hints like clues that provide the model with more information to perform reasoning. In fact, the authors compare the accuracy of their results with CoT-based prompts. I actually found the ablation studies to be the most interesting part of this paper and some of the findings that stood out to me -

In the few-shot scenario, the order in which the examples are organized makes a difference to the prediction accuracy. If we place examples that are more similar to the target text (example to be classified) toward the end of the prompt (closer to the target text), the LLM does a better job. The authors refer to this as the effect of the demonstration order
Based on my reading of the results, it did not matter which set of demonstration examples were chosen. The authors tried with randomly picked examples and examples picked based on similarity. While the accuracy did improve, the uplifts weren't high enough to spend effort on performing the selection.
There has been a lot of information recently about how LLMs can be their own judges and this is exactly what the authors tried as well. They used the LLM to generate the reasoning data for the training examples which can then be used in the prompt as a demonstration example. They also provided additional metrics using reliability, fluency, and 'logic faithful' to confirm that the LLM-generated reasoning is accurate.
Finally, in this paper, the authors make five calls to the LLM for each observation and then perform majority voting or weighted voting. This was the first paper I came across that uses this technique but I find that this is becoming a common practice when working with LLMs.

I enjoyed reading the paper as it gave me a sense of how one could imagine making prompts more sophisticated and improving the technique to maximize performance. A simple one-line prompt would also do the job just like my lexicon-based approach but if you have already received feedback that your product is feasible then you should iterate and try to achieve a higher accuracy through the use of robust prompts and control mechanisms.

Subscribe to geil.tech