Extracting Information from Natural Language Using Generative AI | by Oren Matar | May, 2024

Published:


Extracting and structuring text elements with high accuracy using small models

Towards Data Science
Image generated by an AI by the author

In this post, I’ll introduce a paradigm recently developed at Anaplan for extracting temporal information from natural language text, as part of an NLQ (natural language query) project. While I will focus on time extraction, the paradigm is versatile and applicable for parsing various unstructured texts and extracting diverse patterns of information. This includes named entity recognition, text-to-SQL conversion, quantity extraction, and more.

The paradigm’s core lies in constructing a flexible pipeline, which provides maximal flexibility, making it easy to fine-tune a model to extract the meaning from any conceivable expression in the language. It is based on a deep learning model (transformers) but for us, it achieved a 99.98% accuracy which is relatively rare for ML methods. Additionally, it does not utilize LLMs (large language models), in fact, it requires a minimal transformer model. This yields a compact, adaptable ML model, exhibiting the precision of rule-based systems.

For those seeking time, numerical value, or phone number extraction, Facebook’s Duckling package offers a rule-based solution. However, if Duckling falls short of your requirements or you’re eager to explore a new ML paradigm, read on.

Can LLMs capture the meaning?

LLMs, despite their capabilities, face challenges in parsing such phrases and extracting their meaning comprehensively. Consider the expression “the first 15 weeks of last year.” Converting this to a date range necessitates the model to determine the current year, subtract one, and calculate the position of the 15th week as it adjusts for leap years. Language models were not built for this kind of computation.

In my experience, LLMs can accurately output the correct date range around 90–95% of the time but struggle with the remaining 5–10%, no matter the prompting techniques you use. Not to mention: LLMs are resource-intensive and slow.

Thankfully, by following three principles, compact transformers can successfully accomplish the task

  1. Separate information extraction from logical deduction.
  2. Auto-generate a dataset using structured patterns.
  3. Constrain the generative AI to the required structure.

In this post, I will cover the first two, as the third one I covered in a previous post.

Separate information extraction from logical deduction

The first principle is to ensure that the language model’s role is to extract information from free text, rather than to make any logical deduction: logical deductions can easily be implemented in code.

Consider the phrase: “How many movies came out two years ago?” The language model’s task should be to identify that the relevant year is: this_year - 2, without calculating the actual year (which means it doesn’t need to know the current year). Its focus is parsing the meaning and structuring unstructured language. Once that formula is extracted, we can implement its calculation in code.

For this to work, we introduce a Structured Time Language (STL) capable of expressing time elements. For instance, “on 2020” translates to “TIME.year==2020,” and “three months from now” becomes “NOW.month==3.” While the entire STL language isn’t detailed here, it should be relatively intuitive: you can reference attributes like year, quarter, and month for an absolute time or relative to NOW. The translation of “the last 12 weeks of last year” is “NOW.year==-1 AND TIME.week>=-12”

By removing any logical deduction or calculation from the task, we take a huge burden off the language model and allow it to focus on information extraction. This division of labor will improve its accuracy significantly. After the translation process is complete, it is straightforward to develop code for a parser that reads the structured language and retrieves the necessary date range.

Since this is a translation task — from natural language to STL — we used an encoder-decoder transformer. We used the Bart model from Hugging Face, which can easily be fine-tuned for this task.

But how do we get the data for training the model?

Auto-generate a dataset using structured patterns

Since a training dataset for this translation task does not exist, we must generate it ourselves. This was done by following these steps:

Step one: Write functions to map datetime objects to both “natural language” and STL formats:

def since_year(datetime):
free_text = f“since {datetime.year}”
answer = f”TIME.year >= {datetime.year}”
return free_text, answer

def half_literal(datetime):
free_text = datetime.strftime(“%-d, %B %Y”)
answer = f”TIME.date >= {datetime}”
return free_text, answer

def until_quarter_year(datetime):
q = datetime.month//3
free_text = f”until Q{q}-{datetime.year}”
answer = f”TIME.year=={datetime.year} AND TIME.quarter=={q}”

Given a datetime object, these functions return a tuple of free text and its corresponding STL, for instance: “since 2020”, “TIME.year >= 2020”.

Step two: Sample a random function, and sample a random date within a specified range:

date = np.random.choice(pd.date_range('1970/1/1', '2040/12/31'))

now insert the datetime to the function.

Step three: Append the free text to a random question (we can easily randomly generate questions or draw them from some question dataset, their quality and meaning is not very important).

With this pipeline, we can quickly generate 1000s of text-STL pairs, for example:

  • “What was the GDP growth in Q2–2019?”, “TIME.quarter==2 AND TIME.year==2019”
  • “Since 2017, who won the most Oscars?”, “TIME.year>=2017”
  • “Who was the president on 3 May 2020?”, “TIME.date==2020/05/03”

This approach ensures flexibility in adding new patterns effortlessly. If you find a time expression that is not covered by one of these functions (e.g. “In N years”), you can write a function that will generate examples for this pattern within seconds.

In practice, we can optimize the code efficiency further. Rather than separate functions for each pattern like “since 2020” and “until 2020,” we can randomly sample connective words like “since,” “until,” “on,” etc. This initial batch of functions may require some time to develop, but you can quickly scale to 100s of patterns. Subsequently, addressing any missing expressions becomes trivial, as the pipeline is already established. With a few iterations, nearly all relevant expressions can be covered.

Moreover, we don’t need to cover all the expressions: Since the transformer model we used is pre-trained on a huge corpus of text, it will generalize from the provided patterns to new ones.

Finally, we can use an LLM to generate more examples. Simply ask an LLM:

Hey, what's another way to write "What was the revenue until Aug 23"

And it may return:

"How much did we make before August 2023".

This data augmentation process can be automated too: sending numerous examples to an LLM, thus adding variety to our dataset. Given that the LLM’s role is solely in dataset creation, considerations of cost and speed become inconsequential.

Combining the flexibility of adding new patterns, the generalization of the pre-trained model, and data augmentation using an LLM, we can effectively cover almost any expression.

The final principle of this paradigm is to constrain the generative AI to produce only STL queries, ensuring adherence to the required structure. The method to achieve this, as well as a method for optimizing the tokenization process, was discussed in a previous post.

By adhering to these three principles, we achieved an impressive accuracy of 99.98% on our test dataset. Moreover, this paradigm gave us the flexibility to address new, unsupported, time expressions swiftly.

Summary

Large Language Models (LLMs) aren’t always the optimal solution for language tasks. With the right approach, shallower transformer models can efficiently extract information from natural language with high accuracy and flexibility, at a reduced time and cost.

The key principles to remember are:

  1. Focusing the model only on information extraction, avoiding complex logical deductions. This may require generating a mediating language and implementing a parser and logical deduction in code.
  2. Establishing a pipeline for generating a dataset and training a model, so that adding new functionality (new language patterns) is straightforward and fast. This pipeline can include the use of an LLM, adding more variety to the dataset.
  3. Confining the model generation to the constraints of a structured language.

While this post focused on extracting time elements, the paradigm applies to extracting any information from free text and structuring it into various formats. With this paradigm, you can achieve the accuracy of a rule-based engine, with the flexibility of a machine learning model.

Related Updates

Recent Updates