Financial Data Extraction

Introduction

In this project, we aimed to extract financial information from webpages. The information we wanted to extract were the monetary values (in Yuan) of loans mentioned in these pages, including their principal, interest, fees, etc. and the total value. These webpages were only accessible in Standard Chinese.

For that, we used NLP models to automatically extract this information using the scraped content of the webpages. After training these models, we deployed them into a mixed API/forms container app which is hosted on Azure.

The training phase was an opportunity to compare different types of approaches in terms of their accuracy, speed and cost. One of the challenges was the small amount of data available for training.

The training data came from a selection of manually curated scraped URLs. It was formatted in pairs (sentence, extracted values). Due to the manual process of extraction only a relatively small amount of data was available for training.

Training

For hyperparameter tuning we used the Weights & Biases platform. This allowed us to set up sweep parameters easily and inspect the results using parallel coordinates plots. Our main tools for training models were SpaCy, GPT-3 and PyTorch (for BERT).

Scoring And Evaluation

During the hyperparameter optimisation we minimised the model's loss. For SpaCy, this was the built-in NER engine loss, for GPT-3 this was the built-in text completion loss and for BERT, the cross entropy loss over next token prediction.

In order to measure the model's performance on its purposed task, we defined several metrics relating directly to the task objective:

  • The counts of correctly extracted values (“true positives”), missed values (“false negatives”) and falsely extracted values (“false positives”).
  • The precision & sensitivity derived from these counts.

The SpaCy and GPT-3 models were much faster to train and use than BERT. In an attempt to improve BERT speed, we also compared the speed of the original and dynamically quantised version of the fine-tuned BERT model.

SpaCy Model

SpaCy is a widely used library dedicated to natural language processing. It provides pretrained models in many different languages together with convenient pipelines (such as NER engines) and functionalities, including the ability to fine-tune a preloaded model.

We fine-tuned the NER engine constructed from the zh_core_web_lg NLP model dedicated to the Standard Chinese Language. In the hyperparameter optimisation phase, we searched though different values for the number of epochs (from 10 to 400) and dropout rate (0 or 0.1).

The underlying architecture of this model consists of deep neural networks and uses the attention mechanism.

The best model had scores of 45% precision and 52% sensitivity. This class of models seemed to be able to locate and mostly extract the right information, although the extraction was imperfect. For example, the model sometimes extracted values but ignored anything beyond the decimal point. It seems quite possible that with more training data this model would perform much better.

GPT-3

GPT-3 is OpenAI's most recent class of language models. Training and inference must be done via the OpenAI API interface. These models are known for their success in performing few-shot learning tasks.

The available models comes in 4 degrees from the least to most performing models: Ada, Babbage, Curie and DaVinci. The cost of using the models is determined by the number of tokens going through them. Out of these models, the cheapest model to use is Ada and the most expensive is DaVinci, which is significantly more expensive than the three other models.

We fine-tuned GPT-3 models using the OpenAI library. The hyperparameter search space consists of different base models (Ada, Babbage, Curie), the number of training epochs (1, 2, 3; 1-2 being the amount recommended by OpenAI) and the learning rate multiplier (0.1, 0.02).

According to OpenAI documentation, the recommended amount of data for fine-tuning a GPT-3 model is of a few hundreds of samples. Our data barely reached that threshold but this was still enough to obtain promising results.

A GPT-3 model is designed to continue the text it is given as prompt. In order to use such a model as a NER engine one needs to create prompt and training completion examples that will lead the model to generate a text containing the relevant information in an easily extractable form. This practice of devising a prompt to large language models for a NLP task is called “prompt engineering”.

Despite the small amount of training data, the fine-tuned models often reliably extracted the relevant information from the input. Once we fine-tuned the Ada base model, we found that it was good enough for the task at hand.

The best models reached a precision and sensitivity of at least 90% (for both train and test slices). Even the smallest models (based on Ada) reached these levels, which made them suitable for use, especially considering their reduced cost and computation time compared to the larger models.

The cost of inference of an Ada model is 0.0004$/1000 tokens. A sentence would generally consist of less than 100 tokens.

BERT

We wanted to use a text completion model with training samples formatted in a similar way as for GPT-3, in order to replicate the methods used with GPT-3 on a personal computer.

A natural choice of model would have been GPT-2, since it has a decoder architecture like GPT-3. Unfortunately, fine-tuning GPT-2 required more memory than was available on the computer. Despite being an autoencoder, BERT can still be used for text completion. Much smaller than GPT-2, BERT can be fine-tuned on a personal computer.

The base model was the bert-base-chinese model available on Hugging Face. Despite the relatively smaller size, training and inference were rather slow. In an attempt to address this issue we applied dynamic quantisation on the fine-tuned model and evaluated its effect on the performance of the model.

The search space for hyperparameter optimisation consisted of the number of epochs (5-8) and learning rate (5×10-6, 10-5, 5×10-5, 10-4).

The scores of these models was generally very low. Even if sometimes a model appeared to be able to catch the expected information, this wasn't sufficient to compete with the other types of model. Producing a functional NER engine employing this type of models would require finding more appropriate training configurations, and much more training data.

This training procedure was of course not the optimal way of using a model such as BERT. A more standard procedure would use the BERT internal representation as an embedding to input to a classifier.

Quantised BERT

The procedure of quantisation consists roughly in replacing costly floating point computations with integer operations. This procedure reduced the memory size of the BERT model by ~60% (from ~410Mb to ~170Mb) and the computation time of the full inference pipeline (and not just the quantised part) by ~35%.

Those results were encouraging but still insufficient to make these models practical. Another limitation of dynamic quantisation is that the original model is required to load the quantised model. The quantisation approach could potentially be improved by using static quantisation or distillation (for example: DistilBERT).