Developing a Modern ML Solution for a Packaging Manufacturer

Highlights

Modern ML-Powered Tool: delivered an innovative software solution for invoice process automation and integrated it into the client’s CRM via API.
Data Aggregation in Hours, Not Weeks: significantly reduced time needed for aggregation of information regarding main costs of repair work and availability of spare parts.
Operating with 450 Vendors: the automation includes sorting information all over the client’s database into 33 classes, which previously took a lot of manual effort.

Client

The client is a respected manufacturer of liquid packaging board and market pulp with over 70 years of experience. The company is known for its commitment to safety, reliability, quality, and sustainability, serving customers across North America, Asia, and beyond. The client has a mission to produce the highest-quality paperboard and market pulp, provide unsurpassed customer service, and deliver technical, professional, and operational excellence in everything they do.

Country

the USA

Industry

Manufacturing

Team Size:

Product

The product is a Machine Learning-powered tool for internal company usage only. This tool is designed to automate paperwork and invoice processing for the repair of heavy machinery and equipment. The product helps to predict the master service code of an item in the invoice by its textual description and uses a document parsing pipeline as its main component.

Goals and objectives

Automate the Process of Data Collection: Develop algorithms that can automatically extract relevant data from these invoices, regardless of layout or format, as machinery invoices come in various formats and structures. Work out a solution that includes Natural Language Processing (NLP) algorithms to identify and extract key information such as invoice number, date, items, quantities, prices, and totals.
Automate Data Processing: Develop and tune algorithms to clean and standardize the extracted data and ensure uniformity and accuracy, eliminating inconsistencies and errors.
Reduce Human Effort: Reduce the amount of time and effort required from human operators by automating the data collection and processing tasks, freeing up employees to focus on more value-added business tasks.
Make the Solution Scalable: Design the system architecture in a way that allows it to scale horizontally by adding more computational resources as needed, as the volume of invoices and data sources increases.

Project challenge

No Standardized Requirements for the Format of Invoices: develop robust ML models to handle diverse data layouts and structures, dealing with the absence of standardized requirements for invoice formats.
Lack of Labeled Data and the Presence of Highly Imbalanced Data: deal with a limited amount of labeled training data for invoice processing tasks. The available labeled data was suffering from imbalances, where certain classes or categories were significantly more prevalent than others.

Solution

To implement the project, we assembled a team of a Project Manager and a Machine Learning Engineer with deep Python development expertise.

Our first move was to conduct a Proof of Concept phase, to determine whether our approach was viable. The PoC showed that using Gradient Boosting Classifiers and transformers like Text Embeddings Models was the right choice, and the development started.

We developed a Python pipeline that efficiently classifies item lines in invoices. Using AWS Textract, the tool first extracts the text and tabular data from PDF documents. Leveraging Natural Language Processing (NLP) techniques, text classification, and text embeddings, the pipeline then accurately categorizes item lines. This approach secures data accuracy and completeness.

To train and improve the model, we used labeled item lines of invoices as a data source.

To ensure the accuracy of text classification functionality, we:

Collected data from reliable sources and cleaned it to remove noise and inconsistencies.
Ensured accurate labeling for each text sample to build a reliable training dataset.
Balanced the dataset to prevent class imbalance issues that can affect model performance.
Split the dataset into training, validation, and test sets to evaluate the model’s performance effectively.
Used metrics such as precision, recall, and F1-score to measure the quality and reliability of the classification results.

Finally, we successfully integrated the model with a client’s custom CRM system via API.

Tech Stack

Python
Sentence Transformers

Our results

It took around 6 months to deliver the full scope of the planned functionality and achieve our client’s business goals.

Complete Automation of Invoice Processing: delivered an ML-powered tool through API that predicts the master service code of an item in an invoice by its textual description.
Additional Time Savings and Reduction of Human Error: automating invoice parsing reduces processing time and the number of errors compared to manual work, improving overall business efficiency and compliance with regulations.