Leveraging ML and OpenAI for Automated Data Collection and Processing in B2B Intelligence Services 

Highlights

  • 5x Cost Reduction: Thanks to the automation of data collection and processing, the product offers faster task completion contributing to enhanced staff efficiency.
  • Advanced Technology Integration: Our solution integrates NLP and the YOLO model to accurately extract tabular data from PDFs, with Camelot or AWS Textract for streamlined extraction and GPT ensuring data accuracy for downstream analysis.
  • Enhanced Scalability: By efficiently handling a larger volume of financial reports, our client experiences improved productivity and scalability, positioning them for sustained success in meeting evolving business needs. 

Client

Our client is a leading B2B company specializing in data and intelligence services for financial firms. They offer exclusive access to a proprietary database, deliver curated news and analysis on key financial sectors, and organize bespoke events tailored to industry needs. Established through the strategic integration of several financial publication sources and databases by a parent company, they provide comprehensive solutions to empower informed decision-making within the financial sector.

Country
Industry
Team Size:

Product

The client’s product is designed to streamline the collection of tabular data from diverse sources, notably the financial reports of leading insurance companies publicly accessible in PDF format.

Tailored for the analytics team within the organization, the product automates the search for specific tabular data in PDF files, extracts the identified tables, performs post-processing, and consolidates them into a singular Excel file (xlsx). We established this process by implementing a document parsing pipeline, which analyzes and extracts pertinent data based on predefined rules or patterns. 

Goals and objectives

In addressing the challenges posed by diverse PDF structures and the need for streamlined data processing, our goals were centered around revolutionizing the efficiency and scalability of financial data extraction and analysis.

  • Automate PDF Data Collection: Streamline the collection of data from PDFs of varying structures by implementing an automated process.
  • Automate Data Processing: Enhance data clarity and consistency through automated processing, ensuring that all extracted data maintains a uniform structure and cleanliness.
  • Reduce Human Time on Collection and Processing: Minimize human involvement in data collection and processing tasks, thereby optimizing resource allocation and accelerating workflow efficiency.
  • Scale Solution for Increased Data Sources: Enable the scalability of the solution to handle a larger volume of data sources, facilitating the processing of financial reports from new insurers’ PDFs effortlessly. 

Project challenge

  1. Non-Standardized PDF Formats:
    Addressing the absence of standardized requirements for financial public reports among insurance companies, resulting in diverse structures, formats, and table layouts in PDF files.
  2. Selective Data Extraction:
    Developing automation mechanisms for extracting specific contract types from comprehensive financial reports, despite containing a wide array of operational information, thereby necessitating precise data extraction without human involvement. 

Solution

Drawing upon the expertise of two dedicated machine learning engineers, our team has crafted a robust Python pipeline tailored to meet the unique requirements of our client seamlessly. Our solution promises efficiency and precision thanks to a fusion of cutting-edge technologies such as:

  • Natural Language Processing (NLP) Integration: Harnessing the power of NLP, the tool identifies the required data in PDF documents, ensuring targeted extraction.
  • YOLO Model Implementation: Leveraging the YOLO (You Only Look Once) model, the pipeline excels in table detection, swiftly and accurately pinpointing tables scattered across each PDF page, thereby laying the foundation for streamlined data extraction.
  • Extraction Phase: To extract the identified tables from PDF pages, our solution seamlessly integrates Camelot, a Python library used for extracting tables from PDF files, or AWS Textract, leveraging their robust capabilities to ensure comprehensive and precise data retrieval.
  • Post-Processing with GPT: Following extraction, the extracted tables undergo meticulous post-processing facilitated by the capabilities of GPT. This crucial step guarantees data accuracy and completeness, preparing the extracted data for seamless integration into downstream analysis and decision-making processes.

Furthermore, the solution is engineered with scalability in mind, equipped to effortlessly handle an expanding volume of PDF documents and adapt to evolving client needs. This scalability ensures the sustained effectiveness and relevance of our solution over time, enabling the organization to seamlessly navigate the challenges of data extraction and analysis with ease.

Tech Stack

  • AWS Textract AWS Textract
  • Python Python
  • OpenAI (GPT4) OpenAI (GPT4)

Process

  1. Proof of Concept (POC) stage was aimed to assess the feasibility of automating data post processing.
  2. Minimum Viable Product (MVP) stage focused on determining the feasibility of sourcing data from 25 disparate structures for a single contract type.
  3. Scaling stage involved expanding the solution to encompass 50 sources and a broader range of contract types.

Our results

The delivered product revolutionizes the process of data extraction from PDF documents, delivering unparalleled efficiency, precision, and scalability. By automating the entire workflow from data identification to post-processing, manual intervention is significantly reduced, empowering the team to allocate their time and resources towards higher-value tasks. 

Leveraging state-of-the-art machine learning models and algorithms, the final product ensures the precise extraction and processing of tabular data, minimizing errors, upholding data integrity with utmost accuracy, and offers unparalleled business improvements, such as:

  1. 3x Faster Document Processing: Automating data collection and processing allowed the client’s staff to handle three times more documents within the original timeframe. 
  2. 5x Cost Reduction: Increased operational speed enabled the business to process more documents, resulting in improved efficiency with the same manpower.
  3. Enhanced Productivity and Scalability: With the capability to handle an increased number of financial reports, the company has experienced improved productivity and scalability. 

In the future, we are planning to scale the solution to process a larger volume of financial reports from multiple insurers and contract types that have yielded significant improvements across key metrics.