Developing a Web Crawler Application for One of the Premier U.S. Asset Management Companies

# AWS infrastructure # Data Analytics # Fintech # Web development
Discover more

Highlights

  • Automating Core Business Lines: created an application for financial data crawling and retrieval from multiple websites.
  • Empowering Complex Business Logic: developed sophisticated Artificial Intelligence and Machine Learning models with the usage of intricate AI/ML algorithms.
  • 10x Data Storage Cost Reduction: developed a solution with complex business logic to minimize the product’s vulnerability to bot attacks aimed at stealing financial information.

Client

Our client is one of the premier U.S. asset management companies that employs more than 12 000 people, operates in 32 countries, and manages an investment portfolio in excess of $200 billion. In 2018, the company’s revenue exceeded $1 billion.

As one of their core activities, our client provides investors with company information, forecasts, and related proprietary analytics to enable them to evaluate the investment potential of various businesses.

Country:
Industry:
Team Size:

Product

A Web crawler application that uses sets of keywords to search for Health savings plan data across a large number of websites. The application retrieves such data in the form of text excerpts and makes it available for further analysis. The Web crawler also classifies the retrieved data by category and prioritizes it by relevance.

To gather only relevant data, the Web crawler uses an advanced AI/ML model.

Goals and objectives

  • Automate Data Collection Process:

Deliver an AI/ML-powered solution that will automate the vastly time- and effort-consuming process of manually collecting relevant financial data from hundreds of websites.

  • Performance Optimization:

Integrate the solution into the client’s ecosystem and make it as efficient as possible to gain maximum business value.

Project challenge

  1. Crawling Difficulties:
    Deal with the problem of a significant number of the websites to be crawled having a poor structure and design. Additionally, many of the target health savings plans included additional reservations that altered the logic of their application, as stated in the main body of the text.
  2. Retrieving Relevant Data:
    Figure out how to solve the challenge of the high amount of irrelevant information in the multi-keyword Web crawling process. This makes it difficult to use the relevant data that’s been retrieved. Thus, there was the need to limit the search results to relevant health savings plan-related data only. 
  3. Massive Amount of Information:
    Work out a way to prioritize an overwhelming amount of collected data to be analyzed by the client’s employees every week.
  4. Overcoming Several Technical Hurdles:
    Solve data crawling challenges like IP detection, captures, and more.

Solution

 

Our project team proposed to use AI/Ml to solve the main project challenges. 

To deal with the irrelevant information that contained target keywords, we came up with sophisticated AI/ML (Doc2Vec and Word2Vec) models. These models allow analyzing a larger context around a keyword, making it possible to cut off most of the irrelevant data. Using intricate AI/ML algorithms has also allowed us to properly process health savings plans with additional reservations (for example, ones that have to do with discounts)  that alter these plans’ execution logic.

Our project team has also employed AI and ML algorithms to help reduce the amount of data to be processed by the client’s employees on a weekly basis. In a user-friendly manner, the application shows the sites where no changes in data have taken place since the previous crawling. Besides, it also prioritizes all the retrieved results, indicating the percentage of relevance for each result. While reviewing the results, the client’s employees can skip the ones they consider too low-relevance. 

To deal with the hurdles that prevent site crawling, we’ve applied various techniques, including using a proxy server, multiple IPs, a user agent, and slower clicking speed.

In addition, to optimize system performance and cut costs, SPD Technology’s project team proactively optimized the client-provided data storage architecture of the solution, which has resulted in very significant cost savings for the client. Also, our backend developers implemented the front end of the Web Crawler application, thus creating further cost savings.

Tech Stack

Web
  • Java WebJava
  • Vue.js WebVue.js
  • Spring Boot WebSpring Boot
Infrastructure
  • AWS InfrastructureAWS

Our results

We have successfully developed a cutting-edge solution for our client that leverages modern technologies and improves business operations.

  1. Powerful Tool For Data Crawling: developed an application that automates one of the core business lines of our client and allows collecting data from 1500+ websites. 
  2. Freeing-up Hands of 30 Employees: implemented the solution that allowed the company to time- and cost-efficiently perform a mission-critical task. 
  3. 10x Data Storage Cost Reduction: found the most optimal solutions to the project challenges and reduced the client’s data storage costs by a factor of 10. 
Next Project
# UI/UX Design
Streamlining A Plumbing Marketplace UI/UX For Effortless Communication And Task Management

Highlights Product An online marketplace designed to assist busy individuals and small businesses in outsourcing their plumbing tasks. Fostering...

Explore Case