Developing a Web Crawler Application for One of the Premier U.S. Asset Management Companies

Highlights

Automating Core Business Lines: created an application for financial data crawling and retrieval from multiple websites.
Empowering Complex Business Logic: developed sophisticated Artificial Intelligence and Machine Learning models with the usage of intricate AI/ML algorithms.
10x Data Storage Cost Reduction and Days-to-Hours Manual Data Processing Time Optimization: optimized data storage architecture and reduced the time necessary for manual data analysis from 2-3 days to 2-3 hours.

Client

Our client is one of the premier U.S. asset management companies that employs more than 12 000 people, operates in 32 countries, and manages an investment portfolio in excess of $200 billion. In 2018, the company’s revenue exceeded $1 billion.

As one of their core activities, our client provides investors with company information, forecasts, and related proprietary analytics to enable them to evaluate the investment potential of various businesses.

Country

the USA

Industry

Fintech

Team Size:

Product

A Web crawler application that uses sets of keywords to search for Health savings plan data across a large number of websites. The application retrieves such data in the form of text excerpts and makes it available for further analysis. The Web crawler also classifies the retrieved data by category and prioritizes it by relevance.

To gather only relevant data, the Web crawler uses an advanced AI/ML model.

Goals and objectives

Automate Data Collection Process: Deliver an AI/ML-powered solution that will automate the vastly time- and effort-consuming process of manually collecting relevant financial data from hundreds of websites.
Enable Data Extraction: Ensure a web crawling mechanism extracts relevant data for further manual processing and review.
Provide Data Points Comparison: Develop a tool that allows for the identification of content changes in collected data.
Optimize the Web Crawler Performance: Integrate the solution into the client’s ecosystem and make it as efficient as possible to gain maximum business value.
Activate Data Streaming: Create a tool for collected data transfer to the client’s overseas team for further processing and business value extraction.

Project challenge

Crawling Difficulties:
deal with the problem of a significant number of the websites to be crawled having a poor structure and containing a lot of broken links.
Retrieving Relevant Data:
Figure out how to solve the challenge of the high amount of irrelevant information in the multi-keyword Web crawling process. This makes it difficult to use the relevant data that’s been retrieved. Thus, there was the need to limit the search results to relevant health savings plan-related data only.
Data Entities Prioritization by Relevancy: work out a way to prioritize an overwhelming amount of collected data to be analyzed by the client’s employees every week.
Overcoming Several Technical Hurdles:
Solve data crawling challenges like IP detection, captures, and more.

Solution

Our project team proposed to use AI/Ml to solve the main project challenges.

We got started with preparing a data set for further model training. To deal with the irrelevant information that contained target keywords, we came up with sophisticated AI/ML (Doc2Vec and Word2Vec) models. These models allow analyzing a larger context around a keyword, making it possible to cut off most of the irrelevant data. Using intricate AI/ML algorithms has also allowed us to properly process health savings plans with additional reservations (for example, ones that have to do with discounts) that alter these plans’ execution logic.

Our project team has also employed AI and ML algorithms to help reduce the amount of data to be processed by the client’s employees on a weekly basis. In a user-friendly manner, the application shows the sites where no changes in data have taken place since the previous crawling. Besides, it also prioritizes all the retrieved results, indicating the percentage of relevance for each result. While reviewing the results, the client’s employees can skip the ones they consider too low-relevance. While we were working on preparing a custom model, we used AWS Comprehend for scoring crawled data relevancy .

To deal with the hurdles that prevent site crawling, we’ve applied various techniques, including using a proxy server, multiple IPs, a user agent, and slower clicking speed.

In addition, to optimize system performance and cut costs, SPD Technology’s project team proactively optimized the client-provided data storage architecture of the solution, which has resulted in very significant cost savings for the client. Also, our backend developers implemented the front end of the Web Crawler application, thus creating further cost savings.

Tech Stack

Java
Vue.js
Spring Boot
AWS Comprehend
Vuex

Our results

We have successfully developed a cutting-edge solution for our client that leverages modern technologies and improves business operations.

Powerful Tool For Data Crawling: developed an application that automates one of the core business lines of our client and allows collecting data from 500+ websites.
Freeing-up Hands of 20 Employees: implemented the solution that allowed the company to time- and cost-efficiently perform a mission-critical task. With the mechanism implemented, the data analysis tasks can be performed by 2 people instead of 20, allowing the company’s employees to focus on more strategic tasks.
Reducing Data Processing Time from 2-3 Days to 2-3 Hours: thanks to the implemented solution, the manual analysis of 1000 relevant data points requires 2-3 hours of employee’s time instead of 2-3 days as it was previously.
10x Data Storage Cost Reduction: found the most optimal solutions to the project challenges and reduced the client’s data storage costs by a factor of 10.

Developing an AI-Enabled Web Crawler Tool for the Premier U.S. Asset Management Company