Entity Matching For A Financial Consultancy
Challenges and Industry
A financial consulting agency reached out to GFAIVE as they were working on a project that required matching financial entities in various documents (over 25’000 documents including 200 pages each) and had realized that they did not have enough manpower to complete the project on time while reaching the required 90% accuracy of results.
To speed up the development process and increase results accuracy, the client requested 2 data scientists to be provided as soon as possible. GFAIVE data scientists’ task was to automate the extraction of fields of data (entities) relating to municipal bonds from a range of different types of documents associated with issuing of municipal debt and to apply various business rules on the extracted values.
Once the GFAIVE team received the request for the data scientists and their required skills, the team promptly selected 4 candidates that were deemed qualified and forwarded their CV’s to the client. The client then chose to interview and test 3 of the 4 candidates and picked those who he thought to be best suited. Once the selection was made GFAIVE helped on board the data scientists. The entire process between receiving the request and data scientists starting their work took 5 working days.
In terms of project management, GFAIVE team followed an agile development process, working in sprints and delivering commitments in stages. The client was updated via weekly phone calls and group Slack messages.
For the project, first line methods used were rule based approach and Spacy NER library S econd line methods were RNN: long short term memory (LSTM) and convolutional neural networks (CNNs).
The main steps included:
1. Conversion of source document data to text from image using optical character recognition (OCR).
2. Named entity recognition (NER) tagging or “marking” a process where, in the source documents used for training, the “correct answers” were marked to ease training of the algorithms.
3. Training of first line models.
4. Assessment of the accuracy of first line models.
5. Determination of fields for which first line model accuracy was insufficient.
6. Development of models for these fields using secondary model methodologies and mixtures of models.
The qualitative metrics (recall, precision, F score) above 90% for the result satisfied our clients’ requirements. Moreover, by gaining access to qualified data scientists quickly, the client was able to finish his project twice as fast and at a lower cost than if he was to look for and hire in house team members.
During the feedback stage, the client informed GFAIVE that what was most appreciated was the provision of qualified candidates who were well experienced in working remotely and remained professional throughout the project.