Introduction
AI and Machine Learning instruments such as libraries, available pre-trained models and open source datasets and data annotation platforms have advanced so much in recent years that within a small period of time and effort it is already possible to develop useful systems that automate business processes. The fact that large companies are exposing the results of their research and development to the public such as OpenAI’s GPT-3, even if it is paid, accelerates the process of creating end-user products even further. Such APIs are not cheap, but considering the costs that are needed for research and development of such models from scratch, the price is more than justified. For example, the one-time training of GPT-3 from the ground up alone is 12 million dollars of computational time:
So if one needs the impressive results GTP-3 can deliver they should be ready to pay for existing APIs.
Since the world is not black and white, one is not choosing between very expensive, but perfect quality, and low price (free), but bad quality. It is also possible to achieve good enough results for solving real business problems with limited expenses by combining open source solutions with the investment in creation of proprietary datasets and being more flexible with implementation timelines.
Now let’s discuss a practical example of a project from our experience with one of our customers.
We were approached by a client who was getting resumes of thousands of candidates’ every day on his platform with a request to automate some tasks using ML (machine learning) methods. One of the tasks was to automatically determine the seniority-level of the person’s position (job title). Since several thousand resumes pass through the client’s platform every day, and on average, each resume contains 2–3 positions, the total volume of job titles that need to be analysed daily is a lot, and therefore, the use of automation tools is fully justified, since solving such a problem with the involvement of a human resource becomes very time-consuming and expensive.
Below are the steps we took to solve the problem.
Primary data analysis and the Annotation Guideline creation
We extracted 50.000 job titles from the client’s database (we selected this number based on our past experience, the client’s budget, time and task’s complexity). This is what the raw data looked like:
- Software Engineer at Google
- President @ Amazon
- VP of Engineering at Facebook
The client’s requirements indicated the need to divide job titles into the following categories:
- C-Suite
- VP
- Director
- Manager
- Other
Then the analyst manually went through the sample to select typical examples for each category, as well as, more importantly, border examples.
The Next step was the development of the “Annotation Guideline” document. The development of such a document is an important stage in the data annotation process. It increases the quality and consistency of the dataset which is extremely important for the development of precise models.
A typical Annotation Guideline structure looks like this:
- Description of the task and subject area
- Description of the semantics of each class
- Typical and “boundary” examples for each class
- Description of how to work in the selected data annotation platform
Data Annotation in the Doccano system
The next step was the preparation of a system for collaborative annotation of text data called Doccano. The Doccano system allows you to annotate data for a variety of ML tasks, including:
- Text classification
- Token-classification (NER)
- Sequence to Sequence (for example, “Text to SQL” task)
Some of the advantages of Doccano are:
- Open source code (which makes it possible to modify it with a little effort if necessary and deploy it on-premise)
- Collaborative data annotation
- Multi-language support
- Simple and user-friendly UI
- Mobile-friendliness
- Ability to deploy a two-tier annotation process (annotation → validation).
- Ease of deployment
For simplicity, we have deployed Doccano using Docker-compose. An AWS RDS database with an automatic backup every couple of hours was used.
We also uploaded our Annotation Guideline to Doccano and created accounts for all members of the annotation team.
In total, 50 thousand job titles were annotated with the involvement of 11 people. We did it in 2 weeks and ended up at 40 seconds per job title speed.
In order to get started with Doccano, check their documentation.
Building a machine learning model with FastText
Our next step was the development of the ML model. For rapid prototyping, it is convenient to use the FastText framework (FaceBook) which we did. With their recently announced automatic hyperparameter tuning mechanism, you can quickly build very strong and high-quality baselines for text classification problems with very little effort.
To train a text classification model using FastText, you simply need to do the steps described bellow:
1. Divide data into train, test, valid sets
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.15,
stratify=data['category'].values)
train, valid = train_test_split(train, test_size=0.15,
stratify=train['category'].values)
2. Save train and test sets in FastText format on the disc. The format looks like this:
Data Scientist @ Medium __label__OtherVP of Sales at Microsoft __label__VP
3. Train a FastText model on train and valid sets using hyperparameter auto-tuning mechanism:
import fasttext model = fasttext.train_supervised(input='job_level.train', autotuneValidationFile='job_level.valid', autotuneModelSize='2M')
4. Assess the quality of the resulting model by calculating metrics using the test part of the data:
import numpy as np from sklean.metrics import classification_report labels, probs = model.predict(test['title'].values) y_pred = np.array(labels).ravel() print(classification_report(test['category'].values, y_pred))
Model deployment using BentoML
The logical next step is the “deployment” of the trained model to the customer’s environment. We used the open source BentoML framework.
BentoML supports many ML frameworks, including, but not limited to:
- Scikit-Learn
- PyTorch
- Tensorflow 2
- FastText
- H2O
- Spacy
- Transformers
For the complete list, check the frameworks section of their documentation.
BentoML also allows having a trained model in the shortest possible time for delivering it with the required level of scalability. It supports model deployment in the following environments:
For more information about the deployment environments supported by BentoML, check their deployment guides.
The process for packaging and deploying a FastText model as a Docker container is the following:
1.Implement an inference service. The simplest example is shown bellow:
from bentoml import env, artifacts, BentoService, api from bentoml.frameworks.fasttext import FasttextModelArtifact from bentoml.adapters import JsonInput @env(infer_pip_packages=True) @artifacts([FasttextModelArtifact('model')]) class FasttextClassification(BentoService): @api(input=JsonInput(), batch=True) def predict(self, json_list): input = [i['text'] for i in json_list] result = self.artifacts.model.predict(input) # return top result prediction_result = [i[0].replace('__label__', '') for i in result[0]] return prediction_result
2. Save BentoService that contains the model:
from text_classification import FasttextClassification svc = FasttextClassification() svc.pack('model', model) saved_path = svc.save()
3. Containerize model server with Docker:
bentoml containerize FasttextClassification:latest
Once packaging is finished, we have a docker image with our model wrapped with a RestAPI available, so we can run it locally or push it to the Docker-registry for further deployment to ECS, K8S, etc.
To test the API you can run it locally and check out the swagger docs:
bentoml serve {%saved_path%}
Conclusion
You don’t need to have a huge budget or be a big company to be able to develop ML models that satisfy your business needs. The tools described above proved once again to be very effective in solving real business problems.
Shapion’s team wishes you happy holidays. We believe 2022 will bring even more amazing technologies and we will be happy to help you use them for solving your problems. More articles to come!
About Shapeion
Shapeion is a team of experienced AI and Software Development specialists focused on solving the problems that different companies are facing by providing high-quality software, AI and business development services. Our clients are diverse, from startups that want to develop an MVP to corporations that want to improve their products. You have got a problem? Let’s talk!
Feel free to contact us via info@shapeion.com