12 October 2021
Before you start
Note: Before using this script you need to export Google credentials.
Note: The dataset files should be having ratio as mentioned below:
Training Dataset (80%)
Test Dataset (10%)
Validate Dataset (10%)
For example: If you have 100 files then 80 should belong to the train dataset, 10 for Validation and 10 for the Test dataset.
Manual Annotation Process
Step 1: Create a directory say dir1 and add all the resumes that you want to use for the train dataset.
Step 2: Now you need to create dir2 and dir3 to validate and test the dataset respectively.
Step 3: To proceed further you need to install Google Cloud SDK and authenticate your email ID. Once you have completed the installation of Google Cloud SDK now you have to enter below commands one by one to upload train, validation and test dataset to Google Cloud Storage.
Train:
python2 script.py -t gs://match_making/tenant1/hr/documents/train train,dir1/*.pdf
Validation:
python2 script.py -t gs://match_making/tenant1/hr/documents/validation validation,dir2/*.pdf
Test:
python2 script.py -t gs://match_making/tenant1/hr/documents/test test,dir3/*.pdf
Step 4: To verify that the uploading is successfully completed. Navigate to the path that you have given in the script to verify your uploaded resumes/JDs in the Google Cloud Storage.
Step 5: After you are done with the data upload process, you need to import them by creating a new dataset in AutoML or you can use existing dataset.
Step 6: Navigate to the dataset.csv file in Google Cloud Storage and select the file and click on import.
Step 7: Now once you click on the import dataset, wait for 5-10 minutes to import all the resumes/JDs on the AutoML platform.
Step 8: Once done you will see all the PDF files and you can now annotate and start the annotation process.
Step 9: After completing the annotation you need to start training which usually takes 3 hours.
Step 10: Once training is finished you can test the model
Auto Annotation Process
Step 1: To upload txt files and use auto annotations feature you first need to convert PDF into a TXT file.
Step 2: Now you just need to keep these files in different directories like we have done for uploading PDF files and use the below commands to upload them to Google Cloud Storage.
Note: dict.csv file contains all the labels that we want to auto annotate.
Train:
python2 script.py -d dict.csv -s train,dir1/*.txt gs://match_making/tenant1/hr/documents/train python2 script.py -d dict_banking.csv -s train,dir1/*.txt gs://match_making/tenant1/banking/documents/train
Validation:
python2 script.py -d dict.csv -s validation,dir2/*.txt gs://match_making/tenant1/hr/documents/validate python2 script.py -d dict_banking.csv -s validation,dir2/*.txt gs://match_making/tenant1/banking/documents/validate
Test:
python2 script.py -d dict.csv -s test,dir3/*.txt gs://match_making/tenant1/hr/documents/test python2 script.py -d dict_banking.csv -s test,dir3/*.txt gs://match_making/tenant1/banking/documents/test
Step 3: To verify that the uploading is successfully completed. Navigate to the path that you have given in the script to verify your uploaded resumes/JDs in Google Cloud Storage.
Step 4: After you are done with the data upload process, you need to import them by creating a new dataset in AutoML or you can use the existing dataset.
Step 5: Navigate to the dataset.csv file in Google Cloud Storage and select the file and click on import.
Step 6: Now once you click on the import dataset, wait for 5-10 minutes to import all the resumes/JDs on the AutoML platform.
Step 7: Once done you will see all the documents and you can now start the annotation process.
Step 8: After completing the annotation you need to start training which usually takes 3 hours.
Step 9: Once training is finished you can test the model.