Text extraction is one of the first and the most important step in doing any analysis on the given document. To get the proper context out of any document, it needs to be extracted properly. If the extracted data is not in a proper order we will mess up with the format and the context of the data present in the pdf.
Extracting data from a normal PDF is very easy, what if the input document is RESUME, how will you able to extract the text present in the document as RESUME contains multiple instances of the text in the same line. We need to have some kind of intelligence to extract the content in a proper way.
There are many tools and libraries available to extract data present in a pdf. In this post, we will see how we can use tesseract an open-source library to extract data present in a pdf. we will take one example document to explore how it performs on tesseract and we will explore its pros and cons in a detailed way.
Let’s first see the example
* I have converted the pdf document to the image since tesseract is an OCR engine that works fine on image data.
From this image, it is easy to extract data but extracting data in a meaningful full way is difficult from this non-structured format as it contains multiple instances of text which belong to different groups of text.
We need to extract data like contact information separately, education separately, skills, and many more. Let’s see how we can use the tesseract to extract data in this format
To get the code for this visit here: https://sesame.optisolbusiness.com/
The output looks like this
To achieve this, we have used a text segmentation model named east-text-detection which is available in OpenCV with custom changed parameters to work on this template.
EAST Text detection
EAST: An Efficient and Accurate Scene Text Detector, deep learning model, based on a novel architecture and training pattern. Link to the paper, form the paper we can understand the model architecture in a better way. To download the model file, click here
Let’s see how to achieve this using python
First, we will see what all libraries we need to get installed to get the result
- PyPDF2
- Pdf2image
- Open-cv
- Pandas
- Numpy
- Glob
- pytesseract
The process flow will be like this
- Read the PDF
- Convert PDF to Image, each page will be saved as an individual image
- Run a Custom Tuned EAST text detection Model, which will save each Instance of Text as a separate Image
- Read the image one by one, which was saved in an earlier step and pass it to Tesseract
- With the help of tesseract, we can able to extract the text present in each image
# Necessary Imports
import textseg as ts
from PyPDF2 import PdfFileReader
from pdf2image import convert_from_path
import cv2
import json
import pandas as pd
import glob
import os
data = pd.DataFrame()
final_name_list=[]
final_text_opencv=[]
final_text_tessaract=[]
# Path of all Resume files
for i in resumes:
pdf = PdfFileReader(open(i,’rb’))
# Get fileName of each PDF File
fname = i.split(‘/’)[-1]
# Check how many page each PDF contains
print(pdf.getNumPages())
# Convert pdf object to image
images = convert_from_path(i)
resumes_img=[]
# append all image instance to a list to pass them through model
for j in range(len(images)):
# Save pages as images in the pdf
images[j].save(path_to_write+fname.split(‘.’)[0]+’_’+ str(j) +’.jpg’, ‘JPEG’)
resumes_img.append(path_to_write+fname.split(‘.’)[0]+’_’+ str(j) +’.jpg’)
name_list = fname.split(‘.’)[0]+’_’ +’.jpg’
text_opencv=[]
text_tessaract=[]
for i in resumes_img:
# read image using opencv
frame=cv2.imread(i)
os.remove(i)
img = i.split(“/”)[2]
# Pass the image to the model to get the text instance present in the image.
output_img,label,dilate, c_dict,df1, split_img=ts.get_text_seg(frame, img)
cv2.imwrite(path_to_write+img.split(‘.’)[0]+”.png”,output_img)
for i in range(len(split_img)):
# This Loop will helps us to save the instance of text as a individual image.
cv2.imwrite(path_to_write+img.split(‘.’)[0]+str(i)+”.png”, split_img[i])
text_opencv.append(c_dict)
text_tessaract+=text_from_tesseract(output_img)
tesseract_str = ”.join(text_tessaract)
final_name_list.append(name_list)
final_text_opencv.append(text_opencv)
final_text_tessaract.append(tesseract_str)
# we are selecting the index 0 as we have passed one PDF as an input which contains one Page
print(final_text_opencv[0])
Pros
- We use it freely since it is available as an open-source library
- We can able to change the parameters based on our need
Cons
- The Major problem in using this is there will be the chance of spelling mistakes in the result
- It will be fitted for the different templates of data and we need to change the parameters to get fitted on another template.
Thanks for reading the post. Feel free to reach out to us for further queries. In the next post, we will explore other libraries to extract text in a proper format.