In today’s data-driven world, extracting valuable insights from a multitude of PDF documents is a common challenge. Fortunately, with the power of Python and AI, you can automate the process of summarizing PDFs using ChatGPT. In this blog, we’ll walk you through the steps to achieve this task efficiently.
How can I use ChatGPT to create a summary of a PDF document?
Please make sure to install the following dependencies: Flask, azure-cognitiveservices-vision-computervision, PyMuPDF, long-chain, and openai version 0.28.1.
Step 1: Uploading PDFs via Flask
We begin by setting up a Python application using Flask to create an API for PDF upload. Users can conveniently send their PDF documents through this interface, making the process user-friendly.
app = Flask(__name__)
def main():
files = request.files[‘pdf’]
return convert_pdf_to_jpg(files,files.name)if __name__ == ‘__main__’:
app.run(debug = True)
A Flask app with a route ‘/upload_pdf’ for POST requests. It handles PDF uploads and starts to convert them to JPG images using the convert_pdf_to_jpg
function.
Step 2: Converting PDF Pages to JPG
To work with the content of PDFs, we utilize the fitz
library to convert each page of the PDF into a JPG image. This step ensures that the text within the PDF is in a format that can be processed further.
# Open the PDF filepdf_document = fitz.open(pdf_file)# Create a directory to save the images
output_dir = “Extraction_images/”
if not os.path.exists(‘./Extraction_images’):
os.makedirs(‘./Extraction_images’)
# Loop through the pages and convert to images
for page_num in range(pdf_document.page_count):
page = pdf_document.load_page(page_num)
image = page.get_pixmap(matrix=fitz.Matrix(300/72, 300/72)) # Adjust the resolution as needed
image.save(f”{output_dir}{name_without_extension}_{page_num + 1}.jpg”)
# Close the PDF document
pdf_document.close()
return Extract_text_from_jpg(pdf_file)
The code converts a PDF into JPG images, creating an output directory for the images. It loops through the PDF pages, adjusts the resolution, and saves them as images. Finally, it calls combine_pdf(pdf_file)
.
Step 3: Optical Character Recognition (OCR)
With our PDF pages in image format, we employ Azure OCR Cognitive Services to extract text from each JPG file. This text is then compiled and organized into a single text file.
from azure.cognitiveservices.vision.computervision.models import OperationStatusCodes
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
from msrest.authentication import CognitiveServicesCredentialsdef Azure_Client():
subscription_key = “”endpoint = “”return ComputerVisionClient(
endpoint, CognitiveServicesCredentials(subscription_key)
)def txt_to_file(file_path, string_to_write):
try:
print(file_path)
with open(file_path, “w”,encoding=“utf-8”) as file:
file.write(string_to_write)
print(“OCR Completed”)
except IOError:
print(“An error occurred while writing to the file.”)def ocr_single_file(computervision_client, pdf_path):file = open(pdf_path, “rb”)read_response = computervision_client.read_in_stream(
file,raw=True
)
file.close()
# Get the operation location (URL with an ID at the end) from the response
read_operation_location = read_response.headers[“Operation-Location”]
# Grab the ID from the URLoperation_id = read_operation_location.split(“/”)[-1]
# Call the “GET” API and wait for it to retrieve the results
while True:
read_result = computervision_client.get_read_result(operation_id)
print(read_result)
if read_result.status not in [“notStarted”, “running”]:
break
time.sleep(6)
text = “”
# Print the detected text, line by line
if read_result.status == OperationStatusCodes.succeeded:
for text_result in read_result.analyze_result.read_results:
for line in text_result.lines:
text += “\n” + line.text
text += “\n\n”
return text
def create_folder(pdf_path):
# Extract the file name from the location string
file_name = os.path.basename(pdf_path)
output_dir = os.path.splitext(file_name)[0]
# to create a output folder
output_dir = “Extraction_text/” + output_dir
if not os.path.exists(output_dir):
os.makedirs(output_dir,)
output_file = output_dir + “/” + “ocr.txt”
return output_file
def Extract_text_from_jpg(file):
file.save(‘./’ + file.filename)
folder_path = “./Extraction_images”
# List all PDF files in the folder
jpg_files = glob.glob(os.path.join(folder_path, “*.jpg”))
jpg_files = natsorted(jpg_files)
for jpgfile in jpg_files:
pdf_path = jpgfile
computervision_client = Azure_Client()
txt_file = ocr_single_file(computervision_client, pdf_path)
file_name = create_folder(pdf_path=pdf_path)
print(“__________________”)
print(
“############################################################################”
)
txt_to_file(file_name, txt_file)
- Define
Azure_Client
function: This function sets up the Azure Cognitive Services client by providing the subscription key and endpoint. txt_to_file
function: This function writes text to a file provided byfile_path
.
ocr_single_file
function: This function performs OCR on a single PDF file. It reads the PDF and retrieves the text content using Azure Cognitive Services. It waits for the operation to complete and returns the extracted text.create_folder
function: This function creates an output folder based on the PDF file’s name to store the OCR results in a text file.
Extract_text_from_jpg
function: This function handles the PDF file upload and OCR conversion. It saves uploaded files, processes each image (jpg) in the PDF, and extracts text from them using Azure Cognitive Services. The extracted text is saved to a corresponding output folder as a text file.
Overall, the code processes PDF files, extracts text from images within them, and stores the extracted text in individual text files in output folders.
Step 4: Text Chunking with langchain
To make the text more manageable, we implement the langchain
library and use its RecursiveCharacterTextSplitter
feature. This allows us to divide the text into smaller more digestible chunks. The chunk_size
and separator
parameters help customize the splitting process to suit your needs.
import tiktoken
def text_spliting():
folder_path = “./”
# Get the text file
latest_text_path = glob.glob(os.path.join(folder_path, “*.txt”))with open(latest_text_path[0],“rb”) as t:
prompt_text = t.read()prompt_text = prompt_text.replace(‘\\r’,”)chunk_size = 10000
separators = [‘\\n\\n\\n’]
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
chunk_size=chunk_size,
chunk_overlap=0,
separators=separators
)docs = text_splitter.split_text(prompt_text)print(len(docs),“after chunk”)with open(‘./chunk_output.txt’,‘a’,encoding=“utf-8”) as f:for i,k in enumerate(docs):f.write(k)
f.write(“\n\n\n\n\n”)
f.write(“Next chunk”)
return {“chunk_doc”:“chunk_output.txt”}
The code utilizes the Langchain library for text splitting. It reads a text file, divides it into smaller chunks based on specified separators, and then saves the resulting chunks in a separate file. The code returns information about the processed chunks in a dictionary.
Step 5: Summarization with ChatGPT
As ChatGPT processes each text chunk, it generates corresponding summaries. These summaries are collected and assembled into a final text file. This consolidated document provides a concise yet comprehensive overview of the original PDF content.
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env fileopenai.api_key = os.getenv(‘OPENAI_API_KEY’)def get_completion(prompt, model=“gpt-3.5-turbo”):
response = openai.ChatCompletion.create(
model=model,
messages=messages,
temperature=0, # this is the degree of randomness of the model’s output
)
return response.choices[0].message[“content”]def summarize_prompt():
res = text_spliting()
summarize_file = open(‘./summarize_output.txt’,‘a’,encoding=“utf-8”)
for i,text in enumerate(res[‘chunk_doc’]):prompt = f”””
your task is to generate a short summary of this domain “Example”
Summarize the paragraph below, delimited by triple backticksSummarize the review below, delimited by triple
backticks. Produce results that encompass
both concise summaries and bullet-pointed insights
summary: “`{text}“` “””summary = get_completion(prompt=prompt,model = “gpt-3.5-turbo”)summarize_file.write(summary)summarize_file.write(“\n\n\n”)
The code uses the OpenAI API to generate summaries for text chunks. It loads the OpenAI API key from a local .env file, defines a function get_completion
to retrieve text completions, and another function summarize_prompt
to split text, generate summaries for each chunk, and write the summaries to an output file. The code is designed for summarizing text data related to the “example” domain.
Step 6: Delivering the Summarized Text
The final text file, containing all the summarized information, is ready to be delivered to the client. This step ensures that the extracted insights are readily accessible and easy to understand.
By following these steps, you can streamline the process of extracting valuable information from PDF documents using Python(OpenAI) and ChatGPT. This automated approach not only saves time but also ensures accuracy and consistency in your summarization tasks.
Conclusion:
With Python, Flask, Azure OCR, ChatGPT, and thoughtful libraries like langchain
, you can transform PDFs into concise, actionable insights. By automating the summarization process, you save time and enhance your document handling efficiency. Embrace the power of AI and take your PDF summarization to the next level.
Thank you for reading our blog! We hope you found it helpful. We’d love to hear your feedback. Please feel free to share your thoughts and suggestions on how we can improve or any other topics you’d like us to cover in the future. Your input is valuable to us.