Chat With User PDF Files Using ChatGPT

Welcome to BtechCodeCracker In This Project we combine the power of ChatGPT from OpenAI and the capabilities of LangChain to enable interactive conversations and information extraction from PDF files. With this project, you can chat with our ChatGPT model to ask questions, seek explanations, and obtain valuable insights from PDF documents. Let's explore how ChatGPT and LangChain work together to enhance your PDF file interactions.

Follow these steps to execute the code on Google Colab:

Installation of Required Packages:

!pip install langchain

!pip install openai

!pip install PyPDF2

!pip install faiss-cpu

!pip install tiktoken

Paste Code In Cell -> Click on Run

The code begins by installing several Python packages using the !pip install command. These packages include langchain, openai, PyPDF2, faiss-cpu, and tiktoken. These commands ensure that the required dependencies are installed in the environment.

Importing Required Modules:

from PyPDF2 import PdfReader

from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.text_splitter import CharacterTextSplitter

from langchain.vectorstores import FAISS

import os

from langchain.chains.question_answering import load_qa_chain

from langchain.llms import OpenAI

Paste Code In New Cell -> Click on Run

The necessary modules are imported using import statements. These modules include PdfReader from PyPDF2, various modules from the langchain library, and os.

Setting OpenAI API Key:

os.environ["OPENAI_API_KEY"] = "Your ChatGPT API Key"

Paste Code In New Cell -> Click on Run

The OpenAI API key is set as an environment variable using os.environ["OPENAI_API_KEY"] = "Your ChatGPT API Key". You should replace "Your ChatGPT API Key" with your actual OpenAI API key.

To Find API Key Click on the Link -> https://platform.openai.com/account/api-keys

Login into the Account - > API Keys -> Create new secret key -> Give any name and click on create-> copy the Key and click done.

Setting OpenAI API Key:

pdfreader = PdfReader('New12.pdf')

Paste Code In New Cell -> Click on Run

Make Sure you upload your pdf file and update the name in the place of "New12.pdf"


Reading Text from PDF

from typing_extensions import Concatenate

# read text from pdf

raw_text = ''

for i, page in enumerate(pdfreader.pages):

    content = page.extract_text()

    if content:

        raw_text += content

Paste Code In New Cell -> Click on Run

The code reads text from a PDF file named 'New12.pdf' using the PdfReader class from PyPDF2. It extracts the text from each page of the PDF and concatenates it into the raw_text string variable.

raw_text

Splitting Text into Chunks

text_splitter = CharacterTextSplitter(

    separator = "\n",

    chunk_size = 800,

    chunk_overlap  = 200,

    length_function = len,

)

texts = text_splitter.split_text(raw_text)

Paste Code In New Cell -> Click on Run

The raw_text is split into smaller chunks using the CharacterTextSplitter class from langchain. The splitter is configured with a separator ("\n"), chunk size (800 characters), chunk overlap (200 characters), and a length function (len). The resulting chunks are stored in the texts list variable.

Generating Embeddings

embeddings = OpenAIEmbeddings()

Paste Code In New Cell -> Click on Run

The OpenAIEmbeddings class from langchain is used to generate embeddings for the text chunks. The embeddings are computed using OpenAI's language model.

Building Document Search Index

document_search = FAISS.from_texts(texts, embeddings)

chain = load_qa_chain(OpenAI(), chain_type="stuff")

Paste Code In New Cell -> Click on Run

The FAISS class from langchain is used to build a document search index from the text chunks and their corresponding embeddings. This index enables efficient similarity searches based on the provided query.

Loading Question-Answering Chain

query = "Is there anyy paPayment options in Ecommerce?"

docs = document_search.similarity_search(query)

chain.run(input_documents=docs, question=query)

Paste Code In New Cell -> Click on Run

The load_qa_chain function from langchain is used to load a pre-trained question-answering chain. The chain is configured to use OpenAI's GPT-3 model for answering questions.

Running the Question-Answering Chain

A query, "Is there any payment options in Ecommerce?" is defined.

The document search index is used to find relevant documents based on the query.

The loaded question-answering chain is then applied to the retrieved documents, providing the query as input.

The chain processes the input and attempts to generate relevant answers.

Overall, the code extracts text from a PDF, splits it into smaller chunks, generates embeddings for the chunks, builds a document search index, and uses a pre-trained question-answering chain to find answers to a given query within the indexed documents

For more details on running the code, refer to the Google Colab documentation Chatgpt-PdfReader.ipynb

If you encounter any issues or have further questions, feel free to ask.