PyPDF2: The Ultimate Python Library for PDF Manipulation

Name: Naomi Clarkson

Published on 8/17/2023

PyPDF2 is a powerful, free, and open-source library designed for manipulating PDFs in Python. It's a versatile tool that allows you to split, merge, crop, transform, encrypt, and decrypt PDF files with ease. PyPDF2 supports PDF versions 1.4 to 1.7 and requires no external dependencies other than the Python standard library, making it an accessible and convenient choice for Python developers working with PDFs.

This library is not only robust but also secure, offering a range of features that ensure the integrity and confidentiality of your PDF files. From adding passwords to PDFs to retrieving text and metadata from them, PyPDF2 provides a comprehensive suite of tools for PDF manipulation. In this article, we will delve into the capabilities of PyPDF2, providing detailed explanations, definitions, and examples to help you get the most out of this library.

Want to quickly create Data Visualization from Python Pandas Dataframe with No code?

PyGWalker is a Python library for Exploratory Data Analysis with Visualization. PyGWalker (opens in a new tab) can simplify your Jupyter Notebook data analysis and data visualization workflow, by turning your pandas dataframe (and polars dataframe) into a Tableau-style User Interface for visual exploration.

(opens in a new tab)

What is PyPDF2?

PyPDF2 is a pure-Python library capable of splitting, merging, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. PyPDF2 can retrieve text and metadata from PDFs as well, making it a comprehensive tool for PDF manipulation.

The library is open-source, meaning it's freely available for anyone to use, modify, and distribute. This makes it a popular choice among developers who need to work with PDFs in Python. PyPDF2 is also platform-independent, so you can use it regardless of whether you're working on a Windows, Mac, or Linux machine.

Installation and Usage of PyPDF2

Installing PyPDF2 is straightforward and can be done using pip, the package installer for Python. PyPDF2 requires Python 3.6 or higher to run. Here's how you can install PyPDF2 using pip:

pip install PyPDF2

You can also install PyPDF2 using Anaconda, a popular Python distribution for data science and machine learning. Here's how:

pip install git+https://github.com/py-pdf/PyPDF2.git

Once installed, you can import the PyPDF2 library into your Python script like so:

import PyPDF2

To check the version of PyPDF2 you're using, you can use the __version__ attribute:

PyPDF2.__version__

Working with PDFs using PyPDF2

Once you've installed PyPDF2, you can start working with PDFs. Let's go through some common operations you might need to perform.

Reading a PDF

To read a PDF, you first need to open the file in read-binary mode ('rb'), then create a PdfFileReader object:

inputFile = "path_to_your_pdf_file.pdf"
pdf = open(inputFile, "rb")
pdf_reader = PyPDF2.PdfFileReader(pdf)

You can check the number of pages in the PDF using the numPages attribute:

totalPages = pdf_reader.numPages
print(totalPages)

Extracting Text from a PDF

To extract text from a PDF, you

can use the extractText() method of the PageObject class. First, you need to get a PageObject representing a specific page in the PDF:

page = pdf_reader.getPage(0)  ## Get the first page

Then, you can extract the text from this page:

print(page.extractText())

This will print the text content of the first page of the PDF to the console. Note that extractText() may not always work perfectly, depending on the complexity of the PDF and the encoding of its text.

Splitting PDF Pages

One of the powerful features of PyPDF2 is the ability to split PDF pages. This can be done using the getPage() method of the PdfFileReader object, which retrieves a page by its number. Here's an example of how to split the first page from a PDF:

## Open the PDF
with open('path_to_your_pdf_file.pdf', 'rb') as file:
    reader = PyPDF2.PdfFileReader(file)
    writer = PyPDF2.PdfFileWriter()
 
    ## Get the first page
    first_page = reader.getPage(0)
 
    ## Add the page to the PdfFileWriter object
    writer.addPage(first_page)
 
    ## Write the page to a new file
    with open('output.pdf', 'wb') as output_pdf:
        writer.write(output_pdf)

In this example, output.pdf will be a new PDF file containing only the first page of the original PDF.

Merging PDFs

PyPDF2 also allows you to merge multiple PDFs into one. This can be done using the PdfFileMerger class. Here's an example:

merger = PyPDF2.PdfFileMerger()
 
## List of PDFs to merge
pdfs = ['file1.pdf', 'file2.pdf', 'file3.pdf']
 
for pdf in pdfs:
    merger.append(pdf)
 
merger.write("merged.pdf")
merger.close()

In this example, merged.pdf will be a new PDF file that contains all the pages from file1.pdf, file2.pdf, and file3.pdf, in that order.

Adding Passwords to PDFs

PyPDF2 provides a simple way to add passwords to your PDF files for added security. This can be done using the encrypt() method of the PdfFileWriter object. Here's an example:

## Open the PDF
with open('path_to_your_pdf_file.pdf', 'rb') as file:
    reader = PyPDF2.PdfFileReader(file)
    writer = PyPDF2.PdfFileWriter()
 
    ## Copy all pages from the original PDF to the new one
    for pageNum in range(reader.numPages):
        page = reader.getPage(pageNum)
        writer.addPage(page)
 
    ## Encrypt the new PDF
    writer.encrypt('your_password')
 
    ## Write the encrypted PDF to a new file
    with open('encrypted.pdf', 'wb') as output_pdf:
        writer.write(output_pdf)

In this example, encrypted.pdf will be a new PDF file that is a copy of the original PDF, but encrypted with the password 'your_password'.

Converting PDFs to Images

While PyPDF2 doesn't directly support converting PDFs to images, it can be used in combination with other libraries such as PDF2Image to achieve this. Here's an example:

from pdf2image import convert_from_path
 
## Convert the PDF to a list of images
images = convert_from_path('path_to_your_pdf_file.pdf')
 
## Save the images to files
for i, image in enumerate(images):
    image.save(f'output{i}.png', 'PNG')

In this example, each page of the PDF is converted to a PNG image and saved to a separate file.

FAQs

What versions of PDF does PyPDF2 support?

PyPDF2 supports PDF versions 1.4 to 1.7. This covers a wide range of PDF files, making PyPDF2 a versatile choice for PDF manipulation in Python.

Does PyPDF2 have any dependencies?

No, PyPDF2 does not have any dependencies other than the Python standard library. This makes it easy to install and use on any system that has Python installed.

What Python version is required to run PyPDF2?

PyPDF2 requires Python 3.6 or higher to run. This ensures compatibility with modern Python features and improves the overall performance and security of the library.

NLTK Tokenization in Python: Quickly Get Started Here Pylance: The Ultimate Python Language Server Extension for Visual Studio Code