

This can be done easily with the numpy.histogram function: import numpy as np I hope this article would be fruitful to you, ‘Keep Learning Keep Coding’.1: you can bin the data first.

PdfReader = PyPDF2.PdfFileReader(pdfFileObj)Īs you can see, each page content is showing in the console. Now below is our Python program to read the PDF file line by line: # Importing required modules There might be PDF files in which lines would be separated by ‘\n’, so you can use this as a parameter for ‘split()’ function.

Here I had used a sample PDF file (mypdf), in this each line is separated by a bunch of blank spaces, so I have found my way of splitting the lines (using ‘split()’ function) with two blank spaces as a parameter. So, here we need to find some similarity in the separation of each and every line in the whole PDF document. Now its turn for the actual code, But one Important thing to understand is that there is no direct method in PyPDF library to read PDF file line by line, it always read it as a whole (using ‘extractText()’ function), but one good thing to knew, that it always returns the ‘String’ as an output. So, Let’s get started, our first task is to install PyPDF library. Instead, we would cover this topic of Image-based PDFs in some other article. It doesn’t means that it can’t be handled with PyPDF, but there is a disadvantage of using this is that we need to change its encoding and convert it into text-based PDF, which would result in loss of data. Reading PDF File Line by Lineīefore we get into the code, one important thing that is to be mentioned is that here we are dealing with Text-based PDFs (the PDFs generated using word processing), because Image-based PDF needs to be handled with a different library known as ‘pyTesseract’. PyPDF is capable of Extracting Document Information, Splitting Documents, Merging Documents, Cropping Pages in PDF, Encrypting and Decrypting, etc. That means, it runs on every Python platform without any dependency on any other external library support. PyPDF is completely an independent library. Therefore, we need to use an external library known as ‘PyPDF’ (its recent version is PyPDF4 but we will be using PyPDF2). By default, Python does not come with any of the built-in libraries that can help us to read and write PDF files. We may need to work with PDF files to perform various Natural Language Processing tasks or for any other purpose. And here, we do not need to import any external library also, it is built-in in different versions of Python.īut in the case of working with PDF files is a bit different. You may have gone through various examples of text file handling, in which you must have written text into the file or extracted it from the file as a whole (using ‘read()’ function) or line by line (using ‘readline()’ or ‘readlines()’ function).
