We will read it as follows:
The National Transportation Safety Board said a 12-member team....We definitely do not read it like this:
The National Transpor- House Speaker John tation Safety Board said a Boehner said ...From this it is clear that text in its nature is a sequential stream of information. In order to get a correct search result, we should preserve this sequence.
Ok, it is quite easy for me as a human being to recognize the columns in the text. My recognition system has been practiced in doing that for a long time (still need to practice to apply different image filters directly in my brain). But what about machines? How does software recognize layout structure of a text within a PDF file?
Most PDF documents do not store this layout information. If you look inside the document you can barely distinguish columns, paragraphs, sentences or even words. You can however extract runs of characters and their coordinates within a page.
So how to obtain the information about the text strcture? In this article we are going to build word position histograms in order to recognize columns.
Using histograms as a first guess to finding columnsHow we can make the initial guess? Let’s take a look where the words begin. All the words on the left side of the column have the same X coordinate. Therefor it should be interesting to build a histogram that shows the frequence of a word start along the horizontal axis.
In order to build the histogram I use PDFKit.NET as it allows to extract all glyphs on the page. The glyphs are used to get the position of each word and based on that data I can construct the histograms. From the picture below you can see how the histogram corresponds to the page. The histogram clearly shows two peaks. These peaks are probably the left edges of the columns.
The histogram of the following page has six big peaks and we conclude that the page has six columns.
Download this Visual Studio .NET project to construct histograms yourself.
Author: Sergey Zavoloka